* [PATCH] sched/fair: check for idle core @ 2020-10-20 16:37 Julia Lawall 2020-10-21 7:29 ` Vincent Guittot 2020-10-21 11:20 ` Mel Gorman 0 siblings, 2 replies; 68+ messages in thread From: Julia Lawall @ 2020-10-20 16:37 UTC (permalink / raw) To: Ingo Molnar Cc: kernel-janitors, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles.Muller On a thread wakeup, the change [1] from runnable load average to load average for comparing candidate cores means that recent short-running daemons on the core where a thread ran previously can be considered to have a higher load than the core performing the wakeup, even when the core where the thread ran previously is currently idle. This can cause a thread to migrate, taking the place of some other thread that is about to wake up, and so on. To avoid unnecessary migrations, extend wake_affine_idle to check whether the core where the thread previously ran is currently idle, and if so return that core as the target. [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable load in wakeup path") This particularly has an impact when using passive (intel_cpufreq) power management, where kworkers run every 0.004 seconds on all cores, increasing the likelihood that an idle core will be considered to have a load. The following numbers were obtained with the benchmarking tool hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel benchmarks (https://www.nas.nasa.gov/publications/npb.html). The tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power management were used. Times are in seconds. All experiments use all 160 hardware threads. v5.9/active v5.9+patch/active bt.C.c 24.725724+-0.962340 23.349608+-1.607214 lu.C.x 29.105952+-4.804203 25.249052+-5.561617 sp.C.x 31.220696+-1.831335 30.227760+-2.429792 ua.C.x 26.606118+-1.767384 25.778367+-1.263850 v5.9/passive v5.9+patch/passive bt.C.c 25.330360+-1.028316 23.544036+-1.020189 lu.C.x 35.872659+-4.872090 23.719295+-3.883848 sp.C.x 32.141310+-2.289541 29.125363+-0.872300 ua.C.x 29.024597+-1.667049 25.728888+-1.539772 On the smaller data sets (A and B) and on the other NAS benchmarks there is no impact on performance. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> --- kernel/sched/fair.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index aa4c6227cd6d..9b23dad883ee 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5804,6 +5804,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync) if (sync && cpu_rq(this_cpu)->nr_running = 1) return this_cpu; + if (available_idle_cpu(prev_cpu)) + return prev_cpu; + return nr_cpumask_bits; } ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-20 16:37 [PATCH] sched/fair: check for idle core Julia Lawall @ 2020-10-21 7:29 ` Vincent Guittot 2020-10-21 11:13 ` Peter Zijlstra 2020-10-21 12:27 ` Vincent Guittot 2020-10-21 11:20 ` Mel Gorman 1 sibling, 2 replies; 68+ messages in thread From: Vincent Guittot @ 2020-10-21 7:29 UTC (permalink / raw) To: Julia Lawall Cc: Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller Hi Julia, On Tue, 20 Oct 2020 at 19:21, Julia Lawall <Julia.Lawall@inria.fr> wrote: > > On a thread wakeup, the change [1] from runnable load average to load > average for comparing candidate cores means that recent short-running > daemons on the core where a thread ran previously can be considered to > have a higher load than the core performing the wakeup, even when the > core where the thread ran previously is currently idle. This can > cause a thread to migrate, taking the place of some other thread that > is about to wake up, and so on. To avoid unnecessary migrations, > extend wake_affine_idle to check whether the core where the thread > previously ran is currently idle, and if so return that core as the > target. > > [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable > load in wakeup path") > > This particularly has an impact when using passive (intel_cpufreq) > power management, where kworkers run every 0.004 seconds on all cores, > increasing the likelihood that an idle core will be considered to have > a load. > > The following numbers were obtained with the benchmarking tool > hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel > benchmarks (https://www.nas.nasa.gov/publications/npb.html). The > tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @ > 2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power > management were used. Times are in seconds. All experiments use all > 160 hardware threads. > > v5.9/active v5.9+patch/active > bt.C.c 24.725724+-0.962340 23.349608+-1.607214 > lu.C.x 29.105952+-4.804203 25.249052+-5.561617 > sp.C.x 31.220696+-1.831335 30.227760+-2.429792 > ua.C.x 26.606118+-1.767384 25.778367+-1.263850 > > v5.9/passive v5.9+patch/passive > bt.C.c 25.330360+-1.028316 23.544036+-1.020189 > lu.C.x 35.872659+-4.872090 23.719295+-3.883848 > sp.C.x 32.141310+-2.289541 29.125363+-0.872300 > ua.C.x 29.024597+-1.667049 25.728888+-1.539772 > > On the smaller data sets (A and B) and on the other NAS benchmarks > there is no impact on performance. > > Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Reviewed-by Vincent Guittot <vincent.guittot@linaro.org> > > --- > kernel/sched/fair.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index aa4c6227cd6d..9b23dad883ee 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5804,6 +5804,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync) > if (sync && cpu_rq(this_cpu)->nr_running = 1) > return this_cpu; > > + if (available_idle_cpu(prev_cpu)) > + return prev_cpu; > + > return nr_cpumask_bits; > } > > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 7:29 ` Vincent Guittot @ 2020-10-21 11:13 ` Peter Zijlstra 2020-10-21 12:27 ` Vincent Guittot 1 sibling, 0 replies; 68+ messages in thread From: Peter Zijlstra @ 2020-10-21 11:13 UTC (permalink / raw) To: Vincent Guittot Cc: Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller On Wed, Oct 21, 2020 at 09:29:46AM +0200, Vincent Guittot wrote: > Hi Julia, > > On Tue, 20 Oct 2020 at 19:21, Julia Lawall <Julia.Lawall@inria.fr> wrote: > > > > On a thread wakeup, the change [1] from runnable load average to load > > average for comparing candidate cores means that recent short-running > > daemons on the core where a thread ran previously can be considered to > > have a higher load than the core performing the wakeup, even when the > > core where the thread ran previously is currently idle. This can > > cause a thread to migrate, taking the place of some other thread that > > is about to wake up, and so on. To avoid unnecessary migrations, > > extend wake_affine_idle to check whether the core where the thread > > previously ran is currently idle, and if so return that core as the > > target. > > > > [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable > > load in wakeup path") > > > > This particularly has an impact when using passive (intel_cpufreq) > > power management, where kworkers run every 0.004 seconds on all cores, > > increasing the likelihood that an idle core will be considered to have > > a load. > > > > The following numbers were obtained with the benchmarking tool > > hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel > > benchmarks (https://www.nas.nasa.gov/publications/npb.html). The > > tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @ > > 2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power > > management were used. Times are in seconds. All experiments use all > > 160 hardware threads. > > > > v5.9/active v5.9+patch/active > > bt.C.c 24.725724+-0.962340 23.349608+-1.607214 > > lu.C.x 29.105952+-4.804203 25.249052+-5.561617 > > sp.C.x 31.220696+-1.831335 30.227760+-2.429792 > > ua.C.x 26.606118+-1.767384 25.778367+-1.263850 > > > > v5.9/passive v5.9+patch/passive > > bt.C.c 25.330360+-1.028316 23.544036+-1.020189 > > lu.C.x 35.872659+-4.872090 23.719295+-3.883848 > > sp.C.x 32.141310+-2.289541 29.125363+-0.872300 > > ua.C.x 29.024597+-1.667049 25.728888+-1.539772 > > > > On the smaller data sets (A and B) and on the other NAS benchmarks > > there is no impact on performance. > > > > Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> > > Reviewed-by Vincent Guittot <vincent.guittot@linaro.org> Thanks! ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 7:29 ` Vincent Guittot 2020-10-21 11:13 ` Peter Zijlstra @ 2020-10-21 12:27 ` Vincent Guittot 1 sibling, 0 replies; 68+ messages in thread From: Vincent Guittot @ 2020-10-21 12:27 UTC (permalink / raw) To: Julia Lawall Cc: Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller On Wed, 21 Oct 2020 at 09:29, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > Hi Julia, > > On Tue, 20 Oct 2020 at 19:21, Julia Lawall <Julia.Lawall@inria.fr> wrote: > > > > On a thread wakeup, the change [1] from runnable load average to load > > average for comparing candidate cores means that recent short-running > > daemons on the core where a thread ran previously can be considered to > > have a higher load than the core performing the wakeup, even when the > > core where the thread ran previously is currently idle. This can > > cause a thread to migrate, taking the place of some other thread that > > is about to wake up, and so on. To avoid unnecessary migrations, > > extend wake_affine_idle to check whether the core where the thread > > previously ran is currently idle, and if so return that core as the > > target. > > > > [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable > > load in wakeup path") > > > > This particularly has an impact when using passive (intel_cpufreq) > > power management, where kworkers run every 0.004 seconds on all cores, > > increasing the likelihood that an idle core will be considered to have > > a load. > > > > The following numbers were obtained with the benchmarking tool > > hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel > > benchmarks (https://www.nas.nasa.gov/publications/npb.html). The > > tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @ > > 2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power > > management were used. Times are in seconds. All experiments use all > > 160 hardware threads. > > > > v5.9/active v5.9+patch/active > > bt.C.c 24.725724+-0.962340 23.349608+-1.607214 > > lu.C.x 29.105952+-4.804203 25.249052+-5.561617 > > sp.C.x 31.220696+-1.831335 30.227760+-2.429792 > > ua.C.x 26.606118+-1.767384 25.778367+-1.263850 > > > > v5.9/passive v5.9+patch/passive > > bt.C.c 25.330360+-1.028316 23.544036+-1.020189 > > lu.C.x 35.872659+-4.872090 23.719295+-3.883848 > > sp.C.x 32.141310+-2.289541 29.125363+-0.872300 > > ua.C.x 29.024597+-1.667049 25.728888+-1.539772 > > > > On the smaller data sets (A and B) and on the other NAS benchmarks > > there is no impact on performance. > > A fix tag is missing Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path") > > Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> > > Reviewed-by Vincent Guittot <vincent.guittot@linaro.org> > > > > > --- > > kernel/sched/fair.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index aa4c6227cd6d..9b23dad883ee 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -5804,6 +5804,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync) > > if (sync && cpu_rq(this_cpu)->nr_running = 1) > > return this_cpu; > > > > + if (available_idle_cpu(prev_cpu)) > > + return prev_cpu; > > + > > return nr_cpumask_bits; > > } > > > > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-20 16:37 [PATCH] sched/fair: check for idle core Julia Lawall 2020-10-21 7:29 ` Vincent Guittot @ 2020-10-21 11:20 ` Mel Gorman 2020-10-21 11:56 ` Julia Lawall 2020-10-21 12:25 ` Vincent Guittot 1 sibling, 2 replies; 68+ messages in thread From: Mel Gorman @ 2020-10-21 11:20 UTC (permalink / raw) To: Julia Lawall Cc: Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles.Muller On Tue, Oct 20, 2020 at 06:37:59PM +0200, Julia Lawall wrote: > On a thread wakeup, the change [1] from runnable load average to load > average for comparing candidate cores means that recent short-running > daemons on the core where a thread ran previously can be considered to > have a higher load than the core performing the wakeup, even when the > core where the thread ran previously is currently idle. This can > cause a thread to migrate, taking the place of some other thread that > is about to wake up, and so on. To avoid unnecessary migrations, > extend wake_affine_idle to check whether the core where the thread > previously ran is currently idle, and if so return that core as the > target. > > [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable > load in wakeup path") > > This particularly has an impact when using passive (intel_cpufreq) > power management, where kworkers run every 0.004 seconds on all cores, > increasing the likelihood that an idle core will be considered to have > a load. > > The following numbers were obtained with the benchmarking tool > hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel > benchmarks (https://www.nas.nasa.gov/publications/npb.html). The > tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @ > 2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power > management were used. Times are in seconds. All experiments use all > 160 hardware threads. > > v5.9/active v5.9+patch/active > bt.C.c 24.725724+-0.962340 23.349608+-1.607214 > lu.C.x 29.105952+-4.804203 25.249052+-5.561617 > sp.C.x 31.220696+-1.831335 30.227760+-2.429792 > ua.C.x 26.606118+-1.767384 25.778367+-1.263850 > > v5.9/passive v5.9+patch/passive > bt.C.c 25.330360+-1.028316 23.544036+-1.020189 > lu.C.x 35.872659+-4.872090 23.719295+-3.883848 > sp.C.x 32.141310+-2.289541 29.125363+-0.872300 > ua.C.x 29.024597+-1.667049 25.728888+-1.539772 > > On the smaller data sets (A and B) and on the other NAS benchmarks > there is no impact on performance. > > Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> I suspect that the benefit of this patch is due to avoiding the overhead of wake_affine_weight() check because the following check exists in select_idle_sibling /* * If the previous CPU is cache affine and idle, don't be stupid: */ if (prev != target && cpus_share_cache(prev, target) && (available_idle_cpu(prev) || sched_idle_cpu(prev))) return prev; Still, the concept makes some sense to avoid wake_affine_weight but look at the earlier part of wake_affine_idle() if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu)) return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu; This thing is almost completely useless because this_cpu is only going to be idle if it's a wakeup from interrupt context when the CPU was otherwise idle *but* it takes care to only use the CPU if this and prev share LLC. The patch as it stands may leave a task on a remote node when it should have been pulled local to the waker because prev happened to be idle. This is not guaranteed because a node could have multiple LLCs and prev is still appropriate but that's a different problem entirely and requires much deeper surgery. Still, not pulling a task from a remote node is a change in expected behaviour. While it's possible that NUMA domains will not even reach this path, it depends on the NUMA distance as can be seen in sd_init() for the setting of SD_WAKE_AFFINE so I think the cpus_share_cache check is necessary. I think it would be more appropriate to rework that block that checks this_cpu to instead check if the CPUs share cache first and then return one of them (preference to prev based on the comment above it about avoiding a migration) if either one is idle. I see Vincent already agreed with the patch so I could be wrong. Vincent, did I miss something stupid? -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 11:20 ` Mel Gorman @ 2020-10-21 11:56 ` Julia Lawall 2020-10-21 12:19 ` Peter Zijlstra 2020-10-21 12:28 ` Mel Gorman 2020-10-21 12:25 ` Vincent Guittot 1 sibling, 2 replies; 68+ messages in thread From: Julia Lawall @ 2020-10-21 11:56 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller On Wed, 21 Oct 2020, Mel Gorman wrote: > On Tue, Oct 20, 2020 at 06:37:59PM +0200, Julia Lawall wrote: > > On a thread wakeup, the change [1] from runnable load average to load > > average for comparing candidate cores means that recent short-running > > daemons on the core where a thread ran previously can be considered to > > have a higher load than the core performing the wakeup, even when the > > core where the thread ran previously is currently idle. This can > > cause a thread to migrate, taking the place of some other thread that > > is about to wake up, and so on. To avoid unnecessary migrations, > > extend wake_affine_idle to check whether the core where the thread > > previously ran is currently idle, and if so return that core as the > > target. > > > > [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable > > load in wakeup path") > > > > This particularly has an impact when using passive (intel_cpufreq) > > power management, where kworkers run every 0.004 seconds on all cores, > > increasing the likelihood that an idle core will be considered to have > > a load. > > > > The following numbers were obtained with the benchmarking tool > > hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel > > benchmarks (https://www.nas.nasa.gov/publications/npb.html). The > > tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @ > > 2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power > > management were used. Times are in seconds. All experiments use all > > 160 hardware threads. > > > > v5.9/active v5.9+patch/active > > bt.C.c 24.725724+-0.962340 23.349608+-1.607214 > > lu.C.x 29.105952+-4.804203 25.249052+-5.561617 > > sp.C.x 31.220696+-1.831335 30.227760+-2.429792 > > ua.C.x 26.606118+-1.767384 25.778367+-1.263850 > > > > v5.9/passive v5.9+patch/passive > > bt.C.c 25.330360+-1.028316 23.544036+-1.020189 > > lu.C.x 35.872659+-4.872090 23.719295+-3.883848 > > sp.C.x 32.141310+-2.289541 29.125363+-0.872300 > > ua.C.x 29.024597+-1.667049 25.728888+-1.539772 > > > > On the smaller data sets (A and B) and on the other NAS benchmarks > > there is no impact on performance. > > > > Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> > > I suspect that the benefit of this patch is due to avoiding the overhead > of wake_affine_weight() check because the following check exists in > select_idle_sibling I'm running 160 threads on 160 cores (80 physical cores). All of the threads are thus best off to just stay where they are. If one thread moves to the socket of prev, then they will be displacing other threads that also expect to return to where they were previously located. You can see this in the traces shown here: https://pages.lip6.fr/Julia.Lawall/uas.pdf Prior to 5.8, my machine was using intel_pstate and had few background tasks. Thus the problem wasn't visible in practice. Starting with 5.8 the kernel decided that intel_cpufreq would be more appropriate, which introduced kworkers every 0.004 seconds on all cores. In the graphs for early versions, sometimes the whole benchmark runs with the threads just staying on their cores, or a few migrations. Starting with 5.8, after 5 seconds where there are a number of synchronizations, all of the threads move around between all of the cores. Typically, one bad placement leads to 10-15 threads moving around, until one ends up on the idle core where the original thread was intended to be. > > /* > * If the previous CPU is cache affine and idle, don't be stupid: > */ > if (prev != target && cpus_share_cache(prev, target) && > (available_idle_cpu(prev) || sched_idle_cpu(prev))) > return prev; This isn't triggered in the problematic case, because the problematic case is where the prev core and the waker core are on different sockets. To my understanding, when the runnable load was used and prev was idle, wake_affine_weight would fail, and then wake_affine would return prev. With the load average, in the case where there is a thread on the waker core and there has recently been a daemon on the prev core, the comparison between the cores is a bit random. The patch thus tries to restore the previous behavior. julia > Still, the concept makes some sense to avoid wake_affine_weight but look > at the earlier part of wake_affine_idle() > > if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu)) > return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu; > > This thing is almost completely useless because this_cpu is only going to > be idle if it's a wakeup from interrupt context when the CPU was otherwise > idle *but* it takes care to only use the CPU if this and prev share LLC. > > The patch as it stands may leave a task on a remote node when it should > have been pulled local to the waker because prev happened to be idle. This > is not guaranteed because a node could have multiple LLCs and prev is > still appropriate but that's a different problem entirely and requires > much deeper surgery. Still, not pulling a task from a remote node is > a change in expected behaviour. While it's possible that NUMA domains > will not even reach this path, it depends on the NUMA distance as can > be seen in sd_init() for the setting of SD_WAKE_AFFINE so I think the > cpus_share_cache check is necessary. > > I think it would be more appropriate to rework that block that checks > this_cpu to instead check if the CPUs share cache first and then return one > of them (preference to prev based on the comment above it about avoiding > a migration) if either one is idle. > > I see Vincent already agreed with the patch so I could be wrong. Vincent, > did I miss something stupid? > > -- > Mel Gorman > SUSE Labs > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 11:56 ` Julia Lawall @ 2020-10-21 12:19 ` Peter Zijlstra 2020-10-21 12:42 ` Julia Lawall 2020-10-21 13:10 ` Peter Zijlstra 2020-10-21 12:28 ` Mel Gorman 1 sibling, 2 replies; 68+ messages in thread From: Peter Zijlstra @ 2020-10-21 12:19 UTC (permalink / raw) To: Julia Lawall Cc: Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, Rafael J. Wysocki, viresh.kumar, srinivas.pandruvada On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote: > Prior to 5.8, my machine was using intel_pstate and had few background > tasks. Thus the problem wasn't visible in practice. Starting with 5.8 > the kernel decided that intel_cpufreq would be more appropriate, which > introduced kworkers every 0.004 seconds on all cores. That still doesn't make any sense. Are you running the legacy on-demand thing or something? Rafael, Srinivas, Viresh, how come it defaults to that? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 12:19 ` Peter Zijlstra @ 2020-10-21 12:42 ` Julia Lawall 2020-10-21 12:52 ` Peter Zijlstra 2020-10-21 18:15 ` Rafael J. Wysocki 2020-10-21 13:10 ` Peter Zijlstra 1 sibling, 2 replies; 68+ messages in thread From: Julia Lawall @ 2020-10-21 12:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, Rafael J. Wysocki, viresh.kumar, srinivas.pandruvada On Wed, 21 Oct 2020, Peter Zijlstra wrote: > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote: > > Prior to 5.8, my machine was using intel_pstate and had few background > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8 > > the kernel decided that intel_cpufreq would be more appropriate, which > > introduced kworkers every 0.004 seconds on all cores. > > That still doesn't make any sense. Are you running the legacy on-demand > thing or something? > > Rafael, Srinivas, Viresh, how come it defaults to that? The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small bug. I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not have the HWP feature, even though the cores seemed to be able to change their frequencies at the hardware level. julia ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 12:42 ` Julia Lawall @ 2020-10-21 12:52 ` Peter Zijlstra 2020-10-21 18:18 ` Rafael J. Wysocki 2020-10-21 18:15 ` Rafael J. Wysocki 1 sibling, 1 reply; 68+ messages in thread From: Peter Zijlstra @ 2020-10-21 12:52 UTC (permalink / raw) To: Julia Lawall Cc: Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, Rafael J. Wysocki, viresh.kumar, srinivas.pandruvada On Wed, Oct 21, 2020 at 02:42:20PM +0200, Julia Lawall wrote: > > > On Wed, 21 Oct 2020, Peter Zijlstra wrote: > > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote: > > > Prior to 5.8, my machine was using intel_pstate and had few background > > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8 > > > the kernel decided that intel_cpufreq would be more appropriate, which > > > introduced kworkers every 0.004 seconds on all cores. > > > > That still doesn't make any sense. Are you running the legacy on-demand > > thing or something? > > > > Rafael, Srinivas, Viresh, how come it defaults to that? > > The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small > bug. I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not > have the HWP feature, even though the cores seemed to be able to change > their frequencies at the hardware level. That just makes intel_pstate not prefer active mode. With the clear intent that it should then go use schedutil, but somehow it looks like you landed on ondemand, which is absolutely atrocious. What's: $ for i in /sys/devices/system/cpu/cpu0/cpufreq/scaling_*; do echo -n $i ": "; cat $i; done say, for you? And if you do: $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo schedutil > $i; done Are the kworkers gone? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 12:52 ` Peter Zijlstra @ 2020-10-21 18:18 ` Rafael J. Wysocki 0 siblings, 0 replies; 68+ messages in thread From: Rafael J. Wysocki @ 2020-10-21 18:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, viresh.kumar, srinivas.pandruvada On Wednesday, October 21, 2020 2:52:03 PM CEST Peter Zijlstra wrote: > On Wed, Oct 21, 2020 at 02:42:20PM +0200, Julia Lawall wrote: > > > > > > On Wed, 21 Oct 2020, Peter Zijlstra wrote: > > > > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote: > > > > Prior to 5.8, my machine was using intel_pstate and had few background > > > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8 > > > > the kernel decided that intel_cpufreq would be more appropriate, which > > > > introduced kworkers every 0.004 seconds on all cores. > > > > > > That still doesn't make any sense. Are you running the legacy on-demand > > > thing or something? > > > > > > Rafael, Srinivas, Viresh, how come it defaults to that? > > > > The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small > > bug. I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not > > have the HWP feature, even though the cores seemed to be able to change > > their frequencies at the hardware level. > > That just makes intel_pstate not prefer active mode. With the clear > intent that it should then go use schedutil, but somehow it looks like > you landed on ondemand, which is absolutely atrocious. Probably, the "default governor" setting was inherited from the old config. It didn't matter when the active mode was used by default, because it only recognizes the "powersave" and "performance" settings (and "ondemand" causes it to fall back to "powersave" IIRC). Cheers! ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 12:42 ` Julia Lawall 2020-10-21 12:52 ` Peter Zijlstra @ 2020-10-21 18:15 ` Rafael J. Wysocki 2020-10-21 19:47 ` Julia Lawall 1 sibling, 1 reply; 68+ messages in thread From: Rafael J. Wysocki @ 2020-10-21 18:15 UTC (permalink / raw) To: Julia Lawall Cc: Peter Zijlstra, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, viresh.kumar, srinivas.pandruvada On Wednesday, October 21, 2020 2:42:20 PM CEST Julia Lawall wrote: > > On Wed, 21 Oct 2020, Peter Zijlstra wrote: > > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote: > > > Prior to 5.8, my machine was using intel_pstate and had few background > > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8 > > > the kernel decided that intel_cpufreq would be more appropriate, which > > > introduced kworkers every 0.004 seconds on all cores. > > > > That still doesn't make any sense. Are you running the legacy on-demand > > thing or something? > > > > Rafael, Srinivas, Viresh, how come it defaults to that? > > The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small > bug. I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not > have the HWP feature, even though the cores seemed to be able to change > their frequencies at the hardware level. That's in the range of "turbo" P-states (if a P-state above a certain threshold is requested by the governor, the processor has a license to choose P-states in the range above this threshold by itself). Cheers! ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 18:15 ` Rafael J. Wysocki @ 2020-10-21 19:47 ` Julia Lawall 2020-10-21 20:25 ` Rafael J. Wysocki 0 siblings, 1 reply; 68+ messages in thread From: Julia Lawall @ 2020-10-21 19:47 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Julia Lawall, Peter Zijlstra, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, viresh.kumar, srinivas.pandruvada On Wed, 21 Oct 2020, Rafael J. Wysocki wrote: > On Wednesday, October 21, 2020 2:42:20 PM CEST Julia Lawall wrote: > > > > On Wed, 21 Oct 2020, Peter Zijlstra wrote: > > > > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote: > > > > Prior to 5.8, my machine was using intel_pstate and had few background > > > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8 > > > > the kernel decided that intel_cpufreq would be more appropriate, which > > > > introduced kworkers every 0.004 seconds on all cores. > > > > > > That still doesn't make any sense. Are you running the legacy on-demand > > > thing or something? > > > > > > Rafael, Srinivas, Viresh, how come it defaults to that? > > > > The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small > > bug. I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not > > have the HWP feature, even though the cores seemed to be able to change > > their frequencies at the hardware level. > > That's in the range of "turbo" P-states (if a P-state above a certain threshold > is requested by the governor, the processor has a license to choose P-states > in the range above this threshold by itself). Sorry, but I don't understand this answer at all. thanks, julia ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 19:47 ` Julia Lawall @ 2020-10-21 20:25 ` Rafael J. Wysocki 0 siblings, 0 replies; 68+ messages in thread From: Rafael J. Wysocki @ 2020-10-21 20:25 UTC (permalink / raw) To: Julia Lawall Cc: Peter Zijlstra, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, viresh.kumar, srinivas.pandruvada On Wednesday, October 21, 2020 9:47:51 PM CEST Julia Lawall wrote: > > On Wed, 21 Oct 2020, Rafael J. Wysocki wrote: > > > On Wednesday, October 21, 2020 2:42:20 PM CEST Julia Lawall wrote: > > > > > > On Wed, 21 Oct 2020, Peter Zijlstra wrote: > > > > > > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote: > > > > > Prior to 5.8, my machine was using intel_pstate and had few background > > > > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8 > > > > > the kernel decided that intel_cpufreq would be more appropriate, which > > > > > introduced kworkers every 0.004 seconds on all cores. > > > > > > > > That still doesn't make any sense. Are you running the legacy on-demand > > > > thing or something? > > > > > > > > Rafael, Srinivas, Viresh, how come it defaults to that? > > > > > > The relevant commits are 33aa46f252c7, and 39a188b88332 that fixes a small > > > bug. I have a Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz that does not > > > have the HWP feature, even though the cores seemed to be able to change > > > their frequencies at the hardware level. > > > > That's in the range of "turbo" P-states (if a P-state above a certain threshold > > is requested by the governor, the processor has a license to choose P-states > > in the range above this threshold by itself). > > Sorry, but I don't understand this answer at all. Well, sorry about that and let me rephrase then. Contemporary CPUs have two ranges of P-states, the so called "guaranteed performance" range and the "turbo" range. In the "guaranteed performance" range the CPU runs in the P-state requested by the governor, unless a higher P-state has been requested for another CPU in its frequency domain (usually covering the entire processor package) , in which case that higher P-state will be used (the effective P-state for all CPUs in the frequency domain is the maximum of all P-states requested for individual CPUs). However, if the governor requests a P-state from the "turbo" range, the processor is not required to take that request literally and the PM unit in it may override the governor's choice and cause the CPU to run in a different P-state (also from the "turbo" range), even if lower P-states have been requested for the other CPUs in the processor package. This is also explained in Documentation/admin-guide/pm/intel_pstate.rst (in the Turbo P-states Support section), in more detail and hopefully more clearly. Cheers! ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 12:19 ` Peter Zijlstra 2020-10-21 12:42 ` Julia Lawall @ 2020-10-21 13:10 ` Peter Zijlstra 2020-10-21 18:11 ` Rafael J. Wysocki 1 sibling, 1 reply; 68+ messages in thread From: Peter Zijlstra @ 2020-10-21 13:10 UTC (permalink / raw) To: Julia Lawall Cc: Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, Rafael J. Wysocki, viresh.kumar, srinivas.pandruvada On Wed, Oct 21, 2020 at 02:19:50PM +0200, Peter Zijlstra wrote: > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote: > > Prior to 5.8, my machine was using intel_pstate and had few background > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8 > > the kernel decided that intel_cpufreq would be more appropriate, which > > introduced kworkers every 0.004 seconds on all cores. > > That still doesn't make any sense. Are you running the legacy on-demand > thing or something? > > Rafael, Srinivas, Viresh, how come it defaults to that? Does we want something like this? --- arch/x86/configs/i386_defconfig | 3 +-- arch/x86/configs/x86_64_defconfig | 3 +-- drivers/cpufreq/Kconfig | 7 +++++-- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig index 78210793d357..c343ad459737 100644 --- a/arch/x86/configs/i386_defconfig +++ b/arch/x86/configs/i386_defconfig @@ -41,8 +41,7 @@ CONFIG_PM_DEBUG=y CONFIG_PM_TRACE_RTC=y CONFIG_ACPI_DOCK=y CONFIG_ACPI_BGRT=y -CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y -CONFIG_CPU_FREQ_GOV_ONDEMAND=y +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y CONFIG_X86_ACPI_CPUFREQ=y CONFIG_EFI_VARS=y CONFIG_KPROBES=y diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig index 9936528e1939..23e1ea85c1ec 100644 --- a/arch/x86/configs/x86_64_defconfig +++ b/arch/x86/configs/x86_64_defconfig @@ -38,8 +38,7 @@ CONFIG_PM_DEBUG=y CONFIG_PM_TRACE_RTC=y CONFIG_ACPI_DOCK=y CONFIG_ACPI_BGRT=y -CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y -CONFIG_CPU_FREQ_GOV_ONDEMAND=y +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y CONFIG_X86_ACPI_CPUFREQ=y CONFIG_IA32_EMULATION=y CONFIG_EFI_VARS=y diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index 2c7171e0b001..8dfca6e9b836 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -37,8 +37,7 @@ config CPU_FREQ_STAT choice prompt "Default CPUFreq governor" default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP + default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if SMP default CPU_FREQ_DEFAULT_GOV_PERFORMANCE help This option sets which CPUFreq governor shall be loaded at @@ -71,6 +70,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE config CPU_FREQ_DEFAULT_GOV_ONDEMAND bool "ondemand" + depends on !SMP select CPU_FREQ_GOV_ONDEMAND select CPU_FREQ_GOV_PERFORMANCE help @@ -83,6 +83,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE bool "conservative" + depends on !SMP select CPU_FREQ_GOV_CONSERVATIVE select CPU_FREQ_GOV_PERFORMANCE help @@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE config CPU_FREQ_GOV_ONDEMAND tristate "'ondemand' cpufreq policy governor" + depends on !SMP select CPU_FREQ_GOV_COMMON help 'ondemand' - This driver adds a dynamic cpufreq policy governor. @@ -163,6 +165,7 @@ config CPU_FREQ_GOV_ONDEMAND config CPU_FREQ_GOV_CONSERVATIVE tristate "'conservative' cpufreq governor" depends on CPU_FREQ + depends on !SMP select CPU_FREQ_GOV_COMMON help 'conservative' - this driver is rather similar to the 'ondemand' ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 13:10 ` Peter Zijlstra @ 2020-10-21 18:11 ` Rafael J. Wysocki 2020-10-22 4:53 ` Viresh Kumar 2020-10-22 7:11 ` Peter Zijlstra 0 siblings, 2 replies; 68+ messages in thread From: Rafael J. Wysocki @ 2020-10-21 18:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, viresh.kumar, srinivas.pandruvada On Wednesday, October 21, 2020 3:10:26 PM CEST Peter Zijlstra wrote: > On Wed, Oct 21, 2020 at 02:19:50PM +0200, Peter Zijlstra wrote: > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote: > > > Prior to 5.8, my machine was using intel_pstate and had few background > > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8 > > > the kernel decided that intel_cpufreq would be more appropriate, which > > > introduced kworkers every 0.004 seconds on all cores. > > > > That still doesn't make any sense. Are you running the legacy on-demand > > thing or something? > > > > Rafael, Srinivas, Viresh, how come it defaults to that? > > Does we want something like this? > > --- > arch/x86/configs/i386_defconfig | 3 +-- > arch/x86/configs/x86_64_defconfig | 3 +-- > drivers/cpufreq/Kconfig | 7 +++++-- > 3 files changed, 7 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig > index 78210793d357..c343ad459737 100644 > --- a/arch/x86/configs/i386_defconfig > +++ b/arch/x86/configs/i386_defconfig > @@ -41,8 +41,7 @@ CONFIG_PM_DEBUG=y > CONFIG_PM_TRACE_RTC=y > CONFIG_ACPI_DOCK=y > CONFIG_ACPI_BGRT=y > -CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y > -CONFIG_CPU_FREQ_GOV_ONDEMAND=y > +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y > CONFIG_X86_ACPI_CPUFREQ=y > CONFIG_EFI_VARS=y > CONFIG_KPROBES=y > diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig > index 9936528e1939..23e1ea85c1ec 100644 > --- a/arch/x86/configs/x86_64_defconfig > +++ b/arch/x86/configs/x86_64_defconfig > @@ -38,8 +38,7 @@ CONFIG_PM_DEBUG=y > CONFIG_PM_TRACE_RTC=y > CONFIG_ACPI_DOCK=y > CONFIG_ACPI_BGRT=y > -CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y > -CONFIG_CPU_FREQ_GOV_ONDEMAND=y > +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y > CONFIG_X86_ACPI_CPUFREQ=y > CONFIG_IA32_EMULATION=y > CONFIG_EFI_VARS=y > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig > index 2c7171e0b001..8dfca6e9b836 100644 > --- a/drivers/cpufreq/Kconfig > +++ b/drivers/cpufreq/Kconfig > @@ -37,8 +37,7 @@ config CPU_FREQ_STAT > choice > prompt "Default CPUFreq governor" > default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ > - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM > - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP > + default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if SMP > default CPU_FREQ_DEFAULT_GOV_PERFORMANCE > help > This option sets which CPUFreq governor shall be loaded at > @@ -71,6 +70,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE > > config CPU_FREQ_DEFAULT_GOV_ONDEMAND > bool "ondemand" > + depends on !SMP > select CPU_FREQ_GOV_ONDEMAND > select CPU_FREQ_GOV_PERFORMANCE > help > @@ -83,6 +83,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND > > config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE > bool "conservative" > + depends on !SMP > select CPU_FREQ_GOV_CONSERVATIVE > select CPU_FREQ_GOV_PERFORMANCE > help The changes above should work. > @@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE > > config CPU_FREQ_GOV_ONDEMAND > tristate "'ondemand' cpufreq policy governor" > + depends on !SMP But I don't think that we can do this and the one below. > select CPU_FREQ_GOV_COMMON > help > 'ondemand' - This driver adds a dynamic cpufreq policy governor. > @@ -163,6 +165,7 @@ config CPU_FREQ_GOV_ONDEMAND > config CPU_FREQ_GOV_CONSERVATIVE > tristate "'conservative' cpufreq governor" > depends on CPU_FREQ > + depends on !SMP > select CPU_FREQ_GOV_COMMON > help > 'conservative' - this driver is rather similar to the 'ondemand' > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 18:11 ` Rafael J. Wysocki @ 2020-10-22 4:53 ` Viresh Kumar 2020-10-22 7:11 ` Peter Zijlstra 1 sibling, 0 replies; 68+ messages in thread From: Viresh Kumar @ 2020-10-22 4:53 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada On 21-10-20, 20:11, Rafael J. Wysocki wrote: > On Wednesday, October 21, 2020 3:10:26 PM CEST Peter Zijlstra wrote: > > On Wed, Oct 21, 2020 at 02:19:50PM +0200, Peter Zijlstra wrote: > > > On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote: > > > > Prior to 5.8, my machine was using intel_pstate and had few background > > > > tasks. Thus the problem wasn't visible in practice. Starting with 5.8 > > > > the kernel decided that intel_cpufreq would be more appropriate, which > > > > introduced kworkers every 0.004 seconds on all cores. > > > > > > That still doesn't make any sense. Are you running the legacy on-demand > > > thing or something? > > > > > > Rafael, Srinivas, Viresh, how come it defaults to that? > > > > Does we want something like this? > > > > --- > > arch/x86/configs/i386_defconfig | 3 +-- > > arch/x86/configs/x86_64_defconfig | 3 +-- > > drivers/cpufreq/Kconfig | 7 +++++-- > > 3 files changed, 7 insertions(+), 6 deletions(-) > > > > diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig > > index 78210793d357..c343ad459737 100644 > > --- a/arch/x86/configs/i386_defconfig > > +++ b/arch/x86/configs/i386_defconfig > > @@ -41,8 +41,7 @@ CONFIG_PM_DEBUG=y > > CONFIG_PM_TRACE_RTC=y > > CONFIG_ACPI_DOCK=y > > CONFIG_ACPI_BGRT=y > > -CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y > > -CONFIG_CPU_FREQ_GOV_ONDEMAND=y > > +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y > > CONFIG_X86_ACPI_CPUFREQ=y > > CONFIG_EFI_VARS=y > > CONFIG_KPROBES=y > > diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig > > index 9936528e1939..23e1ea85c1ec 100644 > > --- a/arch/x86/configs/x86_64_defconfig > > +++ b/arch/x86/configs/x86_64_defconfig > > @@ -38,8 +38,7 @@ CONFIG_PM_DEBUG=y > > CONFIG_PM_TRACE_RTC=y > > CONFIG_ACPI_DOCK=y > > CONFIG_ACPI_BGRT=y > > -CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y > > -CONFIG_CPU_FREQ_GOV_ONDEMAND=y > > +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y > > CONFIG_X86_ACPI_CPUFREQ=y > > CONFIG_IA32_EMULATION=y > > CONFIG_EFI_VARS=y > > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig > > index 2c7171e0b001..8dfca6e9b836 100644 > > --- a/drivers/cpufreq/Kconfig > > +++ b/drivers/cpufreq/Kconfig > > @@ -37,8 +37,7 @@ config CPU_FREQ_STAT > > choice > > prompt "Default CPUFreq governor" > > default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ > > - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM > > - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP > > + default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if SMP > > default CPU_FREQ_DEFAULT_GOV_PERFORMANCE > > help > > This option sets which CPUFreq governor shall be loaded at > > @@ -71,6 +70,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE > > > > config CPU_FREQ_DEFAULT_GOV_ONDEMAND > > bool "ondemand" > > + depends on !SMP > > select CPU_FREQ_GOV_ONDEMAND > > select CPU_FREQ_GOV_PERFORMANCE > > help > > @@ -83,6 +83,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND > > > > config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE > > bool "conservative" > > + depends on !SMP > > select CPU_FREQ_GOV_CONSERVATIVE > > select CPU_FREQ_GOV_PERFORMANCE > > help > > The changes above should work. > > > @@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE > > > > config CPU_FREQ_GOV_ONDEMAND > > tristate "'ondemand' cpufreq policy governor" > > + depends on !SMP > > But I don't think that we can do this and the one below. I have exactly the same comments. > > select CPU_FREQ_GOV_COMMON > > help > > 'ondemand' - This driver adds a dynamic cpufreq policy governor. > > @@ -163,6 +165,7 @@ config CPU_FREQ_GOV_ONDEMAND > > config CPU_FREQ_GOV_CONSERVATIVE > > tristate "'conservative' cpufreq governor" > > depends on CPU_FREQ > > + depends on !SMP > > select CPU_FREQ_GOV_COMMON > > help > > 'conservative' - this driver is rather similar to the 'ondemand' > > > > > -- viresh ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 18:11 ` Rafael J. Wysocki 2020-10-22 4:53 ` Viresh Kumar @ 2020-10-22 7:11 ` Peter Zijlstra 2020-10-22 10:59 ` Viresh Kumar 2020-10-22 11:21 ` AW: " Walter Harms 1 sibling, 2 replies; 68+ messages in thread From: Peter Zijlstra @ 2020-10-22 7:11 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, viresh.kumar, srinivas.pandruvada On Wed, Oct 21, 2020 at 08:11:59PM +0200, Rafael J. Wysocki wrote: > > @@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE > > > > config CPU_FREQ_GOV_ONDEMAND > > tristate "'ondemand' cpufreq policy governor" > > + depends on !SMP > > But I don't think that we can do this and the one below. Well, but we need to do something to force people onto schedutil, otherwise we'll get more crap like this thread. Can we take the choice away? Only let Kconfig select which governors are available and then set the default ourselves? I mean, the end goal being to not have selectable governors at all, this seems like a good step anyway. --- diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index 2c7171e0b001..3a9f54b382c0 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -34,77 +34,6 @@ config CPU_FREQ_STAT If in doubt, say N. -choice - prompt "Default CPUFreq governor" - default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP - default CPU_FREQ_DEFAULT_GOV_PERFORMANCE - help - This option sets which CPUFreq governor shall be loaded at - startup. If in doubt, use the default setting. - -config CPU_FREQ_DEFAULT_GOV_PERFORMANCE - bool "performance" - select CPU_FREQ_GOV_PERFORMANCE - help - Use the CPUFreq governor 'performance' as default. This sets - the frequency statically to the highest frequency supported by - the CPU. - -config CPU_FREQ_DEFAULT_GOV_POWERSAVE - bool "powersave" - select CPU_FREQ_GOV_POWERSAVE - help - Use the CPUFreq governor 'powersave' as default. This sets - the frequency statically to the lowest frequency supported by - the CPU. - -config CPU_FREQ_DEFAULT_GOV_USERSPACE - bool "userspace" - select CPU_FREQ_GOV_USERSPACE - help - Use the CPUFreq governor 'userspace' as default. This allows - you to set the CPU frequency manually or when a userspace - program shall be able to set the CPU dynamically without having - to enable the userspace governor manually. - -config CPU_FREQ_DEFAULT_GOV_ONDEMAND - bool "ondemand" - select CPU_FREQ_GOV_ONDEMAND - select CPU_FREQ_GOV_PERFORMANCE - help - Use the CPUFreq governor 'ondemand' as default. This allows - you to get a full dynamic frequency capable system by simply - loading your cpufreq low-level hardware driver. - Be aware that not all cpufreq drivers support the ondemand - governor. If unsure have a look at the help section of the - driver. Fallback governor will be the performance governor. - -config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE - bool "conservative" - select CPU_FREQ_GOV_CONSERVATIVE - select CPU_FREQ_GOV_PERFORMANCE - help - Use the CPUFreq governor 'conservative' as default. This allows - you to get a full dynamic frequency capable system by simply - loading your cpufreq low-level hardware driver. - Be aware that not all cpufreq drivers support the conservative - governor. If unsure have a look at the help section of the - driver. Fallback governor will be the performance governor. - -config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL - bool "schedutil" - depends on SMP - select CPU_FREQ_GOV_SCHEDUTIL - select CPU_FREQ_GOV_PERFORMANCE - help - Use the 'schedutil' CPUFreq governor by default. If unsure, - have a look at the help section of that governor. The fallback - governor will be 'performance'. - -endchoice - config CPU_FREQ_GOV_PERFORMANCE tristate "'performance' governor" help @@ -145,6 +74,7 @@ config CPU_FREQ_GOV_USERSPACE config CPU_FREQ_GOV_ONDEMAND tristate "'ondemand' cpufreq policy governor" select CPU_FREQ_GOV_COMMON + select CPU_FREQ_GOV_PERFORMANCE help 'ondemand' - This driver adds a dynamic cpufreq policy governor. The governor does a periodic polling and @@ -164,6 +94,7 @@ config CPU_FREQ_GOV_CONSERVATIVE tristate "'conservative' cpufreq governor" depends on CPU_FREQ select CPU_FREQ_GOV_COMMON + select CPU_FREQ_GOV_PERFORMANCE help 'conservative' - this driver is rather similar to the 'ondemand' governor both in its source code and its purpose, the difference is @@ -188,6 +119,7 @@ config CPU_FREQ_GOV_SCHEDUTIL bool "'schedutil' cpufreq policy governor" depends on CPU_FREQ && SMP select CPU_FREQ_GOV_ATTR_SET + select CPU_FREQ_GOV_PERFORMANCE select IRQ_WORK help This governor makes decisions based on the utilization data provided diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 1877f5e2e5b0..6848e3337b65 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -626,6 +626,49 @@ static struct cpufreq_governor *find_governor(const char *str_governor) return NULL; } +static struct cpufreq_governor *cpufreq_default_governor(void) +{ + struct cpufreq_governor *gov = NULL; + +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL + gov = find_governor("schedutil"); + if (gov) + return gov; +#endif + +#ifdef CONFIG_CPU_FREQ_GOV_ONDEMAND + gov = find_governor("ondemand"); + if (gov) + return gov; +#endif + +#ifdef CONFIG_CPU_FREQ_GOV_CONSERVATIVE + gov = find_governor("conservative"); + if (gov) + return gov; +#endif + +#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE + gov = find_governor("userspace"); + if (gov) + return gov; +#endif + +#ifdef CONFIG_CPU_FREQ_GOV_POWERSAVE + gov = find_governor("powersave"); + if (gov) + return gov; +#endif + +#ifdef CONFIG_CPU_FREQ_GOV_PERFORMANCE + gov = find_governor("performance"); + if (gov) + return gov; +#endif + + return gov; +} + static struct cpufreq_governor *get_governor(const char *str_governor) { struct cpufreq_governor *t; diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c index aa39ff31ec9f..25449a1ee954 100644 --- a/drivers/cpufreq/cpufreq_conservative.c +++ b/drivers/cpufreq/cpufreq_conservative.c @@ -330,12 +330,5 @@ MODULE_DESCRIPTION("'cpufreq_conservative' - A dynamic cpufreq governor for " "optimised for use in a battery environment"); MODULE_LICENSE("GPL"); -#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE -struct cpufreq_governor *cpufreq_default_governor(void) -{ - return &CPU_FREQ_GOV_CONSERVATIVE; -} -#endif - cpufreq_governor_init(CPU_FREQ_GOV_CONSERVATIVE); cpufreq_governor_exit(CPU_FREQ_GOV_CONSERVATIVE); diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index ac361a8b1d3b..2a7fd645e1fb 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -467,12 +467,5 @@ MODULE_DESCRIPTION("'cpufreq_ondemand' - A dynamic cpufreq governor for " "Low Latency Frequency Transition capable processors"); MODULE_LICENSE("GPL"); -#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND -struct cpufreq_governor *cpufreq_default_governor(void) -{ - return &CPU_FREQ_GOV_ONDEMAND; -} -#endif - cpufreq_governor_init(CPU_FREQ_GOV_ONDEMAND); cpufreq_governor_exit(CPU_FREQ_GOV_ONDEMAND); diff --git a/drivers/cpufreq/cpufreq_performance.c b/drivers/cpufreq/cpufreq_performance.c index 71c1d9aba772..c6c7473a3a73 100644 --- a/drivers/cpufreq/cpufreq_performance.c +++ b/drivers/cpufreq/cpufreq_performance.c @@ -23,12 +23,6 @@ static struct cpufreq_governor cpufreq_gov_performance = { .limits = cpufreq_gov_performance_limits, }; -#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE -struct cpufreq_governor *cpufreq_default_governor(void) -{ - return &cpufreq_gov_performance; -} -#endif #ifndef CONFIG_CPU_FREQ_GOV_PERFORMANCE_MODULE struct cpufreq_governor *cpufreq_fallback_governor(void) { diff --git a/drivers/cpufreq/cpufreq_powersave.c b/drivers/cpufreq/cpufreq_powersave.c index 7749522355b5..8b9555f70e9d 100644 --- a/drivers/cpufreq/cpufreq_powersave.c +++ b/drivers/cpufreq/cpufreq_powersave.c @@ -27,12 +27,5 @@ MODULE_AUTHOR("Dominik Brodowski <linux@brodo.de>"); MODULE_DESCRIPTION("CPUfreq policy governor 'powersave'"); MODULE_LICENSE("GPL"); -#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE -struct cpufreq_governor *cpufreq_default_governor(void) -{ - return &cpufreq_gov_powersave; -} -#endif - cpufreq_governor_init(cpufreq_gov_powersave); cpufreq_governor_exit(cpufreq_gov_powersave); diff --git a/drivers/cpufreq/cpufreq_userspace.c b/drivers/cpufreq/cpufreq_userspace.c index 50a4d7846580..48ce53038ac0 100644 --- a/drivers/cpufreq/cpufreq_userspace.c +++ b/drivers/cpufreq/cpufreq_userspace.c @@ -131,12 +131,5 @@ MODULE_AUTHOR("Dominik Brodowski <linux@brodo.de>, " MODULE_DESCRIPTION("CPUfreq policy governor 'userspace'"); MODULE_LICENSE("GPL"); -#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE -struct cpufreq_governor *cpufreq_default_governor(void) -{ - return &cpufreq_gov_userspace; -} -#endif - cpufreq_governor_init(cpufreq_gov_userspace); cpufreq_governor_exit(cpufreq_gov_userspace); diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 5ae7b4e6e8d6..3b77e408377a 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -888,13 +888,6 @@ struct cpufreq_governor schedutil_gov = { .limits = sugov_limits, }; -#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL -struct cpufreq_governor *cpufreq_default_governor(void) -{ - return &schedutil_gov; -} -#endif - cpufreq_governor_init(schedutil_gov); #ifdef CONFIG_ENERGY_MODEL ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-22 7:11 ` Peter Zijlstra @ 2020-10-22 10:59 ` Viresh Kumar 2020-10-22 11:45 ` Rafael J. Wysocki 2020-10-22 11:21 ` AW: " Walter Harms 1 sibling, 1 reply; 68+ messages in thread From: Viresh Kumar @ 2020-10-22 10:59 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada On 22-10-20, 09:11, Peter Zijlstra wrote: > Well, but we need to do something to force people onto schedutil, > otherwise we'll get more crap like this thread. > > Can we take the choice away? Only let Kconfig select which governors are > available and then set the default ourselves? I mean, the end goal being > to not have selectable governors at all, this seems like a good step > anyway. Just to clarify and complete the point a bit here, the users can still pass the default governor from cmdline using cpufreq.default_governor=, which will take precedence over the one the below code is playing with. And later once the kernel is up, they can still choose a different governor from userspace. > --- > > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig > index 2c7171e0b001..3a9f54b382c0 100644 > --- a/drivers/cpufreq/Kconfig > +++ b/drivers/cpufreq/Kconfig > @@ -34,77 +34,6 @@ config CPU_FREQ_STAT > > If in doubt, say N. > > -choice > - prompt "Default CPUFreq governor" > - default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ > - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM > - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP > - default CPU_FREQ_DEFAULT_GOV_PERFORMANCE > - help > - This option sets which CPUFreq governor shall be loaded at > - startup. If in doubt, use the default setting. > - > -config CPU_FREQ_DEFAULT_GOV_PERFORMANCE > - bool "performance" > - select CPU_FREQ_GOV_PERFORMANCE > - help > - Use the CPUFreq governor 'performance' as default. This sets > - the frequency statically to the highest frequency supported by > - the CPU. > - > -config CPU_FREQ_DEFAULT_GOV_POWERSAVE > - bool "powersave" > - select CPU_FREQ_GOV_POWERSAVE > - help > - Use the CPUFreq governor 'powersave' as default. This sets > - the frequency statically to the lowest frequency supported by > - the CPU. > - > -config CPU_FREQ_DEFAULT_GOV_USERSPACE > - bool "userspace" > - select CPU_FREQ_GOV_USERSPACE > - help > - Use the CPUFreq governor 'userspace' as default. This allows > - you to set the CPU frequency manually or when a userspace > - program shall be able to set the CPU dynamically without having > - to enable the userspace governor manually. > - > -config CPU_FREQ_DEFAULT_GOV_ONDEMAND > - bool "ondemand" > - select CPU_FREQ_GOV_ONDEMAND > - select CPU_FREQ_GOV_PERFORMANCE > - help > - Use the CPUFreq governor 'ondemand' as default. This allows > - you to get a full dynamic frequency capable system by simply > - loading your cpufreq low-level hardware driver. > - Be aware that not all cpufreq drivers support the ondemand > - governor. If unsure have a look at the help section of the > - driver. Fallback governor will be the performance governor. > - > -config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE > - bool "conservative" > - select CPU_FREQ_GOV_CONSERVATIVE > - select CPU_FREQ_GOV_PERFORMANCE > - help > - Use the CPUFreq governor 'conservative' as default. This allows > - you to get a full dynamic frequency capable system by simply > - loading your cpufreq low-level hardware driver. > - Be aware that not all cpufreq drivers support the conservative > - governor. If unsure have a look at the help section of the > - driver. Fallback governor will be the performance governor. > - > -config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL > - bool "schedutil" > - depends on SMP > - select CPU_FREQ_GOV_SCHEDUTIL > - select CPU_FREQ_GOV_PERFORMANCE > - help > - Use the 'schedutil' CPUFreq governor by default. If unsure, > - have a look at the help section of that governor. The fallback > - governor will be 'performance'. > - > -endchoice > - > config CPU_FREQ_GOV_PERFORMANCE > tristate "'performance' governor" > help > @@ -145,6 +74,7 @@ config CPU_FREQ_GOV_USERSPACE > config CPU_FREQ_GOV_ONDEMAND > tristate "'ondemand' cpufreq policy governor" > select CPU_FREQ_GOV_COMMON > + select CPU_FREQ_GOV_PERFORMANCE > help > 'ondemand' - This driver adds a dynamic cpufreq policy governor. > The governor does a periodic polling and > @@ -164,6 +94,7 @@ config CPU_FREQ_GOV_CONSERVATIVE > tristate "'conservative' cpufreq governor" > depends on CPU_FREQ > select CPU_FREQ_GOV_COMMON > + select CPU_FREQ_GOV_PERFORMANCE > help > 'conservative' - this driver is rather similar to the 'ondemand' > governor both in its source code and its purpose, the difference is > @@ -188,6 +119,7 @@ config CPU_FREQ_GOV_SCHEDUTIL > bool "'schedutil' cpufreq policy governor" > depends on CPU_FREQ && SMP > select CPU_FREQ_GOV_ATTR_SET > + select CPU_FREQ_GOV_PERFORMANCE And I am not really sure why we always wanted this backup performance governor to be there unless the said governors are built as module. > select IRQ_WORK > help > This governor makes decisions based on the utilization data provided > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > index 1877f5e2e5b0..6848e3337b65 100644 > --- a/drivers/cpufreq/cpufreq.c > +++ b/drivers/cpufreq/cpufreq.c > @@ -626,6 +626,49 @@ static struct cpufreq_governor *find_governor(const char *str_governor) > return NULL; > } > > +static struct cpufreq_governor *cpufreq_default_governor(void) > +{ > + struct cpufreq_governor *gov = NULL; > + > +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL > + gov = find_governor("schedutil"); > + if (gov) > + return gov; > +#endif > + > +#ifdef CONFIG_CPU_FREQ_GOV_ONDEMAND > + gov = find_governor("ondemand"); > + if (gov) > + return gov; > +#endif > + > +#ifdef CONFIG_CPU_FREQ_GOV_CONSERVATIVE > + gov = find_governor("conservative"); > + if (gov) > + return gov; > +#endif > + > +#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE > + gov = find_governor("userspace"); > + if (gov) > + return gov; > +#endif > + > +#ifdef CONFIG_CPU_FREQ_GOV_POWERSAVE > + gov = find_governor("powersave"); > + if (gov) > + return gov; > +#endif > + > +#ifdef CONFIG_CPU_FREQ_GOV_PERFORMANCE > + gov = find_governor("performance"); > + if (gov) > + return gov; > +#endif > + > + return gov; > +} I think that would be fine with me. Though we would be required to update couple of defconfigs here to make sure they keep working the way they wanted to. -- viresh ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-22 10:59 ` Viresh Kumar @ 2020-10-22 11:45 ` Rafael J. Wysocki 2020-10-22 12:02 ` default cpufreq gov, was: " Peter Zijlstra 2020-10-23 6:24 ` Viresh Kumar 0 siblings, 2 replies; 68+ messages in thread From: Rafael J. Wysocki @ 2020-10-22 11:45 UTC (permalink / raw) To: Viresh Kumar Cc: Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote: > On 22-10-20, 09:11, Peter Zijlstra wrote: > > Well, but we need to do something to force people onto schedutil, > > otherwise we'll get more crap like this thread. > > > > Can we take the choice away? Only let Kconfig select which governors are > > available and then set the default ourselves? I mean, the end goal being > > to not have selectable governors at all, this seems like a good step > > anyway. > > Just to clarify and complete the point a bit here, the users can still > pass the default governor from cmdline using > cpufreq.default_governor=, which will take precedence over the one the > below code is playing with. And later once the kernel is up, they can > still choose a different governor from userspace. Right. Also some people simply set "performance" as the default governor and then don't touch cpufreq otherwise (the idea is to get everything to the max freq right away and stay in that mode forever). This still needs to be possible IMO. > > --- > > > > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig > > index 2c7171e0b001..3a9f54b382c0 100644 > > --- a/drivers/cpufreq/Kconfig > > +++ b/drivers/cpufreq/Kconfig > > @@ -34,77 +34,6 @@ config CPU_FREQ_STAT > > > > If in doubt, say N. > > > > -choice > > - prompt "Default CPUFreq governor" > > - default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ > > - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM > > - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP > > - default CPU_FREQ_DEFAULT_GOV_PERFORMANCE > > - help > > - This option sets which CPUFreq governor shall be loaded at > > - startup. If in doubt, use the default setting. > > - > > -config CPU_FREQ_DEFAULT_GOV_PERFORMANCE > > - bool "performance" > > - select CPU_FREQ_GOV_PERFORMANCE > > - help > > - Use the CPUFreq governor 'performance' as default. This sets > > - the frequency statically to the highest frequency supported by > > - the CPU. > > - > > -config CPU_FREQ_DEFAULT_GOV_POWERSAVE > > - bool "powersave" > > - select CPU_FREQ_GOV_POWERSAVE > > - help > > - Use the CPUFreq governor 'powersave' as default. This sets > > - the frequency statically to the lowest frequency supported by > > - the CPU. > > - > > -config CPU_FREQ_DEFAULT_GOV_USERSPACE > > - bool "userspace" > > - select CPU_FREQ_GOV_USERSPACE > > - help > > - Use the CPUFreq governor 'userspace' as default. This allows > > - you to set the CPU frequency manually or when a userspace > > - program shall be able to set the CPU dynamically without having > > - to enable the userspace governor manually. > > - > > -config CPU_FREQ_DEFAULT_GOV_ONDEMAND > > - bool "ondemand" > > - select CPU_FREQ_GOV_ONDEMAND > > - select CPU_FREQ_GOV_PERFORMANCE > > - help > > - Use the CPUFreq governor 'ondemand' as default. This allows > > - you to get a full dynamic frequency capable system by simply > > - loading your cpufreq low-level hardware driver. > > - Be aware that not all cpufreq drivers support the ondemand > > - governor. If unsure have a look at the help section of the > > - driver. Fallback governor will be the performance governor. > > - > > -config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE > > - bool "conservative" > > - select CPU_FREQ_GOV_CONSERVATIVE > > - select CPU_FREQ_GOV_PERFORMANCE > > - help > > - Use the CPUFreq governor 'conservative' as default. This allows > > - you to get a full dynamic frequency capable system by simply > > - loading your cpufreq low-level hardware driver. > > - Be aware that not all cpufreq drivers support the conservative > > - governor. If unsure have a look at the help section of the > > - driver. Fallback governor will be the performance governor. > > - > > -config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL > > - bool "schedutil" > > - depends on SMP > > - select CPU_FREQ_GOV_SCHEDUTIL > > - select CPU_FREQ_GOV_PERFORMANCE > > - help > > - Use the 'schedutil' CPUFreq governor by default. If unsure, > > - have a look at the help section of that governor. The fallback > > - governor will be 'performance'. > > - > > -endchoice > > - > > config CPU_FREQ_GOV_PERFORMANCE > > tristate "'performance' governor" > > help > > @@ -145,6 +74,7 @@ config CPU_FREQ_GOV_USERSPACE > > config CPU_FREQ_GOV_ONDEMAND > > tristate "'ondemand' cpufreq policy governor" > > select CPU_FREQ_GOV_COMMON > > + select CPU_FREQ_GOV_PERFORMANCE > > help > > 'ondemand' - This driver adds a dynamic cpufreq policy governor. > > The governor does a periodic polling and > > @@ -164,6 +94,7 @@ config CPU_FREQ_GOV_CONSERVATIVE > > tristate "'conservative' cpufreq governor" > > depends on CPU_FREQ > > select CPU_FREQ_GOV_COMMON > > + select CPU_FREQ_GOV_PERFORMANCE > > help > > 'conservative' - this driver is rather similar to the 'ondemand' > > governor both in its source code and its purpose, the difference is > > @@ -188,6 +119,7 @@ config CPU_FREQ_GOV_SCHEDUTIL > > bool "'schedutil' cpufreq policy governor" > > depends on CPU_FREQ && SMP > > select CPU_FREQ_GOV_ATTR_SET > > + select CPU_FREQ_GOV_PERFORMANCE > > And I am not really sure why we always wanted this backup performance > governor to be there unless the said governors are built as module. Apparently, some old drivers had problems with switching frequencies fast enough for ondemand to be used with them and the fallback was for those cases. AFAICS. > > select IRQ_WORK > > help > > This governor makes decisions based on the utilization data provided > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > > index 1877f5e2e5b0..6848e3337b65 100644 > > --- a/drivers/cpufreq/cpufreq.c > > +++ b/drivers/cpufreq/cpufreq.c > > @@ -626,6 +626,49 @@ static struct cpufreq_governor *find_governor(const char *str_governor) > > return NULL; > > } > > > > +static struct cpufreq_governor *cpufreq_default_governor(void) > > +{ > > + struct cpufreq_governor *gov = NULL; > > + > > +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL > > + gov = find_governor("schedutil"); > > + if (gov) > > + return gov; > > +#endif > > + > > +#ifdef CONFIG_CPU_FREQ_GOV_ONDEMAND > > + gov = find_governor("ondemand"); > > + if (gov) > > + return gov; > > +#endif > > + > > +#ifdef CONFIG_CPU_FREQ_GOV_CONSERVATIVE > > + gov = find_governor("conservative"); > > + if (gov) > > + return gov; > > +#endif > > + > > +#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE > > + gov = find_governor("userspace"); > > + if (gov) > > + return gov; > > +#endif > > + > > +#ifdef CONFIG_CPU_FREQ_GOV_POWERSAVE > > + gov = find_governor("powersave"); > > + if (gov) > > + return gov; > > +#endif > > + > > +#ifdef CONFIG_CPU_FREQ_GOV_PERFORMANCE > > + gov = find_governor("performance"); > > + if (gov) > > + return gov; > > +#endif > > + > > + return gov; > > +} > > I think that would be fine with me. Though we would be required to > update couple of defconfigs here to make sure they keep working the > way they wanted to. Generally agreed, but see the point about the "performance" governor above. ^ permalink raw reply [flat|nested] 68+ messages in thread
* default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 11:45 ` Rafael J. Wysocki @ 2020-10-22 12:02 ` Peter Zijlstra 2020-10-22 12:19 ` Rafael J. Wysocki ` (2 more replies) 2020-10-23 6:24 ` Viresh Kumar 1 sibling, 3 replies; 68+ messages in thread From: Peter Zijlstra @ 2020-10-22 12:02 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Viresh Kumar, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada On Thu, Oct 22, 2020 at 01:45:25PM +0200, Rafael J. Wysocki wrote: > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote: > > On 22-10-20, 09:11, Peter Zijlstra wrote: > > > Well, but we need to do something to force people onto schedutil, > > > otherwise we'll get more crap like this thread. > > > > > > Can we take the choice away? Only let Kconfig select which governors are > > > available and then set the default ourselves? I mean, the end goal being > > > to not have selectable governors at all, this seems like a good step > > > anyway. > > > > Just to clarify and complete the point a bit here, the users can still > > pass the default governor from cmdline using > > cpufreq.default_governor=, which will take precedence over the one the > > below code is playing with. And later once the kernel is up, they can > > still choose a different governor from userspace. > > Right. > > Also some people simply set "performance" as the default governor and then > don't touch cpufreq otherwise (the idea is to get everything to the max > freq right away and stay in that mode forever). This still needs to be > possible IMO. Performance/powersave make sense to keep. However I do want to retire ondemand, conservative and also very much intel_pstate/active mode. I also have very little sympathy for userspace. We should start by making it hard to use them and eventually just delete them outright. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 12:02 ` default cpufreq gov, was: " Peter Zijlstra @ 2020-10-22 12:19 ` Rafael J. Wysocki 2020-10-22 12:29 ` Peter Zijlstra 2020-10-22 16:23 ` [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate Rafael J. Wysocki 2020-10-27 11:11 ` default cpufreq gov, was: [PATCH] sched/fair: check for idle core Qais Yousef 2 siblings, 1 reply; 68+ messages in thread From: Rafael J. Wysocki @ 2020-10-22 12:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Viresh Kumar, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown [CC linux-pm and Len] On Thursday, October 22, 2020 2:02:13 PM CEST Peter Zijlstra wrote: > On Thu, Oct 22, 2020 at 01:45:25PM +0200, Rafael J. Wysocki wrote: > > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote: > > > On 22-10-20, 09:11, Peter Zijlstra wrote: > > > > Well, but we need to do something to force people onto schedutil, > > > > otherwise we'll get more crap like this thread. > > > > > > > > Can we take the choice away? Only let Kconfig select which governors are > > > > available and then set the default ourselves? I mean, the end goal being > > > > to not have selectable governors at all, this seems like a good step > > > > anyway. > > > > > > Just to clarify and complete the point a bit here, the users can still > > > pass the default governor from cmdline using > > > cpufreq.default_governor=, which will take precedence over the one the > > > below code is playing with. And later once the kernel is up, they can > > > still choose a different governor from userspace. > > > > Right. > > > > Also some people simply set "performance" as the default governor and then > > don't touch cpufreq otherwise (the idea is to get everything to the max > > freq right away and stay in that mode forever). This still needs to be > > possible IMO. > > Performance/powersave make sense to keep. > > However I do want to retire ondemand, conservative and also very much > intel_pstate/active mode. I agree in general, but IMO it would not be prudent to do that without making schedutil provide the same level of performance in all of the relevant use cases. > I also have very little sympathy for userspace. That I completely agree with. > We should start by making it hard to use them and eventually just delete > them outright. Right, but see above: IMO step 0 should be to ensure that schedutil is a viable replacement for all users. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 12:19 ` Rafael J. Wysocki @ 2020-10-22 12:29 ` Peter Zijlstra 2020-10-22 14:52 ` Mel Gorman 2020-10-22 15:45 ` A L 0 siblings, 2 replies; 68+ messages in thread From: Peter Zijlstra @ 2020-10-22 12:29 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Viresh Kumar, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote: > > However I do want to retire ondemand, conservative and also very much > > intel_pstate/active mode. > > I agree in general, but IMO it would not be prudent to do that without making > schedutil provide the same level of performance in all of the relevant use > cases. Agreed; I though to have understood we were there already. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 12:29 ` Peter Zijlstra @ 2020-10-22 14:52 ` Mel Gorman 2020-10-22 14:58 ` Colin Ian King 2020-10-22 15:25 ` Peter Zijlstra 2020-10-22 15:45 ` A L 1 sibling, 2 replies; 68+ messages in thread From: Mel Gorman @ 2020-10-22 14:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown On Thu, Oct 22, 2020 at 02:29:49PM +0200, Peter Zijlstra wrote: > On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote: > > > However I do want to retire ondemand, conservative and also very much > > > intel_pstate/active mode. > > > > I agree in general, but IMO it would not be prudent to do that without making > > schedutil provide the same level of performance in all of the relevant use > > cases. > > Agreed; I though to have understood we were there already. AFAIK, not quite (added Giovanni as he has been paying more attention). Schedutil has improved since it was merged but not to the extent where it is a drop-in replacement. The standard it needs to meet is that it is at least equivalent to powersave (in intel_pstate language) or ondemand (acpi_cpufreq) and within a reasonable percentage of the performance governor. Defaulting to performance is a) giving up and b) the performance governor is not a universal win. There are some questions currently on whether schedutil is good enough when HWP is not available. There was some evidence (I don't have the data, Giovanni was looking into it) that HWP was a requirement to make schedutil work well. That is a hazard in itself because someone could test on the latest gen Intel CPU and conclude everything is fine and miss that Intel-specific technology is needed to make it work well while throwing everyone else under a bus. Giovanni knows a lot more than I do about this, I could be wrong or forgetting things. For distros, switching to schedutil by default would be nice because frequency selection state would follow the task instead of being per-cpu and we could stop worrying about different HWP implementations but it's not at the point where the switch is advisable. I would expect hard data before switching the default and still would strongly advise having a period of time where we can fall back when someone inevitably finds a new corner case or exception. For reference, SLUB had the same problem for years. It was switched on by default in the kernel config but it was a long time before SLUB was generally equivalent to SLAB in terms of performance. Block multiqueue also had vaguely similar issues before the default changes and a period of time before it was removed removed (example whinging mail https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5spfdb@techsingularity.net/) It's schedutil's turn :P -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 14:52 ` Mel Gorman @ 2020-10-22 14:58 ` Colin Ian King 2020-10-22 15:12 ` Phil Auld 2020-10-22 15:25 ` Peter Zijlstra 1 sibling, 1 reply; 68+ messages in thread From: Colin Ian King @ 2020-10-22 14:58 UTC (permalink / raw) To: Mel Gorman, Peter Zijlstra Cc: Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown On 22/10/2020 15:52, Mel Gorman wrote: > On Thu, Oct 22, 2020 at 02:29:49PM +0200, Peter Zijlstra wrote: >> On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote: >>>> However I do want to retire ondemand, conservative and also very much >>>> intel_pstate/active mode. >>> >>> I agree in general, but IMO it would not be prudent to do that without making >>> schedutil provide the same level of performance in all of the relevant use >>> cases. >> >> Agreed; I though to have understood we were there already. > > AFAIK, not quite (added Giovanni as he has been paying more attention). > Schedutil has improved since it was merged but not to the extent where > it is a drop-in replacement. The standard it needs to meet is that > it is at least equivalent to powersave (in intel_pstate language) > or ondemand (acpi_cpufreq) and within a reasonable percentage of the > performance governor. Defaulting to performance is a) giving up and b) > the performance governor is not a universal win. There are some questions > currently on whether schedutil is good enough when HWP is not available. > There was some evidence (I don't have the data, Giovanni was looking into > it) that HWP was a requirement to make schedutil work well. That is a > hazard in itself because someone could test on the latest gen Intel CPU > and conclude everything is fine and miss that Intel-specific technology > is needed to make it work well while throwing everyone else under a bus. > Giovanni knows a lot more than I do about this, I could be wrong or > forgetting things. > > For distros, switching to schedutil by default would be nice because > frequency selection state would follow the task instead of being per-cpu > and we could stop worrying about different HWP implementations but it's > not at the point where the switch is advisable. I would expect hard data > before switching the default and still would strongly advise having a > period of time where we can fall back when someone inevitably finds a > new corner case or exception. ..and it would be really useful for distros to know when the hard data is available so that they can make an informed decision when to move to schedutil. > > For reference, SLUB had the same problem for years. It was switched > on by default in the kernel config but it was a long time before > SLUB was generally equivalent to SLAB in terms of performance. Block > multiqueue also had vaguely similar issues before the default changes > and a period of time before it was removed removed (example whinging mail > https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5spfdb@techsingularity.net/) > It's schedutil's turn :P > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 14:58 ` Colin Ian King @ 2020-10-22 15:12 ` Phil Auld 2020-10-22 16:35 ` Mel Gorman 0 siblings, 1 reply; 68+ messages in thread From: Phil Auld @ 2020-10-22 15:12 UTC (permalink / raw) To: Colin Ian King Cc: Mel Gorman, Peter Zijlstra, Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown On Thu, Oct 22, 2020 at 03:58:13PM +0100 Colin Ian King wrote: > On 22/10/2020 15:52, Mel Gorman wrote: > > On Thu, Oct 22, 2020 at 02:29:49PM +0200, Peter Zijlstra wrote: > >> On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote: > >>>> However I do want to retire ondemand, conservative and also very much > >>>> intel_pstate/active mode. > >>> > >>> I agree in general, but IMO it would not be prudent to do that without making > >>> schedutil provide the same level of performance in all of the relevant use > >>> cases. > >> > >> Agreed; I though to have understood we were there already. > > > > AFAIK, not quite (added Giovanni as he has been paying more attention). > > Schedutil has improved since it was merged but not to the extent where > > it is a drop-in replacement. The standard it needs to meet is that > > it is at least equivalent to powersave (in intel_pstate language) > > or ondemand (acpi_cpufreq) and within a reasonable percentage of the > > performance governor. Defaulting to performance is a) giving up and b) > > the performance governor is not a universal win. There are some questions > > currently on whether schedutil is good enough when HWP is not available. > > There was some evidence (I don't have the data, Giovanni was looking into > > it) that HWP was a requirement to make schedutil work well. That is a > > hazard in itself because someone could test on the latest gen Intel CPU > > and conclude everything is fine and miss that Intel-specific technology > > is needed to make it work well while throwing everyone else under a bus. > > Giovanni knows a lot more than I do about this, I could be wrong or > > forgetting things. > > > > For distros, switching to schedutil by default would be nice because > > frequency selection state would follow the task instead of being per-cpu > > and we could stop worrying about different HWP implementations but it's > > not at the point where the switch is advisable. I would expect hard data > > before switching the default and still would strongly advise having a > > period of time where we can fall back when someone inevitably finds a > > new corner case or exception. > > ..and it would be really useful for distros to know when the hard data > is available so that they can make an informed decision when to move to > schedutil. > I think distros are on the hook to generate that hard data themselves with which to make such a decision. I don't expect it to be done by someone else. > > > > For reference, SLUB had the same problem for years. It was switched > > on by default in the kernel config but it was a long time before > > SLUB was generally equivalent to SLAB in terms of performance. Block > > multiqueue also had vaguely similar issues before the default changes > > and a period of time before it was removed removed (example whinging mail > > https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5spfdb@techsingularity.net/) > > It's schedutil's turn :P > > > Agreed. I'd like the option to switch back if we make the default change. It's on the table and I'd like to be able to go that way. Cheers, Phil -- ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 15:12 ` Phil Auld @ 2020-10-22 16:35 ` Mel Gorman 2020-10-22 17:59 ` Rafael J. Wysocki 0 siblings, 1 reply; 68+ messages in thread From: Mel Gorman @ 2020-10-22 16:35 UTC (permalink / raw) To: Phil Auld Cc: Colin Ian King, Peter Zijlstra, Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown On Thu, Oct 22, 2020 at 11:12:00AM -0400, Phil Auld wrote: > > > AFAIK, not quite (added Giovanni as he has been paying more attention). > > > Schedutil has improved since it was merged but not to the extent where > > > it is a drop-in replacement. The standard it needs to meet is that > > > it is at least equivalent to powersave (in intel_pstate language) > > > or ondemand (acpi_cpufreq) and within a reasonable percentage of the > > > performance governor. Defaulting to performance is a) giving up and b) > > > the performance governor is not a universal win. There are some questions > > > currently on whether schedutil is good enough when HWP is not available. > > > There was some evidence (I don't have the data, Giovanni was looking into > > > it) that HWP was a requirement to make schedutil work well. That is a > > > hazard in itself because someone could test on the latest gen Intel CPU > > > and conclude everything is fine and miss that Intel-specific technology > > > is needed to make it work well while throwing everyone else under a bus. > > > Giovanni knows a lot more than I do about this, I could be wrong or > > > forgetting things. > > > > > > For distros, switching to schedutil by default would be nice because > > > frequency selection state would follow the task instead of being per-cpu > > > and we could stop worrying about different HWP implementations but it's > > > not at the point where the switch is advisable. I would expect hard data > > > before switching the default and still would strongly advise having a > > > period of time where we can fall back when someone inevitably finds a > > > new corner case or exception. > > > > ..and it would be really useful for distros to know when the hard data > > is available so that they can make an informed decision when to move to > > schedutil. > > > > I think distros are on the hook to generate that hard data themselves > with which to make such a decision. I don't expect it to be done by > someone else. > Yep, distros are on the hook. When I said "I would expect hard data", it was in the knowledge that for openSUSE/SLE, we (as in SUSE) would be generating said data and making a call based on it. I'd be surprised if Phil was not thinking along the same lines. > > > For reference, SLUB had the same problem for years. It was switched > > > on by default in the kernel config but it was a long time before > > > SLUB was generally equivalent to SLAB in terms of performance. Block > > > multiqueue also had vaguely similar issues before the default changes > > > and a period of time before it was removed removed (example whinging mail > > > https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5spfdb@techsingularity.net/) > > > It's schedutil's turn :P > > > > > > > Agreed. I'd like the option to switch back if we make the default change. > It's on the table and I'd like to be able to go that way. > Yep. It sounds chicken, but it's a useful safety net and a reasonable way to deprecate a feature. It's also useful for bug creation -- User X running whatever found that schedutil is worse than the old governor and had to temporarily switch back. Repeat until complaining stops and then tear out the old stuff. When/if there is a patch setting schedutil as the default, cc suitable distro people (Giovanni and myself for openSUSE). Other distros assuming they're watching can nominate their own victim. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 16:35 ` Mel Gorman @ 2020-10-22 17:59 ` Rafael J. Wysocki 2020-10-22 20:32 ` Mel Gorman 0 siblings, 1 reply; 68+ messages in thread From: Rafael J. Wysocki @ 2020-10-22 17:59 UTC (permalink / raw) To: Mel Gorman Cc: Phil Auld, Colin Ian King, Peter Zijlstra, Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, Linux Kernel Mailing List, Valentin Schneider, Gilles Muller, Srinivas Pandruvada, Linux PM, Len Brown On Thu, Oct 22, 2020 at 6:35 PM Mel Gorman <mgorman@suse.de> wrote: > > On Thu, Oct 22, 2020 at 11:12:00AM -0400, Phil Auld wrote: > > > > AFAIK, not quite (added Giovanni as he has been paying more attention). > > > > Schedutil has improved since it was merged but not to the extent where > > > > it is a drop-in replacement. The standard it needs to meet is that > > > > it is at least equivalent to powersave (in intel_pstate language) > > > > or ondemand (acpi_cpufreq) and within a reasonable percentage of the > > > > performance governor. Defaulting to performance is a) giving up and b) > > > > the performance governor is not a universal win. There are some questions > > > > currently on whether schedutil is good enough when HWP is not available. > > > > There was some evidence (I don't have the data, Giovanni was looking into > > > > it) that HWP was a requirement to make schedutil work well. That is a > > > > hazard in itself because someone could test on the latest gen Intel CPU > > > > and conclude everything is fine and miss that Intel-specific technology > > > > is needed to make it work well while throwing everyone else under a bus. > > > > Giovanni knows a lot more than I do about this, I could be wrong or > > > > forgetting things. > > > > > > > > For distros, switching to schedutil by default would be nice because > > > > frequency selection state would follow the task instead of being per-cpu > > > > and we could stop worrying about different HWP implementations but it's > > > > not at the point where the switch is advisable. I would expect hard data > > > > before switching the default and still would strongly advise having a > > > > period of time where we can fall back when someone inevitably finds a > > > > new corner case or exception. > > > > > > ..and it would be really useful for distros to know when the hard data > > > is available so that they can make an informed decision when to move to > > > schedutil. > > > > > > > I think distros are on the hook to generate that hard data themselves > > with which to make such a decision. I don't expect it to be done by > > someone else. > > > > Yep, distros are on the hook. When I said "I would expect hard data", > it was in the knowledge that for openSUSE/SLE, we (as in SUSE) would be > generating said data and making a call based on it. I'd be surprised if > Phil was not thinking along the same lines. > > > > > For reference, SLUB had the same problem for years. It was switched > > > > on by default in the kernel config but it was a long time before > > > > SLUB was generally equivalent to SLAB in terms of performance. Block > > > > multiqueue also had vaguely similar issues before the default changes > > > > and a period of time before it was removed removed (example whinging mail > > > > https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5spfdb@techsingularity.net/) > > > > It's schedutil's turn :P > > > > > > > > > > > Agreed. I'd like the option to switch back if we make the default change. > > It's on the table and I'd like to be able to go that way. > > > > Yep. It sounds chicken, but it's a useful safety net and a reasonable > way to deprecate a feature. It's also useful for bug creation -- User X > running whatever found that schedutil is worse than the old governor and > had to temporarily switch back. Repeat until complaining stops and then > tear out the old stuff. > > When/if there is a patch setting schedutil as the default, cc suitable > distro people (Giovanni and myself for openSUSE). So for the record, Giovanni was on the CC list of the "cpufreq: intel_pstate: Use passive mode by default without HWP" patch that this discussion resulted from (and which kind of belongs to the above category). > Other distros assuming they're watching can nominate their own victim. But no other victims had been nominated at that time. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 17:59 ` Rafael J. Wysocki @ 2020-10-22 20:32 ` Mel Gorman 2020-10-22 20:39 ` Phil Auld 0 siblings, 1 reply; 68+ messages in thread From: Mel Gorman @ 2020-10-22 20:32 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Phil Auld, Colin Ian King, Peter Zijlstra, Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, Linux Kernel Mailing List, Valentin Schneider, Gilles Muller, Srinivas Pandruvada, Linux PM, Len Brown On Thu, Oct 22, 2020 at 07:59:43PM +0200, Rafael J. Wysocki wrote: > > > Agreed. I'd like the option to switch back if we make the default change. > > > It's on the table and I'd like to be able to go that way. > > > > > > > Yep. It sounds chicken, but it's a useful safety net and a reasonable > > way to deprecate a feature. It's also useful for bug creation -- User X > > running whatever found that schedutil is worse than the old governor and > > had to temporarily switch back. Repeat until complaining stops and then > > tear out the old stuff. > > > > When/if there is a patch setting schedutil as the default, cc suitable > > distro people (Giovanni and myself for openSUSE). > > So for the record, Giovanni was on the CC list of the "cpufreq: > intel_pstate: Use passive mode by default without HWP" patch that this > discussion resulted from (and which kind of belongs to the above > category). > Oh I know, I did not mean to suggest that you did not. He made people aware that this was going to be coming down the line and has been looking into the "what if schedutil was the default" question. AFAIK, it's still a work-in-progress and I don't know all the specifics but he knows more than I do on the topic. I only know enough that if we flipped the switch tomorrow that we could be plagued with google searches suggesting it be turned off again just like there is still broken advice out there about disabling intel_pstate for usually the wrong reasons. The passive patch was a clear flag that the intent is that schedutil will be the default at some unknown point in the future. That point is now a bit closer and this thread could have encouraged a premature change of the default resulting in unfair finger pointing at one company's test team. If at least two distos check it out and it still goes wrong, at least there will be shared blame :/ > > Other distros assuming they're watching can nominate their own victim. > > But no other victims had been nominated at that time. We have one, possibly two if Phil agrees. That's better than zero or unfairly placing the full responsibility on the Intel guys that have been testing it out. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 20:32 ` Mel Gorman @ 2020-10-22 20:39 ` Phil Auld 0 siblings, 0 replies; 68+ messages in thread From: Phil Auld @ 2020-10-22 20:39 UTC (permalink / raw) To: Mel Gorman Cc: Rafael J. Wysocki, Colin Ian King, Peter Zijlstra, Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, Linux Kernel Mailing List, Valentin Schneider, Gilles Muller, Srinivas Pandruvada, Linux PM, Len Brown On Thu, Oct 22, 2020 at 09:32:55PM +0100 Mel Gorman wrote: > On Thu, Oct 22, 2020 at 07:59:43PM +0200, Rafael J. Wysocki wrote: > > > > Agreed. I'd like the option to switch back if we make the default change. > > > > It's on the table and I'd like to be able to go that way. > > > > > > > > > > Yep. It sounds chicken, but it's a useful safety net and a reasonable > > > way to deprecate a feature. It's also useful for bug creation -- User X > > > running whatever found that schedutil is worse than the old governor and > > > had to temporarily switch back. Repeat until complaining stops and then > > > tear out the old stuff. > > > > > > When/if there is a patch setting schedutil as the default, cc suitable > > > distro people (Giovanni and myself for openSUSE). > > > > So for the record, Giovanni was on the CC list of the "cpufreq: > > intel_pstate: Use passive mode by default without HWP" patch that this > > discussion resulted from (and which kind of belongs to the above > > category). > > > > Oh I know, I did not mean to suggest that you did not. He made people > aware that this was going to be coming down the line and has been looking > into the "what if schedutil was the default" question. AFAIK, it's still > a work-in-progress and I don't know all the specifics but he knows more > than I do on the topic. I only know enough that if we flipped the switch > tomorrow that we could be plagued with google searches suggesting it be > turned off again just like there is still broken advice out there about > disabling intel_pstate for usually the wrong reasons. > > The passive patch was a clear flag that the intent is that schedutil will > be the default at some unknown point in the future. That point is now a > bit closer and this thread could have encouraged a premature change of > the default resulting in unfair finger pointing at one company's test > team. If at least two distos check it out and it still goes wrong, at > least there will be shared blame :/ > > > > Other distros assuming they're watching can nominate their own victim. > > > > But no other victims had been nominated at that time. > > We have one, possibly two if Phil agrees. That's better than zero or > unfairly placing the full responsibility on the Intel guys that have been > testing it out. > Yes. I agree and we (RHEL) are planning to test this soon. I'll try to get to it. You can certainly CC me, please, athough I also try to watch for this sort of thing on list. Cheers, Phil > -- > Mel Gorman > SUSE Labs > -- ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 14:52 ` Mel Gorman 2020-10-22 14:58 ` Colin Ian King @ 2020-10-22 15:25 ` Peter Zijlstra 2020-10-22 15:55 ` Rafael J. Wysocki ` (2 more replies) 1 sibling, 3 replies; 68+ messages in thread From: Peter Zijlstra @ 2020-10-22 15:25 UTC (permalink / raw) To: Mel Gorman Cc: Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown On Thu, Oct 22, 2020 at 03:52:50PM +0100, Mel Gorman wrote: > There are some questions > currently on whether schedutil is good enough when HWP is not available. Srinivas and Rafael will know better, but Intel does run a lot of tests and IIRC it was found that schedutil was on-par for !HWP. That was the basis for commit: 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP") But now it turns out that commit results in running intel_pstate-passive on ondemand, which is quite horrible. > There was some evidence (I don't have the data, Giovanni was looking into > it) that HWP was a requirement to make schedutil work well. That seems to be the question; Rafael just said the opposite. > For distros, switching to schedutil by default would be nice because > frequency selection state would follow the task instead of being per-cpu > and we could stop worrying about different HWP implementations but it's s/HWP/cpufreq-governors/ ? But yes. > not at the point where the switch is advisable. I would expect hard data > before switching the default and still would strongly advise having a > period of time where we can fall back when someone inevitably finds a > new corner case or exception. Which is why I advocated to make it 'difficult' to use the old ones and only later remove them. > For reference, SLUB had the same problem for years. It was switched > on by default in the kernel config but it was a long time before > SLUB was generally equivalent to SLAB in terms of performance. I remember :-) ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 15:25 ` Peter Zijlstra @ 2020-10-22 15:55 ` Rafael J. Wysocki 2020-10-22 16:29 ` Mel Gorman 2020-10-22 20:10 ` Giovanni Gherdovich 2 siblings, 0 replies; 68+ messages in thread From: Rafael J. Wysocki @ 2020-10-22 15:55 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, Linux Kernel Mailing List, Valentin Schneider, Gilles Muller, Srinivas Pandruvada, Linux PM, Len Brown On Thu, Oct 22, 2020 at 5:25 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Oct 22, 2020 at 03:52:50PM +0100, Mel Gorman wrote: > > > There are some questions > > currently on whether schedutil is good enough when HWP is not available. > > Srinivas and Rafael will know better, but Intel does run a lot of tests > and IIRC it was found that schedutil was on-par for !HWP. That was the > basis for commit: > > 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP") > > But now it turns out that commit results in running intel_pstate-passive > on ondemand, which is quite horrible. It doesn't in general. AFAICS this happens only if "ondemand" was selected as the default governor in the old kernel config, which should not be the common case. But I do agree that this needs to be avoided. > > There was some evidence (I don't have the data, Giovanni was looking into > > it) that HWP was a requirement to make schedutil work well. > > That seems to be the question; Rafael just said the opposite. I'm not aware of any data like that. HWP should not be required and it should always be possible to make an HWP system run without HWP (except for those with exotic BIOS configs). However, schedutil should work without HWP as well as (or better than) the "ondemand" and "conservative" governors on top of the same driver (whatever it is) and it should work as well as (or better than) "raw" HWP (so to speak) on top of intel_pstate in the passive mode with HWP enabled (before 5.9 it couldn't work in that configuration at all and now it can do that, which I guess may be regarded as an improvement). > > For distros, switching to schedutil by default would be nice because > > frequency selection state would follow the task instead of being per-cpu > > and we could stop worrying about different HWP implementations but it's > > s/HWP/cpufreq-governors/ ? But yes. Well, different HWP implementations in different processor generations may be a concern as well in general. > > not at the point where the switch is advisable. I would expect hard data > > before switching the default and still would strongly advise having a > > period of time where we can fall back when someone inevitably finds a > > new corner case or exception. > > Which is why I advocated to make it 'difficult' to use the old ones and > only later remove them. Slightly less convenient may be sufficient IMV. > > For reference, SLUB had the same problem for years. It was switched > > on by default in the kernel config but it was a long time before > > SLUB was generally equivalent to SLAB in terms of performance. > > I remember :-) ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 15:25 ` Peter Zijlstra 2020-10-22 15:55 ` Rafael J. Wysocki @ 2020-10-22 16:29 ` Mel Gorman 2020-10-22 20:10 ` Giovanni Gherdovich 2 siblings, 0 replies; 68+ messages in thread From: Mel Gorman @ 2020-10-22 16:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Giovanni Gherdovich, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown On Thu, Oct 22, 2020 at 05:25:14PM +0200, Peter Zijlstra wrote: > On Thu, Oct 22, 2020 at 03:52:50PM +0100, Mel Gorman wrote: > > > There are some questions > > currently on whether schedutil is good enough when HWP is not available. > > Srinivas and Rafael will know better, but Intel does run a lot of tests > and IIRC it was found that schedutil was on-par for !HWP. That was the > basis for commit: > > 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP") > > But now it turns out that commit results in running intel_pstate-passive > on ondemand, which is quite horrible. > I know Intel ran a lot of tests, no question about it and no fingers are being pointed. I know I've had enough bugs patches tested with a battery of tests on various machines and still ended up with bug reports :) > > There was some evidence (I don't have the data, Giovanni was looking into > > it) that HWP was a requirement to make schedutil work well. > > That seems to be the question; Rafael just said the opposite. > > > For distros, switching to schedutil by default would be nice because > > frequency selection state would follow the task instead of being per-cpu > > and we could stop worrying about different HWP implementations but it's > > s/HWP/cpufreq-governors/ ? But yes. > I've seen cases where HWP had variable behaviour between CPU generations. It was hard to quantify and/or figure out because HWP is a black box. > > not at the point where the switch is advisable. I would expect hard data > > before switching the default and still would strongly advise having a > > period of time where we can fall back when someone inevitably finds a > > new corner case or exception. > > Which is why I advocated to make it 'difficult' to use the old ones and > only later remove them. > That's fair. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 15:25 ` Peter Zijlstra 2020-10-22 15:55 ` Rafael J. Wysocki 2020-10-22 16:29 ` Mel Gorman @ 2020-10-22 20:10 ` Giovanni Gherdovich 2020-10-22 20:16 ` Giovanni Gherdovich 2020-10-23 7:03 ` Peter Zijlstra 2 siblings, 2 replies; 68+ messages in thread From: Giovanni Gherdovich @ 2020-10-22 20:10 UTC (permalink / raw) To: Peter Zijlstra, Rafael J. Wysocki Cc: Mel Gorman, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown Hello Peter, Rafael, back in August I tested a v5.8 kernel adding Rafael's patches from v5.9 that make schedutil and HWP works together, i.e. f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled"). The main point I took from the exercise is that tbench (network benchmark in localhost) is problematic for schedutil and only with HWP (thanks to Rafael's patch above) it reaches the throughput of the other governors. When HWP isn't available, the penalty is 5-10% and I need to understand if the cause is something that can affect other applications too (or just a quirk of this test). I ran this campaign this summer when Rafal CC'ed me to f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled"), I didn't reply as the patch was a win anyways (my bad, I should have posted the positive results). The regression of tbench with schedutil w/o HWP, that went unnoticed for long, got the best of my attention. Other remarks * on gitsource (running the git unit test suite, measures elapsed time) schedutil is a lot better than Intel's powersave but not as good as the performance governor. * for the AMD EPYC machines we haven't yet implemented frequency invariant accounting, which might explain why schedutil looses to ondemand on all the benchmarks. * on dbench (filesystem, measures latency) and kernbench (kernel compilation), sugov is as good as the Intel performance governor. You can add or remove HWP (to either sugov or perfgov), it doesn't make a difference. Intel's powersave in general trails behind. * generally my main concern is performance, not power efficiency, but I was a little disappointed to see schedutil being just as efficient as perfgov (the performance-per-watt ratios): there are even a few cases where (on tbench) the performance governor is both faster and more efficient. From previous conversations with Rafael I recall that switching frequency has an energy cost, so it could be that schedutil switches too often to amortize it. I haven't checked. To read the tables: Tilde (~) means the result is the same as baseline (or, the ratio is close to 1). The double asterisk (**) is a visual aid and means the result is worse than baseline (higher or lower depending on the case). For an overview of the possible configurations (intel_psate passive, active, HWP on/off etc) I made the diagram at https://beta.suse.com/private/ggherdovich/cpufreq/x86-cpufreq.png 1) INTEL, HWP-CAPABLE MACHINES 2) INTEL, NON-HWP-CAPABLE MACHINES 3) AMD EPYC 1) INTEL, HWP-CAPABLE MACHINES: 64x_SKYLAKE_NUMA: Intel Skylake SP, 32 cores / 64 threads, NUMA, SATA SSD storage ------------------------------------------------------------------------------ sugov-HWP sugov-no-HWP powersave-HWP perfgov-HWP better if ------------------------------------------------------------------------------ PERFORMANCE RATIOS tbench 1.00 0.68 ~ 1.03** higher dbench 1.00 ~ 1.03 ~ lower kernbench 1.00 ~ 1.11 ~ lower gitsource 1.00 1.03 2.26 0.82** lower ------------------------------------------------------------------------------ PERFORMANCE-PER-WATT RATIOS tbench 1.00 0.74 ~ ~ higher dbench 1.00 ~ ~ ~ higher kernbench 1.00 ~ 0.96 ~ higher gitsource 1.00 0.96 0.45 1.15** higher 8x_SKYLAKE_UMA: Intel Skylake (client), 4 cores / 8 threads, UMA, SATA SSD storage ------------------------------------------------------------------------------ sugov-HWP sugov-no-HWP powersave-HWP perfgov-HWP better if ------------------------------------------------------------------------------ PERFORMANCE RATIOS tbench 1.00 0.91 ~ ~ higher dbench 1.00 ~ ~ ~ lower kernbench 1.00 ~ ~ ~ lower gitsource 1.00 1.04 1.77 ~ lower ------------------------------------------------------------------------------ PERFORMANCE-PER-WATT RATIOS tbench 1.00 0.95 ~ ~ higher dbench 1.00 ~ ~ ~ higher kernbench 1.00 ~ ~ ~ higher gitsource 1.00 ~ 0.74 ~ higher 8x_COFFEELAKE_UMA: Intel Coffee Lake, 4 cores / 8 threads, UMA, NVMe SSD storage --------------------------------------------------------------- sugov-HWP powersave-HWP perfgov-HWP better if --------------------------------------------------------------- PERFORMANCE RATIOS tbench 1.00 ~ ~ higher dbench 1.00 1.12 ~ lower kernbench 1.00 ~ ~ lower gitsource 1.00 2.05 ~ lower --------------------------------------------------------------- PERFORMANCE-PER-WATT RATIOS tbench 1.00 ~ ~ higher dbench 1.00 1.80** ~ higher kernbench 1.00 ~ ~ higher gitsource 1.00 1.52** ~ higher 2) INTEL, NON-HWP-CAPABLE MACHINES: 80x_BROADWELL_NUMA: Intel Broadwell EP, 40 cores / 80 threads, NUMA, SATA SSD storage --------------------------------------------------------------- sugov powersave perfgov better if --------------------------------------------------------------- PERFORMANCE RATIOS tbench 1.00 1.11** 1.10** higher dbench 1.00 1.10 ~ lower kernbench 1.00 1.10 ~ lower gitsource 1.00 2.27 0.95** lower --------------------------------------------------------------- PERFORMANCE-PER-WATT RATIOS tbench 1.00 1.05** 1.04** higher dbench 1.00 1.24** 0.95 higher kernbench 1.00 ~ ~ higher gitsource 1.00 0.86 1.04** higher 48x_HASWELL_NUMA: Intel Haswell EP, 24 cores / 48 threads, NUMA, HDD storage --------------------------------------------------------------- sugov powersave perfgov better if --------------------------------------------------------------- PERFORMANCE RATIOS tbench 1.00 1.25** 1.27** higher dbench 1.00 1.17 ~ lower kernbench 1.00 1.04 ~ lower gitsource 1.00 1.54 0.79** lower --------------------------------------------------------------- PERFORMANCE-PER-WATT RATIOS tbench 1.00 1.18** 1.11** higher dbench 1.00 1.25** ~ higher kernbench 1.00 1.04** 0.97 higher gitsource 1.00 0.77 ~ higher 3) AMD EPYC: 256x_ROME_NUMA: AMD Rome , 128 cores / 256 threads, NUMA, SATA SSD storage --------------------------------------------------------------- sugov ondemand perfgov better if --------------------------------------------------------------- PERFORMANCE RATIOS tbench 1.00 1.11** 1.58** higher dbench 1.00 0.44** 0.40** lower kernbench 1.00 ~ 0.91** lower gitsource 1.00 0.96** 0.65** lower 128x_NAPLES_NUMA: AMD Naples , 64 cores / 128 threads, NUMA, SATA SSD storage --------------------------------------------------------------- sugov ondemand perfgov better if --------------------------------------------------------------- PERFORMANCE RATIOS tbench 1.00 1.10** 1.19** higher dbench 1.00 1.05 0.95** lower kernbench 1.00 ~ 0.95** lower gitsource 1.00 0.93** 0.55** lower Giovanni ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 20:10 ` Giovanni Gherdovich @ 2020-10-22 20:16 ` Giovanni Gherdovich 2020-10-23 7:03 ` Peter Zijlstra 1 sibling, 0 replies; 68+ messages in thread From: Giovanni Gherdovich @ 2020-10-22 20:16 UTC (permalink / raw) To: Peter Zijlstra, Rafael J. Wysocki Cc: Mel Gorman, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown On Thu, 2020-10-22 at 22:10 +0200, Giovanni Gherdovich wrote: > [...] > To read the tables: > > Tilde (~) means the result is the same as baseline (or, the ratio is close > to 1). The double asterisk (**) is a visual aid and means the result is > worse than baseline (higher or lower depending on the case). Ouch, the opposite. Double asterisk (**) is where the result is better than baseline, and schedutil needs improvement. Giovanni ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 20:10 ` Giovanni Gherdovich 2020-10-22 20:16 ` Giovanni Gherdovich @ 2020-10-23 7:03 ` Peter Zijlstra 2020-10-23 17:46 ` Tom Lendacky 1 sibling, 1 reply; 68+ messages in thread From: Peter Zijlstra @ 2020-10-23 7:03 UTC (permalink / raw) To: Giovanni Gherdovich Cc: Rafael J. Wysocki, Mel Gorman, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown, thomas.lendacky, puwen, yazen.ghannam, kim.phillips, suravee.suthikulpanit On Thu, Oct 22, 2020 at 10:10:35PM +0200, Giovanni Gherdovich wrote: > * for the AMD EPYC machines we haven't yet implemented frequency invariant > accounting, which might explain why schedutil looses to ondemand on all > the benchmarks. Right, I poked the AMD people on that a few times, but nothing seems to be forthcoming :/ Tom, any way you could perhaps expedite the matter? In particular we're looking for some X86_VENDOR_AMD/HYGON code to run in arch/x86/kernel/smpboot.c:init_freq_invariance() The main issue is finding a 'max' frequency that is not the absolute max turbo boost (this could result in not reaching it very often) but also not too low such that we're always clipping. And while we're here, IIUC AMD is still using acpi_cpufreq, but AFAIK the chips have a CPPC interface which could be used instead. Is there any progress on that? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-23 7:03 ` Peter Zijlstra @ 2020-10-23 17:46 ` Tom Lendacky 2020-10-26 19:52 ` Fontenot, Nathan 0 siblings, 1 reply; 68+ messages in thread From: Tom Lendacky @ 2020-10-23 17:46 UTC (permalink / raw) To: Peter Zijlstra, Giovanni Gherdovich Cc: Rafael J. Wysocki, Mel Gorman, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown, puwen, yazen.ghannam, kim.phillips, suravee.suthikulpanit, Fontenot, Nathan On 10/23/20 2:03 AM, Peter Zijlstra wrote: > On Thu, Oct 22, 2020 at 10:10:35PM +0200, Giovanni Gherdovich wrote: >> * for the AMD EPYC machines we haven't yet implemented frequency invariant >> accounting, which might explain why schedutil looses to ondemand on all >> the benchmarks. > > Right, I poked the AMD people on that a few times, but nothing seems to > be forthcoming :/ Tom, any way you could perhaps expedite the matter? Adding Nathan to the thread to help out here. Thanks, Tom > > In particular we're looking for some X86_VENDOR_AMD/HYGON code to run in > > arch/x86/kernel/smpboot.c:init_freq_invariance() > > The main issue is finding a 'max' frequency that is not the absolute max > turbo boost (this could result in not reaching it very often) but also > not too low such that we're always clipping. > > And while we're here, IIUC AMD is still using acpi_cpufreq, but AFAIK > the chips have a CPPC interface which could be used instead. Is there > any progress on that? > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-23 17:46 ` Tom Lendacky @ 2020-10-26 19:52 ` Fontenot, Nathan 0 siblings, 0 replies; 68+ messages in thread From: Fontenot, Nathan @ 2020-10-26 19:52 UTC (permalink / raw) To: Tom Lendacky, Peter Zijlstra, Giovanni Gherdovich Cc: Rafael J. Wysocki, Mel Gorman, Viresh Kumar, Julia Lawall, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown, puwen, yazen.ghannam, kim.phillips, suravee.suthikulpanit On 10/23/2020 12:46 PM, Tom Lendacky wrote: > On 10/23/20 2:03 AM, Peter Zijlstra wrote: >> On Thu, Oct 22, 2020 at 10:10:35PM +0200, Giovanni Gherdovich wrote: >>> * for the AMD EPYC machines we haven't yet implemented frequency invariant >>> accounting, which might explain why schedutil looses to ondemand on all >>> the benchmarks. >> >> Right, I poked the AMD people on that a few times, but nothing seems to >> be forthcoming :/ Tom, any way you could perhaps expedite the matter? > > Adding Nathan to the thread to help out here. > > Thanks, > Tom Thanks Tom, diving in... > >> >> In particular we're looking for some X86_VENDOR_AMD/HYGON code to run in >> >> arch/x86/kernel/smpboot.c:init_freq_invariance() >> >> The main issue is finding a 'max' frequency that is not the absolute max >> turbo boost (this could result in not reaching it very often) but also >> not too low such that we're always clipping. I've started looking into this and have a lead but need to confirm that the frequency value I'm getting is not an absolute max. >> >> And while we're here, IIUC AMD is still using acpi_cpufreq, but AFAIK >> the chips have a CPPC interface which could be used instead. Is there >> any progress on that? >> Correct, AMD uses acpi_cpufreq. The newer AMD chips do have a CPPC interface (not sure how far back 'newer' covers). I'll take a look at schedutil and cppc_cpufreq and the possibility of transitioning to them for AMD. -Nathan ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 12:29 ` Peter Zijlstra 2020-10-22 14:52 ` Mel Gorman @ 2020-10-22 15:45 ` A L 2020-10-22 15:55 ` Vincent Guittot 1 sibling, 1 reply; 68+ messages in thread From: A L @ 2020-10-22 15:45 UTC (permalink / raw) To: Peter Zijlstra, Rafael J. Wysocki Cc: Viresh Kumar, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown ---- From: Peter Zijlstra <peterz@infradead.org> -- Sent: 2020-10-22 - 14:29 ---- > On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote: >> > However I do want to retire ondemand, conservative and also very much >> > intel_pstate/active mode. >> >> I agree in general, but IMO it would not be prudent to do that without making >> schedutil provide the same level of performance in all of the relevant use >> cases. > > Agreed; I though to have understood we were there already. Hi, Currently schedutil does not populate all stats like ondemand does, which can be a problem for some monitoring software. On my AMD 3000G CPU with kernel-5.9.1: grep. /sys/devices/system/cpu/cpufreq/policy0/stats/* With ondemand: time_in_state:3900000 145179 time_in_state:1600000 9588482 total_trans:177565 trans_table: From : To trans_table: : 3900000 1600000 trans_table: 3900000: 0 88783 trans_table: 1600000: 88782 0 With schedutil only two file exists: reset:<empty> total_trans:216609 I'd really like to have these stats populated with schedutil, if that's possible. Thanks. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 15:45 ` A L @ 2020-10-22 15:55 ` Vincent Guittot 2020-10-23 5:23 ` Viresh Kumar 0 siblings, 1 reply; 68+ messages in thread From: Vincent Guittot @ 2020-10-22 15:55 UTC (permalink / raw) To: A L Cc: Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, Srinivas Pandruvada, Linux PM, Len Brown On Thu, 22 Oct 2020 at 17:45, A L <mail@lechevalier.se> wrote: > > > > ---- From: Peter Zijlstra <peterz@infradead.org> -- Sent: 2020-10-22 - 14:29 ---- > > > On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote: > >> > However I do want to retire ondemand, conservative and also very much > >> > intel_pstate/active mode. > >> > >> I agree in general, but IMO it would not be prudent to do that without making > >> schedutil provide the same level of performance in all of the relevant use > >> cases. > > > > Agreed; I though to have understood we were there already. > > Hi, > > > Currently schedutil does not populate all stats like ondemand does, which can be a problem for some monitoring software. > > On my AMD 3000G CPU with kernel-5.9.1: > > > grep. /sys/devices/system/cpu/cpufreq/policy0/stats/* > > With ondemand: > time_in_state:3900000 145179 > time_in_state:1600000 9588482 > total_trans:177565 > trans_table: From : To > trans_table: : 3900000 1600000 > trans_table: 3900000: 0 88783 > trans_table: 1600000: 88782 0 > > With schedutil only two file exists: > reset:<empty> > total_trans:216609 > > > I'd really like to have these stats populated with schedutil, if that's possible. Your problem might have been fixed with commit 96f60cddf7a1 ("cpufreq: stats: Enable stats for fast-switch as well") > > Thanks. > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 15:55 ` Vincent Guittot @ 2020-10-23 5:23 ` Viresh Kumar 0 siblings, 0 replies; 68+ messages in thread From: Viresh Kumar @ 2020-10-23 5:23 UTC (permalink / raw) To: Vincent Guittot Cc: A L, Peter Zijlstra, Rafael J. Wysocki, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, Srinivas Pandruvada, Linux PM, Len Brown On 22-10-20, 17:55, Vincent Guittot wrote: > On Thu, 22 Oct 2020 at 17:45, A L <mail@lechevalier.se> wrote: > > > > > > > > ---- From: Peter Zijlstra <peterz@infradead.org> -- Sent: 2020-10-22 - 14:29 ---- > > > > > On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote: > > >> > However I do want to retire ondemand, conservative and also very much > > >> > intel_pstate/active mode. > > >> > > >> I agree in general, but IMO it would not be prudent to do that without making > > >> schedutil provide the same level of performance in all of the relevant use > > >> cases. > > > > > > Agreed; I though to have understood we were there already. > > > > Hi, > > > > > > Currently schedutil does not populate all stats like ondemand does, which can be a problem for some monitoring software. > > > > On my AMD 3000G CPU with kernel-5.9.1: > > > > > > grep. /sys/devices/system/cpu/cpufreq/policy0/stats/* > > > > With ondemand: > > time_in_state:3900000 145179 > > time_in_state:1600000 9588482 > > total_trans:177565 > > trans_table: From : To > > trans_table: : 3900000 1600000 > > trans_table: 3900000: 0 88783 > > trans_table: 1600000: 88782 0 > > > > With schedutil only two file exists: > > reset:<empty> > > total_trans:216609 > > > > > > I'd really like to have these stats populated with schedutil, if that's possible. > > Your problem might have been fixed with > commit 96f60cddf7a1 ("cpufreq: stats: Enable stats for fast-switch as well") Thanks Vincent. Right, I have already fixed that for everyone. -- viresh ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate 2020-10-22 12:02 ` default cpufreq gov, was: " Peter Zijlstra 2020-10-22 12:19 ` Rafael J. Wysocki @ 2020-10-22 16:23 ` Rafael J. Wysocki 2020-10-23 6:29 ` Viresh Kumar 2020-10-23 15:15 ` [PATCH v2] " Rafael J. Wysocki 2020-10-27 11:11 ` default cpufreq gov, was: [PATCH] sched/fair: check for idle core Qais Yousef 2 siblings, 2 replies; 68+ messages in thread From: Rafael J. Wysocki @ 2020-10-22 16:23 UTC (permalink / raw) To: Peter Zijlstra, Viresh Kumar, Julia Lawall Cc: Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Subject: [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate Commit 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP") was meant to cause intel_pstate without HWP to be used in the passive mode with the schedutil governor on top of it by default, but it missed the case in which either "ondemand" or "conservative" was selected as the default governor in the existing kernel config, in which case the previous old governor configuration would be used, causing the default legacy governor to be used on top of intel_pstate instead of schedutil. Address this by preventing "ondemand" and "conservative" from being configured as the default cpufreq governor in the case when schedutil is the default choice for the default governor setting. [Note that the default cpufreq governor can still be set via the kernel command line if need be and that choice is not limited, so if anyone really wants to use one of the legacy governors by default, it can be achieved this way.] Fixes: 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP") Cc: 5.8+ <stable@vger.kernel.org> # 5.8+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- drivers/cpufreq/Kconfig | 2 ++ 1 file changed, 2 insertions(+) Index: linux-pm/drivers/cpufreq/Kconfig =================================--- linux-pm.orig/drivers/cpufreq/Kconfig +++ linux-pm/drivers/cpufreq/Kconfig @@ -71,6 +71,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE config CPU_FREQ_DEFAULT_GOV_ONDEMAND bool "ondemand" + depends on !SMP || !X86_INTEL_PSTATE select CPU_FREQ_GOV_ONDEMAND select CPU_FREQ_GOV_PERFORMANCE help @@ -83,6 +84,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE bool "conservative" + depends on !SMP || !X86_INTEL_PSTATE select CPU_FREQ_GOV_CONSERVATIVE select CPU_FREQ_GOV_PERFORMANCE help ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate 2020-10-22 16:23 ` [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate Rafael J. Wysocki @ 2020-10-23 6:29 ` Viresh Kumar 2020-10-23 11:59 ` Rafael J. Wysocki 2020-10-23 15:15 ` [PATCH v2] " Rafael J. Wysocki 1 sibling, 1 reply; 68+ messages in thread From: Viresh Kumar @ 2020-10-23 6:29 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown On 22-10-20, 18:23, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Subject: [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate > > Commit 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by > default without HWP") was meant to cause intel_pstate without HWP > to be used in the passive mode with the schedutil governor on top of > it by default, but it missed the case in which either "ondemand" or > "conservative" was selected as the default governor in the existing > kernel config, in which case the previous old governor configuration > would be used, causing the default legacy governor to be used on top > of intel_pstate instead of schedutil. > > Address this by preventing "ondemand" and "conservative" from being > configured as the default cpufreq governor in the case when schedutil > is the default choice for the default governor setting. > > [Note that the default cpufreq governor can still be set via the > kernel command line if need be and that choice is not limited, > so if anyone really wants to use one of the legacy governors by > default, it can be achieved this way.] > > Fixes: 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP") > Cc: 5.8+ <stable@vger.kernel.org> # 5.8+ > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > drivers/cpufreq/Kconfig | 2 ++ > 1 file changed, 2 insertions(+) > > Index: linux-pm/drivers/cpufreq/Kconfig > =================================> --- linux-pm.orig/drivers/cpufreq/Kconfig > +++ linux-pm/drivers/cpufreq/Kconfig > @@ -71,6 +71,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE > > config CPU_FREQ_DEFAULT_GOV_ONDEMAND > bool "ondemand" > + depends on !SMP || !X86_INTEL_PSTATE > select CPU_FREQ_GOV_ONDEMAND > select CPU_FREQ_GOV_PERFORMANCE > help > @@ -83,6 +84,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND > > config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE > bool "conservative" > + depends on !SMP || !X86_INTEL_PSTATE While reading this first it felt like a SMP platforms related problem (which I was surprised about), and then I understood what you are doing. I wonder if rewriting it this way makes it more readable with same result eventually. depends on !(X86_INTEL_PSTATE && SMP) -- viresh ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate 2020-10-23 6:29 ` Viresh Kumar @ 2020-10-23 11:59 ` Rafael J. Wysocki 0 siblings, 0 replies; 68+ messages in thread From: Rafael J. Wysocki @ 2020-10-23 11:59 UTC (permalink / raw) To: Viresh Kumar Cc: Rafael J. Wysocki, Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, Linux Kernel Mailing List, Valentin Schneider, Gilles Muller, Srinivas Pandruvada, Linux PM, Len Brown On Fri, Oct 23, 2020 at 8:17 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > On 22-10-20, 18:23, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Subject: [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate > > > > Commit 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by > > default without HWP") was meant to cause intel_pstate without HWP > > to be used in the passive mode with the schedutil governor on top of > > it by default, but it missed the case in which either "ondemand" or > > "conservative" was selected as the default governor in the existing > > kernel config, in which case the previous old governor configuration > > would be used, causing the default legacy governor to be used on top > > of intel_pstate instead of schedutil. > > > > Address this by preventing "ondemand" and "conservative" from being > > configured as the default cpufreq governor in the case when schedutil > > is the default choice for the default governor setting. > > > > [Note that the default cpufreq governor can still be set via the > > kernel command line if need be and that choice is not limited, > > so if anyone really wants to use one of the legacy governors by > > default, it can be achieved this way.] > > > > Fixes: 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP") > > Cc: 5.8+ <stable@vger.kernel.org> # 5.8+ > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > --- > > drivers/cpufreq/Kconfig | 2 ++ > > 1 file changed, 2 insertions(+) > > > > Index: linux-pm/drivers/cpufreq/Kconfig > > =================================> > --- linux-pm.orig/drivers/cpufreq/Kconfig > > +++ linux-pm/drivers/cpufreq/Kconfig > > @@ -71,6 +71,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE > > > > config CPU_FREQ_DEFAULT_GOV_ONDEMAND > > bool "ondemand" > > + depends on !SMP || !X86_INTEL_PSTATE > > select CPU_FREQ_GOV_ONDEMAND > > select CPU_FREQ_GOV_PERFORMANCE > > help > > @@ -83,6 +84,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND > > > > config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE > > bool "conservative" > > + depends on !SMP || !X86_INTEL_PSTATE > > While reading this first it felt like a SMP platforms related problem > (which I was surprised about), and then I understood what you are > doing. > > I wonder if rewriting it this way makes it more readable with same > result eventually. > > depends on !(X86_INTEL_PSTATE && SMP) Agreed, will update. Thanks! ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH v2] cpufreq: Avoid configuring old governors as default with intel_pstate 2020-10-22 16:23 ` [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate Rafael J. Wysocki 2020-10-23 6:29 ` Viresh Kumar @ 2020-10-23 15:15 ` Rafael J. Wysocki 2020-10-27 3:13 ` Viresh Kumar 1 sibling, 1 reply; 68+ messages in thread From: Rafael J. Wysocki @ 2020-10-23 15:15 UTC (permalink / raw) To: Peter Zijlstra, Viresh Kumar, Julia Lawall Cc: Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Commit 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP") was meant to cause intel_pstate to be used in the passive mode with the schedutil governor on top of it, but it missed the case in which either "ondemand" or "conservative" was selected as the default governor in the existing kernel config, in which case the previous old governor configuration would be used, causing the default legacy governor to be used on top of intel_pstate instead of schedutil. Address this by preventing "ondemand" and "conservative" from being configured as the default cpufreq governor in the case when schedutil is the default choice for the default governor setting. [Note that the default cpufreq governor can still be set via the kernel command line if need be and that choice is not limited, so if anyone really wants to use one of the legacy governors by default, it can be achieved this way.] Fixes: 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP") Reported-by: Julia Lawall <julia.lawall@inria.fr> Cc: 5.8+ <stable@vger.kernel.org> # 5.8+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- The v2 addresses a review comment from Viresh regarding of the expression format and adds a missing Reported-by for Julia. --- drivers/cpufreq/Kconfig | 2 ++ 1 file changed, 2 insertions(+) Index: linux-pm/drivers/cpufreq/Kconfig =================================--- linux-pm.orig/drivers/cpufreq/Kconfig +++ linux-pm/drivers/cpufreq/Kconfig @@ -71,6 +71,7 @@ config CPU_FREQ_DEFAULT_GOV_USERSPACE config CPU_FREQ_DEFAULT_GOV_ONDEMAND bool "ondemand" + depends on !(X86_INTEL_PSTATE && SMP) select CPU_FREQ_GOV_ONDEMAND select CPU_FREQ_GOV_PERFORMANCE help @@ -83,6 +84,7 @@ config CPU_FREQ_DEFAULT_GOV_ONDEMAND config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE bool "conservative" + depends on !(X86_INTEL_PSTATE && SMP) select CPU_FREQ_GOV_CONSERVATIVE select CPU_FREQ_GOV_PERFORMANCE help ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2] cpufreq: Avoid configuring old governors as default with intel_pstate 2020-10-23 15:15 ` [PATCH v2] " Rafael J. Wysocki @ 2020-10-27 3:13 ` Viresh Kumar 0 siblings, 0 replies; 68+ messages in thread From: Viresh Kumar @ 2020-10-27 3:13 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, Linux PM, Len Brown On 23-10-20, 17:15, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Commit 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by > default without HWP") was meant to cause intel_pstate to be used > in the passive mode with the schedutil governor on top of it, but > it missed the case in which either "ondemand" or "conservative" > was selected as the default governor in the existing kernel config, > in which case the previous old governor configuration would be used, > causing the default legacy governor to be used on top of intel_pstate > instead of schedutil. > > Address this by preventing "ondemand" and "conservative" from being > configured as the default cpufreq governor in the case when schedutil > is the default choice for the default governor setting. > > [Note that the default cpufreq governor can still be set via the > kernel command line if need be and that choice is not limited, > so if anyone really wants to use one of the legacy governors by > default, it can be achieved this way.] > > Fixes: 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP") > Reported-by: Julia Lawall <julia.lawall@inria.fr> > Cc: 5.8+ <stable@vger.kernel.org> # 5.8+ > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > The v2 addresses a review comment from Viresh regarding of the expression format > and adds a missing Reported-by for Julia. > > --- > drivers/cpufreq/Kconfig | 2 ++ > 1 file changed, 2 insertions(+) Acked-by: Viresh Kumar <viresh.kumar@linaro.org> -- viresh ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-22 12:02 ` default cpufreq gov, was: " Peter Zijlstra 2020-10-22 12:19 ` Rafael J. Wysocki 2020-10-22 16:23 ` [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate Rafael J. Wysocki @ 2020-10-27 11:11 ` Qais Yousef 2020-10-27 11:26 ` Valentin Schneider 2 siblings, 1 reply; 68+ messages in thread From: Qais Yousef @ 2020-10-27 11:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Viresh Kumar, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada On 10/22/20 14:02, Peter Zijlstra wrote: > On Thu, Oct 22, 2020 at 01:45:25PM +0200, Rafael J. Wysocki wrote: > > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote: > > > On 22-10-20, 09:11, Peter Zijlstra wrote: > > > > Well, but we need to do something to force people onto schedutil, > > > > otherwise we'll get more crap like this thread. > > > > > > > > Can we take the choice away? Only let Kconfig select which governors are > > > > available and then set the default ourselves? I mean, the end goal being > > > > to not have selectable governors at all, this seems like a good step > > > > anyway. > > > > > > Just to clarify and complete the point a bit here, the users can still > > > pass the default governor from cmdline using > > > cpufreq.default_governor=, which will take precedence over the one the > > > below code is playing with. And later once the kernel is up, they can > > > still choose a different governor from userspace. > > > > Right. > > > > Also some people simply set "performance" as the default governor and then > > don't touch cpufreq otherwise (the idea is to get everything to the max > > freq right away and stay in that mode forever). This still needs to be > > possible IMO. > > Performance/powersave make sense to keep. > > However I do want to retire ondemand, conservative and also very much > intel_pstate/active mode. I also have very little sympathy for > userspace. Userspace is useful for testing and sanity checking. Not sure if people use it to measure voltage/current at each frequency to generate dynamic-power-coefficient for their platform. Lukasz, Dietmar? Thanks -- Qais Yousef > > We should start by making it hard to use them and eventually just delete > them outright. > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-27 11:11 ` default cpufreq gov, was: [PATCH] sched/fair: check for idle core Qais Yousef @ 2020-10-27 11:26 ` Valentin Schneider 2020-10-27 11:42 ` Qais Yousef 0 siblings, 1 reply; 68+ messages in thread From: Valentin Schneider @ 2020-10-27 11:26 UTC (permalink / raw) To: Qais Yousef Cc: Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Gilles Muller, srinivas.pandruvada On 27/10/20 11:11, Qais Yousef wrote: > On 10/22/20 14:02, Peter Zijlstra wrote: >> However I do want to retire ondemand, conservative and also very much >> intel_pstate/active mode. I also have very little sympathy for >> userspace. > > Userspace is useful for testing and sanity checking. Not sure if people use it > to measure voltage/current at each frequency to generate > dynamic-power-coefficient for their platform. Lukasz, Dietmar? > It's valuable even just for cpufreq sanity checking - we have that test that goes through increasing frequencies and asserts the work done is monotonically increasing. This has been quite useful in the past to detect broken bits. That *should* still be totally doable with any other governor by using the scaling_{min, max}_freq sysfs interface. > Thanks ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-27 11:26 ` Valentin Schneider @ 2020-10-27 11:42 ` Qais Yousef 2020-10-27 11:48 ` Viresh Kumar 0 siblings, 1 reply; 68+ messages in thread From: Qais Yousef @ 2020-10-27 11:42 UTC (permalink / raw) To: Valentin Schneider Cc: Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Gilles Muller, srinivas.pandruvada On 10/27/20 11:26, Valentin Schneider wrote: > > On 27/10/20 11:11, Qais Yousef wrote: > > On 10/22/20 14:02, Peter Zijlstra wrote: > >> However I do want to retire ondemand, conservative and also very much > >> intel_pstate/active mode. I also have very little sympathy for > >> userspace. > > > > Userspace is useful for testing and sanity checking. Not sure if people use it > > to measure voltage/current at each frequency to generate > > dynamic-power-coefficient for their platform. Lukasz, Dietmar? > > > > It's valuable even just for cpufreq sanity checking - we have that test > that goes through increasing frequencies and asserts the work done is > monotonically increasing. This has been quite useful in the past to detect > broken bits. > > That *should* still be totally doable with any other governor by using the > scaling_{min, max}_freq sysfs interface. True. This effectively makes every governor a potential user space governor. /me not sure to be happy or grumpy about it Thanks -- Qais Yousef ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core 2020-10-27 11:42 ` Qais Yousef @ 2020-10-27 11:48 ` Viresh Kumar 0 siblings, 0 replies; 68+ messages in thread From: Viresh Kumar @ 2020-10-27 11:48 UTC (permalink / raw) To: Qais Yousef Cc: Valentin Schneider, Peter Zijlstra, Rafael J. Wysocki, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Gilles Muller, srinivas.pandruvada On 27-10-20, 11:42, Qais Yousef wrote: > On 10/27/20 11:26, Valentin Schneider wrote: > > > > On 27/10/20 11:11, Qais Yousef wrote: > > > On 10/22/20 14:02, Peter Zijlstra wrote: > > >> However I do want to retire ondemand, conservative and also very much > > >> intel_pstate/active mode. I also have very little sympathy for > > >> userspace. > > > > > > Userspace is useful for testing and sanity checking. Not sure if people use it > > > to measure voltage/current at each frequency to generate > > > dynamic-power-coefficient for their platform. Lukasz, Dietmar? > > > > > > > It's valuable even just for cpufreq sanity checking - we have that test > > that goes through increasing frequencies and asserts the work done is > > monotonically increasing. This has been quite useful in the past to detect > > broken bits. > > > > That *should* still be totally doable with any other governor by using the > > scaling_{min, max}_freq sysfs interface. > > True. This effectively makes every governor a potential user space governor. > > /me not sure to be happy or grumpy about it Userspace governor should be kept as is, it is very effective to get unnecessary governor code out of the path when testing basic functioning of the hardware/driver. It is quite useful when things don't work as expected. -- viresh ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-22 11:45 ` Rafael J. Wysocki 2020-10-22 12:02 ` default cpufreq gov, was: " Peter Zijlstra @ 2020-10-23 6:24 ` Viresh Kumar 2020-10-23 15:06 ` Rafael J. Wysocki 1 sibling, 1 reply; 68+ messages in thread From: Viresh Kumar @ 2020-10-23 6:24 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada On 22-10-20, 13:45, Rafael J. Wysocki wrote: > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote: > > And I am not really sure why we always wanted this backup performance > > governor to be there unless the said governors are built as module. > > Apparently, some old drivers had problems with switching frequencies fast enough > for ondemand to be used with them and the fallback was for those cases. AFAICS. Do we still need this ? Or better ask those platforms to individually enable both of them. -- viresh ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-23 6:24 ` Viresh Kumar @ 2020-10-23 15:06 ` Rafael J. Wysocki 2020-10-27 3:13 ` Viresh Kumar 0 siblings, 1 reply; 68+ messages in thread From: Rafael J. Wysocki @ 2020-10-23 15:06 UTC (permalink / raw) To: Viresh Kumar Cc: Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, linux-pm On Friday, October 23, 2020 8:12:46 AM CEST Viresh Kumar wrote: > On 22-10-20, 13:45, Rafael J. Wysocki wrote: > > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote: > > > And I am not really sure why we always wanted this backup performance > > > governor to be there unless the said governors are built as module. > > > > Apparently, some old drivers had problems with switching frequencies fast enough > > for ondemand to be used with them and the fallback was for those cases. AFAICS. > > Do we still need this ? For the reasonably modern hardware, I don't think so. > Or better ask those platforms to individually > enable both of them. Bu who knows what they are? :-) ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-23 15:06 ` Rafael J. Wysocki @ 2020-10-27 3:13 ` Viresh Kumar 0 siblings, 0 replies; 68+ messages in thread From: Viresh Kumar @ 2020-10-27 3:13 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Peter Zijlstra, Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, srinivas.pandruvada, linux-pm On 23-10-20, 17:06, Rafael J. Wysocki wrote: > On Friday, October 23, 2020 8:12:46 AM CEST Viresh Kumar wrote: > > On 22-10-20, 13:45, Rafael J. Wysocki wrote: > > > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote: > > > > And I am not really sure why we always wanted this backup performance > > > > governor to be there unless the said governors are built as module. > > > > > > Apparently, some old drivers had problems with switching frequencies fast enough > > > for ondemand to be used with them and the fallback was for those cases. AFAICS. > > > > Do we still need this ? > > For the reasonably modern hardware, I don't think so. > > > Or better ask those platforms to individually > > enable both of them. > > Bu who knows what they are? :-) I was planning to break them and let them complain :) -- viresh ^ permalink raw reply [flat|nested] 68+ messages in thread
* AW: [PATCH] sched/fair: check for idle core 2020-10-22 7:11 ` Peter Zijlstra 2020-10-22 10:59 ` Viresh Kumar @ 2020-10-22 11:21 ` Walter Harms 1 sibling, 0 replies; 68+ messages in thread From: Walter Harms @ 2020-10-22 11:21 UTC (permalink / raw) To: Peter Zijlstra, Rafael J. Wysocki Cc: Julia Lawall, Mel Gorman, Ingo Molnar, kernel-janitors, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller, viresh.kumar, srinivas.pandruvada To avoid this #ifdef jungle .. I guess it would be save to search for a governor even ones that are not enabled. and a second thing: can we change the subject please ? jm2c wh ________________________________________ Von: Peter Zijlstra [peterz@infradead.org] Gesendet: Donnerstag, 22. Oktober 2020 09:11 An: Rafael J. Wysocki Cc: Julia Lawall; Mel Gorman; Ingo Molnar; kernel-janitors@vger.kernel.org; Juri Lelli; Vincent Guittot; Dietmar Eggemann; Steven Rostedt; Ben Segall; Daniel Bristot de Oliveira; linux-kernel@vger.kernel.org; Valentin Schneider; Gilles Muller; viresh.kumar@linaro.org; srinivas.pandruvada@linux.intel.com Betreff: Re: [PATCH] sched/fair: check for idle core On Wed, Oct 21, 2020 at 08:11:59PM +0200, Rafael J. Wysocki wrote: > > @@ -144,6 +145,7 @@ config CPU_FREQ_GOV_USERSPACE > > > > config CPU_FREQ_GOV_ONDEMAND > > tristate "'ondemand' cpufreq policy governor" > > + depends on !SMP > > But I don't think that we can do this and the one below. Well, but we need to do something to force people onto schedutil, otherwise we'll get more crap like this thread. Can we take the choice away? Only let Kconfig select which governors are available and then set the default ourselves? I mean, the end goal being to not have selectable governors at all, this seems like a good step anyway. --- diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index 2c7171e0b001..3a9f54b382c0 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -34,77 +34,6 @@ config CPU_FREQ_STAT If in doubt, say N. -choice - prompt "Default CPUFreq governor" - default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM - default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP - default CPU_FREQ_DEFAULT_GOV_PERFORMANCE - help - This option sets which CPUFreq governor shall be loaded at - startup. If in doubt, use the default setting. - -config CPU_FREQ_DEFAULT_GOV_PERFORMANCE - bool "performance" - select CPU_FREQ_GOV_PERFORMANCE - help - Use the CPUFreq governor 'performance' as default. This sets - the frequency statically to the highest frequency supported by - the CPU. - -config CPU_FREQ_DEFAULT_GOV_POWERSAVE - bool "powersave" - select CPU_FREQ_GOV_POWERSAVE - help - Use the CPUFreq governor 'powersave' as default. This sets - the frequency statically to the lowest frequency supported by - the CPU. - -config CPU_FREQ_DEFAULT_GOV_USERSPACE - bool "userspace" - select CPU_FREQ_GOV_USERSPACE - help - Use the CPUFreq governor 'userspace' as default. This allows - you to set the CPU frequency manually or when a userspace - program shall be able to set the CPU dynamically without having - to enable the userspace governor manually. - -config CPU_FREQ_DEFAULT_GOV_ONDEMAND - bool "ondemand" - select CPU_FREQ_GOV_ONDEMAND - select CPU_FREQ_GOV_PERFORMANCE - help - Use the CPUFreq governor 'ondemand' as default. This allows - you to get a full dynamic frequency capable system by simply - loading your cpufreq low-level hardware driver. - Be aware that not all cpufreq drivers support the ondemand - governor. If unsure have a look at the help section of the - driver. Fallback governor will be the performance governor. - -config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE - bool "conservative" - select CPU_FREQ_GOV_CONSERVATIVE - select CPU_FREQ_GOV_PERFORMANCE - help - Use the CPUFreq governor 'conservative' as default. This allows - you to get a full dynamic frequency capable system by simply - loading your cpufreq low-level hardware driver. - Be aware that not all cpufreq drivers support the conservative - governor. If unsure have a look at the help section of the - driver. Fallback governor will be the performance governor. - -config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL - bool "schedutil" - depends on SMP - select CPU_FREQ_GOV_SCHEDUTIL - select CPU_FREQ_GOV_PERFORMANCE - help - Use the 'schedutil' CPUFreq governor by default. If unsure, - have a look at the help section of that governor. The fallback - governor will be 'performance'. - -endchoice - config CPU_FREQ_GOV_PERFORMANCE tristate "'performance' governor" help @@ -145,6 +74,7 @@ config CPU_FREQ_GOV_USERSPACE config CPU_FREQ_GOV_ONDEMAND tristate "'ondemand' cpufreq policy governor" select CPU_FREQ_GOV_COMMON + select CPU_FREQ_GOV_PERFORMANCE help 'ondemand' - This driver adds a dynamic cpufreq policy governor. The governor does a periodic polling and @@ -164,6 +94,7 @@ config CPU_FREQ_GOV_CONSERVATIVE tristate "'conservative' cpufreq governor" depends on CPU_FREQ select CPU_FREQ_GOV_COMMON + select CPU_FREQ_GOV_PERFORMANCE help 'conservative' - this driver is rather similar to the 'ondemand' governor both in its source code and its purpose, the difference is @@ -188,6 +119,7 @@ config CPU_FREQ_GOV_SCHEDUTIL bool "'schedutil' cpufreq policy governor" depends on CPU_FREQ && SMP select CPU_FREQ_GOV_ATTR_SET + select CPU_FREQ_GOV_PERFORMANCE select IRQ_WORK help This governor makes decisions based on the utilization data provided diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 1877f5e2e5b0..6848e3337b65 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -626,6 +626,49 @@ static struct cpufreq_governor *find_governor(const char *str_governor) return NULL; } +static struct cpufreq_governor *cpufreq_default_governor(void) +{ + struct cpufreq_governor *gov = NULL; + +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL + gov = find_governor("schedutil"); + if (gov) + return gov; +#endif + +#ifdef CONFIG_CPU_FREQ_GOV_ONDEMAND + gov = find_governor("ondemand"); + if (gov) + return gov; +#endif + +#ifdef CONFIG_CPU_FREQ_GOV_CONSERVATIVE + gov = find_governor("conservative"); + if (gov) + return gov; +#endif + +#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE + gov = find_governor("userspace"); + if (gov) + return gov; +#endif + +#ifdef CONFIG_CPU_FREQ_GOV_POWERSAVE + gov = find_governor("powersave"); + if (gov) + return gov; +#endif + +#ifdef CONFIG_CPU_FREQ_GOV_PERFORMANCE + gov = find_governor("performance"); + if (gov) + return gov; +#endif + + return gov; +} + static struct cpufreq_governor *get_governor(const char *str_governor) { struct cpufreq_governor *t; diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c index aa39ff31ec9f..25449a1ee954 100644 --- a/drivers/cpufreq/cpufreq_conservative.c +++ b/drivers/cpufreq/cpufreq_conservative.c @@ -330,12 +330,5 @@ MODULE_DESCRIPTION("'cpufreq_conservative' - A dynamic cpufreq governor for " "optimised for use in a battery environment"); MODULE_LICENSE("GPL"); -#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE -struct cpufreq_governor *cpufreq_default_governor(void) -{ - return &CPU_FREQ_GOV_CONSERVATIVE; -} -#endif - cpufreq_governor_init(CPU_FREQ_GOV_CONSERVATIVE); cpufreq_governor_exit(CPU_FREQ_GOV_CONSERVATIVE); diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index ac361a8b1d3b..2a7fd645e1fb 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -467,12 +467,5 @@ MODULE_DESCRIPTION("'cpufreq_ondemand' - A dynamic cpufreq governor for " "Low Latency Frequency Transition capable processors"); MODULE_LICENSE("GPL"); -#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND -struct cpufreq_governor *cpufreq_default_governor(void) -{ - return &CPU_FREQ_GOV_ONDEMAND; -} -#endif - cpufreq_governor_init(CPU_FREQ_GOV_ONDEMAND); cpufreq_governor_exit(CPU_FREQ_GOV_ONDEMAND); diff --git a/drivers/cpufreq/cpufreq_performance.c b/drivers/cpufreq/cpufreq_performance.c index 71c1d9aba772..c6c7473a3a73 100644 --- a/drivers/cpufreq/cpufreq_performance.c +++ b/drivers/cpufreq/cpufreq_performance.c @@ -23,12 +23,6 @@ static struct cpufreq_governor cpufreq_gov_performance = { .limits = cpufreq_gov_performance_limits, }; -#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE -struct cpufreq_governor *cpufreq_default_governor(void) -{ - return &cpufreq_gov_performance; -} -#endif #ifndef CONFIG_CPU_FREQ_GOV_PERFORMANCE_MODULE struct cpufreq_governor *cpufreq_fallback_governor(void) { diff --git a/drivers/cpufreq/cpufreq_powersave.c b/drivers/cpufreq/cpufreq_powersave.c index 7749522355b5..8b9555f70e9d 100644 --- a/drivers/cpufreq/cpufreq_powersave.c +++ b/drivers/cpufreq/cpufreq_powersave.c @@ -27,12 +27,5 @@ MODULE_AUTHOR("Dominik Brodowski <linux@brodo.de>"); MODULE_DESCRIPTION("CPUfreq policy governor 'powersave'"); MODULE_LICENSE("GPL"); -#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE -struct cpufreq_governor *cpufreq_default_governor(void) -{ - return &cpufreq_gov_powersave; -} -#endif - cpufreq_governor_init(cpufreq_gov_powersave); cpufreq_governor_exit(cpufreq_gov_powersave); diff --git a/drivers/cpufreq/cpufreq_userspace.c b/drivers/cpufreq/cpufreq_userspace.c index 50a4d7846580..48ce53038ac0 100644 --- a/drivers/cpufreq/cpufreq_userspace.c +++ b/drivers/cpufreq/cpufreq_userspace.c @@ -131,12 +131,5 @@ MODULE_AUTHOR("Dominik Brodowski <linux@brodo.de>, " MODULE_DESCRIPTION("CPUfreq policy governor 'userspace'"); MODULE_LICENSE("GPL"); -#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE -struct cpufreq_governor *cpufreq_default_governor(void) -{ - return &cpufreq_gov_userspace; -} -#endif - cpufreq_governor_init(cpufreq_gov_userspace); cpufreq_governor_exit(cpufreq_gov_userspace); diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 5ae7b4e6e8d6..3b77e408377a 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -888,13 +888,6 @@ struct cpufreq_governor schedutil_gov = { .limits = sugov_limits, }; -#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL -struct cpufreq_governor *cpufreq_default_governor(void) -{ - return &schedutil_gov; -} -#endif - cpufreq_governor_init(schedutil_gov); #ifdef CONFIG_ENERGY_MODEL ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 11:56 ` Julia Lawall 2020-10-21 12:19 ` Peter Zijlstra @ 2020-10-21 12:28 ` Mel Gorman 1 sibling, 0 replies; 68+ messages in thread From: Mel Gorman @ 2020-10-21 12:28 UTC (permalink / raw) To: Julia Lawall Cc: Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote: > > I suspect that the benefit of this patch is due to avoiding the overhead > > of wake_affine_weight() check because the following check exists in > > select_idle_sibling > > I'm running 160 threads on 160 cores (80 physical cores). All of the > threads are thus best off to just stay where they are. If one thread > moves to the socket of prev, then they will be displacing other threads > that also expect to return to where they were previously located. > > You can see this in the traces shown here: > > https://pages.lip6.fr/Julia.Lawall/uas.pdf > This is an important point -- almost all CPUs are commonly active and the search for an idle CPU is going to be difficult given that it is inherently race-prone. In this case, you're right, it may indeed be better off to just leave the task where it is if the prev CPU is idle. Not only does it avoid the wake_affine_weight cost, it avoids a useless search of idle siblings. However, there is alsoo the important case where a machine is only partially or lightly utilised. In that case, a cross-node migration may be beneficial as the wakee may need to access data in the wakers cache or memory local to the waker (no guarantee, the waker/wakee could be completely independent but the scheduler cannot tell). In the low utilisation case, select_idle_sibling is also likely to be less expensive. Leaving a task on a remote node because the prev CPU was idle is an important semantic change in the behaviour of the scheduler when there is spare idle capacity in either domain. It's non-obvious from the changelog and the patch itself that this change of behaviour happens. > Prior to 5.8, my machine was using intel_pstate and had few background > tasks. Thus the problem wasn't visible in practice. Starting with 5.8 > the kernel decided that intel_cpufreq would be more appropriate, which > introduced kworkers every 0.004 seconds on all cores. In the graphs for > early versions, sometimes the whole benchmark runs with the threads just > staying on their cores, or a few migrations. Starting with 5.8, after 5 > seconds where there are a number of synchronizations, all of the threads > move around between all of the cores. Typically, one bad placement leads > to 10-15 threads moving around, until one ends up on the idle core where > the original thread was intended to be. > And this is an issue but I don't think the fix is avoiding cross-node migrations entirely if the prev CPU happened to be idle because cache lines will have to be bounced and/or remote data be accessed. At minimum, the low utilisation case should be considered and the changelog justify why avoiding cross-node wakeups is still appropriate when the waker CPU has plenty of idle CPUs. > > > > /* > > * If the previous CPU is cache affine and idle, don't be stupid: > > */ > > if (prev != target && cpus_share_cache(prev, target) && > > (available_idle_cpu(prev) || sched_idle_cpu(prev))) > > return prev; > > This isn't triggered in the problematic case, because the problematic case > is where the prev core and the waker core are on different sockets. > I simply wanted to highlight the fact it checks whether caches are shared or not as that is also taken into account in select_idle_sibling for example. > To my understanding, when the runnable load was used and prev was idle, > wake_affine_weight would fail, and then wake_affine would return prev. > With the load average, in the case where there is a thread on the waker > core and there has recently been a daemon on the prev core, the comparison > between the cores is a bit random. The patch thus tries to restore the > previous behavior. > I think wake_affine_weight() is a bit random anyway simply because it picks a CPU that select_idle_sibling ignores in a lot of cases and in others simply stacks the wakee on top of the waker. Every so often I think that it should simply be ripped out because the scheduler avoids stacking on the wakeup paths unless there is no other choice in which case the task might as well remain on prev anyway. Maybe it made more sense when there was an effort to detect when task stacking is appropriate but right now, wake_affine_weight() is basically a random migration generator -- leave it to the load balancer to figure out placement decisions based on load. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 11:20 ` Mel Gorman 2020-10-21 11:56 ` Julia Lawall @ 2020-10-21 12:25 ` Vincent Guittot 2020-10-21 12:47 ` Mel Gorman 1 sibling, 1 reply; 68+ messages in thread From: Vincent Guittot @ 2020-10-21 12:25 UTC (permalink / raw) To: Mel Gorman Cc: Julia Lawall, Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles.Muller Le mercredi 21 oct. 2020 à 12:20:38 (+0100), Mel Gorman a écrit : > On Tue, Oct 20, 2020 at 06:37:59PM +0200, Julia Lawall wrote: > > On a thread wakeup, the change [1] from runnable load average to load > > average for comparing candidate cores means that recent short-running > > daemons on the core where a thread ran previously can be considered to > > have a higher load than the core performing the wakeup, even when the > > core where the thread ran previously is currently idle. This can > > cause a thread to migrate, taking the place of some other thread that > > is about to wake up, and so on. To avoid unnecessary migrations, > > extend wake_affine_idle to check whether the core where the thread > > previously ran is currently idle, and if so return that core as the > > target. > > > > [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable > > load in wakeup path") > > > > This particularly has an impact when using passive (intel_cpufreq) > > power management, where kworkers run every 0.004 seconds on all cores, > > increasing the likelihood that an idle core will be considered to have > > a load. > > > > The following numbers were obtained with the benchmarking tool > > hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel > > benchmarks (https://www.nas.nasa.gov/publications/npb.html). The > > tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @ > > 2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power > > management were used. Times are in seconds. All experiments use all > > 160 hardware threads. > > > > v5.9/active v5.9+patch/active > > bt.C.c 24.725724+-0.962340 23.349608+-1.607214 > > lu.C.x 29.105952+-4.804203 25.249052+-5.561617 > > sp.C.x 31.220696+-1.831335 30.227760+-2.429792 > > ua.C.x 26.606118+-1.767384 25.778367+-1.263850 > > > > v5.9/passive v5.9+patch/passive > > bt.C.c 25.330360+-1.028316 23.544036+-1.020189 > > lu.C.x 35.872659+-4.872090 23.719295+-3.883848 > > sp.C.x 32.141310+-2.289541 29.125363+-0.872300 > > ua.C.x 29.024597+-1.667049 25.728888+-1.539772 > > > > On the smaller data sets (A and B) and on the other NAS benchmarks > > there is no impact on performance. > > > > Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> > > I suspect that the benefit of this patch is due to avoiding the overhead > of wake_affine_weight() check because the following check exists in > select_idle_sibling > > /* > * If the previous CPU is cache affine and idle, don't be stupid: > */ > if (prev != target && cpus_share_cache(prev, target) && > (available_idle_cpu(prev) || sched_idle_cpu(prev))) > return prev; > > Still, the concept makes some sense to avoid wake_affine_weight but look > at the earlier part of wake_affine_idle() > > if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu)) > return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu; > > This thing is almost completely useless because this_cpu is only going to > be idle if it's a wakeup from interrupt context when the CPU was otherwise > idle *but* it takes care to only use the CPU if this and prev share LLC. > > The patch as it stands may leave a task on a remote node when it should > have been pulled local to the waker because prev happened to be idle. This > is not guaranteed because a node could have multiple LLCs and prev is > still appropriate but that's a different problem entirely and requires > much deeper surgery. Still, not pulling a task from a remote node is > a change in expected behaviour. While it's possible that NUMA domains > will not even reach this path, it depends on the NUMA distance as can > be seen in sd_init() for the setting of SD_WAKE_AFFINE so I think the > cpus_share_cache check is necessary. > > I think it would be more appropriate to rework that block that checks > this_cpu to instead check if the CPUs share cache first and then return one > of them (preference to prev based on the comment above it about avoiding > a migration) if either one is idle. > > I see Vincent already agreed with the patch so I could be wrong. Vincent, > did I miss something stupid? This patch fixes the problem that we don't favor anymore the prev_cpu when it is idle since commit 11f10e5420f6ce because load is not null when cpu is idle whereas runnable_load was And this is important because this will then decide in which LLC we will looks for a cpu > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 12:25 ` Vincent Guittot @ 2020-10-21 12:47 ` Mel Gorman 2020-10-21 12:56 ` Julia Lawall 0 siblings, 1 reply; 68+ messages in thread From: Mel Gorman @ 2020-10-21 12:47 UTC (permalink / raw) To: Vincent Guittot Cc: Julia Lawall, Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles.Muller On Wed, Oct 21, 2020 at 02:25:32PM +0200, Vincent Guittot wrote: > > I see Vincent already agreed with the patch so I could be wrong. Vincent, > > did I miss something stupid? > > This patch fixes the problem that we don't favor anymore the prev_cpu when it is idle since > commit 11f10e5420f6ce because load is not null when cpu is idle whereas runnable_load was > And this is important because this will then decide in which LLC we will looks for a cpu > Ok, that is understandable but I'm still concerned that the fix simply trades one problem for another by leaving related tasks remote to each other and increasing cache misses and remote data accesses. wake_affine_weight is a giant pain because really we don't care about the load on the waker CPU or its available, we care about whether it has idle siblings that can be found quickly. As tempting as ripping it out is, it never happened because sometimes it makes the right decision. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 12:47 ` Mel Gorman @ 2020-10-21 12:56 ` Julia Lawall 2020-10-21 13:18 ` Mel Gorman 0 siblings, 1 reply; 68+ messages in thread From: Julia Lawall @ 2020-10-21 12:56 UTC (permalink / raw) To: Mel Gorman Cc: Vincent Guittot, Julia Lawall, Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles.Muller [-- Attachment #1: Type: text/plain, Size: 1478 bytes --] On Wed, 21 Oct 2020, Mel Gorman wrote: > On Wed, Oct 21, 2020 at 02:25:32PM +0200, Vincent Guittot wrote: > > > I see Vincent already agreed with the patch so I could be wrong. Vincent, > > > did I miss something stupid? > > > > This patch fixes the problem that we don't favor anymore the prev_cpu when it is idle since > > commit 11f10e5420f6ce because load is not null when cpu is idle whereas runnable_load was > > And this is important because this will then decide in which LLC we will looks for a cpu > > > > Ok, that is understandable but I'm still concerned that the fix simply > trades one problem for another by leaving related tasks remote to each > other and increasing cache misses and remote data accesses. > > wake_affine_weight is a giant pain because really we don't care about the > load on the waker CPU or its available, we care about whether it has idle > siblings that can be found quickly. As tempting as ripping it out is, > it never happened because sometimes it makes the right decision. My goal was to restore the previous behavior, when runnable load was used. The patch removing the use of runnable load (11f10e5420f6) presented it basically as that load balancing was using it, so wakeup should use it too, and any way it didn't matter because idle CPUS were checked for anyway. Is your point of view that the proposed change is overkill? Or is it that the original behavior was not desirable? thanks, julia ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 12:56 ` Julia Lawall @ 2020-10-21 13:18 ` Mel Gorman 2020-10-21 13:24 ` Julia Lawall 2020-10-21 13:48 ` Julia Lawall 0 siblings, 2 replies; 68+ messages in thread From: Mel Gorman @ 2020-10-21 13:18 UTC (permalink / raw) To: Julia Lawall Cc: Vincent Guittot, Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles.Muller On Wed, Oct 21, 2020 at 02:56:06PM +0200, Julia Lawall wrote: > > > On Wed, 21 Oct 2020, Mel Gorman wrote: > > > On Wed, Oct 21, 2020 at 02:25:32PM +0200, Vincent Guittot wrote: > > > > I see Vincent already agreed with the patch so I could be wrong. Vincent, > > > > did I miss something stupid? > > > > > > This patch fixes the problem that we don't favor anymore the prev_cpu when it is idle since > > > commit 11f10e5420f6ce because load is not null when cpu is idle whereas runnable_load was > > > And this is important because this will then decide in which LLC we will looks for a cpu > > > > > > > Ok, that is understandable but I'm still concerned that the fix simply > > trades one problem for another by leaving related tasks remote to each > > other and increasing cache misses and remote data accesses. > > > > wake_affine_weight is a giant pain because really we don't care about the > > load on the waker CPU or its available, we care about whether it has idle > > siblings that can be found quickly. As tempting as ripping it out is, > > it never happened because sometimes it makes the right decision. > > My goal was to restore the previous behavior, when runnable load was used. > The patch removing the use of runnable load (11f10e5420f6) presented it > basically as that load balancing was using it, so wakeup should use it > too, and any way it didn't matter because idle CPUS were checked for > anyway. > Which is fair. > Is your point of view that the proposed change is overkill? Or is it that > the original behavior was not desirable? > I worry it's overkill because prev is always used if it is idle even if it is on a node remote to the waker. It cuts off the option of a wakee moving to a CPU local to the waker which is not equivalent to the original behaviour. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 13:18 ` Mel Gorman @ 2020-10-21 13:24 ` Julia Lawall 2020-10-21 15:08 ` Mel Gorman 2020-10-21 13:48 ` Julia Lawall 1 sibling, 1 reply; 68+ messages in thread From: Julia Lawall @ 2020-10-21 13:24 UTC (permalink / raw) To: Mel Gorman Cc: Julia Lawall, Vincent Guittot, Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller [-- Attachment #1: Type: text/plain, Size: 2269 bytes --] On Wed, 21 Oct 2020, Mel Gorman wrote: > On Wed, Oct 21, 2020 at 02:56:06PM +0200, Julia Lawall wrote: > > > > > > On Wed, 21 Oct 2020, Mel Gorman wrote: > > > > > On Wed, Oct 21, 2020 at 02:25:32PM +0200, Vincent Guittot wrote: > > > > > I see Vincent already agreed with the patch so I could be wrong. Vincent, > > > > > did I miss something stupid? > > > > > > > > This patch fixes the problem that we don't favor anymore the prev_cpu when it is idle since > > > > commit 11f10e5420f6ce because load is not null when cpu is idle whereas runnable_load was > > > > And this is important because this will then decide in which LLC we will looks for a cpu > > > > > > > > > > Ok, that is understandable but I'm still concerned that the fix simply > > > trades one problem for another by leaving related tasks remote to each > > > other and increasing cache misses and remote data accesses. > > > > > > wake_affine_weight is a giant pain because really we don't care about the > > > load on the waker CPU or its available, we care about whether it has idle > > > siblings that can be found quickly. As tempting as ripping it out is, > > > it never happened because sometimes it makes the right decision. > > > > My goal was to restore the previous behavior, when runnable load was used. > > The patch removing the use of runnable load (11f10e5420f6) presented it > > basically as that load balancing was using it, so wakeup should use it > > too, and any way it didn't matter because idle CPUS were checked for > > anyway. > > > > Which is fair. > > > Is your point of view that the proposed change is overkill? Or is it that > > the original behavior was not desirable? > > > > I worry it's overkill because prev is always used if it is idle even > if it is on a node remote to the waker. It cuts off the option of a > wakee moving to a CPU local to the waker which is not equivalent to the > original behaviour. But it is equal to the original behavior in the idle prev case if you go back to the runnable load average days... The problem seems impossible to solve, because there is no way to know by looking only at prev and this whether the thread would prefer to stay where it was or go to the waker. julia ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 13:24 ` Julia Lawall @ 2020-10-21 15:08 ` Mel Gorman 2020-10-21 15:18 ` Julia Lawall 2020-10-21 15:19 ` Vincent Guittot 0 siblings, 2 replies; 68+ messages in thread From: Mel Gorman @ 2020-10-21 15:08 UTC (permalink / raw) To: Julia Lawall Cc: Vincent Guittot, Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller On Wed, Oct 21, 2020 at 03:24:48PM +0200, Julia Lawall wrote: > > I worry it's overkill because prev is always used if it is idle even > > if it is on a node remote to the waker. It cuts off the option of a > > wakee moving to a CPU local to the waker which is not equivalent to the > > original behaviour. > > But it is equal to the original behavior in the idle prev case if you go > back to the runnable load average days... > It is similar but it misses the sync treatment and sd->imbalance_pct part of wake_affine_weight which has unpredictable consequences. The data available is only on the fully utilised case. > The problem seems impossible to solve, because there is no way to know by > looking only at prev and this whether the thread would prefer to stay > where it was or go to the waker. > Yes, this is definitely true. Looking at prev_cpu and this_cpu is a crude approximation and the path is heavily limited in terms of how clever it can be. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 15:08 ` Mel Gorman @ 2020-10-21 15:18 ` Julia Lawall 2020-10-21 15:23 ` Vincent Guittot 2020-10-21 15:19 ` Vincent Guittot 1 sibling, 1 reply; 68+ messages in thread From: Julia Lawall @ 2020-10-21 15:18 UTC (permalink / raw) To: Mel Gorman Cc: Vincent Guittot, Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller On Wed, 21 Oct 2020, Mel Gorman wrote: > On Wed, Oct 21, 2020 at 03:24:48PM +0200, Julia Lawall wrote: > > > I worry it's overkill because prev is always used if it is idle even > > > if it is on a node remote to the waker. It cuts off the option of a > > > wakee moving to a CPU local to the waker which is not equivalent to the > > > original behaviour. > > > > But it is equal to the original behavior in the idle prev case if you go > > back to the runnable load average days... > > > > It is similar but it misses the sync treatment and sd->imbalance_pct part of > wake_affine_weight which has unpredictable consequences. The data > available is only on the fully utilised case. OK, what if my patch were: @@ -5800,6 +5800,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync) if (sync && cpu_rq(this_cpu)->nr_running = 1) return this_cpu; + if (!sync && available_idle_cpu(prev_cpu)) + return prev_cpu; + return nr_cpumask_bits; } The sd->imbalance_pct part would have previously been a multiplication by 0, so it doesn't need to be taken into account. julia > > > The problem seems impossible to solve, because there is no way to know by > > looking only at prev and this whether the thread would prefer to stay > > where it was or go to the waker. > > > > Yes, this is definitely true. Looking at prev_cpu and this_cpu is a > crude approximation and the path is heavily limited in terms of how > clever it can be. > > -- > Mel Gorman > SUSE Labs > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 15:18 ` Julia Lawall @ 2020-10-21 15:23 ` Vincent Guittot 2020-10-21 15:33 ` Julia Lawall 0 siblings, 1 reply; 68+ messages in thread From: Vincent Guittot @ 2020-10-21 15:23 UTC (permalink / raw) To: Julia Lawall Cc: Mel Gorman, Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller On Wed, 21 Oct 2020 at 17:18, Julia Lawall <julia.lawall@inria.fr> wrote: > > > > On Wed, 21 Oct 2020, Mel Gorman wrote: > > > On Wed, Oct 21, 2020 at 03:24:48PM +0200, Julia Lawall wrote: > > > > I worry it's overkill because prev is always used if it is idle even > > > > if it is on a node remote to the waker. It cuts off the option of a > > > > wakee moving to a CPU local to the waker which is not equivalent to the > > > > original behaviour. > > > > > > But it is equal to the original behavior in the idle prev case if you go > > > back to the runnable load average days... > > > > > > > It is similar but it misses the sync treatment and sd->imbalance_pct part of > > wake_affine_weight which has unpredictable consequences. The data > > available is only on the fully utilised case. > > OK, what if my patch were: > > @@ -5800,6 +5800,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync) > if (sync && cpu_rq(this_cpu)->nr_running = 1) > return this_cpu; > > + if (!sync && available_idle_cpu(prev_cpu)) > + return prev_cpu; > + this is not useful because when prev_cpu is idle, its runnable_avg was null so the only way for this_cpu to be selected by wake_affine_weight is to be null too which is not really possible when sync is set because sync is used to say, the running task on this cpu is about to sleep > return nr_cpumask_bits; > } > > The sd->imbalance_pct part would have previously been a multiplication by > 0, so it doesn't need to be taken into account. > > julia > > > > > > The problem seems impossible to solve, because there is no way to know by > > > looking only at prev and this whether the thread would prefer to stay > > > where it was or go to the waker. > > > > > > > Yes, this is definitely true. Looking at prev_cpu and this_cpu is a > > crude approximation and the path is heavily limited in terms of how > > clever it can be. > > > > -- > > Mel Gorman > > SUSE Labs > > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 15:23 ` Vincent Guittot @ 2020-10-21 15:33 ` Julia Lawall 0 siblings, 0 replies; 68+ messages in thread From: Julia Lawall @ 2020-10-21 15:33 UTC (permalink / raw) To: Vincent Guittot Cc: Mel Gorman, Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller On Wed, 21 Oct 2020, Vincent Guittot wrote: > On Wed, 21 Oct 2020 at 17:18, Julia Lawall <julia.lawall@inria.fr> wrote: > > > > > > > > On Wed, 21 Oct 2020, Mel Gorman wrote: > > > > > On Wed, Oct 21, 2020 at 03:24:48PM +0200, Julia Lawall wrote: > > > > > I worry it's overkill because prev is always used if it is idle even > > > > > if it is on a node remote to the waker. It cuts off the option of a > > > > > wakee moving to a CPU local to the waker which is not equivalent to the > > > > > original behaviour. > > > > > > > > But it is equal to the original behavior in the idle prev case if you go > > > > back to the runnable load average days... > > > > > > > > > > It is similar but it misses the sync treatment and sd->imbalance_pct part of > > > wake_affine_weight which has unpredictable consequences. The data > > > available is only on the fully utilised case. > > > > OK, what if my patch were: > > > > @@ -5800,6 +5800,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync) > > if (sync && cpu_rq(this_cpu)->nr_running = 1) > > return this_cpu; > > > > + if (!sync && available_idle_cpu(prev_cpu)) > > + return prev_cpu; > > + > > this is not useful because when prev_cpu is idle, its runnable_avg was > null so the only > way for this_cpu to be selected by wake_affine_weight is to be null > too which is not really > possible when sync is set because sync is used to say, the running > task on this cpu > is about to sleep OK, I agree. Previously prev_eff_load was 0 when prev was idle, and whether the sync code is executed in wake_affine_weight or not, it will not b the case that this_eff_load < prev_eff_load, so this will not be selected. julia > > > return nr_cpumask_bits; > > } > > > > The sd->imbalance_pct part would have previously been a multiplication by > > 0, so it doesn't need to be taken into account. > > > > julia > > > > > > > > > The problem seems impossible to solve, because there is no way to know by > > > > looking only at prev and this whether the thread would prefer to stay > > > > where it was or go to the waker. > > > > > > > > > > Yes, this is definitely true. Looking at prev_cpu and this_cpu is a > > > crude approximation and the path is heavily limited in terms of how > > > clever it can be. > > > > > > -- > > > Mel Gorman > > > SUSE Labs > > > > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 15:08 ` Mel Gorman 2020-10-21 15:18 ` Julia Lawall @ 2020-10-21 15:19 ` Vincent Guittot 2020-10-21 17:00 ` Mel Gorman 1 sibling, 1 reply; 68+ messages in thread From: Vincent Guittot @ 2020-10-21 15:19 UTC (permalink / raw) To: Mel Gorman Cc: Julia Lawall, Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller On Wed, 21 Oct 2020 at 17:08, Mel Gorman <mgorman@suse.de> wrote: > > On Wed, Oct 21, 2020 at 03:24:48PM +0200, Julia Lawall wrote: > > > I worry it's overkill because prev is always used if it is idle even > > > if it is on a node remote to the waker. It cuts off the option of a > > > wakee moving to a CPU local to the waker which is not equivalent to the > > > original behaviour. > > > > But it is equal to the original behavior in the idle prev case if you go > > back to the runnable load average days... > > > > It is similar but it misses the sync treatment and sd->imbalance_pct part of > wake_affine_weight which has unpredictable consequences. The data > available is only on the fully utilised case. In fact It's the same because runnable_load_avg was null when cpu is idle, so if prev_cpu was idle, we were selecting prev_idle > > > The problem seems impossible to solve, because there is no way to know by > > looking only at prev and this whether the thread would prefer to stay > > where it was or go to the waker. > > > > Yes, this is definitely true. Looking at prev_cpu and this_cpu is a > crude approximation and the path is heavily limited in terms of how > clever it can be. > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 15:19 ` Vincent Guittot @ 2020-10-21 17:00 ` Mel Gorman 2020-10-21 17:39 ` Julia Lawall 0 siblings, 1 reply; 68+ messages in thread From: Mel Gorman @ 2020-10-21 17:00 UTC (permalink / raw) To: Vincent Guittot Cc: Julia Lawall, Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller On Wed, Oct 21, 2020 at 05:19:53PM +0200, Vincent Guittot wrote: > On Wed, 21 Oct 2020 at 17:08, Mel Gorman <mgorman@suse.de> wrote: > > > > On Wed, Oct 21, 2020 at 03:24:48PM +0200, Julia Lawall wrote: > > > > I worry it's overkill because prev is always used if it is idle even > > > > if it is on a node remote to the waker. It cuts off the option of a > > > > wakee moving to a CPU local to the waker which is not equivalent to the > > > > original behaviour. > > > > > > But it is equal to the original behavior in the idle prev case if you go > > > back to the runnable load average days... > > > > > > > It is similar but it misses the sync treatment and sd->imbalance_pct part of > > wake_affine_weight which has unpredictable consequences. The data > > available is only on the fully utilised case. > > In fact It's the same because runnable_load_avg was null when cpu is idle, so > if prev_cpu was idle, we were selecting prev_idle > Sync wakeups may only consider this_cpu and the load of the waker but in that case, it was probably selected already by the sync check in wake_affine_idle which will pass except when the domain is overloaded. Fair enough, I'll withdraw any concerns. It could have done with a comment :/ -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 17:00 ` Mel Gorman @ 2020-10-21 17:39 ` Julia Lawall 0 siblings, 0 replies; 68+ messages in thread From: Julia Lawall @ 2020-10-21 17:39 UTC (permalink / raw) To: Mel Gorman Cc: Vincent Guittot, Julia Lawall, Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles Muller On Wed, 21 Oct 2020, Mel Gorman wrote: > On Wed, Oct 21, 2020 at 05:19:53PM +0200, Vincent Guittot wrote: > > On Wed, 21 Oct 2020 at 17:08, Mel Gorman <mgorman@suse.de> wrote: > > > > > > On Wed, Oct 21, 2020 at 03:24:48PM +0200, Julia Lawall wrote: > > > > > I worry it's overkill because prev is always used if it is idle even > > > > > if it is on a node remote to the waker. It cuts off the option of a > > > > > wakee moving to a CPU local to the waker which is not equivalent to the > > > > > original behaviour. > > > > > > > > But it is equal to the original behavior in the idle prev case if you go > > > > back to the runnable load average days... > > > > > > > > > > It is similar but it misses the sync treatment and sd->imbalance_pct part of > > > wake_affine_weight which has unpredictable consequences. The data > > > available is only on the fully utilised case. > > > > In fact It's the same because runnable_load_avg was null when cpu is idle, so > > if prev_cpu was idle, we were selecting prev_idle > > > > Sync wakeups may only consider this_cpu and the load of the waker but > in that case, it was probably selected already by the sync check in > wake_affine_idle which will pass except when the domain is overloaded. > Fair enough, I'll withdraw any concerns. It could have done with a > comment :/ Sure, I'll resend the patch and extend the log message with this issue. Otherwise, I was wondering are there any particular kinds of applications where gathering the threads back with the waker is a good idea? I've been looking more at applications with N threads on N cores, where it would be best for the threads to remain where they are. thanks, julia ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 13:18 ` Mel Gorman 2020-10-21 13:24 ` Julia Lawall @ 2020-10-21 13:48 ` Julia Lawall 2020-10-21 15:26 ` Mel Gorman 1 sibling, 1 reply; 68+ messages in thread From: Julia Lawall @ 2020-10-21 13:48 UTC (permalink / raw) To: Mel Gorman Cc: Julia Lawall, Vincent Guittot, Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles.Muller [-- Attachment #1: Type: text/plain, Size: 2183 bytes --] On Wed, 21 Oct 2020, Mel Gorman wrote: > On Wed, Oct 21, 2020 at 02:56:06PM +0200, Julia Lawall wrote: > > > > > > On Wed, 21 Oct 2020, Mel Gorman wrote: > > > > > On Wed, Oct 21, 2020 at 02:25:32PM +0200, Vincent Guittot wrote: > > > > > I see Vincent already agreed with the patch so I could be wrong. Vincent, > > > > > did I miss something stupid? > > > > > > > > This patch fixes the problem that we don't favor anymore the prev_cpu when it is idle since > > > > commit 11f10e5420f6ce because load is not null when cpu is idle whereas runnable_load was > > > > And this is important because this will then decide in which LLC we will looks for a cpu > > > > > > > > > > Ok, that is understandable but I'm still concerned that the fix simply > > > trades one problem for another by leaving related tasks remote to each > > > other and increasing cache misses and remote data accesses. > > > > > > wake_affine_weight is a giant pain because really we don't care about the > > > load on the waker CPU or its available, we care about whether it has idle > > > siblings that can be found quickly. As tempting as ripping it out is, > > > it never happened because sometimes it makes the right decision. > > > > My goal was to restore the previous behavior, when runnable load was used. > > The patch removing the use of runnable load (11f10e5420f6) presented it > > basically as that load balancing was using it, so wakeup should use it > > too, and any way it didn't matter because idle CPUS were checked for > > anyway. > > > > Which is fair. > > > Is your point of view that the proposed change is overkill? Or is it that > > the original behavior was not desirable? > > > > I worry it's overkill because prev is always used if it is idle even > if it is on a node remote to the waker. It cuts off the option of a > wakee moving to a CPU local to the waker which is not equivalent to the > original behaviour. Could it be possible to check p->recent_used_cpu? If that is prev (or on the same socket?), then prev could be a good choice. If that is on the same socket as the waker, then maybe the waker would be better. julia ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] sched/fair: check for idle core 2020-10-21 13:48 ` Julia Lawall @ 2020-10-21 15:26 ` Mel Gorman 0 siblings, 0 replies; 68+ messages in thread From: Mel Gorman @ 2020-10-21 15:26 UTC (permalink / raw) To: Julia Lawall Cc: Vincent Guittot, Ingo Molnar, kernel-janitors, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, linux-kernel, Valentin Schneider, Gilles.Muller On Wed, Oct 21, 2020 at 03:48:59PM +0200, Julia Lawall wrote: > > I worry it's overkill because prev is always used if it is idle even > > if it is on a node remote to the waker. It cuts off the option of a > > wakee moving to a CPU local to the waker which is not equivalent to the > > original behaviour. > > Could it be possible to check p->recent_used_cpu? If that is prev (or on > the same socket?), then prev could be a good choice. If that is on the > same socket as the waker, then maybe the waker would be better. > It is an interesting idea but the treatment of p->recent_used_cpu is a bit strange though. At one point, I looked at reconciling all the select CPU paths into one single pass (prototype worked back in 5.6-era but was not a universal win). As part of that, I remembered that the setting of recent_used_cpu is a problem because it can be set to the wakers recent_used_cpu by this chunk if (want_affine) current->recent_used_cpu = cpu; Fixing that on its own causes other problems. From an unposted series that "fixed it" as the last part of 14 patch series; After select_idle_sibling, the *wakers* recent_used_cpu is set to the CPU used for the wakee. This was necessary at the time as without it, the miss rate was unacceptably high. It still works, but it is less efficient than it can be. This patch leaves the waker state alone and uses either the previous CPU on a hit or the target CPU on a miss when the recently used CPU is examined for idleness. The rest of the series is important as without it, this patch improves the number of hits but the miss rate gets ridiculously high. After the rest of the series, the hit and miss rates are both higher but the miss rate is acceptable. As the hunk means that recent_used_cpu could have been set based on a previous wakeup, it makes it unreliable for making cross-node decisions. p->recent_used_cpu's primary purpose at the moment is to avoid select_idle_sibling searches. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2020-10-27 11:48 UTC | newest] Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-10-20 16:37 [PATCH] sched/fair: check for idle core Julia Lawall 2020-10-21 7:29 ` Vincent Guittot 2020-10-21 11:13 ` Peter Zijlstra 2020-10-21 12:27 ` Vincent Guittot 2020-10-21 11:20 ` Mel Gorman 2020-10-21 11:56 ` Julia Lawall 2020-10-21 12:19 ` Peter Zijlstra 2020-10-21 12:42 ` Julia Lawall 2020-10-21 12:52 ` Peter Zijlstra 2020-10-21 18:18 ` Rafael J. Wysocki 2020-10-21 18:15 ` Rafael J. Wysocki 2020-10-21 19:47 ` Julia Lawall 2020-10-21 20:25 ` Rafael J. Wysocki 2020-10-21 13:10 ` Peter Zijlstra 2020-10-21 18:11 ` Rafael J. Wysocki 2020-10-22 4:53 ` Viresh Kumar 2020-10-22 7:11 ` Peter Zijlstra 2020-10-22 10:59 ` Viresh Kumar 2020-10-22 11:45 ` Rafael J. Wysocki 2020-10-22 12:02 ` default cpufreq gov, was: " Peter Zijlstra 2020-10-22 12:19 ` Rafael J. Wysocki 2020-10-22 12:29 ` Peter Zijlstra 2020-10-22 14:52 ` Mel Gorman 2020-10-22 14:58 ` Colin Ian King 2020-10-22 15:12 ` Phil Auld 2020-10-22 16:35 ` Mel Gorman 2020-10-22 17:59 ` Rafael J. Wysocki 2020-10-22 20:32 ` Mel Gorman 2020-10-22 20:39 ` Phil Auld 2020-10-22 15:25 ` Peter Zijlstra 2020-10-22 15:55 ` Rafael J. Wysocki 2020-10-22 16:29 ` Mel Gorman 2020-10-22 20:10 ` Giovanni Gherdovich 2020-10-22 20:16 ` Giovanni Gherdovich 2020-10-23 7:03 ` Peter Zijlstra 2020-10-23 17:46 ` Tom Lendacky 2020-10-26 19:52 ` Fontenot, Nathan 2020-10-22 15:45 ` A L 2020-10-22 15:55 ` Vincent Guittot 2020-10-23 5:23 ` Viresh Kumar 2020-10-22 16:23 ` [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate Rafael J. Wysocki 2020-10-23 6:29 ` Viresh Kumar 2020-10-23 11:59 ` Rafael J. Wysocki 2020-10-23 15:15 ` [PATCH v2] " Rafael J. Wysocki 2020-10-27 3:13 ` Viresh Kumar 2020-10-27 11:11 ` default cpufreq gov, was: [PATCH] sched/fair: check for idle core Qais Yousef 2020-10-27 11:26 ` Valentin Schneider 2020-10-27 11:42 ` Qais Yousef 2020-10-27 11:48 ` Viresh Kumar 2020-10-23 6:24 ` Viresh Kumar 2020-10-23 15:06 ` Rafael J. Wysocki 2020-10-27 3:13 ` Viresh Kumar 2020-10-22 11:21 ` AW: " Walter Harms 2020-10-21 12:28 ` Mel Gorman 2020-10-21 12:25 ` Vincent Guittot 2020-10-21 12:47 ` Mel Gorman 2020-10-21 12:56 ` Julia Lawall 2020-10-21 13:18 ` Mel Gorman 2020-10-21 13:24 ` Julia Lawall 2020-10-21 15:08 ` Mel Gorman 2020-10-21 15:18 ` Julia Lawall 2020-10-21 15:23 ` Vincent Guittot 2020-10-21 15:33 ` Julia Lawall 2020-10-21 15:19 ` Vincent Guittot 2020-10-21 17:00 ` Mel Gorman 2020-10-21 17:39 ` Julia Lawall 2020-10-21 13:48 ` Julia Lawall 2020-10-21 15:26 ` Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).