* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-03-26 19:19 ` [PATCH v3] " Rik van Riel
@ 2021-03-28 15:36 ` Mel Gorman
2021-04-06 15:10 ` Vincent Guittot
` (2 subsequent siblings)
3 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2021-03-28 15:36 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, kernel-team, Peter Zijlstra (Intel),
Ingo Molnar, Vincent Guittot, Valentin Schneider
On Fri, Mar 26, 2021 at 03:19:32PM -0400, Rik van Riel wrote:
> ---8<---
> sched,fair: bring back select_idle_smt, but differently
>
> Mel Gorman did some nice work in 9fe1f127b913
> ("sched/fair: Merge select_idle_core/cpu()"), resulting in the kernel
> being more efficient at finding an idle CPU, and in tasks spending less
> time waiting to be run, both according to the schedstats run_delay
> numbers, and according to measured application latencies. Yay.
>
> The flip side of this is that we see more task migrations (about
> 30% more), higher cache misses, higher memory bandwidth utilization,
> and higher CPU use, for the same number of requests/second.
>
> This is most pronounced on a memcache type workload, which saw
> a consistent 1-3% increase in total CPU use on the system, due
> to those increased task migrations leading to higher L2 cache
> miss numbers, and higher memory utilization. The exclusive L3
> cache on Skylake does us no favors there.
>
> On our web serving workload, that effect is usually negligible.
>
> It appears that the increased number of CPU migrations is generally
> a good thing, since it leads to lower cpu_delay numbers, reflecting
> the fact that tasks get to run faster. However, the reduced locality
> and the corresponding increase in L2 cache misses hurts a little.
>
> The patch below appears to fix the regression, while keeping the
> benefit of the lower cpu_delay numbers, by reintroducing select_idle_smt
> with a twist: when a socket has no idle cores, check to see if the
> sibling of "prev" is idle, before searching all the other CPUs.
>
> This fixes both the occasional 9% regression on the web serving
> workload, and the continuous 2% CPU use regression on the memcache
> type workload.
>
> With Mel's patches and this patch together, task migrations are still
> high, but L2 cache misses, memory bandwidth, and CPU time used are back
> down to what they were before. The p95 and p99 response times for the
> memcache type application improve by about 10% over what they were
> before Mel's patches got merged.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
FWIW, v3 appears to have performed faster than v2 on the few tests I ran
and the patch looks fine.
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-03-26 19:19 ` [PATCH v3] " Rik van Riel
2021-03-28 15:36 ` Mel Gorman
@ 2021-04-06 15:10 ` Vincent Guittot
2021-04-06 15:26 ` Rik van Riel
2021-04-09 11:24 ` [tip: sched/core] sched/fair: Bring back select_idle_smt(), " tip-bot2 for Rik van Riel
2021-04-09 16:14 ` tip-bot2 for Rik van Riel
3 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2021-04-06 15:10 UTC (permalink / raw)
To: Rik van Riel
Cc: Mel Gorman, linux-kernel, Kernel Team, Peter Zijlstra (Intel),
Ingo Molnar, Valentin Schneider
On Fri, 26 Mar 2021 at 20:19, Rik van Riel <riel@surriel.com> wrote:
>
> On Mon, 22 Mar 2021 11:03:06 +0000
> Mel Gorman <mgorman@techsingularity.net> wrote:
>
>
> > Second, select_idle_smt() does not use the cpus mask so consider moving
> > the cpus initialisation after select_idle_smt() has been called.
> > Specifically this initialisation
> >
> > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >
> > Alternatively, clear the bits in the SMT sibling scan to avoid checking
> > the siblings twice. It's a tradeoff because initialising and clearing
> > bits is not free and the cost is wasted if a sibling is free.
>
> I tried a number of different variations on moving the CPU mask
> initialization, and clearing CPUs from the mask, and failed to
> get any clear results from those in testing, even in workloads
> with lots of context switches.
>
> Below is a simple version that seems to perform identically to
> more complicated versions :)
>
> ---8<---
> sched,fair: bring back select_idle_smt, but differently
>
> Mel Gorman did some nice work in 9fe1f127b913
> ("sched/fair: Merge select_idle_core/cpu()"), resulting in the kernel
> being more efficient at finding an idle CPU, and in tasks spending less
> time waiting to be run, both according to the schedstats run_delay
> numbers, and according to measured application latencies. Yay.
>
> The flip side of this is that we see more task migrations (about
> 30% more), higher cache misses, higher memory bandwidth utilization,
> and higher CPU use, for the same number of requests/second.
>
> This is most pronounced on a memcache type workload, which saw
> a consistent 1-3% increase in total CPU use on the system, due
> to those increased task migrations leading to higher L2 cache
> miss numbers, and higher memory utilization. The exclusive L3
> cache on Skylake does us no favors there.
>
> On our web serving workload, that effect is usually negligible.
>
> It appears that the increased number of CPU migrations is generally
> a good thing, since it leads to lower cpu_delay numbers, reflecting
> the fact that tasks get to run faster. However, the reduced locality
> and the corresponding increase in L2 cache misses hurts a little.
>
> The patch below appears to fix the regression, while keeping the
> benefit of the lower cpu_delay numbers, by reintroducing select_idle_smt
> with a twist: when a socket has no idle cores, check to see if the
> sibling of "prev" is idle, before searching all the other CPUs.
>
> This fixes both the occasional 9% regression on the web serving
> workload, and the continuous 2% CPU use regression on the memcache
> type workload.
>
> With Mel's patches and this patch together, task migrations are still
> high, but L2 cache misses, memory bandwidth, and CPU time used are back
> down to what they were before. The p95 and p99 response times for the
> memcache type application improve by about 10% over what they were
> before Mel's patches got merged.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
> kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 52 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 794c2cb945f8..69680158963f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6098,6 +6098,28 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
> return -1;
> }
>
> +/*
> + * Scan the local SMT mask for idle CPUs.
> + */
> +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int
> +target)
> +{
> + int cpu;
> +
> + if (!static_branch_likely(&sched_smt_present))
> + return -1;
> +
> + for_each_cpu(cpu, cpu_smt_mask(target)) {
> + if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
> + !cpumask_test_cpu(cpu, sched_domain_span(sd)))
> + continue;
> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> + return cpu;
> + }
> +
> + return -1;
> +}
> +
> #else /* CONFIG_SCHED_SMT */
>
> static inline void set_idle_cores(int cpu, int val)
> @@ -6114,6 +6136,11 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma
> return __select_idle_cpu(core);
> }
>
> +static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + return -1;
> +}
> +
> #endif /* CONFIG_SCHED_SMT */
>
> /*
> @@ -6121,7 +6148,7 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma
> * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> * average idle time for this rq (as found in rq->avg_idle).
> */
> -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
> +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int prev, int target)
> {
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> int i, cpu, idle_cpu = -1, nr = INT_MAX;
> @@ -6136,23 +6163,32 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
> - if (sched_feat(SIS_PROP) && !smt) {
> - u64 avg_cost, avg_idle, span_avg;
> + if (!smt) {
> + if (cpus_share_cache(prev, target)) {
Have you checked the impact on no smt system ? would worth a static branch.
Also, this doesn't need to be in select_idle_cpu() which aims to loop
the sched_domain becaus you only compare target and prev. So you can
move this call to select_idle_smt() in select_idle_sibling() as part
of all the others tests done on prev/target/recent_used_cpu before
calling select_idle_cpu() and even save a useless sd =
rcu_dereference(per_cpu(sd_llc, target));
> + /* No idle core. Check if prev has an idle sibling. */
> + i = select_idle_smt(p, sd, prev);
> + if ((unsigned int)i < nr_cpumask_bits)
> + return i;
> + }
>
> - /*
> - * Due to large variance we need a large fuzz factor;
> - * hackbench in particularly is sensitive here.
> - */
> - avg_idle = this_rq()->avg_idle / 512;
> - avg_cost = this_sd->avg_scan_cost + 1;
> + if (sched_feat(SIS_PROP)) {
> + u64 avg_cost, avg_idle, span_avg;
>
> - span_avg = sd->span_weight * avg_idle;
> - if (span_avg > 4*avg_cost)
> - nr = div_u64(span_avg, avg_cost);
> - else
> - nr = 4;
> + /*
> + * Due to large variance we need a large fuzz factor;
> + * hackbench in particularly is sensitive here.
> + */
> + avg_idle = this_rq()->avg_idle / 512;
> + avg_cost = this_sd->avg_scan_cost + 1;
>
> - time = cpu_clock(this);
> + span_avg = sd->span_weight * avg_idle;
> + if (span_avg > 4*avg_cost)
> + nr = div_u64(span_avg, avg_cost);
> + else
> + nr = 4;
> +
> + time = cpu_clock(this);
> + }
> }
>
> for_each_cpu_wrap(cpu, cpus, target) {
> @@ -6307,7 +6343,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if (!sd)
> return target;
>
> - i = select_idle_cpu(p, sd, target);
> + i = select_idle_cpu(p, sd, prev, target);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> --
> 2.25.4
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-06 15:10 ` Vincent Guittot
@ 2021-04-06 15:26 ` Rik van Riel
2021-04-06 15:31 ` Vincent Guittot
2021-04-07 7:17 ` Peter Zijlstra
0 siblings, 2 replies; 26+ messages in thread
From: Rik van Riel @ 2021-04-06 15:26 UTC (permalink / raw)
To: Vincent Guittot
Cc: Mel Gorman, linux-kernel, Kernel Team, Peter Zijlstra (Intel),
Ingo Molnar, Valentin Schneider
[-- Attachment #1: Type: text/plain, Size: 1653 bytes --]
On Tue, 2021-04-06 at 17:10 +0200, Vincent Guittot wrote:
> On Fri, 26 Mar 2021 at 20:19, Rik van Riel <riel@surriel.com> wrote:
>
> > -static int select_idle_cpu(struct task_struct *p, struct
> > sched_domain *sd, int target)
> > +static int select_idle_cpu(struct task_struct *p, struct
> > sched_domain *sd, int prev, int target)
> > {
> > struct cpumask *cpus =
> > this_cpu_cpumask_var_ptr(select_idle_mask);
> > int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > @@ -6136,23 +6163,32 @@ static int select_idle_cpu(struct
> > task_struct *p, struct sched_domain *sd, int t
> >
> > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >
> > - if (sched_feat(SIS_PROP) && !smt) {
> > - u64 avg_cost, avg_idle, span_avg;
> > + if (!smt) {
> > + if (cpus_share_cache(prev, target)) {
>
> Have you checked the impact on no smt system ? would worth a static
> branch.
>
> Also, this doesn't need to be in select_idle_cpu() which aims to loop
> the sched_domain becaus you only compare target and prev. So you can
> move this call to select_idle_smt() in select_idle_sibling()
After Mel's rewrite, there no longer are calls to
select_idle_core() or select_idle_smt() in select_idle_sibling().
Everything got folded into one single loop in select_idle_cpu()
I would be happy to pull the static branch out of select_idle_smt()
and place it into this if condition, though. You are right that
would save some overhead on non-smt systems.
Peter, would you prefer a follow-up patch for that or a version 4
of the patch?
--
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-06 15:26 ` Rik van Riel
@ 2021-04-06 15:31 ` Vincent Guittot
2021-04-06 15:33 ` Vincent Guittot
2021-04-06 15:55 ` Rik van Riel
2021-04-07 7:17 ` Peter Zijlstra
1 sibling, 2 replies; 26+ messages in thread
From: Vincent Guittot @ 2021-04-06 15:31 UTC (permalink / raw)
To: Rik van Riel
Cc: Mel Gorman, linux-kernel, Kernel Team, Peter Zijlstra (Intel),
Ingo Molnar, Valentin Schneider
On Tue, 6 Apr 2021 at 17:26, Rik van Riel <riel@surriel.com> wrote:
>
> On Tue, 2021-04-06 at 17:10 +0200, Vincent Guittot wrote:
> > On Fri, 26 Mar 2021 at 20:19, Rik van Riel <riel@surriel.com> wrote:
> >
> > > -static int select_idle_cpu(struct task_struct *p, struct
> > > sched_domain *sd, int target)
> > > +static int select_idle_cpu(struct task_struct *p, struct
> > > sched_domain *sd, int prev, int target)
> > > {
> > > struct cpumask *cpus =
> > > this_cpu_cpumask_var_ptr(select_idle_mask);
> > > int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > > @@ -6136,23 +6163,32 @@ static int select_idle_cpu(struct
> > > task_struct *p, struct sched_domain *sd, int t
> > >
> > > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > >
> > > - if (sched_feat(SIS_PROP) && !smt) {
> > > - u64 avg_cost, avg_idle, span_avg;
> > > + if (!smt) {
> > > + if (cpus_share_cache(prev, target)) {
> >
> > Have you checked the impact on no smt system ? would worth a static
> > branch.
> >
> > Also, this doesn't need to be in select_idle_cpu() which aims to loop
> > the sched_domain becaus you only compare target and prev. So you can
> > move this call to select_idle_smt() in select_idle_sibling()
>
> After Mel's rewrite, there no longer are calls to
> select_idle_core() or select_idle_smt() in select_idle_sibling().
select_idle_smt() had even disappeared that why it was not in
select_idle_sibling
>
> Everything got folded into one single loop in select_idle_cpu()
but this is done completely out of the loop so we don't need to
complify the function with unrelated stuff
>
> I would be happy to pull the static branch out of select_idle_smt()
> and place it into this if condition, though. You are right that
> would save some overhead on non-smt systems.
>
> Peter, would you prefer a follow-up patch for that or a version 4
> of the patch?
>
> --
> All Rights Reversed.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-06 15:31 ` Vincent Guittot
@ 2021-04-06 15:33 ` Vincent Guittot
2021-04-06 15:55 ` Rik van Riel
1 sibling, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2021-04-06 15:33 UTC (permalink / raw)
To: Rik van Riel
Cc: Mel Gorman, linux-kernel, Kernel Team, Peter Zijlstra (Intel),
Ingo Molnar, Valentin Schneider
On Tue, 6 Apr 2021 at 17:31, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> On Tue, 6 Apr 2021 at 17:26, Rik van Riel <riel@surriel.com> wrote:
> >
> > On Tue, 2021-04-06 at 17:10 +0200, Vincent Guittot wrote:
> > > On Fri, 26 Mar 2021 at 20:19, Rik van Riel <riel@surriel.com> wrote:
> > >
> > > > -static int select_idle_cpu(struct task_struct *p, struct
> > > > sched_domain *sd, int target)
> > > > +static int select_idle_cpu(struct task_struct *p, struct
> > > > sched_domain *sd, int prev, int target)
> > > > {
> > > > struct cpumask *cpus =
> > > > this_cpu_cpumask_var_ptr(select_idle_mask);
> > > > int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > > > @@ -6136,23 +6163,32 @@ static int select_idle_cpu(struct
> > > > task_struct *p, struct sched_domain *sd, int t
> > > >
> > > > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > > >
> > > > - if (sched_feat(SIS_PROP) && !smt) {
> > > > - u64 avg_cost, avg_idle, span_avg;
> > > > + if (!smt) {
> > > > + if (cpus_share_cache(prev, target)) {
> > >
> > > Have you checked the impact on no smt system ? would worth a static
> > > branch.
> > >
> > > Also, this doesn't need to be in select_idle_cpu() which aims to loop
> > > the sched_domain becaus you only compare target and prev. So you can
> > > move this call to select_idle_smt() in select_idle_sibling()
> >
> > After Mel's rewrite, there no longer are calls to
> > select_idle_core() or select_idle_smt() in select_idle_sibling().
>
> select_idle_smt() had even disappeared that why it was not in
> select_idle_sibling
>
> >
> > Everything got folded into one single loop in select_idle_cpu()
>
> but this is done completely out of the loop so we don't need to
> complify the function with unrelated stuff
s/complify/complexify/
>
> >
> > I would be happy to pull the static branch out of select_idle_smt()
> > and place it into this if condition, though. You are right that
> > would save some overhead on non-smt systems.
> >
> > Peter, would you prefer a follow-up patch for that or a version 4
> > of the patch?
> >
> > --
> > All Rights Reversed.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-06 15:31 ` Vincent Guittot
2021-04-06 15:33 ` Vincent Guittot
@ 2021-04-06 15:55 ` Rik van Riel
2021-04-06 16:13 ` Vincent Guittot
1 sibling, 1 reply; 26+ messages in thread
From: Rik van Riel @ 2021-04-06 15:55 UTC (permalink / raw)
To: Vincent Guittot
Cc: Mel Gorman, linux-kernel, Kernel Team, Peter Zijlstra (Intel),
Ingo Molnar, Valentin Schneider
[-- Attachment #1: Type: text/plain, Size: 2217 bytes --]
On Tue, 2021-04-06 at 17:31 +0200, Vincent Guittot wrote:
> On Tue, 6 Apr 2021 at 17:26, Rik van Riel <riel@surriel.com> wrote:
> > On Tue, 2021-04-06 at 17:10 +0200, Vincent Guittot wrote:
> > > On Fri, 26 Mar 2021 at 20:19, Rik van Riel <riel@surriel.com>
> > > wrote:
> > >
> > > > -static int select_idle_cpu(struct task_struct *p, struct
> > > > sched_domain *sd, int target)
> > > > +static int select_idle_cpu(struct task_struct *p, struct
> > > > sched_domain *sd, int prev, int target)
> > > > {
> > > > struct cpumask *cpus =
> > > > this_cpu_cpumask_var_ptr(select_idle_mask);
> > > > int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > > > @@ -6136,23 +6163,32 @@ static int select_idle_cpu(struct
> > > > task_struct *p, struct sched_domain *sd, int t
> > > >
> > > > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > > >
> > > > - if (sched_feat(SIS_PROP) && !smt) {
> > > > - u64 avg_cost, avg_idle, span_avg;
> > > > + if (!smt) {
> > > > + if (cpus_share_cache(prev, target)) {
> > >
> > > Have you checked the impact on no smt system ? would worth a
> > > static
> > > branch.
> > >
> > > Also, this doesn't need to be in select_idle_cpu() which aims to
> > > loop
> > > the sched_domain becaus you only compare target and prev. So you
> > > can
> > > move this call to select_idle_smt() in select_idle_sibling()
> >
> > After Mel's rewrite, there no longer are calls to
> > select_idle_core() or select_idle_smt() in select_idle_sibling().
>
> select_idle_smt() had even disappeared that why it was not in
> select_idle_sibling
>
> > Everything got folded into one single loop in select_idle_cpu()
>
> but this is done completely out of the loop so we don't need to
> complify the function with unrelated stuff
Not entirely. The call to select_idle_smt() is still
conditional on test_idle_cores() returning false.
We only look for the
other sibling if there is no idle
core in the LLC. If there is an idle core, we prefer
that.
Pulling the select_idle_smt() call out of select_idle_cpu()
would mean having to test_idle_cores() twice.
--
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-06 15:55 ` Rik van Riel
@ 2021-04-06 16:13 ` Vincent Guittot
0 siblings, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2021-04-06 16:13 UTC (permalink / raw)
To: Rik van Riel
Cc: Mel Gorman, linux-kernel, Kernel Team, Peter Zijlstra (Intel),
Ingo Molnar, Valentin Schneider
On Tue, 6 Apr 2021 at 17:55, Rik van Riel <riel@surriel.com> wrote:
>
> On Tue, 2021-04-06 at 17:31 +0200, Vincent Guittot wrote:
> > On Tue, 6 Apr 2021 at 17:26, Rik van Riel <riel@surriel.com> wrote:
> > > On Tue, 2021-04-06 at 17:10 +0200, Vincent Guittot wrote:
> > > > On Fri, 26 Mar 2021 at 20:19, Rik van Riel <riel@surriel.com>
> > > > wrote:
> > > >
> > > > > -static int select_idle_cpu(struct task_struct *p, struct
> > > > > sched_domain *sd, int target)
> > > > > +static int select_idle_cpu(struct task_struct *p, struct
> > > > > sched_domain *sd, int prev, int target)
> > > > > {
> > > > > struct cpumask *cpus =
> > > > > this_cpu_cpumask_var_ptr(select_idle_mask);
> > > > > int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > > > > @@ -6136,23 +6163,32 @@ static int select_idle_cpu(struct
> > > > > task_struct *p, struct sched_domain *sd, int t
> > > > >
> > > > > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > > > >
> > > > > - if (sched_feat(SIS_PROP) && !smt) {
> > > > > - u64 avg_cost, avg_idle, span_avg;
> > > > > + if (!smt) {
> > > > > + if (cpus_share_cache(prev, target)) {
> > > >
> > > > Have you checked the impact on no smt system ? would worth a
> > > > static
> > > > branch.
> > > >
> > > > Also, this doesn't need to be in select_idle_cpu() which aims to
> > > > loop
> > > > the sched_domain becaus you only compare target and prev. So you
> > > > can
> > > > move this call to select_idle_smt() in select_idle_sibling()
> > >
> > > After Mel's rewrite, there no longer are calls to
> > > select_idle_core() or select_idle_smt() in select_idle_sibling().
> >
> > select_idle_smt() had even disappeared that why it was not in
> > select_idle_sibling
> >
> > > Everything got folded into one single loop in select_idle_cpu()
> >
> > but this is done completely out of the loop so we don't need to
> > complify the function with unrelated stuff
>
> Not entirely. The call to select_idle_smt() is still
> conditional on test_idle_cores() returning false.
>
> We only look for the
> other sibling if there is no idle
> core in the LLC. If there is an idle core, we prefer
> that.
>
> Pulling the select_idle_smt() call out of select_idle_cpu()
> would mean having to test_idle_cores() twice.
In this case passes the results test_idle_cores as a parameters instead of prev
>
> --
> All Rights Reversed.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-06 15:26 ` Rik van Riel
2021-04-06 15:31 ` Vincent Guittot
@ 2021-04-07 7:17 ` Peter Zijlstra
2021-04-07 9:41 ` Mel Gorman
2021-04-07 9:42 ` Vincent Guittot
1 sibling, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2021-04-07 7:17 UTC (permalink / raw)
To: Rik van Riel
Cc: Vincent Guittot, Mel Gorman, linux-kernel, Kernel Team,
Ingo Molnar, Valentin Schneider
On Tue, Apr 06, 2021 at 11:26:37AM -0400, Rik van Riel wrote:
> I would be happy to pull the static branch out of select_idle_smt()
> and place it into this if condition, though. You are right that
> would save some overhead on non-smt systems.
>
> Peter, would you prefer a follow-up patch for that or a version 4
> of the patch?
Sorry, I was side-tracked with that core scheduling crap.. Something
like the below then?
(Also fixed that stray line-wrap)
---
Subject: sched/fair: Bring back select_idle_smt(), but differently
From: Rik van Riel <riel@surriel.com>
Date: Fri, 26 Mar 2021 15:19:32 -0400
From: Rik van Riel <riel@surriel.com>
Mel Gorman did some nice work in 9fe1f127b913 ("sched/fair: Merge
select_idle_core/cpu()"), resulting in the kernel being more efficient
at finding an idle CPU, and in tasks spending less time waiting to be
run, both according to the schedstats run_delay numbers, and according
to measured application latencies. Yay.
The flip side of this is that we see more task migrations (about 30%
more), higher cache misses, higher memory bandwidth utilization, and
higher CPU use, for the same number of requests/second.
This is most pronounced on a memcache type workload, which saw a
consistent 1-3% increase in total CPU use on the system, due to those
increased task migrations leading to higher L2 cache miss numbers, and
higher memory utilization. The exclusive L3 cache on Skylake does us
no favors there.
On our web serving workload, that effect is usually negligible.
It appears that the increased number of CPU migrations is generally a
good thing, since it leads to lower cpu_delay numbers, reflecting the
fact that tasks get to run faster. However, the reduced locality and
the corresponding increase in L2 cache misses hurts a little.
The patch below appears to fix the regression, while keeping the
benefit of the lower cpu_delay numbers, by reintroducing
select_idle_smt with a twist: when a socket has no idle cores, check
to see if the sibling of "prev" is idle, before searching all the
other CPUs.
This fixes both the occasional 9% regression on the web serving
workload, and the continuous 2% CPU use regression on the memcache
type workload.
With Mel's patches and this patch together, task migrations are still
high, but L2 cache misses, memory bandwidth, and CPU time used are
back down to what they were before. The p95 and p99 response times for
the memcache type application improve by about 10% over what they were
before Mel's patches got merged.
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210326151932.2c187840@imladris.surriel.com
---
kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++--
1 file changed, 37 insertions(+), 2 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6112,6 +6112,27 @@ static int select_idle_core(struct task_
return -1;
}
+/*
+ * Scan the local SMT mask for idle CPUs.
+ */
+static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ int cpu;
+
+ if (!static_branch_likely(&sched_smt_present))
+ return -1;
+
+ for_each_cpu(cpu, cpu_smt_mask(target)) {
+ if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
+ !cpumask_test_cpu(cpu, sched_domain_span(sd)))
+ continue;
+ if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+ return cpu;
+ }
+
+ return -1;
+}
+
#else /* CONFIG_SCHED_SMT */
static inline void set_idle_cores(int cpu, int val)
@@ -6128,6 +6149,11 @@ static inline int select_idle_core(struc
return __select_idle_cpu(core);
}
+static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ return -1;
+}
+
#endif /* CONFIG_SCHED_SMT */
/*
@@ -6135,7 +6161,7 @@ static inline int select_idle_core(struc
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
* average idle time for this rq (as found in rq->avg_idle).
*/
-static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int prev, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
@@ -6148,6 +6174,15 @@ static int select_idle_cpu(struct task_s
if (!this_sd)
return -1;
+ /* If we have SMT but there are no idle cores */
+ if (static_branch_likely(&sched_smt_presernt) && !smt) {
+ if (cpus_share_cache(prev, target)) {
+ i = select_idle_smt(p, sd, prev);
+ if ((unsigned int)i < nr_cpumask_bits)
+ return i;
+ }
+ }
+
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
if (sched_feat(SIS_PROP) && !smt) {
@@ -6321,7 +6356,7 @@ static int select_idle_sibling(struct ta
if (!sd)
return target;
- i = select_idle_cpu(p, sd, target);
+ i = select_idle_cpu(p, sd, prev, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-07 7:17 ` Peter Zijlstra
@ 2021-04-07 9:41 ` Mel Gorman
2021-04-07 10:15 ` Peter Zijlstra
2021-04-07 9:42 ` Vincent Guittot
1 sibling, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2021-04-07 9:41 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rik van Riel, Vincent Guittot, linux-kernel, Kernel Team,
Ingo Molnar, Valentin Schneider
On Wed, Apr 07, 2021 at 09:17:18AM +0200, Peter Zijlstra wrote:
> Subject: sched/fair: Bring back select_idle_smt(), but differently
> From: Rik van Riel <riel@surriel.com>
> Date: Fri, 26 Mar 2021 15:19:32 -0400
>
> From: Rik van Riel <riel@surriel.com>
>
> Mel Gorman did some nice work in 9fe1f127b913 ("sched/fair: Merge
> select_idle_core/cpu()"), resulting in the kernel being more efficient
> at finding an idle CPU, and in tasks spending less time waiting to be
> run, both according to the schedstats run_delay numbers, and according
> to measured application latencies. Yay.
>
> The flip side of this is that we see more task migrations (about 30%
> more), higher cache misses, higher memory bandwidth utilization, and
> higher CPU use, for the same number of requests/second.
>
> This is most pronounced on a memcache type workload, which saw a
> consistent 1-3% increase in total CPU use on the system, due to those
> increased task migrations leading to higher L2 cache miss numbers, and
> higher memory utilization. The exclusive L3 cache on Skylake does us
> no favors there.
>
> On our web serving workload, that effect is usually negligible.
>
> It appears that the increased number of CPU migrations is generally a
> good thing, since it leads to lower cpu_delay numbers, reflecting the
> fact that tasks get to run faster. However, the reduced locality and
> the corresponding increase in L2 cache misses hurts a little.
>
> The patch below appears to fix the regression, while keeping the
> benefit of the lower cpu_delay numbers, by reintroducing
> select_idle_smt with a twist: when a socket has no idle cores, check
> to see if the sibling of "prev" is idle, before searching all the
> other CPUs.
>
> This fixes both the occasional 9% regression on the web serving
> workload, and the continuous 2% CPU use regression on the memcache
> type workload.
>
> With Mel's patches and this patch together, task migrations are still
> high, but L2 cache misses, memory bandwidth, and CPU time used are
> back down to what they were before. The p95 and p99 response times for
> the memcache type application improve by about 10% over what they were
> before Mel's patches got merged.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20210326151932.2c187840@imladris.surriel.com
I think this is still ok and should not invalidate the previous tests on
v3. While test_idle_cores() was checked on target, as long as target/prev
share cache, the test should be equivalent other than there is a minor
race so
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
One minor question below though which previously confused me.
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6112,6 +6112,27 @@ static int select_idle_core(struct task_
> return -1;
> }
>
> +/*
> + * Scan the local SMT mask for idle CPUs.
> + */
> +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + int cpu;
> +
> + if (!static_branch_likely(&sched_smt_present))
> + return -1;
> +
> + for_each_cpu(cpu, cpu_smt_mask(target)) {
> + if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
> + !cpumask_test_cpu(cpu, sched_domain_span(sd)))
> + continue;
While I know that !cpumask_test_cpu(cpu, sched_domain_span(sd)) was
done previously, I found it hard to believe that the test matters. If
target/prev share a the LLC domain, why would the SMT siblings *not*
share a LLC?
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-07 9:41 ` Mel Gorman
@ 2021-04-07 10:15 ` Peter Zijlstra
2021-04-07 10:47 ` Mel Gorman
0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-04-07 10:15 UTC (permalink / raw)
To: Mel Gorman
Cc: Rik van Riel, Vincent Guittot, linux-kernel, Kernel Team,
Ingo Molnar, Valentin Schneider
On Wed, Apr 07, 2021 at 10:41:06AM +0100, Mel Gorman wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6112,6 +6112,27 @@ static int select_idle_core(struct task_
> > return -1;
> > }
> >
> > +/*
> > + * Scan the local SMT mask for idle CPUs.
> > + */
> > +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > + int cpu;
> > +
> > + if (!static_branch_likely(&sched_smt_present))
> > + return -1;
> > +
> > + for_each_cpu(cpu, cpu_smt_mask(target)) {
> > + if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
> > + !cpumask_test_cpu(cpu, sched_domain_span(sd)))
> > + continue;
>
> While I know that !cpumask_test_cpu(cpu, sched_domain_span(sd)) was
> done previously, I found it hard to believe that the test matters. If
> target/prev share a the LLC domain, why would the SMT siblings *not*
> share a LLC?
I think the reason for it is that a cpuset might have split the siblings
apart and disabled load-balancing across them or something.
Then the affinity mask can still cross the partition, but we shouldn't
ever move into it through balancing.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-07 10:15 ` Peter Zijlstra
@ 2021-04-07 10:47 ` Mel Gorman
2021-04-07 12:10 ` Peter Zijlstra
0 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2021-04-07 10:47 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rik van Riel, Vincent Guittot, linux-kernel, Kernel Team,
Ingo Molnar, Valentin Schneider
On Wed, Apr 07, 2021 at 12:15:13PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 07, 2021 at 10:41:06AM +0100, Mel Gorman wrote:
>
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6112,6 +6112,27 @@ static int select_idle_core(struct task_
> > > return -1;
> > > }
> > >
> > > +/*
> > > + * Scan the local SMT mask for idle CPUs.
> > > + */
> > > +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
> > > +{
> > > + int cpu;
> > > +
> > > + if (!static_branch_likely(&sched_smt_present))
> > > + return -1;
> > > +
> > > + for_each_cpu(cpu, cpu_smt_mask(target)) {
> > > + if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
> > > + !cpumask_test_cpu(cpu, sched_domain_span(sd)))
> > > + continue;
> >
> > While I know that !cpumask_test_cpu(cpu, sched_domain_span(sd)) was
> > done previously, I found it hard to believe that the test matters. If
> > target/prev share a the LLC domain, why would the SMT siblings *not*
> > share a LLC?
>
> I think the reason for it is that a cpuset might have split the siblings
> apart and disabled load-balancing across them or something.
>
> Then the affinity mask can still cross the partition, but we shouldn't
> ever move into it through balancing.
Ok, cpusets do split domains. I can't imagine the logic of splitting SMT
siblings across cpusets but if it's possible, it has to be checked and
protecting that with cpusets_enabled() would be a little overkill and
possibly miss some other corner case :(
Thanks.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-07 10:47 ` Mel Gorman
@ 2021-04-07 12:10 ` Peter Zijlstra
0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2021-04-07 12:10 UTC (permalink / raw)
To: Mel Gorman
Cc: Rik van Riel, Vincent Guittot, linux-kernel, Kernel Team,
Ingo Molnar, Valentin Schneider
On Wed, Apr 07, 2021 at 11:47:17AM +0100, Mel Gorman wrote:
> Ok, cpusets do split domains. I can't imagine the logic of splitting SMT
> siblings across cpusets but if it's possible, it has to be checked and
> protecting that with cpusets_enabled() would be a little overkill and
> possibly miss some other corner case :(
Quite. The logic is that some people can't be arsed to look at how the
topology is enumerated and simply split logical CPUs by a random number.
Imagine we have 4 cores, with 2 threads each, for 8 logical CPUs.
Suppose you want a partition of 6 'cpus' and 2 spares. Hoping for a 3:1
core split.
If things are enumerated like: core0{smt0, smt1}, core1{smt0, smt1} ...
then 0-5 will get you 3 whole cores.
If OTOH things are enumerated like: smt0{core0, core1, core2, core3},
smt1{...} then 0-5 will get you the loonie side of the moon.
Funny thing is that x86 can iterate either way depending on how the BIOS
monkey filled out the tables.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-07 7:17 ` Peter Zijlstra
2021-04-07 9:41 ` Mel Gorman
@ 2021-04-07 9:42 ` Vincent Guittot
2021-04-07 9:54 ` Peter Zijlstra
1 sibling, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2021-04-07 9:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rik van Riel, Mel Gorman, linux-kernel, Kernel Team, Ingo Molnar,
Valentin Schneider
Le mercredi 07 avril 2021 à 09:17:18 (+0200), Peter Zijlstra a écrit :
> On Tue, Apr 06, 2021 at 11:26:37AM -0400, Rik van Riel wrote:
> > I would be happy to pull the static branch out of select_idle_smt()
> > and place it into this if condition, though. You are right that
> > would save some overhead on non-smt systems.
> >
> > Peter, would you prefer a follow-up patch for that or a version 4
> > of the patch?
>
> Sorry, I was side-tracked with that core scheduling crap.. Something
> like the below then?
>
> (Also fixed that stray line-wrap)
>
> ---
> Subject: sched/fair: Bring back select_idle_smt(), but differently
> From: Rik van Riel <riel@surriel.com>
> Date: Fri, 26 Mar 2021 15:19:32 -0400
>
> From: Rik van Riel <riel@surriel.com>
>
> Mel Gorman did some nice work in 9fe1f127b913 ("sched/fair: Merge
> select_idle_core/cpu()"), resulting in the kernel being more efficient
> at finding an idle CPU, and in tasks spending less time waiting to be
> run, both according to the schedstats run_delay numbers, and according
> to measured application latencies. Yay.
>
> The flip side of this is that we see more task migrations (about 30%
> more), higher cache misses, higher memory bandwidth utilization, and
> higher CPU use, for the same number of requests/second.
>
> This is most pronounced on a memcache type workload, which saw a
> consistent 1-3% increase in total CPU use on the system, due to those
> increased task migrations leading to higher L2 cache miss numbers, and
> higher memory utilization. The exclusive L3 cache on Skylake does us
> no favors there.
>
> On our web serving workload, that effect is usually negligible.
>
> It appears that the increased number of CPU migrations is generally a
> good thing, since it leads to lower cpu_delay numbers, reflecting the
> fact that tasks get to run faster. However, the reduced locality and
> the corresponding increase in L2 cache misses hurts a little.
>
> The patch below appears to fix the regression, while keeping the
> benefit of the lower cpu_delay numbers, by reintroducing
> select_idle_smt with a twist: when a socket has no idle cores, check
> to see if the sibling of "prev" is idle, before searching all the
> other CPUs.
>
> This fixes both the occasional 9% regression on the web serving
> workload, and the continuous 2% CPU use regression on the memcache
> type workload.
>
> With Mel's patches and this patch together, task migrations are still
> high, but L2 cache misses, memory bandwidth, and CPU time used are
> back down to what they were before. The p95 and p99 response times for
> the memcache type application improve by about 10% over what they were
> before Mel's patches got merged.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20210326151932.2c187840@imladris.surriel.com
> ---
> kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++--
> 1 file changed, 37 insertions(+), 2 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6112,6 +6112,27 @@ static int select_idle_core(struct task_
> return -1;
> }
>
> +/*
> + * Scan the local SMT mask for idle CPUs.
> + */
> +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + int cpu;
> +
> + if (!static_branch_likely(&sched_smt_present))
> + return -1;
> +
> + for_each_cpu(cpu, cpu_smt_mask(target)) {
> + if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
> + !cpumask_test_cpu(cpu, sched_domain_span(sd)))
> + continue;
> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> + return cpu;
> + }
> +
> + return -1;
> +}
> +
> #else /* CONFIG_SCHED_SMT */
>
> static inline void set_idle_cores(int cpu, int val)
> @@ -6128,6 +6149,11 @@ static inline int select_idle_core(struc
> return __select_idle_cpu(core);
> }
>
> +static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + return -1;
> +}
> +
> #endif /* CONFIG_SCHED_SMT */
>
> /*
> @@ -6135,7 +6161,7 @@ static inline int select_idle_core(struc
> * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> * average idle time for this rq (as found in rq->avg_idle).
> */
> -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
> +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int prev, int target)
> {
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> int i, cpu, idle_cpu = -1, nr = INT_MAX;
> @@ -6148,6 +6174,15 @@ static int select_idle_cpu(struct task_s
> if (!this_sd)
> return -1;
>
> + /* If we have SMT but there are no idle cores */
> + if (static_branch_likely(&sched_smt_presernt) && !smt) {
> + if (cpus_share_cache(prev, target)) {
> + i = select_idle_smt(p, sd, prev);
> + if ((unsigned int)i < nr_cpumask_bits)
> + return i;
> + }
> + }
> +
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
> if (sched_feat(SIS_PROP) && !smt) {
> @@ -6321,7 +6356,7 @@ static int select_idle_sibling(struct ta
> if (!sd)
> return target;
>
> - i = select_idle_cpu(p, sd, target);
> + i = select_idle_cpu(p, sd, prev, target);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
I would really prefer to keep that out of select_idle_cpu which aims to merge in one
single loop the walk through sd_llc. In the case of select_idle_smt, this is done outside
the loop:
I would prefer the below which also removed a number of useless and duplicated static_branch_likely(&sched_smt_presernt) in test_idle_cores and select_idle_smt
---
kernel/sched/fair.c | 48 +++++++++++++++++++++++++++++++++++++--------
1 file changed, 40 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d73bdbb2d40..01d0bacedc8d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6038,11 +6038,9 @@ static inline bool test_idle_cores(int cpu, bool def)
{
struct sched_domain_shared *sds;
- if (static_branch_likely(&sched_smt_present)) {
- sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
- if (sds)
- return READ_ONCE(sds->has_idle_cores);
- }
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds)
+ return READ_ONCE(sds->has_idle_cores);
return def;
}
@@ -6112,6 +6110,25 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
return -1;
}
+/*
+ * Scan the local SMT mask for idle CPUs.
+ */
+static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int
+target)
+{
+ int cpu;
+
+ for_each_cpu(cpu, cpu_smt_mask(target)) {
+ if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
+ !cpumask_test_cpu(cpu, sched_domain_span(sd)))
+ continue;
+ if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+ return cpu;
+ }
+
+ return -1;
+}
+
#else /* CONFIG_SCHED_SMT */
static inline void set_idle_cores(int cpu, int val)
@@ -6128,6 +6145,11 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma
return __select_idle_cpu(core);
}
+static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ return -1;
+}
+
#endif /* CONFIG_SCHED_SMT */
/*
@@ -6135,11 +6157,10 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
* average idle time for this rq (as found in rq->avg_idle).
*/
-static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool smt, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
- bool smt = test_idle_cores(target, false);
int this = smp_processor_id();
struct sched_domain *this_sd;
u64 time;
@@ -6242,6 +6263,7 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
*/
static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
+ bool smt = false;
struct sched_domain *sd;
unsigned long task_util;
int i, recent_used_cpu;
@@ -6317,11 +6339,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
}
}
+ if (static_branch_likely(&sched_smt_present)) {
+ smt = test_idle_cores(target, false);
+ if (!smt && cpus_share_cache(prev, target)) {
+ /* No idle core. Check if prev has an idle sibling. */
+ i = select_idle_smt(p, sd, prev);
+ if ((unsigned int)i < nr_cpumask_bits)
+ return i;
+ }
+ }
+
sd = rcu_dereference(per_cpu(sd_llc, target));
if (!sd)
return target;
- i = select_idle_cpu(p, sd, target);
+ i = select_idle_cpu(p, sd, smt, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
--
2.17.1
>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-07 9:42 ` Vincent Guittot
@ 2021-04-07 9:54 ` Peter Zijlstra
2021-04-07 9:57 ` Vincent Guittot
2021-04-07 10:19 ` Peter Zijlstra
0 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2021-04-07 9:54 UTC (permalink / raw)
To: Vincent Guittot
Cc: Rik van Riel, Mel Gorman, linux-kernel, Kernel Team, Ingo Molnar,
Valentin Schneider
On Wed, Apr 07, 2021 at 11:42:17AM +0200, Vincent Guittot wrote:
> I would really prefer to keep that out of select_idle_cpu which aims to merge in one
> single loop the walk through sd_llc. In the case of select_idle_smt, this is done outside
> the loop:
Fair enough.
> @@ -6317,11 +6339,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> }
> }
>
> + if (static_branch_likely(&sched_smt_present)) {
> + smt = test_idle_cores(target, false);
> + if (!smt && cpus_share_cache(prev, target)) {
> + /* No idle core. Check if prev has an idle sibling. */
> + i = select_idle_smt(p, sd, prev);
> + if ((unsigned int)i < nr_cpumask_bits)
> + return i;
> + }
> + }
> +
> sd = rcu_dereference(per_cpu(sd_llc, target));
> if (!sd)
> return target;
It needs to be here, otherwise you're using @sd uninitialized.
> - i = select_idle_cpu(p, sd, target);
> + i = select_idle_cpu(p, sd, smt, target);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
Let me have another poke at it.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-07 9:54 ` Peter Zijlstra
@ 2021-04-07 9:57 ` Vincent Guittot
2021-04-07 10:19 ` Peter Zijlstra
1 sibling, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2021-04-07 9:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rik van Riel, Mel Gorman, linux-kernel, Kernel Team, Ingo Molnar,
Valentin Schneider
On Wed, 7 Apr 2021 at 11:55, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Apr 07, 2021 at 11:42:17AM +0200, Vincent Guittot wrote:
> > I would really prefer to keep that out of select_idle_cpu which aims to merge in one
> > single loop the walk through sd_llc. In the case of select_idle_smt, this is done outside
> > the loop:
>
> Fair enough.
>
> > @@ -6317,11 +6339,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > }
> > }
> >
> > + if (static_branch_likely(&sched_smt_present)) {
> > + smt = test_idle_cores(target, false);
> > + if (!smt && cpus_share_cache(prev, target)) {
> > + /* No idle core. Check if prev has an idle sibling. */
> > + i = select_idle_smt(p, sd, prev);
> > + if ((unsigned int)i < nr_cpumask_bits)
> > + return i;
> > + }
> > + }
> > +
> > sd = rcu_dereference(per_cpu(sd_llc, target));
> > if (!sd)
> > return target;
>
> It needs to be here, otherwise you're using @sd uninitialized.
argh yes...
>
> > - i = select_idle_cpu(p, sd, target);
> > + i = select_idle_cpu(p, sd, smt, target);
> > if ((unsigned)i < nr_cpumask_bits)
> > return i;
>
> Let me have another poke at it.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-07 9:54 ` Peter Zijlstra
2021-04-07 9:57 ` Vincent Guittot
@ 2021-04-07 10:19 ` Peter Zijlstra
2021-04-07 10:24 ` Vincent Guittot
2021-04-08 15:52 ` Rik van Riel
1 sibling, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2021-04-07 10:19 UTC (permalink / raw)
To: Vincent Guittot
Cc: Rik van Riel, Mel Gorman, linux-kernel, Kernel Team, Ingo Molnar,
Valentin Schneider
On Wed, Apr 07, 2021 at 11:54:37AM +0200, Peter Zijlstra wrote:
> Let me have another poke at it.
Pretty much what you did, except I also did s/smt/has_idle_core/ and
fixed that @sd thing.
Like so then?
---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6038,11 +6038,9 @@ static inline bool test_idle_cores(int c
{
struct sched_domain_shared *sds;
- if (static_branch_likely(&sched_smt_present)) {
- sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
- if (sds)
- return READ_ONCE(sds->has_idle_cores);
- }
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds)
+ return READ_ONCE(sds->has_idle_cores);
return def;
}
@@ -6112,6 +6110,24 @@ static int select_idle_core(struct task_
return -1;
}
+/*
+ * Scan the local SMT mask for idle CPUs.
+ */
+static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ int cpu;
+
+ for_each_cpu(cpu, cpu_smt_mask(target)) {
+ if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
+ !cpumask_test_cpu(cpu, sched_domain_span(sd)))
+ continue;
+ if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+ return cpu;
+ }
+
+ return -1;
+}
+
#else /* CONFIG_SCHED_SMT */
static inline void set_idle_cores(int cpu, int val)
@@ -6128,6 +6144,11 @@ static inline int select_idle_core(struc
return __select_idle_cpu(core);
}
+static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ return -1;
+}
+
#endif /* CONFIG_SCHED_SMT */
/*
@@ -6135,11 +6156,10 @@ static inline int select_idle_core(struc
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
* average idle time for this rq (as found in rq->avg_idle).
*/
-static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
- bool smt = test_idle_cores(target, false);
int this = smp_processor_id();
struct sched_domain *this_sd;
u64 time;
@@ -6150,7 +6170,7 @@ static int select_idle_cpu(struct task_s
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
- if (sched_feat(SIS_PROP) && !smt) {
+ if (sched_feat(SIS_PROP) && !has_idle_core) {
u64 avg_cost, avg_idle, span_avg;
/*
@@ -6170,7 +6190,7 @@ static int select_idle_cpu(struct task_s
}
for_each_cpu_wrap(cpu, cpus, target) {
- if (smt) {
+ if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
if ((unsigned int)i < nr_cpumask_bits)
return i;
@@ -6184,10 +6204,10 @@ static int select_idle_cpu(struct task_s
}
}
- if (smt)
+ if (has_idle_core)
set_idle_cores(this, false);
- if (sched_feat(SIS_PROP) && !smt) {
+ if (sched_feat(SIS_PROP) && !has_idle_core) {
time = cpu_clock(this) - time;
update_avg(&this_sd->avg_scan_cost, time);
}
@@ -6242,6 +6262,7 @@ static inline bool asym_fits_capacity(in
*/
static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
+ bool has_idle_core = false;
struct sched_domain *sd;
unsigned long task_util;
int i, recent_used_cpu;
@@ -6321,7 +6342,17 @@ static int select_idle_sibling(struct ta
if (!sd)
return target;
- i = select_idle_cpu(p, sd, target);
+ if (static_branch_likely(&sched_smt_present)) {
+ has_idle_core = test_idle_cores(target, false);
+
+ if (!has_idle_core && cpus_share_cache(prev, target)) {
+ i = select_idle_smt(p, sd, prev);
+ if ((unsigned int)i < nr_cpumask_bits)
+ return i;
+ }
+ }
+
+ i = select_idle_cpu(p, sd, has_idle_core, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-07 10:19 ` Peter Zijlstra
@ 2021-04-07 10:24 ` Vincent Guittot
2021-04-08 15:52 ` Rik van Riel
1 sibling, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2021-04-07 10:24 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rik van Riel, Mel Gorman, linux-kernel, Kernel Team, Ingo Molnar,
Valentin Schneider
On Wed, 7 Apr 2021 at 12:19, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Apr 07, 2021 at 11:54:37AM +0200, Peter Zijlstra wrote:
>
> > Let me have another poke at it.
>
> Pretty much what you did, except I also did s/smt/has_idle_core/ and
> fixed that @sd thing.
>
> Like so then?
Yes. Looks good to me
>
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6038,11 +6038,9 @@ static inline bool test_idle_cores(int c
> {
> struct sched_domain_shared *sds;
>
> - if (static_branch_likely(&sched_smt_present)) {
> - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> - if (sds)
> - return READ_ONCE(sds->has_idle_cores);
> - }
> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> + if (sds)
> + return READ_ONCE(sds->has_idle_cores);
>
> return def;
> }
> @@ -6112,6 +6110,24 @@ static int select_idle_core(struct task_
> return -1;
> }
>
> +/*
> + * Scan the local SMT mask for idle CPUs.
> + */
> +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + int cpu;
> +
> + for_each_cpu(cpu, cpu_smt_mask(target)) {
> + if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
> + !cpumask_test_cpu(cpu, sched_domain_span(sd)))
> + continue;
> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> + return cpu;
> + }
> +
> + return -1;
> +}
> +
> #else /* CONFIG_SCHED_SMT */
>
> static inline void set_idle_cores(int cpu, int val)
> @@ -6128,6 +6144,11 @@ static inline int select_idle_core(struc
> return __select_idle_cpu(core);
> }
>
> +static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + return -1;
> +}
> +
> #endif /* CONFIG_SCHED_SMT */
>
> /*
> @@ -6135,11 +6156,10 @@ static inline int select_idle_core(struc
> * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> * average idle time for this rq (as found in rq->avg_idle).
> */
> -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
> +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
> {
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> int i, cpu, idle_cpu = -1, nr = INT_MAX;
> - bool smt = test_idle_cores(target, false);
> int this = smp_processor_id();
> struct sched_domain *this_sd;
> u64 time;
> @@ -6150,7 +6170,7 @@ static int select_idle_cpu(struct task_s
>
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
> - if (sched_feat(SIS_PROP) && !smt) {
> + if (sched_feat(SIS_PROP) && !has_idle_core) {
> u64 avg_cost, avg_idle, span_avg;
>
> /*
> @@ -6170,7 +6190,7 @@ static int select_idle_cpu(struct task_s
> }
>
> for_each_cpu_wrap(cpu, cpus, target) {
> - if (smt) {
> + if (has_idle_core) {
> i = select_idle_core(p, cpu, cpus, &idle_cpu);
> if ((unsigned int)i < nr_cpumask_bits)
> return i;
> @@ -6184,10 +6204,10 @@ static int select_idle_cpu(struct task_s
> }
> }
>
> - if (smt)
> + if (has_idle_core)
> set_idle_cores(this, false);
>
> - if (sched_feat(SIS_PROP) && !smt) {
> + if (sched_feat(SIS_PROP) && !has_idle_core) {
> time = cpu_clock(this) - time;
> update_avg(&this_sd->avg_scan_cost, time);
> }
> @@ -6242,6 +6262,7 @@ static inline bool asym_fits_capacity(in
> */
> static int select_idle_sibling(struct task_struct *p, int prev, int target)
> {
> + bool has_idle_core = false;
> struct sched_domain *sd;
> unsigned long task_util;
> int i, recent_used_cpu;
> @@ -6321,7 +6342,17 @@ static int select_idle_sibling(struct ta
> if (!sd)
> return target;
>
> - i = select_idle_cpu(p, sd, target);
> + if (static_branch_likely(&sched_smt_present)) {
> + has_idle_core = test_idle_cores(target, false);
> +
> + if (!has_idle_core && cpus_share_cache(prev, target)) {
> + i = select_idle_smt(p, sd, prev);
> + if ((unsigned int)i < nr_cpumask_bits)
> + return i;
> + }
> + }
> +
> + i = select_idle_cpu(p, sd, has_idle_core, target);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently
2021-04-07 10:19 ` Peter Zijlstra
2021-04-07 10:24 ` Vincent Guittot
@ 2021-04-08 15:52 ` Rik van Riel
1 sibling, 0 replies; 26+ messages in thread
From: Rik van Riel @ 2021-04-08 15:52 UTC (permalink / raw)
To: Peter Zijlstra, Vincent Guittot
Cc: Mel Gorman, linux-kernel, Kernel Team, Ingo Molnar, Valentin Schneider
[-- Attachment #1: Type: text/plain, Size: 349 bytes --]
On Wed, 2021-04-07 at 12:19 +0200, Peter Zijlstra wrote:
> On Wed, Apr 07, 2021 at 11:54:37AM +0200, Peter Zijlstra wrote:
>
> > Let me have another poke at it.
>
> Pretty much what you did, except I also did s/smt/has_idle_core/ and
> fixed that @sd thing.
>
> Like so then?
Looks good to me. Thank you.
--
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 26+ messages in thread
* [tip: sched/core] sched/fair: Bring back select_idle_smt(), but differently
2021-03-26 19:19 ` [PATCH v3] " Rik van Riel
2021-03-28 15:36 ` Mel Gorman
2021-04-06 15:10 ` Vincent Guittot
@ 2021-04-09 11:24 ` tip-bot2 for Rik van Riel
2021-04-09 16:14 ` tip-bot2 for Rik van Riel
3 siblings, 0 replies; 26+ messages in thread
From: tip-bot2 for Rik van Riel @ 2021-04-09 11:24 UTC (permalink / raw)
To: linux-tip-commits
Cc: Rik van Riel, Peter Zijlstra (Intel),
Mel Gorman, Vincent Guittot, x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 6bcd3e21ba278098920d26d4888f5e6f4087c61d
Gitweb: https://git.kernel.org/tip/6bcd3e21ba278098920d26d4888f5e6f4087c61d
Author: Rik van Riel <riel@surriel.com>
AuthorDate: Fri, 26 Mar 2021 15:19:32 -04:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 08 Apr 2021 23:09:44 +02:00
sched/fair: Bring back select_idle_smt(), but differently
Mel Gorman did some nice work in 9fe1f127b913 ("sched/fair: Merge
select_idle_core/cpu()"), resulting in the kernel being more efficient
at finding an idle CPU, and in tasks spending less time waiting to be
run, both according to the schedstats run_delay numbers, and according
to measured application latencies. Yay.
The flip side of this is that we see more task migrations (about 30%
more), higher cache misses, higher memory bandwidth utilization, and
higher CPU use, for the same number of requests/second.
This is most pronounced on a memcache type workload, which saw a
consistent 1-3% increase in total CPU use on the system, due to those
increased task migrations leading to higher L2 cache miss numbers, and
higher memory utilization. The exclusive L3 cache on Skylake does us
no favors there.
On our web serving workload, that effect is usually negligible.
It appears that the increased number of CPU migrations is generally a
good thing, since it leads to lower cpu_delay numbers, reflecting the
fact that tasks get to run faster. However, the reduced locality and
the corresponding increase in L2 cache misses hurts a little.
The patch below appears to fix the regression, while keeping the
benefit of the lower cpu_delay numbers, by reintroducing
select_idle_smt with a twist: when a socket has no idle cores, check
to see if the sibling of "prev" is idle, before searching all the
other CPUs.
This fixes both the occasional 9% regression on the web serving
workload, and the continuous 2% CPU use regression on the memcache
type workload.
With Mel's patches and this patch together, task migrations are still
high, but L2 cache misses, memory bandwidth, and CPU time used are
back down to what they were before. The p95 and p99 response times for
the memcache type application improve by about 10% over what they were
before Mel's patches got merged.
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210326151932.2c187840@imladris.surriel.com
---
kernel/sched/fair.c | 55 ++++++++++++++++++++++++++++++++++----------
1 file changed, 43 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d73bdb..d0bd861 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6038,11 +6038,9 @@ static inline bool test_idle_cores(int cpu, bool def)
{
struct sched_domain_shared *sds;
- if (static_branch_likely(&sched_smt_present)) {
- sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
- if (sds)
- return READ_ONCE(sds->has_idle_cores);
- }
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds)
+ return READ_ONCE(sds->has_idle_cores);
return def;
}
@@ -6112,6 +6110,24 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
return -1;
}
+/*
+ * Scan the local SMT mask for idle CPUs.
+ */
+static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ int cpu;
+
+ for_each_cpu(cpu, cpu_smt_mask(target)) {
+ if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
+ !cpumask_test_cpu(cpu, sched_domain_span(sd)))
+ continue;
+ if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+ return cpu;
+ }
+
+ return -1;
+}
+
#else /* CONFIG_SCHED_SMT */
static inline void set_idle_cores(int cpu, int val)
@@ -6128,6 +6144,11 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma
return __select_idle_cpu(core);
}
+static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ return -1;
+}
+
#endif /* CONFIG_SCHED_SMT */
/*
@@ -6135,11 +6156,10 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
* average idle time for this rq (as found in rq->avg_idle).
*/
-static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
- bool smt = test_idle_cores(target, false);
int this = smp_processor_id();
struct sched_domain *this_sd;
u64 time;
@@ -6150,7 +6170,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
- if (sched_feat(SIS_PROP) && !smt) {
+ if (sched_feat(SIS_PROP) && !has_idle_core) {
u64 avg_cost, avg_idle, span_avg;
/*
@@ -6170,7 +6190,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
}
for_each_cpu_wrap(cpu, cpus, target) {
- if (smt) {
+ if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
if ((unsigned int)i < nr_cpumask_bits)
return i;
@@ -6184,10 +6204,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
}
}
- if (smt)
+ if (has_idle_core)
set_idle_cores(this, false);
- if (sched_feat(SIS_PROP) && !smt) {
+ if (sched_feat(SIS_PROP) && !has_idle_core) {
time = cpu_clock(this) - time;
update_avg(&this_sd->avg_scan_cost, time);
}
@@ -6242,6 +6262,7 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
*/
static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
+ bool has_idle_core = false;
struct sched_domain *sd;
unsigned long task_util;
int i, recent_used_cpu;
@@ -6321,7 +6342,17 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (!sd)
return target;
- i = select_idle_cpu(p, sd, target);
+ if (static_branch_likely(&sched_smt_present)) {
+ has_idle_core = test_idle_cores(target, false);
+
+ if (!has_idle_core && cpus_share_cache(prev, target)) {
+ i = select_idle_smt(p, sd, prev);
+ if ((unsigned int)i < nr_cpumask_bits)
+ return i;
+ }
+ }
+
+ i = select_idle_cpu(p, sd, has_idle_core, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [tip: sched/core] sched/fair: Bring back select_idle_smt(), but differently
2021-03-26 19:19 ` [PATCH v3] " Rik van Riel
` (2 preceding siblings ...)
2021-04-09 11:24 ` [tip: sched/core] sched/fair: Bring back select_idle_smt(), " tip-bot2 for Rik van Riel
@ 2021-04-09 16:14 ` tip-bot2 for Rik van Riel
3 siblings, 0 replies; 26+ messages in thread
From: tip-bot2 for Rik van Riel @ 2021-04-09 16:14 UTC (permalink / raw)
To: linux-tip-commits
Cc: Rik van Riel, Peter Zijlstra (Intel),
Mel Gorman, Vincent Guittot, x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: c722f35b513f807629603bbf24640b1a48be21b5
Gitweb: https://git.kernel.org/tip/c722f35b513f807629603bbf24640b1a48be21b5
Author: Rik van Riel <riel@surriel.com>
AuthorDate: Fri, 26 Mar 2021 15:19:32 -04:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 09 Apr 2021 18:01:39 +02:00
sched/fair: Bring back select_idle_smt(), but differently
Mel Gorman did some nice work in 9fe1f127b913 ("sched/fair: Merge
select_idle_core/cpu()"), resulting in the kernel being more efficient
at finding an idle CPU, and in tasks spending less time waiting to be
run, both according to the schedstats run_delay numbers, and according
to measured application latencies. Yay.
The flip side of this is that we see more task migrations (about 30%
more), higher cache misses, higher memory bandwidth utilization, and
higher CPU use, for the same number of requests/second.
This is most pronounced on a memcache type workload, which saw a
consistent 1-3% increase in total CPU use on the system, due to those
increased task migrations leading to higher L2 cache miss numbers, and
higher memory utilization. The exclusive L3 cache on Skylake does us
no favors there.
On our web serving workload, that effect is usually negligible.
It appears that the increased number of CPU migrations is generally a
good thing, since it leads to lower cpu_delay numbers, reflecting the
fact that tasks get to run faster. However, the reduced locality and
the corresponding increase in L2 cache misses hurts a little.
The patch below appears to fix the regression, while keeping the
benefit of the lower cpu_delay numbers, by reintroducing
select_idle_smt with a twist: when a socket has no idle cores, check
to see if the sibling of "prev" is idle, before searching all the
other CPUs.
This fixes both the occasional 9% regression on the web serving
workload, and the continuous 2% CPU use regression on the memcache
type workload.
With Mel's patches and this patch together, task migrations are still
high, but L2 cache misses, memory bandwidth, and CPU time used are
back down to what they were before. The p95 and p99 response times for
the memcache type application improve by about 10% over what they were
before Mel's patches got merged.
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210326151932.2c187840@imladris.surriel.com
---
kernel/sched/fair.c | 55 ++++++++++++++++++++++++++++++++++----------
1 file changed, 43 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d73bdb..bc34e35 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6038,11 +6038,9 @@ static inline bool test_idle_cores(int cpu, bool def)
{
struct sched_domain_shared *sds;
- if (static_branch_likely(&sched_smt_present)) {
- sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
- if (sds)
- return READ_ONCE(sds->has_idle_cores);
- }
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds)
+ return READ_ONCE(sds->has_idle_cores);
return def;
}
@@ -6112,6 +6110,24 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
return -1;
}
+/*
+ * Scan the local SMT mask for idle CPUs.
+ */
+static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ int cpu;
+
+ for_each_cpu(cpu, cpu_smt_mask(target)) {
+ if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
+ !cpumask_test_cpu(cpu, sched_domain_span(sd)))
+ continue;
+ if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+ return cpu;
+ }
+
+ return -1;
+}
+
#else /* CONFIG_SCHED_SMT */
static inline void set_idle_cores(int cpu, int val)
@@ -6128,6 +6144,11 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma
return __select_idle_cpu(core);
}
+static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ return -1;
+}
+
#endif /* CONFIG_SCHED_SMT */
/*
@@ -6135,11 +6156,10 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
* average idle time for this rq (as found in rq->avg_idle).
*/
-static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
- bool smt = test_idle_cores(target, false);
int this = smp_processor_id();
struct sched_domain *this_sd;
u64 time;
@@ -6150,7 +6170,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
- if (sched_feat(SIS_PROP) && !smt) {
+ if (sched_feat(SIS_PROP) && !has_idle_core) {
u64 avg_cost, avg_idle, span_avg;
/*
@@ -6170,7 +6190,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
}
for_each_cpu_wrap(cpu, cpus, target) {
- if (smt) {
+ if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
if ((unsigned int)i < nr_cpumask_bits)
return i;
@@ -6184,10 +6204,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
}
}
- if (smt)
+ if (has_idle_core)
set_idle_cores(this, false);
- if (sched_feat(SIS_PROP) && !smt) {
+ if (sched_feat(SIS_PROP) && !has_idle_core) {
time = cpu_clock(this) - time;
update_avg(&this_sd->avg_scan_cost, time);
}
@@ -6242,6 +6262,7 @@ static inline bool asym_fits_capacity(int task_util, int cpu)
*/
static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
+ bool has_idle_core = false;
struct sched_domain *sd;
unsigned long task_util;
int i, recent_used_cpu;
@@ -6321,7 +6342,17 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (!sd)
return target;
- i = select_idle_cpu(p, sd, target);
+ if (sched_smt_active()) {
+ has_idle_core = test_idle_cores(target, false);
+
+ if (!has_idle_core && cpus_share_cache(prev, target)) {
+ i = select_idle_smt(p, sd, prev);
+ if ((unsigned int)i < nr_cpumask_bits)
+ return i;
+ }
+ }
+
+ i = select_idle_cpu(p, sd, has_idle_core, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
^ permalink raw reply related [flat|nested] 26+ messages in thread