linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2
@ 2021-01-15 20:36 Barry Song
  2021-01-18 11:13 ` Vincent Guittot
  0 siblings, 1 reply; 8+ messages in thread
From: Barry Song @ 2021-01-15 20:36 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, valentin.schneider, linux-kernel, mgorman
  Cc: linuxarm, Barry Song

This patch is a follow-up of the 3-hops issue reported by Valentin Schneider:
[1] https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com/
[2] https://lore.kernel.org/lkml/20201110184300.15673-1-valentin.schneider@arm.com/

Here is a brief summary of the background:
For a NUMA system with 3-hops, sched_group for NUMA 2-hops could be not a
subset of sched_domain.
For example, for a system with the below topology(two cpus in each NUMA
node):
node   0   1   2   3
  0:  10  12  20  22
  1:  12  10  22  24
  2:  20  22  10  12
  3:  22  24  12  10

For CPU0, domain-2 will span 0-5, but its group will span 0-3, 4-7.
4-7 isn't a subset of 0-5.

CPU0 attaching sched-domain(s):
 domain-0: span=0-1 level=MC
  groups: 0:{ span=0 cap=989 }, 1:{ span=1 cap=1016 }
  domain-1: span=0-3 level=NUMA
   groups: 0:{ span=0-1 cap=2005 }, 2:{ span=2-3 cap=2028 }
   domain-2: span=0-5 level=NUMA
    groups: 0:{ span=0-3 cap=4033 }, 4:{ span=4-7 cap=3909 }
 ERROR: groups don't span domain->span
    domain-3: span=0-7 level=NUMA
     groups: 0:{ span=0-5 mask=0-1 cap=6062 }, 6:{ span=4-7 mask=6-7 cap=3928 }

All other cpus also have the same issue: sched_group could be not a subset
of sched_domain.

Here I am trying to figure out the scheduling impact of this issue from
two aspects:
1. find busiest cpu in load_balance
2. find idlest cpu in fork/exec/wake balance

For case 1, load_balance() seems to be handling this issue correctly as it only
fills cpus in sched_domain to the cpus of lb_env. Also, find_busiest_group()
and find_busiest_queue() will result in scanning cpus within env.cpus only:

static int load_balance(int this_cpu, struct rq *this_rq,
                        struct sched_domain *sd, enum cpu_idle_type idle,
                        int *continue_balancing)
{`
	...

 	struct lb_env env = {
		...
		.cpus           = cpus,
		.fbq_type       = all,
		.tasks          = LIST_HEAD_INIT(env.tasks),
	};

	/* added by barry: only cpus in sched_domain are put in lb_env */
	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
	...
	/*
	 * added by barry: the below functions are only scanning cpus
	 * in env.cpus
	 */
	group = find_busiest_group(&env);
	...

	busiest = find_busiest_queue(&env, group);
	...
}

But one thing which looks wrong is that update_sg_lb_stats() is only counting
tasks in sched_domain, but sgs->group_capacity and sgs->group_weight are
counting all cpus in the sched_group. Then finally, update_sg_lb_stats()
uses the load of cpus which are in the sched_domain to calculate group_type
and avg_load which can be seriously underestimated. This is explained in
detail as the comments added by me in the code:

static inline void update_sg_lb_stats()
{
	int i, nr_running, local_group;

	/* added by barry: here it only counts cpu in the sched_domain */
        for_each_cpu_and(i, sched_group_span(group), env->cpus) {
		...
		sgs->group_load += cpu_load(rq);
		sgs->group_util += cpu_util(i);
		sgs->group_runnable += cpu_runnable(rq);
		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
		nr_running = rq->nr_running;
		sgs->sum_nr_running += nr_running;
		...
	}

	...
	/* added by barry: here it count all cpus which might not be in the domain */
	sgs->group_capacity = group->sgc->capacity;

	sgs->group_weight = group->group_weight;

	/* added by barry: finally the group_type and avg_load could be wrong */

	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);

	if (sgs->group_type == group_overloaded)
		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
		sgs->group_capacity;
	...
}
For example, if we have 2 cpus in sched_domain and  4 cpus in sched_group, the
code is using the load of 2 cpus to calculate the group_type and avg_load of 4
cpus, the sched_group is likely to get much lower load than the real case.
This patch fixed it by only counting cpus within sched_domain for group_capacity
and group_weight.

For case 2, find_idlest_group() and find_idlest_group_cpu() don't use sched_domain
for scanning at all. They are scanning all cpus in the sched_group though sched_group
isn't a subset of sched_domain. So they can result in picking an idle cpu outside
the sched_domain but inside the sched_group.
This patch moved to only scan cpus within the sched_domain, which would be similar
with load_balance().

For this moment, this is pretty much PoC code to get feedback.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 kernel/sched/fair.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..f183dba4961e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5901,7 +5901,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu);
  * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
  */
 static int
-find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
+find_idlest_group_cpu(struct sched_domain *sd, struct sched_group *group, struct task_struct *p, int this_cpu)
 {
 	unsigned long load, min_load = ULONG_MAX;
 	unsigned int min_exit_latency = UINT_MAX;
@@ -5916,6 +5916,10 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+		/* when sched_group isn't a subset of sched_domain */
+		if (!cpumask_test_cpu(i, sched_domain_span(sd)))
+			continue;
+
 		if (sched_idle_cpu(i))
 			return i;
 
@@ -5984,7 +5988,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 			continue;
 		}
 
-		new_cpu = find_idlest_group_cpu(group, p, cpu);
+		new_cpu = find_idlest_group_cpu(sd, group, p, cpu);
 		if (new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of 'cpu': */
 			sd = sd->child;
@@ -8416,6 +8420,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
 			env->flags |= LBF_NOHZ_AGAIN;
 
+		sgs->group_capacity += capacity_of(i);
+		sgs->group_weight++;
 		sgs->group_load += cpu_load(rq);
 		sgs->group_util += cpu_util(i);
 		sgs->group_runnable += cpu_runnable(rq);
@@ -8462,10 +8468,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->group_asym_packing = 1;
 	}
 
-	sgs->group_capacity = group->sgc->capacity;
-
-	sgs->group_weight = group->group_weight;
-
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
 	/* Computing avg_load makes sense only when group is overloaded */
@@ -8688,10 +8690,12 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 
 	memset(sgs, 0, sizeof(*sgs));
 
-	for_each_cpu(i, sched_group_span(group)) {
+	for_each_cpu_and(i, sched_group_span(group), sched_domain_span(sd)) {
 		struct rq *rq = cpu_rq(i);
 		unsigned int local;
 
+		sgs->group_capacity += capacity_of(i);
+		sgs->group_weight++;
 		sgs->group_load += cpu_load_without(rq, p);
 		sgs->group_util += cpu_util_without(i, p);
 		sgs->group_runnable += cpu_runnable_without(rq, p);
@@ -8715,10 +8719,6 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 		sgs->group_misfit_task_load = 1;
 	}
 
-	sgs->group_capacity = group->sgc->capacity;
-
-	sgs->group_weight = group->group_weight;
-
 	sgs->group_type = group_classify(sd->imbalance_pct, group, sgs);
 
 	/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2
  2021-01-15 20:36 [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2 Barry Song
@ 2021-01-18 11:13 ` Vincent Guittot
  2021-01-18 11:25   ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 8+ messages in thread
From: Vincent Guittot @ 2021-01-18 11:13 UTC (permalink / raw)
  To: Barry Song
  Cc: Ingo Molnar, Peter Zijlstra, Dietmar Eggemann, Morten Rasmussen,
	Valentin Schneider, linux-kernel, Mel Gorman, linuxarm

On Fri, 15 Jan 2021 at 21:42, Barry Song <song.bao.hua@hisilicon.com> wrote:
>
> This patch is a follow-up of the 3-hops issue reported by Valentin Schneider:
> [1] https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com/
> [2] https://lore.kernel.org/lkml/20201110184300.15673-1-valentin.schneider@arm.com/
>
> Here is a brief summary of the background:
> For a NUMA system with 3-hops, sched_group for NUMA 2-hops could be not a
> subset of sched_domain.
> For example, for a system with the below topology(two cpus in each NUMA
> node):
> node   0   1   2   3
>   0:  10  12  20  22
>   1:  12  10  22  24
>   2:  20  22  10  12
>   3:  22  24  12  10
>
> For CPU0, domain-2 will span 0-5, but its group will span 0-3, 4-7.
> 4-7 isn't a subset of 0-5.
>
> CPU0 attaching sched-domain(s):
>  domain-0: span=0-1 level=MC
>   groups: 0:{ span=0 cap=989 }, 1:{ span=1 cap=1016 }
>   domain-1: span=0-3 level=NUMA
>    groups: 0:{ span=0-1 cap=2005 }, 2:{ span=2-3 cap=2028 }
>    domain-2: span=0-5 level=NUMA
>     groups: 0:{ span=0-3 cap=4033 }, 4:{ span=4-7 cap=3909 }
>  ERROR: groups don't span domain->span
>     domain-3: span=0-7 level=NUMA
>      groups: 0:{ span=0-5 mask=0-1 cap=6062 }, 6:{ span=4-7 mask=6-7 cap=3928 }
>
> All other cpus also have the same issue: sched_group could be not a subset
> of sched_domain.
>
> Here I am trying to figure out the scheduling impact of this issue from
> two aspects:
> 1. find busiest cpu in load_balance
> 2. find idlest cpu in fork/exec/wake balance

Would be better to fix the error in the sched domain topology instead
of hacking the load balance to compensate the topology problem

>
> For case 1, load_balance() seems to be handling this issue correctly as it only
> fills cpus in sched_domain to the cpus of lb_env. Also, find_busiest_group()
> and find_busiest_queue() will result in scanning cpus within env.cpus only:
>
> static int load_balance(int this_cpu, struct rq *this_rq,
>                         struct sched_domain *sd, enum cpu_idle_type idle,
>                         int *continue_balancing)
> {`
>         ...
>
>         struct lb_env env = {
>                 ...
>                 .cpus           = cpus,
>                 .fbq_type       = all,
>                 .tasks          = LIST_HEAD_INIT(env.tasks),
>         };
>
>         /* added by barry: only cpus in sched_domain are put in lb_env */
>         cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>         ...
>         /*
>          * added by barry: the below functions are only scanning cpus
>          * in env.cpus
>          */
>         group = find_busiest_group(&env);
>         ...
>
>         busiest = find_busiest_queue(&env, group);
>         ...
> }
>
> But one thing which looks wrong is that update_sg_lb_stats() is only counting
> tasks in sched_domain, but sgs->group_capacity and sgs->group_weight are
> counting all cpus in the sched_group. Then finally, update_sg_lb_stats()
> uses the load of cpus which are in the sched_domain to calculate group_type
> and avg_load which can be seriously underestimated. This is explained in
> detail as the comments added by me in the code:
>
> static inline void update_sg_lb_stats()
> {
>         int i, nr_running, local_group;
>
>         /* added by barry: here it only counts cpu in the sched_domain */
>         for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>                 ...
>                 sgs->group_load += cpu_load(rq);
>                 sgs->group_util += cpu_util(i);
>                 sgs->group_runnable += cpu_runnable(rq);
>                 sgs->sum_h_nr_running += rq->cfs.h_nr_running;
>                 nr_running = rq->nr_running;
>                 sgs->sum_nr_running += nr_running;
>                 ...
>         }
>
>         ...
>         /* added by barry: here it count all cpus which might not be in the domain */
>         sgs->group_capacity = group->sgc->capacity;
>
>         sgs->group_weight = group->group_weight;
>
>         /* added by barry: finally the group_type and avg_load could be wrong */
>
>         sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>
>         if (sgs->group_type == group_overloaded)
>                 sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
>                 sgs->group_capacity;
>         ...
> }
> For example, if we have 2 cpus in sched_domain and  4 cpus in sched_group, the
> code is using the load of 2 cpus to calculate the group_type and avg_load of 4
> cpus, the sched_group is likely to get much lower load than the real case.
> This patch fixed it by only counting cpus within sched_domain for group_capacity
> and group_weight.
>
> For case 2, find_idlest_group() and find_idlest_group_cpu() don't use sched_domain
> for scanning at all. They are scanning all cpus in the sched_group though sched_group
> isn't a subset of sched_domain. So they can result in picking an idle cpu outside
> the sched_domain but inside the sched_group.
> This patch moved to only scan cpus within the sched_domain, which would be similar
> with load_balance().
>
> For this moment, this is pretty much PoC code to get feedback.
>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> ---
>  kernel/sched/fair.c | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04a3ce20da67..f183dba4961e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5901,7 +5901,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu);
>   * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
>   */
>  static int
> -find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> +find_idlest_group_cpu(struct sched_domain *sd, struct sched_group *group, struct task_struct *p, int this_cpu)
>  {
>         unsigned long load, min_load = ULONG_MAX;
>         unsigned int min_exit_latency = UINT_MAX;
> @@ -5916,6 +5916,10 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>
>         /* Traverse only the allowed CPUs */
>         for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
> +               /* when sched_group isn't a subset of sched_domain */
> +               if (!cpumask_test_cpu(i, sched_domain_span(sd)))
> +                       continue;
> +
>                 if (sched_idle_cpu(i))
>                         return i;
>
> @@ -5984,7 +5988,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>                         continue;
>                 }
>
> -               new_cpu = find_idlest_group_cpu(group, p, cpu);
> +               new_cpu = find_idlest_group_cpu(sd, group, p, cpu);
>                 if (new_cpu == cpu) {
>                         /* Now try balancing at a lower domain level of 'cpu': */
>                         sd = sd->child;
> @@ -8416,6 +8420,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                 if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
>                         env->flags |= LBF_NOHZ_AGAIN;
>
> +               sgs->group_capacity += capacity_of(i);
> +               sgs->group_weight++;
>                 sgs->group_load += cpu_load(rq);
>                 sgs->group_util += cpu_util(i);
>                 sgs->group_runnable += cpu_runnable(rq);
> @@ -8462,10 +8468,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                 sgs->group_asym_packing = 1;
>         }
>
> -       sgs->group_capacity = group->sgc->capacity;
> -
> -       sgs->group_weight = group->group_weight;
> -
>         sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>
>         /* Computing avg_load makes sense only when group is overloaded */
> @@ -8688,10 +8690,12 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
>
>         memset(sgs, 0, sizeof(*sgs));
>
> -       for_each_cpu(i, sched_group_span(group)) {
> +       for_each_cpu_and(i, sched_group_span(group), sched_domain_span(sd)) {
>                 struct rq *rq = cpu_rq(i);
>                 unsigned int local;
>
> +               sgs->group_capacity += capacity_of(i);
> +               sgs->group_weight++;
>                 sgs->group_load += cpu_load_without(rq, p);
>                 sgs->group_util += cpu_util_without(i, p);
>                 sgs->group_runnable += cpu_runnable_without(rq, p);
> @@ -8715,10 +8719,6 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
>                 sgs->group_misfit_task_load = 1;
>         }
>
> -       sgs->group_capacity = group->sgc->capacity;
> -
> -       sgs->group_weight = group->group_weight;
> -
>         sgs->group_type = group_classify(sd->imbalance_pct, group, sgs);
>
>         /*
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2
  2021-01-18 11:13 ` Vincent Guittot
@ 2021-01-18 11:25   ` Song Bao Hua (Barry Song)
  2021-01-21 18:14     ` Valentin Schneider
  0 siblings, 1 reply; 8+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-01-18 11:25 UTC (permalink / raw)
  To: Vincent Guittot, Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Dietmar Eggemann, Morten Rasmussen,
	linux-kernel, Mel Gorman, linuxarm



> -----Original Message-----
> From: Vincent Guittot [mailto:vincent.guittot@linaro.org]
> Sent: Tuesday, January 19, 2021 12:14 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: Ingo Molnar <mingo@kernel.org>; Peter Zijlstra <peterz@infradead.org>;
> Dietmar Eggemann <dietmar.eggemann@arm.com>; Morten Rasmussen
> <morten.rasmussen@arm.com>; Valentin Schneider <valentin.schneider@arm.com>;
> linux-kernel <linux-kernel@vger.kernel.org>; Mel Gorman <mgorman@suse.de>;
> linuxarm@openeuler.org
> Subject: Re: [RFC PATCH] sched/fair: first try to fix the scheduling impact
> of NUMA diameter > 2
> 
> On Fri, 15 Jan 2021 at 21:42, Barry Song <song.bao.hua@hisilicon.com> wrote:
> >
> > This patch is a follow-up of the 3-hops issue reported by Valentin Schneider:
> > [1] https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com/
> > [2]
> https://lore.kernel.org/lkml/20201110184300.15673-1-valentin.schneider@arm
> .com/
> >
> > Here is a brief summary of the background:
> > For a NUMA system with 3-hops, sched_group for NUMA 2-hops could be not a
> > subset of sched_domain.
> > For example, for a system with the below topology(two cpus in each NUMA
> > node):
> > node   0   1   2   3
> >   0:  10  12  20  22
> >   1:  12  10  22  24
> >   2:  20  22  10  12
> >   3:  22  24  12  10
> >
> > For CPU0, domain-2 will span 0-5, but its group will span 0-3, 4-7.
> > 4-7 isn't a subset of 0-5.
> >
> > CPU0 attaching sched-domain(s):
> >  domain-0: span=0-1 level=MC
> >   groups: 0:{ span=0 cap=989 }, 1:{ span=1 cap=1016 }
> >   domain-1: span=0-3 level=NUMA
> >    groups: 0:{ span=0-1 cap=2005 }, 2:{ span=2-3 cap=2028 }
> >    domain-2: span=0-5 level=NUMA
> >     groups: 0:{ span=0-3 cap=4033 }, 4:{ span=4-7 cap=3909 }
> >  ERROR: groups don't span domain->span
> >     domain-3: span=0-7 level=NUMA
> >      groups: 0:{ span=0-5 mask=0-1 cap=6062 }, 6:{ span=4-7 mask=6-7 cap=3928 }
> >
> > All other cpus also have the same issue: sched_group could be not a subset
> > of sched_domain.
> >
> > Here I am trying to figure out the scheduling impact of this issue from
> > two aspects:
> > 1. find busiest cpu in load_balance
> > 2. find idlest cpu in fork/exec/wake balance
> 
> Would be better to fix the error in the sched domain topology instead
> of hacking the load balance to compensate the topology problem

I think Valentin Schneider has tried to do that before, but failed. This will
add some new groups which won't be managed by current update_group_capacity()?
@Valentine, would you like to share more details?

On the other hand, another purpose of this RFC is that I also want to dig into
more details about how the 3-hops issue could affect the behavior of scheduler.
In Valentine's original thread, I think we haven't figured out how the issue
will really impact scheduling.

> 
> >
> > For case 1, load_balance() seems to be handling this issue correctly as it
> only
> > fills cpus in sched_domain to the cpus of lb_env. Also, find_busiest_group()
> > and find_busiest_queue() will result in scanning cpus within env.cpus only:
> >
> > static int load_balance(int this_cpu, struct rq *this_rq,
> >                         struct sched_domain *sd, enum cpu_idle_type idle,
> >                         int *continue_balancing)
> > {`
> >         ...
> >
> >         struct lb_env env = {
> >                 ...
> >                 .cpus           = cpus,
> >                 .fbq_type       = all,
> >                 .tasks          = LIST_HEAD_INIT(env.tasks),
> >         };
> >
> >         /* added by barry: only cpus in sched_domain are put in lb_env */
> >         cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
> >         ...
> >         /*
> >          * added by barry: the below functions are only scanning cpus
> >          * in env.cpus
> >          */
> >         group = find_busiest_group(&env);
> >         ...
> >
> >         busiest = find_busiest_queue(&env, group);
> >         ...
> > }
> >
> > But one thing which looks wrong is that update_sg_lb_stats() is only counting
> > tasks in sched_domain, but sgs->group_capacity and sgs->group_weight are
> > counting all cpus in the sched_group. Then finally, update_sg_lb_stats()
> > uses the load of cpus which are in the sched_domain to calculate group_type
> > and avg_load which can be seriously underestimated. This is explained in
> > detail as the comments added by me in the code:
> >
> > static inline void update_sg_lb_stats()
> > {
> >         int i, nr_running, local_group;
> >
> >         /* added by barry: here it only counts cpu in the sched_domain */
> >         for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> >                 ...
> >                 sgs->group_load += cpu_load(rq);
> >                 sgs->group_util += cpu_util(i);
> >                 sgs->group_runnable += cpu_runnable(rq);
> >                 sgs->sum_h_nr_running += rq->cfs.h_nr_running;
> >                 nr_running = rq->nr_running;
> >                 sgs->sum_nr_running += nr_running;
> >                 ...
> >         }
> >
> >         ...
> >         /* added by barry: here it count all cpus which might not be in the
> domain */
> >         sgs->group_capacity = group->sgc->capacity;
> >
> >         sgs->group_weight = group->group_weight;
> >
> >         /* added by barry: finally the group_type and avg_load could be wrong
> */
> >
> >         sgs->group_type = group_classify(env->sd->imbalance_pct, group,
> sgs);
> >
> >         if (sgs->group_type == group_overloaded)
> >                 sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
> >                 sgs->group_capacity;
> >         ...
> > }
> > For example, if we have 2 cpus in sched_domain and  4 cpus in sched_group,
> the
> > code is using the load of 2 cpus to calculate the group_type and avg_load
> of 4
> > cpus, the sched_group is likely to get much lower load than the real case.
> > This patch fixed it by only counting cpus within sched_domain for
> group_capacity
> > and group_weight.
> >
> > For case 2, find_idlest_group() and find_idlest_group_cpu() don't use
> sched_domain
> > for scanning at all. They are scanning all cpus in the sched_group though
> sched_group
> > isn't a subset of sched_domain. So they can result in picking an idle cpu
> outside
> > the sched_domain but inside the sched_group.
> > This patch moved to only scan cpus within the sched_domain, which would be
> similar
> > with load_balance().
> >
> > For this moment, this is pretty much PoC code to get feedback.
> >
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > ---
> >  kernel/sched/fair.c | 22 +++++++++++-----------
> >  1 file changed, 11 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04a3ce20da67..f183dba4961e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5901,7 +5901,7 @@ find_idlest_group(struct sched_domain *sd, struct
> task_struct *p, int this_cpu);
> >   * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
> >   */
> >  static int
> > -find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int
> this_cpu)
> > +find_idlest_group_cpu(struct sched_domain *sd, struct sched_group *group,
> struct task_struct *p, int this_cpu)
> >  {
> >         unsigned long load, min_load = ULONG_MAX;
> >         unsigned int min_exit_latency = UINT_MAX;
> > @@ -5916,6 +5916,10 @@ find_idlest_group_cpu(struct sched_group *group,
> struct task_struct *p, int this
> >
> >         /* Traverse only the allowed CPUs */
> >         for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
> > +               /* when sched_group isn't a subset of sched_domain */
> > +               if (!cpumask_test_cpu(i, sched_domain_span(sd)))
> > +                       continue;
> > +
> >                 if (sched_idle_cpu(i))
> >                         return i;
> >
> > @@ -5984,7 +5988,7 @@ static inline int find_idlest_cpu(struct sched_domain
> *sd, struct task_struct *p
> >                         continue;
> >                 }
> >
> > -               new_cpu = find_idlest_group_cpu(group, p, cpu);
> > +               new_cpu = find_idlest_group_cpu(sd, group, p, cpu);
> >                 if (new_cpu == cpu) {
> >                         /* Now try balancing at a lower domain level of 'cpu':
> */
> >                         sd = sd->child;
> > @@ -8416,6 +8420,8 @@ static inline void update_sg_lb_stats(struct lb_env
> *env,
> >                 if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq,
> false))
> >                         env->flags |= LBF_NOHZ_AGAIN;
> >
> > +               sgs->group_capacity += capacity_of(i);
> > +               sgs->group_weight++;
> >                 sgs->group_load += cpu_load(rq);
> >                 sgs->group_util += cpu_util(i);
> >                 sgs->group_runnable += cpu_runnable(rq);
> > @@ -8462,10 +8468,6 @@ static inline void update_sg_lb_stats(struct lb_env
> *env,
> >                 sgs->group_asym_packing = 1;
> >         }
> >
> > -       sgs->group_capacity = group->sgc->capacity;
> > -
> > -       sgs->group_weight = group->group_weight;
> > -
> >         sgs->group_type = group_classify(env->sd->imbalance_pct, group,
> sgs);
> >
> >         /* Computing avg_load makes sense only when group is overloaded */
> > @@ -8688,10 +8690,12 @@ static inline void update_sg_wakeup_stats(struct
> sched_domain *sd,
> >
> >         memset(sgs, 0, sizeof(*sgs));
> >
> > -       for_each_cpu(i, sched_group_span(group)) {
> > +       for_each_cpu_and(i, sched_group_span(group), sched_domain_span(sd))
> {
> >                 struct rq *rq = cpu_rq(i);
> >                 unsigned int local;
> >
> > +               sgs->group_capacity += capacity_of(i);
> > +               sgs->group_weight++;
> >                 sgs->group_load += cpu_load_without(rq, p);
> >                 sgs->group_util += cpu_util_without(i, p);
> >                 sgs->group_runnable += cpu_runnable_without(rq, p);
> > @@ -8715,10 +8719,6 @@ static inline void update_sg_wakeup_stats(struct
> sched_domain *sd,
> >                 sgs->group_misfit_task_load = 1;
> >         }
> >
> > -       sgs->group_capacity = group->sgc->capacity;
> > -
> > -       sgs->group_weight = group->group_weight;
> > -
> >         sgs->group_type = group_classify(sd->imbalance_pct, group, sgs);
> >
> >         /*
> > --
> > 2.25.1
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2
  2021-01-18 11:25   ` Song Bao Hua (Barry Song)
@ 2021-01-21 18:14     ` Valentin Schneider
  2021-01-22  2:53       ` Song Bao Hua (Barry Song)
  2021-01-25  3:13       ` Song Bao Hua (Barry Song)
  0 siblings, 2 replies; 8+ messages in thread
From: Valentin Schneider @ 2021-01-21 18:14 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song), Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Dietmar Eggemann, Morten Rasmussen,
	linux-kernel, Mel Gorman, linuxarm


Hi,

On 18/01/21 11:25, Song Bao Hua wrote:
>> -----Original Message-----
>> From: Vincent Guittot [mailto:vincent.guittot@linaro.org]
>> Sent: Tuesday, January 19, 2021 12:14 AM
>> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
>> Cc: Ingo Molnar <mingo@kernel.org>; Peter Zijlstra <peterz@infradead.org>;
>> Dietmar Eggemann <dietmar.eggemann@arm.com>; Morten Rasmussen
>> <morten.rasmussen@arm.com>; Valentin Schneider <valentin.schneider@arm.com>;
>> linux-kernel <linux-kernel@vger.kernel.org>; Mel Gorman <mgorman@suse.de>;
>> linuxarm@openeuler.org
>> Subject: Re: [RFC PATCH] sched/fair: first try to fix the scheduling impact
>> of NUMA diameter > 2
>>
>> On Fri, 15 Jan 2021 at 21:42, Barry Song <song.bao.hua@hisilicon.com> wrote:
>> >
>> > This patch is a follow-up of the 3-hops issue reported by Valentin Schneider:
>> > [1] https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com/
>> > [2]
>> https://lore.kernel.org/lkml/20201110184300.15673-1-valentin.schneider@arm
>> .com/
>> >
>> > Here is a brief summary of the background:
>> > For a NUMA system with 3-hops, sched_group for NUMA 2-hops could be not a
>> > subset of sched_domain.
>> > For example, for a system with the below topology(two cpus in each NUMA
>> > node):
>> > node   0   1   2   3
>> >   0:  10  12  20  22
>> >   1:  12  10  22  24
>> >   2:  20  22  10  12
>> >   3:  22  24  12  10
>> >
>> > For CPU0, domain-2 will span 0-5, but its group will span 0-3, 4-7.
>> > 4-7 isn't a subset of 0-5.
>> >
>> > CPU0 attaching sched-domain(s):
>> >  domain-0: span=0-1 level=MC
>> >   groups: 0:{ span=0 cap=989 }, 1:{ span=1 cap=1016 }
>> >   domain-1: span=0-3 level=NUMA
>> >    groups: 0:{ span=0-1 cap=2005 }, 2:{ span=2-3 cap=2028 }
>> >    domain-2: span=0-5 level=NUMA
>> >     groups: 0:{ span=0-3 cap=4033 }, 4:{ span=4-7 cap=3909 }
>> >  ERROR: groups don't span domain->span
>> >     domain-3: span=0-7 level=NUMA
>> >      groups: 0:{ span=0-5 mask=0-1 cap=6062 }, 6:{ span=4-7 mask=6-7 cap=3928 }
>> >
>> > All other cpus also have the same issue: sched_group could be not a subset
>> > of sched_domain.
>> >
>> > Here I am trying to figure out the scheduling impact of this issue from
>> > two aspects:
>> > 1. find busiest cpu in load_balance
>> > 2. find idlest cpu in fork/exec/wake balance
>>
>> Would be better to fix the error in the sched domain topology instead
>> of hacking the load balance to compensate the topology problem
>
> I think Valentin Schneider has tried to do that before, but failed. This will
> add some new groups which won't be managed by current update_group_capacity()?
> @Valentine, would you like to share more details?
>

Sorry for being late to the party, this is gnarly stuff and I can't dive
back into it without spending some time staring at my notes and diagrams...
I did indeed try to fix the group construction, thinking it would "just" be
a matter of changing one mask into another, but it turned out to be quite
trickier.

Let's go back to https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com/

Right now, for that #Case study w/ QEMU platform, we get:

  CPU0-domain1: span=0-2
    group0: span=0-2, mask=0
    group2: span=1-3, mask=2 # CPU3 shouldn't be included
  CPU1-domain1: span=0-3
    group1: span=0-2, mask=1
    group3: span=2-3, mask=3
  CPU2-domain1: span=0-3
    group2: span=1-3, mask=2
    group0: span=0-1, mask=0
  CPU3-domain1: span=0-2
    group3: span=2-3, mask=3
    group1: span=0-2, mask=1 # CPU0 shouldn't be included

We would want to "fix" this into:

  CPU0-domain1
    group0: span=0-2, mask=0
  - group2: span=1-3, mask=2
  + group?: span=1-2, mask=??
  CPU1-domain1
    group1: span=0-2, mask=1
    group3: span=2-3, mask=3
  CPU2-domain1
    group2: span=1-3, mask=2
    group0: span=0-1, mask=0
  CPU3-domain1
    group3: span=2-3, mask=3
  - group1: span=0-2, mask=1
  + group?: span=1-2, mask=??

Note the '?' for the group ID and for the group balance mask. What I tried
to hint at when writing this is that, right now, there is no sane ID nor
balance mask to give to those "new" groups.

The group ID is supposed to be the CPU owning the sched_group_capacity
structure for the unique group span, which right now is the first CPU in
the balance mask - I recommend reading the comments atop
group_balance_cpu(), build_balance_mask() and get_group().

Here, we would end up with 5 unique group spans despite only having 4 CPUs:
our allocation scheme doesn't hold up anymore. This stems from the way we
allocate our topology data: we have a percpu slot per topology level.

Furthermore, these "new" groups won't be the local group of any CPU,
which means update_group_capacity() will never visit them - their capacity
will never be updated.


Here are some possible ways forward:
- Have extra storage in struct sd_data for sched_group_capacity of those
  new, non-local groups. There might be topologies where you'll need to
  store more than one such group per CPU in there.
- During load balance stats update, update the local group *and* all of
  those new, non-local ones.

Conceptually I think this is what would be required, but it does feel very
duct-tape-y...

> On the other hand, another purpose of this RFC is that I also want to dig into
> more details about how the 3-hops issue could affect the behavior of scheduler.
> In Valentine's original thread, I think we haven't figured out how the issue
> will really impact scheduling.
>

I think the long story short was that since you can end up with groups
spanning CPUs farther away than what the domain represents, the load
balance stats computation (to figure out which busiest group to pull from)
can and will be skewered.

There are safeguards to prevent pulling from outside the domain span, but
that doesn't protect the stats.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2
  2021-01-21 18:14     ` Valentin Schneider
@ 2021-01-22  2:53       ` Song Bao Hua (Barry Song)
  2021-01-25  3:13       ` Song Bao Hua (Barry Song)
  1 sibling, 0 replies; 8+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-01-22  2:53 UTC (permalink / raw)
  To: Valentin Schneider, Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Dietmar Eggemann, Morten Rasmussen,
	linux-kernel, Mel Gorman, linuxarm



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Friday, January 22, 2021 7:15 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Vincent Guittot
> <vincent.guittot@linaro.org>
> Cc: Ingo Molnar <mingo@kernel.org>; Peter Zijlstra <peterz@infradead.org>;
> Dietmar Eggemann <dietmar.eggemann@arm.com>; Morten Rasmussen
> <morten.rasmussen@arm.com>; linux-kernel <linux-kernel@vger.kernel.org>; Mel
> Gorman <mgorman@suse.de>; linuxarm@openeuler.org
> Subject: RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact
> of NUMA diameter > 2
> 
> 
> Hi,
> 
> On 18/01/21 11:25, Song Bao Hua wrote:
> >> -----Original Message-----
> >> From: Vincent Guittot [mailto:vincent.guittot@linaro.org]
> >> Sent: Tuesday, January 19, 2021 12:14 AM
> >> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> >> Cc: Ingo Molnar <mingo@kernel.org>; Peter Zijlstra <peterz@infradead.org>;
> >> Dietmar Eggemann <dietmar.eggemann@arm.com>; Morten Rasmussen
> >> <morten.rasmussen@arm.com>; Valentin Schneider
> <valentin.schneider@arm.com>;
> >> linux-kernel <linux-kernel@vger.kernel.org>; Mel Gorman <mgorman@suse.de>;
> >> linuxarm@openeuler.org
> >> Subject: Re: [RFC PATCH] sched/fair: first try to fix the scheduling impact
> >> of NUMA diameter > 2
> >>
> >> On Fri, 15 Jan 2021 at 21:42, Barry Song <song.bao.hua@hisilicon.com> wrote:
> >> >
> >> > This patch is a follow-up of the 3-hops issue reported by Valentin Schneider:
> >> > [1] https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com/
> >> > [2]
> >>
> https://lore.kernel.org/lkml/20201110184300.15673-1-valentin.schneider@arm
> >> .com/
> >> >
> >> > Here is a brief summary of the background:
> >> > For a NUMA system with 3-hops, sched_group for NUMA 2-hops could be not
> a
> >> > subset of sched_domain.
> >> > For example, for a system with the below topology(two cpus in each NUMA
> >> > node):
> >> > node   0   1   2   3
> >> >   0:  10  12  20  22
> >> >   1:  12  10  22  24
> >> >   2:  20  22  10  12
> >> >   3:  22  24  12  10
> >> >
> >> > For CPU0, domain-2 will span 0-5, but its group will span 0-3, 4-7.
> >> > 4-7 isn't a subset of 0-5.
> >> >
> >> > CPU0 attaching sched-domain(s):
> >> >  domain-0: span=0-1 level=MC
> >> >   groups: 0:{ span=0 cap=989 }, 1:{ span=1 cap=1016 }
> >> >   domain-1: span=0-3 level=NUMA
> >> >    groups: 0:{ span=0-1 cap=2005 }, 2:{ span=2-3 cap=2028 }
> >> >    domain-2: span=0-5 level=NUMA
> >> >     groups: 0:{ span=0-3 cap=4033 }, 4:{ span=4-7 cap=3909 }
> >> >  ERROR: groups don't span domain->span
> >> >     domain-3: span=0-7 level=NUMA
> >> >      groups: 0:{ span=0-5 mask=0-1 cap=6062 }, 6:{ span=4-7 mask=6-7
> cap=3928 }
> >> >
> >> > All other cpus also have the same issue: sched_group could be not a subset
> >> > of sched_domain.
> >> >
> >> > Here I am trying to figure out the scheduling impact of this issue from
> >> > two aspects:
> >> > 1. find busiest cpu in load_balance
> >> > 2. find idlest cpu in fork/exec/wake balance
> >>
> >> Would be better to fix the error in the sched domain topology instead
> >> of hacking the load balance to compensate the topology problem
> >
> > I think Valentin Schneider has tried to do that before, but failed. This will
> > add some new groups which won't be managed by current update_group_capacity()?
> > @Valentine, would you like to share more details?
> >
> 
> Sorry for being late to the party, this is gnarly stuff and I can't dive
> back into it without spending some time staring at my notes and diagrams...
> I did indeed try to fix the group construction, thinking it would "just" be
> a matter of changing one mask into another, but it turned out to be quite
> trickier.
> 
> Let's go back to https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com/
> 
> Right now, for that #Case study w/ QEMU platform, we get:
> 
>   CPU0-domain1: span=0-2
>     group0: span=0-2, mask=0
>     group2: span=1-3, mask=2 # CPU3 shouldn't be included
>   CPU1-domain1: span=0-3
>     group1: span=0-2, mask=1
>     group3: span=2-3, mask=3
>   CPU2-domain1: span=0-3
>     group2: span=1-3, mask=2
>     group0: span=0-1, mask=0
>   CPU3-domain1: span=0-2
>     group3: span=2-3, mask=3
>     group1: span=0-2, mask=1 # CPU0 shouldn't be included
> 
> We would want to "fix" this into:
> 
>   CPU0-domain1
>     group0: span=0-2, mask=0
>   - group2: span=1-3, mask=2
>   + group?: span=1-2, mask=??
>   CPU1-domain1
>     group1: span=0-2, mask=1
>     group3: span=2-3, mask=3
>   CPU2-domain1
>     group2: span=1-3, mask=2
>     group0: span=0-1, mask=0
>   CPU3-domain1
>     group3: span=2-3, mask=3
>   - group1: span=0-2, mask=1
>   + group?: span=1-2, mask=??
> 
> Note the '?' for the group ID and for the group balance mask. What I tried
> to hint at when writing this is that, right now, there is no sane ID nor
> balance mask to give to those "new" groups.
> 
> The group ID is supposed to be the CPU owning the sched_group_capacity
> structure for the unique group span, which right now is the first CPU in
> the balance mask - I recommend reading the comments atop
> group_balance_cpu(), build_balance_mask() and get_group().
> 
> Here, we would end up with 5 unique group spans despite only having 4 CPUs:
> our allocation scheme doesn't hold up anymore. This stems from the way we
> allocate our topology data: we have a percpu slot per topology level.
> 
> Furthermore, these "new" groups won't be the local group of any CPU,
> which means update_group_capacity() will never visit them - their capacity
> will never be updated.
> 
> 
> Here are some possible ways forward:
> - Have extra storage in struct sd_data for sched_group_capacity of those
>   new, non-local groups. There might be topologies where you'll need to
>   store more than one such group per CPU in there.
> - During load balance stats update, update the local group *and* all of
>   those new, non-local ones.
> 
> Conceptually I think this is what would be required, but it does feel very
> duct-tape-y...

Yep. kernel is building sched_groups of domain[n] by using the child domains
domain[n-1] of those cpus in the span of domain[n]. 
so the new groups added by you don't have same span with the child domain
domain[n-1]. This kind of groups will be quite weird and need be maintained
separately.

> 
> > On the other hand, another purpose of this RFC is that I also want to dig
> into
> > more details about how the 3-hops issue could affect the behavior of scheduler.
> > In Valentine's original thread, I think we haven't figured out how the issue
> > will really impact scheduling.
> >
> 
> I think the long story short was that since you can end up with groups
> spanning CPUs farther away than what the domain represents, the load
> balance stats computation (to figure out which busiest group to pull from)
> can and will be skewered.

Yes. My patch mentioned two places where load balance stats are skewered.

1. find_busiest_group() in load_balance()
Just in case domain span is 0-3, one of its groups has span 2-5.
4 and 5 don't belong to the domain 0-3.

While calculating the load of group, update_sg_lb_stats() will do
(the sum of cpu2 and cpu3)/(capacity of the whole group cpu2-5). 

So the load of group2-5 will be underestimated. my patch moved to
do this:
(the sum of cpu2 and cpu3)/(the sum of capacity of cpu2-3)

It actually gets a relatively correct load of cpu2-3 which is a
part of group 2-5. If this "half" group has high load, another
group still have chance to pull task from cpu2 and cpu3.

2. find_idlest_group() in select_task_rq_fair()
This is mainly for placing a new forked/exec-ed task on an idle cpu.

In this path, I found there is totally no safeguard to prevent pushing
task to outside the domain span. And the load calculation will count all
cpus in the group which has cpu outside the domain.
(the sum of cpu2,3,4,5)/(capacity of the whole group cpu2-5)

What I have done here is moving to do load stats update in
update_sg_wakeup_stats() by:
(the sum of cpu2 and cpu3)/(the sum of capacity of cpu2-3)

and add a safeguard to prevent pushing task to cpu 4-5.

> 
> There are safeguards to prevent pulling from outside the domain span, but
> that doesn't protect the stats.

I did see pulling task won't go outside the domain in find_busiest_queue().
But since the load calculation is wrong, task pulling could happen in the
wrong direction.

Thanks
Barry



^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2
  2021-01-21 18:14     ` Valentin Schneider
  2021-01-22  2:53       ` Song Bao Hua (Barry Song)
@ 2021-01-25  3:13       ` Song Bao Hua (Barry Song)
  2021-01-25 12:10         ` Valentin Schneider
  1 sibling, 1 reply; 8+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-01-25  3:13 UTC (permalink / raw)
  To: Valentin Schneider, Vincent Guittot, Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Dietmar Eggemann, Morten Rasmussen,
	linux-kernel, linuxarm

> >
> >
> > Hi,
> >
> > On 18/01/21 11:25, Song Bao Hua wrote:
> > >> -----Original Message-----
> > >> From: Vincent Guittot [mailto:vincent.guittot@linaro.org]
> > >> Sent: Tuesday, January 19, 2021 12:14 AM
> > >> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > >> Cc: Ingo Molnar <mingo@kernel.org>; Peter Zijlstra <peterz@infradead.org>;
> > >> Dietmar Eggemann <dietmar.eggemann@arm.com>; Morten Rasmussen
> > >> <morten.rasmussen@arm.com>; Valentin Schneider
> > <valentin.schneider@arm.com>;
> > >> linux-kernel <linux-kernel@vger.kernel.org>; Mel Gorman
> <mgorman@suse.de>;
> > >> linuxarm@openeuler.org
> > >> Subject: Re: [RFC PATCH] sched/fair: first try to fix the scheduling impact
> > >> of NUMA diameter > 2
> > >>
> > >> On Fri, 15 Jan 2021 at 21:42, Barry Song <song.bao.hua@hisilicon.com> wrote:
> > >> >
> > >> > This patch is a follow-up of the 3-hops issue reported by Valentin
> Schneider:
> > >> > [1] https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com/
> > >> > [2]
> > >>
> >
> https://lore.kernel.org/lkml/20201110184300.15673-1-valentin.schneider@arm
> > >> .com/
> > >> >
> > >> > Here is a brief summary of the background:
> > >> > For a NUMA system with 3-hops, sched_group for NUMA 2-hops could be not
> > a
> > >> > subset of sched_domain.
> > >> > For example, for a system with the below topology(two cpus in each NUMA
> > >> > node):
> > >> > node   0   1   2   3
> > >> >   0:  10  12  20  22
> > >> >   1:  12  10  22  24
> > >> >   2:  20  22  10  12
> > >> >   3:  22  24  12  10
> > >> >
> > >> > For CPU0, domain-2 will span 0-5, but its group will span 0-3, 4-7.
> > >> > 4-7 isn't a subset of 0-5.
> > >> >
> > >> > CPU0 attaching sched-domain(s):
> > >> >  domain-0: span=0-1 level=MC
> > >> >   groups: 0:{ span=0 cap=989 }, 1:{ span=1 cap=1016 }
> > >> >   domain-1: span=0-3 level=NUMA
> > >> >    groups: 0:{ span=0-1 cap=2005 }, 2:{ span=2-3 cap=2028 }
> > >> >    domain-2: span=0-5 level=NUMA
> > >> >     groups: 0:{ span=0-3 cap=4033 }, 4:{ span=4-7 cap=3909 }
> > >> >  ERROR: groups don't span domain->span
> > >> >     domain-3: span=0-7 level=NUMA
> > >> >      groups: 0:{ span=0-5 mask=0-1 cap=6062 }, 6:{ span=4-7 mask=6-7
> > cap=3928 }
> > >> >
> > >> > All other cpus also have the same issue: sched_group could be not a subset
> > >> > of sched_domain.
> > >> >
> > >> > Here I am trying to figure out the scheduling impact of this issue from
> > >> > two aspects:
> > >> > 1. find busiest cpu in load_balance
> > >> > 2. find idlest cpu in fork/exec/wake balance
> > >>
> > >> Would be better to fix the error in the sched domain topology instead
> > >> of hacking the load balance to compensate the topology problem
> > >
> > > I think Valentin Schneider has tried to do that before, but failed. This
> will
> > > add some new groups which won't be managed by current
> update_group_capacity()?
> > > @Valentine, would you like to share more details?
> > >
> >
> > Sorry for being late to the party, this is gnarly stuff and I can't dive
> > back into it without spending some time staring at my notes and diagrams...
> > I did indeed try to fix the group construction, thinking it would "just" be
> > a matter of changing one mask into another, but it turned out to be quite
> > trickier.
> >
> > Let's go back to https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com/
> >
> > Right now, for that #Case study w/ QEMU platform, we get:
> >
> >   CPU0-domain1: span=0-2
> >     group0: span=0-2, mask=0
> >     group2: span=1-3, mask=2 # CPU3 shouldn't be included
> >   CPU1-domain1: span=0-3
> >     group1: span=0-2, mask=1
> >     group3: span=2-3, mask=3
> >   CPU2-domain1: span=0-3
> >     group2: span=1-3, mask=2
> >     group0: span=0-1, mask=0
> >   CPU3-domain1: span=0-2
> >     group3: span=2-3, mask=3
> >     group1: span=0-2, mask=1 # CPU0 shouldn't be included
> >
> > We would want to "fix" this into:
> >
> >   CPU0-domain1
> >     group0: span=0-2, mask=0
> >   - group2: span=1-3, mask=2
> >   + group?: span=1-2, mask=??
> >   CPU1-domain1
> >     group1: span=0-2, mask=1
> >     group3: span=2-3, mask=3
> >   CPU2-domain1
> >     group2: span=1-3, mask=2
> >     group0: span=0-1, mask=0
> >   CPU3-domain1
> >     group3: span=2-3, mask=3
> >   - group1: span=0-2, mask=1
> >   + group?: span=1-2, mask=??
> >
> > Note the '?' for the group ID and for the group balance mask. What I tried
> > to hint at when writing this is that, right now, there is no sane ID nor
> > balance mask to give to those "new" groups.
> >
> > The group ID is supposed to be the CPU owning the sched_group_capacity
> > structure for the unique group span, which right now is the first CPU in
> > the balance mask - I recommend reading the comments atop
> > group_balance_cpu(), build_balance_mask() and get_group().
> >
> > Here, we would end up with 5 unique group spans despite only having 4 CPUs:
> > our allocation scheme doesn't hold up anymore. This stems from the way we
> > allocate our topology data: we have a percpu slot per topology level.
> >
> > Furthermore, these "new" groups won't be the local group of any CPU,
> > which means update_group_capacity() will never visit them - their capacity
> > will never be updated.
> >
> >
> > Here are some possible ways forward:
> > - Have extra storage in struct sd_data for sched_group_capacity of those
> >   new, non-local groups. There might be topologies where you'll need to
> >   store more than one such group per CPU in there.
> > - During load balance stats update, update the local group *and* all of
> >   those new, non-local ones.
> >
> > Conceptually I think this is what would be required, but it does feel very
> > duct-tape-y...
> 
> Yep. kernel is building sched_groups of domain[n] by using the child domains
> domain[n-1] of those cpus in the span of domain[n].
> so the new groups added by you don't have same span with the child domain
> domain[n-1]. This kind of groups will be quite weird and need be maintained
> separately.

As long as NUMA diameter > 2, building sched_domain by sibling's child domain
will definitely create a sched_domain with sched_group which will span
out of the sched_domain
               +------+         +------+        +-------+       +------+
               | node |  12     |node  | 20     | node  |  12   |node  |
               |  0   +---------+1     +--------+ 2     +-------+3     |
               +------+         +------+        +-------+       +------+

domain0        node0            node1            node2          node3

domain1        node0+1          node0+1          node2+3        node2+3
                                                 +
domain2        node0+1+2                         |
             group: node0+1                      |
               group:node2+3 <-------------------+

when node2 is added into the domain2 of node0, kernel is using the child
domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
the span of node0+1+2.

Will we move to use the *child* domain of the *child* domain of node2's
domain2 to build the sched_group?

I mean:
               +------+         +------+        +-------+       +------+
               | node |  12     |node  | 20     | node  |  12   |node  |
               |  0   +---------+1     +--------+ 2     +-------+3     |
               +------+         +------+        +-------+       +------+

domain0        node0            node1          +- node2          node3
                                               |
domain1        node0+1          node0+1        | node2+3        node2+3
                                               |
domain2        node0+1+2                       |
             group: node0+1                    |
               group:node2 <-------------------+

In this way, it seems we don't have to create a new group as we are just
reusing the existing group?

> 
> >
> > > On the other hand, another purpose of this RFC is that I also want to dig
> > into
> > > more details about how the 3-hops issue could affect the behavior of scheduler.
> > > In Valentine's original thread, I think we haven't figured out how the issue
> > > will really impact scheduling.
> > >
> >
> > I think the long story short was that since you can end up with groups
> > spanning CPUs farther away than what the domain represents, the load
> > balance stats computation (to figure out which busiest group to pull from)
> > can and will be skewered.
> 
> Yes. My patch mentioned two places where load balance stats are skewered.
> 
> 1. find_busiest_group() in load_balance()
> Just in case domain span is 0-3, one of its groups has span 2-5.
> 4 and 5 don't belong to the domain 0-3.
> 
> While calculating the load of group, update_sg_lb_stats() will do
> (the sum of cpu2 and cpu3)/(capacity of the whole group cpu2-5).
> 
> So the load of group2-5 will be underestimated. my patch moved to
> do this:
> (the sum of cpu2 and cpu3)/(the sum of capacity of cpu2-3)
> 
> It actually gets a relatively correct load of cpu2-3 which is a
> part of group 2-5. If this "half" group has high load, another
> group still have chance to pull task from cpu2 and cpu3.
> 
> 2. find_idlest_group() in select_task_rq_fair()
> This is mainly for placing a new forked/exec-ed task on an idle cpu.
> 
> In this path, I found there is totally no safeguard to prevent pushing
> task to outside the domain span. And the load calculation will count all
> cpus in the group which has cpu outside the domain.
> (the sum of cpu2,3,4,5)/(capacity of the whole group cpu2-5)
> 
> What I have done here is moving to do load stats update in
> update_sg_wakeup_stats() by:
> (the sum of cpu2 and cpu3)/(the sum of capacity of cpu2-3)
> 
> and add a safeguard to prevent pushing task to cpu 4-5.
> 
> >
> > There are safeguards to prevent pulling from outside the domain span, but
> > that doesn't protect the stats.
> 
> I did see pulling task won't go outside the domain in find_busiest_queue().
> But since the load calculation is wrong, task pulling could happen in the
> wrong direction.
> 

Thanks
Barry


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2
  2021-01-25  3:13       ` Song Bao Hua (Barry Song)
@ 2021-01-25 12:10         ` Valentin Schneider
  2021-01-25 21:55           ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 8+ messages in thread
From: Valentin Schneider @ 2021-01-25 12:10 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song), Vincent Guittot, Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Dietmar Eggemann, Morten Rasmussen,
	linux-kernel, linuxarm

On 25/01/21 03:13, Song Bao Hua (Barry Song) wrote:
> As long as NUMA diameter > 2, building sched_domain by sibling's child domain
> will definitely create a sched_domain with sched_group which will span
> out of the sched_domain
>                +------+         +------+        +-------+       +------+
>                | node |  12     |node  | 20     | node  |  12   |node  |
>                |  0   +---------+1     +--------+ 2     +-------+3     |
>                +------+         +------+        +-------+       +------+
>
> domain0        node0            node1            node2          node3
>
> domain1        node0+1          node0+1          node2+3        node2+3
>                                                  +
> domain2        node0+1+2                         |
>              group: node0+1                      |
>                group:node2+3 <-------------------+
>
> when node2 is added into the domain2 of node0, kernel is using the child
> domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
> the span of node0+1+2.
>
> Will we move to use the *child* domain of the *child* domain of node2's
> domain2 to build the sched_group?
>
> I mean:
>                +------+         +------+        +-------+       +------+
>                | node |  12     |node  | 20     | node  |  12   |node  |
>                |  0   +---------+1     +--------+ 2     +-------+3     |
>                +------+         +------+        +-------+       +------+
>
> domain0        node0            node1          +- node2          node3
>                                                |
> domain1        node0+1          node0+1        | node2+3        node2+3
>                                                |
> domain2        node0+1+2                       |
>              group: node0+1                    |
>                group:node2 <-------------------+
>
> In this way, it seems we don't have to create a new group as we are just
> reusing the existing group?
>

One thing I've been musing over is pretty much this; that is to say we
would make all non-local NUMA sched_groups span a single node. This would
let us reuse an existing span+sched_group_capacity: the local group of that
node at its first NUMA topology level.

Essentially this means getting rid of the overlapping groups, and the
balance mask is handled the same way as for !NUMA, i.e. it's the local
group span. I've not gone far enough through the thought experiment to see
where does it miserably fall apart... It is at the very least violating the
expectation that a group span is a child domain's span - here it can be a
grand^x children domain's span.


If we take your topology, we currently have:

| tl\node | 0            | 1             | 2             | 3            |
|---------+--------------+---------------+---------------+--------------|
| NUMA0   | (0)->(1)     | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)     |
| NUMA1   | (0-1)->(1-3) | (0-2)->(2-3)  | (1-3)->(0-1)  | (2-3)->(0-2) |
| NUMA2   | (0-2)->(1-3) | N/A           | N/A           | (1-3)->(0-2) |

With the current overlapping group scheme, we would need to make it look
like so:

| tl\node | 0             | 1             | 2             | 3             |
|---------+---------------+---------------+---------------+---------------|
| NUMA0   | (0)->(1)      | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)      |
| NUMA1   | (0-1)->(1-2)* | (0-2)->(2-3)  | (1-3)->(0-1)  | (2-3)->(1-2)* |
| NUMA2   | (0-2)->(1-3)  | N/A           | N/A           | (1-3)->(0-2)  |

But as already discussed, that's tricky to make work. With the node-span
groups thing, we would turn this into:

| tl\node | 0          | 1             | 2             | 3          |
|---------+------------+---------------+---------------+------------|
| NUMA0   | (0)->(1)   | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)   |
| NUMA1   | (0-1)->(2) | (0-2)->(3)    | (1-3)->(0)    | (2-3)->(1) |
| NUMA2   | (0-2)->(3) | N/A           | N/A           | (1-3)->(0) |


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2
  2021-01-25 12:10         ` Valentin Schneider
@ 2021-01-25 21:55           ` Song Bao Hua (Barry Song)
  0 siblings, 0 replies; 8+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-01-25 21:55 UTC (permalink / raw)
  To: Valentin Schneider, Vincent Guittot, Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Dietmar Eggemann, Morten Rasmussen,
	linux-kernel, linuxarm



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Tuesday, January 26, 2021 1:11 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Vincent Guittot
> <vincent.guittot@linaro.org>; Mel Gorman <mgorman@suse.de>
> Cc: Ingo Molnar <mingo@kernel.org>; Peter Zijlstra <peterz@infradead.org>;
> Dietmar Eggemann <dietmar.eggemann@arm.com>; Morten Rasmussen
> <morten.rasmussen@arm.com>; linux-kernel <linux-kernel@vger.kernel.org>;
> linuxarm@openeuler.org
> Subject: RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact
> of NUMA diameter > 2
> 
> On 25/01/21 03:13, Song Bao Hua (Barry Song) wrote:
> > As long as NUMA diameter > 2, building sched_domain by sibling's child domain
> > will definitely create a sched_domain with sched_group which will span
> > out of the sched_domain
> >                +------+         +------+        +-------+       +------+
> >                | node |  12     |node  | 20     | node  |  12   |node  |
> >                |  0   +---------+1     +--------+ 2     +-------+3     |
> >                +------+         +------+        +-------+       +------+
> >
> > domain0        node0            node1            node2          node3
> >
> > domain1        node0+1          node0+1          node2+3        node2+3
> >                                                  +
> > domain2        node0+1+2                         |
> >              group: node0+1                      |
> >                group:node2+3 <-------------------+
> >
> > when node2 is added into the domain2 of node0, kernel is using the child
> > domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
> > the span of node0+1+2.
> >
> > Will we move to use the *child* domain of the *child* domain of node2's
> > domain2 to build the sched_group?
> >
> > I mean:
> >                +------+         +------+        +-------+       +------+
> >                | node |  12     |node  | 20     | node  |  12   |node  |
> >                |  0   +---------+1     +--------+ 2     +-------+3     |
> >                +------+         +------+        +-------+       +------+
> >
> > domain0        node0            node1          +- node2          node3
> >                                                |
> > domain1        node0+1          node0+1        | node2+3        node2+3
> >                                                |
> > domain2        node0+1+2                       |
> >              group: node0+1                    |
> >                group:node2 <-------------------+
> >
> > In this way, it seems we don't have to create a new group as we are just
> > reusing the existing group?
> >
> 
> One thing I've been musing over is pretty much this; that is to say we
> would make all non-local NUMA sched_groups span a single node. This would
> let us reuse an existing span+sched_group_capacity: the local group of that
> node at its first NUMA topology level.
> 
> Essentially this means getting rid of the overlapping groups, and the
> balance mask is handled the same way as for !NUMA, i.e. it's the local
> group span. I've not gone far enough through the thought experiment to see
> where does it miserably fall apart... It is at the very least violating the
> expectation that a group span is a child domain's span - here it can be a
> grand^x children domain's span.
> 
> 
> If we take your topology, we currently have:
> 
> | tl\node | 0            | 1             | 2             | 3            |
> |---------+--------------+---------------+---------------+--------------|
> | NUMA0   | (0)->(1)     | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)     |
> | NUMA1   | (0-1)->(1-3) | (0-2)->(2-3)  | (1-3)->(0-1)  | (2-3)->(0-2) |
> | NUMA2   | (0-2)->(1-3) | N/A           | N/A           | (1-3)->(0-2) |
> 
> With the current overlapping group scheme, we would need to make it look
> like so:
> 
> | tl\node | 0             | 1             | 2             | 3             |
> |---------+---------------+---------------+---------------+---------------
> |
> | NUMA0   | (0)->(1)      | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)      |
> | NUMA1   | (0-1)->(1-2)* | (0-2)->(2-3)  | (1-3)->(0-1)  | (2-3)->(1-2)* |
> | NUMA2   | (0-2)->(1-3)  | N/A           | N/A           | (1-3)->(0-2)  |
> 
> But as already discussed, that's tricky to make work. With the node-span
> groups thing, we would turn this into:
> 
> | tl\node | 0          | 1             | 2             | 3          |
> |---------+------------+---------------+---------------+------------|
> | NUMA0   | (0)->(1)   | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)   |
> | NUMA1   | (0-1)->(2) | (0-2)->(3)    | (1-3)->(0)    | (2-3)->(1) |
> | NUMA2   | (0-2)->(3) | N/A           | N/A           | (1-3)->(0) |

Actually I didn't mean going that far. What I was thinking is that
we only fix the sched_domain while sched_group isn't a subset of
sched_domain. For those sched_domains which haven't the group span
issue, we just don't touch it. For NUMA1, we change like your diagram,
but NUMA2 won't be changed. The concept is like:

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1040,6 +1040,19 @@ build_overlap_sched_groups(struct sched_domain
*sd, int cpu)
                }

                sg_span = sched_group_span(sg);
+#if 1
+               if (sibling->child && !cpumask_subset(sg_span, span)) {
+                       sg = build_group_from_child_sched_domain(sibling->child, cpu);
+                       ...
+                       sg_span = sched_group_span(sg);
+               }
+#endif
                cpumask_or(covered, covered, sg_span);

Thanks
Barry

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-01-25 22:01 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-15 20:36 [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2 Barry Song
2021-01-18 11:13 ` Vincent Guittot
2021-01-18 11:25   ` Song Bao Hua (Barry Song)
2021-01-21 18:14     ` Valentin Schneider
2021-01-22  2:53       ` Song Bao Hua (Barry Song)
2021-01-25  3:13       ` Song Bao Hua (Barry Song)
2021-01-25 12:10         ` Valentin Schneider
2021-01-25 21:55           ` Song Bao Hua (Barry Song)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).