[PATCH] sched/topology: Set SD_PREFER_SIBLING consistently on non-NUMA levels

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched/topology: Set SD_PREFER_SIBLING consistently on non-NUMA levels
@ 2017-11-27 17:29 Morten Rasmussen
  2017-11-29  9:24 ` Vincent Guittot
  0 siblings, 1 reply; 4+ messages in thread
From: Morten Rasmussen @ 2017-11-27 17:29 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, vincent.guittot, mgalbraith, linux-kernel,
	Morten Rasmussen

SD_PREFER_SIBLING adds an additional bias towards spreading tasks on the
_parent_ sched_domain even if a sched_group isn't overloaded. It is
currently set on:

   1. SMT level to promote spreading to sibling cores rather than using
      sibling HW-threads (caff37ef96eac7fe96a).

   2. Non-NUMA levels which don't have the SD_SHARE_PKG_RESOURCES flag
      set (= DIE level in the default topology) as it was found to
      improve benchmarks on certain NUMA systems (6956dc568f34107f1d02b).

   3. Any non-NUMA level that inherits the flag due to elimination of
      its parent sched_domain level in the de-generate step of the
      sched_domain hierarchy set up (= MC level in the default
      topology).

Preferring siblings seems to be a useful tweak for all non-NUMA levels,
so we should enable it on all non-NUMA levels. As it is, it is possible
to have it SMT and DIE, but not MC in between when using the default
topology.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/topology.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6798276d29af..7f70806bfa0f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1122,7 +1122,7 @@ sd_init(struct sched_domain_topology_level *tl,
 					| 0*SD_SHARE_CPUCAPACITY
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 0*SD_SERIALIZE
-					| 0*SD_PREFER_SIBLING
+					| 1*SD_PREFER_SIBLING
 					| 0*SD_NUMA
 					| sd_flags
 					,
@@ -1153,7 +1153,6 @@ sd_init(struct sched_domain_topology_level *tl,
 	}
 
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
-		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 110;
 		sd->smt_gain = 1178; /* ~15% */
 
@@ -1168,6 +1167,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->busy_idx = 3;
 		sd->idle_idx = 2;
 
+		sd->flags &= ~SD_PREFER_SIBLING;
 		sd->flags |= SD_SERIALIZE;
 		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
 			sd->flags &= ~(SD_BALANCE_EXEC |
@@ -1177,7 +1177,6 @@ sd_init(struct sched_domain_topology_level *tl,
 
 #endif
 	} else {
-		sd->flags |= SD_PREFER_SIBLING;
 		sd->cache_nice_tries = 1;
 		sd->busy_idx = 2;
 		sd->idle_idx = 1;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] sched/topology: Set SD_PREFER_SIBLING consistently on non-NUMA levels
  2017-11-27 17:29 [PATCH] sched/topology: Set SD_PREFER_SIBLING consistently on non-NUMA levels Morten Rasmussen
@ 2017-11-29  9:24 ` Vincent Guittot
  2017-11-29 11:55   ` Morten Rasmussen
  2017-11-29 11:59   ` Morten Rasmussen
  0 siblings, 2 replies; 4+ messages in thread
From: Vincent Guittot @ 2017-11-29  9:24 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, mgalbraith, linux-kernel

Hi Morten,

On 27 November 2017 at 18:29, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> SD_PREFER_SIBLING adds an additional bias towards spreading tasks on the
> _parent_ sched_domain even if a sched_group isn't overloaded. It is
> currently set on:
>
>    1. SMT level to promote spreading to sibling cores rather than using
>       sibling HW-threads (caff37ef96eac7fe96a).
>
>    2. Non-NUMA levels which don't have the SD_SHARE_PKG_RESOURCES flag
>       set (= DIE level in the default topology) as it was found to
>       improve benchmarks on certain NUMA systems (6956dc568f34107f1d02b).

So the goal is to have the last non-NUMA level with the flag so we can
spread between "DIE"

>
>    3. Any non-NUMA level that inherits the flag due to elimination of
>       its parent sched_domain level in the de-generate step of the
>       sched_domain hierarchy set up (= MC level in the default
>       topology).

This is to ensure that the last non NUMA level has the flag when the
DIE one disappears but the goal is the same as 2.

>
> Preferring siblings seems to be a useful tweak for all non-NUMA levels,
> so we should enable it on all non-NUMA levels. As it is, it is possible
> to have it SMT and DIE, but not MC in between when using the default
> topology.

So you want to extend it to all non NUMA level. And especially you
want to spread tasks in each MC groups when we have DIE and MC levels.
Have you got benchmark results to show improvement or is it just to
align topology configuration?
The fact that this flag improves bench for SMT and NUMA level doesn't
mean that it will improve for MC level as well. We have the
wake_wide/wake_affine stuffs that tries to do similar thing
dynamically and it regularly improves/regresses benchmark like
sysbench or hackbench



>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/sched/topology.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 6798276d29af..7f70806bfa0f 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1122,7 +1122,7 @@ sd_init(struct sched_domain_topology_level *tl,
>                                         | 0*SD_SHARE_CPUCAPACITY
>                                         | 0*SD_SHARE_PKG_RESOURCES
>                                         | 0*SD_SERIALIZE
> -                                       | 0*SD_PREFER_SIBLING
> +                                       | 1*SD_PREFER_SIBLING
>                                         | 0*SD_NUMA
>                                         | sd_flags
>                                         ,
> @@ -1153,7 +1153,6 @@ sd_init(struct sched_domain_topology_level *tl,
>         }
>
>         if (sd->flags & SD_SHARE_CPUCAPACITY) {
> -               sd->flags |= SD_PREFER_SIBLING;
>                 sd->imbalance_pct = 110;
>                 sd->smt_gain = 1178; /* ~15% */
>
> @@ -1168,6 +1167,7 @@ sd_init(struct sched_domain_topology_level *tl,
>                 sd->busy_idx = 3;
>                 sd->idle_idx = 2;
>
> +               sd->flags &= ~SD_PREFER_SIBLING;
>                 sd->flags |= SD_SERIALIZE;
>                 if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
>                         sd->flags &= ~(SD_BALANCE_EXEC |
> @@ -1177,7 +1177,6 @@ sd_init(struct sched_domain_topology_level *tl,
>
>  #endif
>         } else {
> -               sd->flags |= SD_PREFER_SIBLING;
>                 sd->cache_nice_tries = 1;
>                 sd->busy_idx = 2;
>                 sd->idle_idx = 1;
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] sched/topology: Set SD_PREFER_SIBLING consistently on non-NUMA levels
  2017-11-29  9:24 ` Vincent Guittot
@ 2017-11-29 11:55   ` Morten Rasmussen
  2017-11-29 11:59   ` Morten Rasmussen
  1 sibling, 0 replies; 4+ messages in thread
From: Morten Rasmussen @ 2017-11-29 11:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, mgalbraith, linux-kernel

On Wed, Nov 29, 2017 at 10:24:57AM +0100, Vincent Guittot wrote:
> Hi Morten,
>
> On 27 November 2017 at 18:29, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > SD_PREFER_SIBLING adds an additional bias towards spreading tasks on the
> > _parent_ sched_domain even if a sched_group isn't overloaded. It is
> > currently set on:
> >
> >    1. SMT level to promote spreading to sibling cores rather than using
> >       sibling HW-threads (caff37ef96eac7fe96a).
> >
> >    2. Non-NUMA levels which don't have the SD_SHARE_PKG_RESOURCES flag
> >       set (= DIE level in the default topology) as it was found to
> >       improve benchmarks on certain NUMA systems (6956dc568f34107f1d02b).
>
> So the goal is to have the last non-NUMA level with the flag so we can
> spread between "DIE"

The flag is already set on "DIE", the goal is to be consistent and set
the flag consistently. As it is, it is set on SMT and DIE, but not MC:

SMT: SD_PREFER_SIBLING
MC:  !SD_PREFER_SIBLING
DIE: SD_PREFER_SIBLING

So the goal is to spread consistently at all non-NUMA levels regardless
of which levels are present as currently it is not that obvious when
spreading happens on MC level as this list illustrates.

>
> >
> >    3. Any non-NUMA level that inherits the flag due to elimination of
> >       its parent sched_domain level in the de-generate step of the
> >       sched_domain hierarchy set up (= MC level in the default
> >       topology).
>
> This is to ensure that the last non NUMA level has the flag when the
> DIE one disappears but the goal is the same as 2.

My point was just summarize the conditions for enabling spreading. It
doesn't guarantee that the last non-NUMA level is spreading. For
example, x86 numa-in-package topology doesn't have DIE at all, so there
is nothing to inherit. Going the argument used to introduce the flag on
DIE level, it should be set on MC for those systems.

The whole thing really boils down to: Why do we want to not always
spread at all levels inside a single node? It seems that we almost
always do that already. So why not make it truly always?

> > Preferring siblings seems to be a useful tweak for all non-NUMA levels,
> > so we should enable it on all non-NUMA levels. As it is, it is possible
> > to have it SMT and DIE, but not MC in between when using the default
> > topology.
>
> So you want to extend it to all non NUMA level. And especially you
> want to spread tasks in each MC groups when we have DIE and MC levels.
> Have you got benchmark results to show improvement or is it just to
> align topology configuration?

It is purely to have consistent scheduling behaviour and make it clearer
how system topology affect scheduling. SD_PREFER_SIBLING is currently a
behavioral flag, which means that each architecture cannot modify it, it
is set by a set of rules based on the number of levels used by the
architecture and the topology flags set. So should an architecture have
platform with two or more non-SMT non-NUMA levels it is currently not
possible to make the scheduler spread on both levels.

> The fact that this flag improves bench for SMT and NUMA level doesn't
> mean that it will improve for MC level as well. We have the
> wake_wide/wake_affine stuffs that tries to do similar thing
> dynamically and it regularly improves/regresses benchmark like
> sysbench or hackbench

True. wake_wide/wake_affine in the wake-up path has some of the same
behaviour as SD_PREFER_SIBLING has for periodic/idle/nohz balancing.
However, that is already consistently enabled for all levels apart for
NUMA levels where the reclaim distance is too high. So currently we do
allow some spreading, but not consistently.

I can't say that this patch won't cause any regressions. I was hoping
someone would say that it is definitely a bad idea with a good reason,
or we try it for the benefit of consistent flag setting. I'm happy to
drop the patch if there is good explanation for why the flag is set like
it is, but I think we have to make it controllable by each architecture
then.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] sched/topology: Set SD_PREFER_SIBLING consistently on non-NUMA levels
  2017-11-29  9:24 ` Vincent Guittot
  2017-11-29 11:55   ` Morten Rasmussen
@ 2017-11-29 11:59   ` Morten Rasmussen
  1 sibling, 0 replies; 4+ messages in thread
From: Morten Rasmussen @ 2017-11-29 11:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, mgalbraith, linux-kernel

[Resent without the disclaimer at the bottom. Sorry!]

On Wed, Nov 29, 2017 at 10:24:57AM +0100, Vincent Guittot wrote:
> Hi Morten,
>
> On 27 November 2017 at 18:29, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > SD_PREFER_SIBLING adds an additional bias towards spreading tasks on the
> > _parent_ sched_domain even if a sched_group isn't overloaded. It is
> > currently set on:
> >
> >    1. SMT level to promote spreading to sibling cores rather than using
> >       sibling HW-threads (caff37ef96eac7fe96a).
> >
> >    2. Non-NUMA levels which don't have the SD_SHARE_PKG_RESOURCES flag
> >       set (= DIE level in the default topology) as it was found to
> >       improve benchmarks on certain NUMA systems (6956dc568f34107f1d02b).
>
> So the goal is to have the last non-NUMA level with the flag so we can
> spread between "DIE"

The flag is already set on "DIE", the goal is to be consistent and set
the flag consistently. As it is, it is set on SMT and DIE, but not MC:

SMT: SD_PREFER_SIBLING
MC:  !SD_PREFER_SIBLING
DIE: SD_PREFER_SIBLING

So the goal is to spread consistently at all non-NUMA levels regardless
of which levels are present as currently it is not that obvious when
spreading happens on MC level as this list illustrates.

>
> >
> >    3. Any non-NUMA level that inherits the flag due to elimination of
> >       its parent sched_domain level in the de-generate step of the
> >       sched_domain hierarchy set up (= MC level in the default
> >       topology).
>
> This is to ensure that the last non NUMA level has the flag when the
> DIE one disappears but the goal is the same as 2.

My point was just summarize the conditions for enabling spreading. It
doesn't guarantee that the last non-NUMA level is spreading. For
example, x86 numa-in-package topology doesn't have DIE at all, so there
is nothing to inherit. Going the argument used to introduce the flag on
DIE level, it should be set on MC for those systems.

The whole thing really boils down to: Why do we want to not always
spread at all levels inside a single node? It seems that we almost
always do that already. So why not make it truly always?

> > Preferring siblings seems to be a useful tweak for all non-NUMA levels,
> > so we should enable it on all non-NUMA levels. As it is, it is possible
> > to have it SMT and DIE, but not MC in between when using the default
> > topology.
>
> So you want to extend it to all non NUMA level. And especially you
> want to spread tasks in each MC groups when we have DIE and MC levels.
> Have you got benchmark results to show improvement or is it just to
> align topology configuration?

It is purely to have consistent scheduling behaviour and make it clearer
how system topology affect scheduling. SD_PREFER_SIBLING is currently a
behavioral flag, which means that each architecture cannot modify it, it
is set by a set of rules based on the number of levels used by the
architecture and the topology flags set. So should an architecture have
platform with two or more non-SMT non-NUMA levels it is currently not
possible to make the scheduler spread on both levels.

> The fact that this flag improves bench for SMT and NUMA level doesn't
> mean that it will improve for MC level as well. We have the
> wake_wide/wake_affine stuffs that tries to do similar thing
> dynamically and it regularly improves/regresses benchmark like
> sysbench or hackbench

True. wake_wide/wake_affine in the wake-up path has some of the same
behaviour as SD_PREFER_SIBLING has for periodic/idle/nohz balancing.
However, that is already consistently enabled for all levels apart for
NUMA levels where the reclaim distance is too high. So currently we do
allow some spreading, but not consistently.

I can't say that this patch won't cause any regressions. I was hoping
someone would say that it is definitely a bad idea with a good reason,
or we try it for the benefit of consistent flag setting. I'm happy to
drop the patch if there is good explanation for why the flag is set like
it is, but I think we have to make it controllable by each architecture
then.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-11-29 12:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-27 17:29 [PATCH] sched/topology: Set SD_PREFER_SIBLING consistently on non-NUMA levels Morten Rasmussen
2017-11-29  9:24 ` Vincent Guittot
2017-11-29 11:55   ` Morten Rasmussen
2017-11-29 11:59   ` Morten Rasmussen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.