All of lore.kernel.org
 help / color / mirror / Atom feed
* SD_LOAD_BALANCE
@ 2020-09-03 14:09 Julia Lawall
  2020-09-03 16:59 ` SD_LOAD_BALANCE Valentin Schneider
  0 siblings, 1 reply; 11+ messages in thread
From: Julia Lawall @ 2020-09-03 14:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Valentin Schneider
  Cc: linux-kernel, Gilles Muller

Uses of SD_LOAD_BALANCE were removed in commit e669ac8ab952 (first
released in v5.8), with the comment:

The SD_LOAD_BALANCE flag is set unconditionally for all domains in
sd_init().

I have the impression that this was not quite true.  The NUMA domain was
not initialized with sd_init, and didn't have the SD_LOAD_BALANCE flag
set.  The effect is that in v5.8, the for_each_domain loop in
select_task_rq_fair can always end up at the global NUMA domain, and thus
consider any pair of waking cpu (cpu) and previous cpus of the wakee
(prev_cpu) as arguments to wake_affine.  Up to v5.7, this was only
possible if cpu and prev_cpu were together in some lower level domain, ie
sharing the LLC.  The effect is that in v5.8 wake_affine can often end up
choosing as a target a core that does not share the LLC with the core
where the thread ran previously.  Threads then move around a lot between
the different sockets.

Was this intentional?

The effect can be seen in the traces of the parsec vips benchmark at the
following URL:

https://pages.lip6.fr/Julia.Lawall/vips.pdf

The first two graphs (complete run and small fragment) are Linux v5.7 and
the next two are Linux v5.8.  The machine has 160 hardware threads
organized in 4 sockets and the colors are by socket.  In the small
fragment for v5.7 (second graph), one can see that a given pid pretty much
stays on the same socket, while in the corresponding fragment for v5.8
(fourth graph), the pids move around between the sockets.  The x's
describe the unblocks that result in a migration.  A pink x means that the
migration is in the same socket, while a blue x means that the migration
is to another socket. It's not apparent from the graphs, but by adding
some tracing, it seems that the new socket is always the one of the core
that handles the wakeup.

I haven't yet studied the early part of the execution of vips in detail,
but I suspect that the same issue causes all of the threads to be
initially on the same socket in v5.7, while in v5.8 they are more quickly
dispersed to other sockets.

My impression from the parsec and the NAS benchmarks is that the v5.8
performance is a bit better than v5.7, probably because of getting more
threads to more different sockets earlier, but other benchmarks might
rely more on locality and might react less well to threads moving around
so much in this way.

julia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SD_LOAD_BALANCE
  2020-09-03 14:09 SD_LOAD_BALANCE Julia Lawall
@ 2020-09-03 16:59 ` Valentin Schneider
  2020-09-03 17:10   ` SD_LOAD_BALANCE Julia Lawall
                     ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Valentin Schneider @ 2020-09-03 16:59 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel, Gilles Muller


Hi Julia,

On 03/09/20 15:09, Julia Lawall wrote:
> Uses of SD_LOAD_BALANCE were removed in commit e669ac8ab952 (first
> released in v5.8), with the comment:
>
> The SD_LOAD_BALANCE flag is set unconditionally for all domains in
> sd_init().
>
> I have the impression that this was not quite true.  The NUMA domain was
> not initialized with sd_init, and didn't have the SD_LOAD_BALANCE flag
> set.

Did you check the contents of

  /proc/sys/kernel/sched_domain/cpu*/domain*/flags

(requires CONFIG_SCHED_DEBUG)? If the LSB is set, it would mean
SD_LOAD_BALANCE is set.

The sched_domain construction flow isn't the easiest thing to follow, but
NUMA domains *have* to go through sd_init().

What happens is we first go through sched_init_numa(), and there we add
some more topology levels on top of the default ones (or the arch-defined
ones if using an arch-defined topology hierarchy) by using the NUMA
distance table.

We then build the actual domains in sched_init_domains(), and that goes
through a loop that looks like

  for_each_cpu() {
      for_each_sd_topology() {
          build_sched_domain() -> sd_init()
      }
  }

where the SD topology loop is going to iterate over the newly-added
NUMA-specific topology levels. Since that used to unconditionally set
SD_LOAD_BALANCE, NUMA domains really ought to have it.

If that wasn't the case, we would have fired the (now removed) warning in
sched_domain_debug_one() that would do:

       if (!(sd->flags & SD_LOAD_BALANCE)) {
               printk("does not load-balance\n");
               if (sd->parent)
                       printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain has parent");
               return -1;
       }

> The effect is that in v5.8, the for_each_domain loop in
> select_task_rq_fair can always end up at the global NUMA domain, and thus
> consider any pair of waking cpu (cpu) and previous cpus of the wakee
> (prev_cpu) as arguments to wake_affine.  Up to v5.7, this was only
> possible if cpu and prev_cpu were together in some lower level domain, ie
> sharing the LLC.  The effect is that in v5.8 wake_affine can often end up
> choosing as a target a core that does not share the LLC with the core
> where the thread ran previously.  Threads then move around a lot between
> the different sockets.
>
> Was this intentional?
>

AFAICT it isn't forbidden for the logic here to peek outside of the
previous LLC. The NUMA reclaim distance thing says we allow affine wakeups
and fork / exec balancing to move a task to a CPU at most RECLAIM_DISTANCE
away (in NUMA distance values). However, I don't remember any patch
changing this between v5.7 and v5.8.

Briefly glancing over the kernel/sched log between v5.7 and v5.8, I don't
see any obvious culprits. Did you try to bisect this? If it indeed ends on
the SD_LOAD_BALANCE thing, well, I'll be off eating my keyboard.

> The effect can be seen in the traces of the parsec vips benchmark at the
> following URL:
>
> https://pages.lip6.fr/Julia.Lawall/vips.pdf
>
> The first two graphs (complete run and small fragment) are Linux v5.7 and
> the next two are Linux v5.8.  The machine has 160 hardware threads
> organized in 4 sockets and the colors are by socket.  In the small
> fragment for v5.7 (second graph), one can see that a given pid pretty much
> stays on the same socket, while in the corresponding fragment for v5.8
> (fourth graph), the pids move around between the sockets.  The x's
> describe the unblocks that result in a migration.  A pink x means that the
> migration is in the same socket, while a blue x means that the migration
> is to another socket. It's not apparent from the graphs, but by adding
> some tracing, it seems that the new socket is always the one of the core
> that handles the wakeup.
>

Interesting graphs, thanks for sharing!

> I haven't yet studied the early part of the execution of vips in detail,
> but I suspect that the same issue causes all of the threads to be
> initially on the same socket in v5.7, while in v5.8 they are more quickly
> dispersed to other sockets.
>
> My impression from the parsec and the NAS benchmarks is that the v5.8
> performance is a bit better than v5.7, probably because of getting more
> threads to more different sockets earlier, but other benchmarks might
> rely more on locality and might react less well to threads moving around
> so much in this way.
>
> julia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SD_LOAD_BALANCE
  2020-09-03 16:59 ` SD_LOAD_BALANCE Valentin Schneider
@ 2020-09-03 17:10   ` Julia Lawall
  2020-09-03 18:17   ` SD_LOAD_BALANCE Julia Lawall
  2020-10-10 16:14   ` SD_LOAD_BALANCE Julia Lawall
  2 siblings, 0 replies; 11+ messages in thread
From: Julia Lawall @ 2020-09-03 17:10 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel, Gilles Muller



On Thu, 3 Sep 2020, Valentin Schneider wrote:

>
> Hi Julia,
>
> On 03/09/20 15:09, Julia Lawall wrote:
> > Uses of SD_LOAD_BALANCE were removed in commit e669ac8ab952 (first
> > released in v5.8), with the comment:
> >
> > The SD_LOAD_BALANCE flag is set unconditionally for all domains in
> > sd_init().
> >
> > I have the impression that this was not quite true.  The NUMA domain was
> > not initialized with sd_init, and didn't have the SD_LOAD_BALANCE flag
> > set.
>
> Did you check the contents of
>
>   /proc/sys/kernel/sched_domain/cpu*/domain*/flags
>
> (requires CONFIG_SCHED_DEBUG)? If the LSB is set, it would mean
> SD_LOAD_BALANCE is set.

No, I didn't check that.  Thanks for the pointer.

>
> The sched_domain construction flow isn't the easiest thing to follow, but
> NUMA domains *have* to go through sd_init().
>
> What happens is we first go through sched_init_numa(), and there we add
> some more topology levels on top of the default ones (or the arch-defined
> ones if using an arch-defined topology hierarchy) by using the NUMA
> distance table.
>
> We then build the actual domains in sched_init_domains(), and that goes
> through a loop that looks like
>
>   for_each_cpu() {
>       for_each_sd_topology() {
>           build_sched_domain() -> sd_init()
>       }
>   }
>
> where the SD topology loop is going to iterate over the newly-added
> NUMA-specific topology levels. Since that used to unconditionally set
> SD_LOAD_BALANCE, NUMA domains really ought to have it.
>
> If that wasn't the case, we would have fired the (now removed) warning in
> sched_domain_debug_one() that would do:
>
>        if (!(sd->flags & SD_LOAD_BALANCE)) {
>                printk("does not load-balance\n");
>                if (sd->parent)
>                        printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain has parent");
>                return -1;
>        }

OK, thanks for all of this information.  I will study the code and
execution some more, and let you know what I find out.

julia

>
> > The effect is that in v5.8, the for_each_domain loop in
> > select_task_rq_fair can always end up at the global NUMA domain, and thus
> > consider any pair of waking cpu (cpu) and previous cpus of the wakee
> > (prev_cpu) as arguments to wake_affine.  Up to v5.7, this was only
> > possible if cpu and prev_cpu were together in some lower level domain, ie
> > sharing the LLC.  The effect is that in v5.8 wake_affine can often end up
> > choosing as a target a core that does not share the LLC with the core
> > where the thread ran previously.  Threads then move around a lot between
> > the different sockets.
> >
> > Was this intentional?
> >
>
> AFAICT it isn't forbidden for the logic here to peek outside of the
> previous LLC. The NUMA reclaim distance thing says we allow affine wakeups
> and fork / exec balancing to move a task to a CPU at most RECLAIM_DISTANCE
> away (in NUMA distance values). However, I don't remember any patch
> changing this between v5.7 and v5.8.
>
> Briefly glancing over the kernel/sched log between v5.7 and v5.8, I don't
> see any obvious culprits. Did you try to bisect this? If it indeed ends on
> the SD_LOAD_BALANCE thing, well, I'll be off eating my keyboard.
>
> > The effect can be seen in the traces of the parsec vips benchmark at the
> > following URL:
> >
> > https://pages.lip6.fr/Julia.Lawall/vips.pdf
> >
> > The first two graphs (complete run and small fragment) are Linux v5.7 and
> > the next two are Linux v5.8.  The machine has 160 hardware threads
> > organized in 4 sockets and the colors are by socket.  In the small
> > fragment for v5.7 (second graph), one can see that a given pid pretty much
> > stays on the same socket, while in the corresponding fragment for v5.8
> > (fourth graph), the pids move around between the sockets.  The x's
> > describe the unblocks that result in a migration.  A pink x means that the
> > migration is in the same socket, while a blue x means that the migration
> > is to another socket. It's not apparent from the graphs, but by adding
> > some tracing, it seems that the new socket is always the one of the core
> > that handles the wakeup.
> >
>
> Interesting graphs, thanks for sharing!
>
> > I haven't yet studied the early part of the execution of vips in detail,
> > but I suspect that the same issue causes all of the threads to be
> > initially on the same socket in v5.7, while in v5.8 they are more quickly
> > dispersed to other sockets.
> >
> > My impression from the parsec and the NAS benchmarks is that the v5.8
> > performance is a bit better than v5.7, probably because of getting more
> > threads to more different sockets earlier, but other benchmarks might
> > rely more on locality and might react less well to threads moving around
> > so much in this way.
> >
> > julia
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SD_LOAD_BALANCE
  2020-09-03 16:59 ` SD_LOAD_BALANCE Valentin Schneider
  2020-09-03 17:10   ` SD_LOAD_BALANCE Julia Lawall
@ 2020-09-03 18:17   ` Julia Lawall
  2020-10-10 16:14   ` SD_LOAD_BALANCE Julia Lawall
  2 siblings, 0 replies; 11+ messages in thread
From: Julia Lawall @ 2020-09-03 18:17 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel, Gilles Muller



On Thu, 3 Sep 2020, Valentin Schneider wrote:

>
> Hi Julia,
>
> On 03/09/20 15:09, Julia Lawall wrote:
> > Uses of SD_LOAD_BALANCE were removed in commit e669ac8ab952 (first
> > released in v5.8), with the comment:
> >
> > The SD_LOAD_BALANCE flag is set unconditionally for all domains in
> > sd_init().
> >
> > I have the impression that this was not quite true.  The NUMA domain was
> > not initialized with sd_init, and didn't have the SD_LOAD_BALANCE flag
> > set.
>
> Did you check the contents of
>
>   /proc/sys/kernel/sched_domain/cpu*/domain*/flags
>
> (requires CONFIG_SCHED_DEBUG)? If the LSB is set, it would mean
> SD_LOAD_BALANCE is set.
>
> The sched_domain construction flow isn't the easiest thing to follow, but
> NUMA domains *have* to go through sd_init().
>
> What happens is we first go through sched_init_numa(), and there we add
> some more topology levels on top of the default ones (or the arch-defined
> ones if using an arch-defined topology hierarchy) by using the NUMA
> distance table.
>
> We then build the actual domains in sched_init_domains(), and that goes
> through a loop that looks like
>
>   for_each_cpu() {
>       for_each_sd_topology() {
>           build_sched_domain() -> sd_init()
>       }
>   }
>
> where the SD topology loop is going to iterate over the newly-added
> NUMA-specific topology levels. Since that used to unconditionally set
> SD_LOAD_BALANCE, NUMA domains really ought to have it.
>
> If that wasn't the case, we would have fired the (now removed) warning in
> sched_domain_debug_one() that would do:
>
>        if (!(sd->flags & SD_LOAD_BALANCE)) {
>                printk("does not load-balance\n");
>                if (sd->parent)
>                        printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain has parent");
>                return -1;
>        }

OK, I see also that in v5.7 tmp->flags does have the SD_LOAD_BALANCE bit
set, so I will have to look further to see what the difference is.  Thanks
for the pointers.

julia


>
> > The effect is that in v5.8, the for_each_domain loop in
> > select_task_rq_fair can always end up at the global NUMA domain, and thus
> > consider any pair of waking cpu (cpu) and previous cpus of the wakee
> > (prev_cpu) as arguments to wake_affine.  Up to v5.7, this was only
> > possible if cpu and prev_cpu were together in some lower level domain, ie
> > sharing the LLC.  The effect is that in v5.8 wake_affine can often end up
> > choosing as a target a core that does not share the LLC with the core
> > where the thread ran previously.  Threads then move around a lot between
> > the different sockets.
> >
> > Was this intentional?
> >
>
> AFAICT it isn't forbidden for the logic here to peek outside of the
> previous LLC. The NUMA reclaim distance thing says we allow affine wakeups
> and fork / exec balancing to move a task to a CPU at most RECLAIM_DISTANCE
> away (in NUMA distance values). However, I don't remember any patch
> changing this between v5.7 and v5.8.
>
> Briefly glancing over the kernel/sched log between v5.7 and v5.8, I don't
> see any obvious culprits. Did you try to bisect this? If it indeed ends on
> the SD_LOAD_BALANCE thing, well, I'll be off eating my keyboard.
>
> > The effect can be seen in the traces of the parsec vips benchmark at the
> > following URL:
> >
> > https://pages.lip6.fr/Julia.Lawall/vips.pdf
> >
> > The first two graphs (complete run and small fragment) are Linux v5.7 and
> > the next two are Linux v5.8.  The machine has 160 hardware threads
> > organized in 4 sockets and the colors are by socket.  In the small
> > fragment for v5.7 (second graph), one can see that a given pid pretty much
> > stays on the same socket, while in the corresponding fragment for v5.8
> > (fourth graph), the pids move around between the sockets.  The x's
> > describe the unblocks that result in a migration.  A pink x means that the
> > migration is in the same socket, while a blue x means that the migration
> > is to another socket. It's not apparent from the graphs, but by adding
> > some tracing, it seems that the new socket is always the one of the core
> > that handles the wakeup.
> >
>
> Interesting graphs, thanks for sharing!
>
> > I haven't yet studied the early part of the execution of vips in detail,
> > but I suspect that the same issue causes all of the threads to be
> > initially on the same socket in v5.7, while in v5.8 they are more quickly
> > dispersed to other sockets.
> >
> > My impression from the parsec and the NAS benchmarks is that the v5.8
> > performance is a bit better than v5.7, probably because of getting more
> > threads to more different sockets earlier, but other benchmarks might
> > rely more on locality and might react less well to threads moving around
> > so much in this way.
> >
> > julia
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SD_LOAD_BALANCE
  2020-09-03 16:59 ` SD_LOAD_BALANCE Valentin Schneider
  2020-09-03 17:10   ` SD_LOAD_BALANCE Julia Lawall
  2020-09-03 18:17   ` SD_LOAD_BALANCE Julia Lawall
@ 2020-10-10 16:14   ` Julia Lawall
  2020-10-12 10:16     ` SD_LOAD_BALANCE Vincent Guittot
  2020-10-12 11:21     ` SD_LOAD_BALANCE Peter Zijlstra
  2 siblings, 2 replies; 11+ messages in thread
From: Julia Lawall @ 2020-10-10 16:14 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel, Gilles Muller

Hello,

Previously, I was wondering about why starting in Linux v5.8 my unblocking
threads were moving to different sockets more frequently than in previous
releases.  Now, I think that I have found the reason.

The first issue is the change from runnable load average to load average
in computing wake_affine_weight:

---

commit 11f10e5420f6cecac7d4823638bff040c257aba9
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Fri Oct 18 15:26:36 2019 +0200

    sched/fair: Use load instead of runnable load in wakeup path

    Runnable load was originally introduced to take into account the case where
    blocked load biases the wake up path which may end to select an overloaded
    CPU with a large number of runnable tasks instead of an underutilized
    CPU with a huge blocked load.

    Tha wake up path now starts looking for idle CPUs before comparing
    runnable load and it's worth aligning the wake up path with the
    load_balance() logic.

---

The unfortunate case is illustrated by the following trace (*** on the
important lines):

       trace-cmd-8006  [118]   451.444751: sched_migrate_task:   comm=containerd pid=2481 prio=120 orig_cpu=114 dest_cpu=118
          ua.B.x-8007  [105]   451.444752: sched_switch:         ua.B.x:8007 [120] S ==> swapper/105:0 [120]
       trace-cmd-8006  [118]   451.444769: sched_switch:         ua.B.x:8006 [120] S ==> containerd:2481 [120] ***
      containerd-2481  [118]   451.444859: sched_switch:         containerd:2481 [120] S ==> swapper/118:0 [120] ***
          ua.B.x-8148  [016]   451.444910: sched_switch:         ua.B.x:8148 [120] S ==> swapper/16:0 [120]
          ua.B.x-8154  [127]   451.445186: sched_switch:         ua.B.x:8154 [120] S ==> swapper/127:0 [120]
          ua.B.x-8145  [047]   451.445199: sched_switch:         ua.B.x:8145 [120] S ==> swapper/47:0 [120]
          ua.B.x-8138  [147]   451.445200: sched_switch:         ua.B.x:8138 [120] S ==> swapper/147:0 [120]
          ua.B.x-8152  [032]   451.445210: sched_switch:         ua.B.x:8152 [120] S ==> swapper/32:0 [120]
          ua.B.x-8144  [067]   451.445215: sched_switch:         ua.B.x:8144 [120] S ==> swapper/67:0 [120]
          ua.B.x-8137  [000]   451.445219: sched_switch:         ua.B.x:8137 [120] S ==> swapper/0:0 [120]
          ua.B.x-8140  [075]   451.445225: sched_switch:         ua.B.x:8140 [120] S ==> swapper/75:0 [120]
          ua.B.x-8155  [084]   451.445229: sched_switch:         ua.B.x:8155 [120] S ==> swapper/84:0 [120]
          ua.B.x-8161  [155]   451.445232: sched_switch:         ua.B.x:8161 [120] S ==> swapper/155:0 [120]
          ua.B.x-8156  [095]   451.445261: sched_switch:         ua.B.x:8156 [120] S ==> swapper/95:0 [120]
          ua.B.x-8153  [134]   451.445268: sched_switch:         ua.B.x:8153 [120] S ==> swapper/134:0 [120]
          ua.B.x-8151  [154]   451.445268: sched_switch:         ua.B.x:8151 [120] S ==> swapper/154:0 [120]
          ua.B.x-8141  [107]   451.445273: sched_switch:         ua.B.x:8141 [120] S ==> swapper/107:0 [120]
          ua.B.x-8146  [131]   451.445275: sched_switch:         ua.B.x:8146 [120] S ==> swapper/131:0 [120]
          ua.B.x-8160  [035]   451.445286: sched_switch:         ua.B.x:8160 [120] S ==> swapper/35:0 [120]
          ua.B.x-8136  [088]   451.445286: sched_switch:         ua.B.x:8136 [120] S ==> swapper/88:0 [120]
          ua.B.x-8159  [056]   451.445290: sched_switch:         ua.B.x:8159 [120] S ==> swapper/56:0 [120]
          ua.B.x-8147  [036]   451.445294: sched_switch:         ua.B.x:8147 [120] S ==> swapper/36:0 [120]
          ua.B.x-8150  [150]   451.445298: sched_switch:         ua.B.x:8150 [120] S ==> swapper/150:0 [120]
          ua.B.x-8142  [159]   451.445300: sched_switch:         ua.B.x:8142 [120] S ==> swapper/159:0 [120]
          ua.B.x-8157  [058]   451.445309: sched_switch:         ua.B.x:8157 [120] S ==> swapper/58:0 [120]
          ua.B.x-8149  [123]   451.445311: sched_switch:         ua.B.x:8149 [120] S ==> swapper/123:0 [120]
          ua.B.x-8162  [156]   451.445313: sched_switch:         ua.B.x:8162 [120] S ==> swapper/156:0 [120]
          ua.B.x-8164  [019]   451.445317: sched_switch:         ua.B.x:8164 [120] S ==> swapper/19:0 [120]
          ua.B.x-8139  [068]   451.445319: sched_switch:         ua.B.x:8139 [120] S ==> swapper/68:0 [120]
          ua.B.x-8143  [126]   451.445335: sched_switch:         ua.B.x:8143 [120] S ==> swapper/126:0 [120]
          ua.B.x-8163  [062]   451.445361: sched_switch:         ua.B.x:8163 [120] S ==> swapper/62:0 [120]
          ua.B.x-8158  [129]   451.445371: sched_switch:         ua.B.x:8158 [120] S ==> swapper/129:0 [120]
          ua.B.x-8040  [043]   451.445413: sched_wake_idle_without_ipi: cpu=0
          ua.B.x-8165  [098]   451.445451: sched_switch:         ua.B.x:8165 [120] S ==> swapper/98:0 [120]
          ua.B.x-8069  [009]   451.445622: sched_waking:         comm=ua.B.x pid=8007 prio=120 target_cpu=105
          ua.B.x-8069  [009]   451.445635: sched_wake_idle_without_ipi: cpu=105
          ua.B.x-8069  [009]   451.445638: sched_wakeup:         ua.B.x:8007 [120] success=1 CPU:105
          ua.B.x-8069  [009]   451.445639: sched_waking:         comm=ua.B.x pid=8006 prio=120 target_cpu=118
          <idle>-0     [105]   451.445641: sched_switch:         swapper/105:0 [120] R ==> ua.B.x:8007 [120]
          ua.B.x-8069  [009]   451.445645: bprint:               select_task_rq_fair: wake_affine_weight2 returning this_cpu: 614400 < 2981888
          ua.B.x-8069  [009]   451.445650: sched_migrate_task:   comm=ua.B.x pid=8006 prio=120 orig_cpu=118 dest_cpu=129 ***

Basically, core 118 has run both a thread of the NAS UA benchmark and a
containerd, and so it seems to have a higher load average than this_cpu, ie
9, when it wakes up.  The commit says "The wake up path now starts looking
for idle CPUs", but this is not always the case.  Here the target and prev
are not on the same socket, and in that case cpus_share_cache(prev, target)
fails and there is no check as to whether prev is idle.  The result is that
an idle core is left idle and the thread is migrated to another socket,
perhaps impacting locality.

Prior to v5.8 on my machine this was a rare event, because there were not
many of these background processes.  But in v5.8, the default governor for
Intel machines without the HWP feature was changed from intel_pstate to
intel_cpufreq.  The use of intel_cpufreq triggers very frequent kworkers on
all cores, which makes it much more likely that cores that are currently
idle, and are overall not at all overloaded, will have a higher load
average even with the waking thread deducted, than the core managing the
wakeup of the threads.

Would it be useful to always check whether prev is idle, perhaps in
wake_affine_idle or perhaps in select_idle_sibling?

Traces from various versions are available at
https://pages.lip6.fr/Julia.Lawall/uas.pdf.  5.8 and 5.9-rc7 are toward the
end of the file.  In these versions, all the threads systematically move
around at synchronization points in the program.

thanks,
julia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SD_LOAD_BALANCE
  2020-10-10 16:14   ` SD_LOAD_BALANCE Julia Lawall
@ 2020-10-12 10:16     ` Vincent Guittot
  2020-10-12 10:34       ` SD_LOAD_BALANCE Julia Lawall
  2020-10-12 11:21     ` SD_LOAD_BALANCE Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2020-10-12 10:16 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Valentin Schneider, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	linux-kernel, Gilles Muller

Hi Julia,

On Sat, 10 Oct 2020 at 18:14, Julia Lawall <julia.lawall@inria.fr> wrote:
>
> Hello,
>
> Previously, I was wondering about why starting in Linux v5.8 my unblocking
> threads were moving to different sockets more frequently than in previous
> releases.  Now, I think that I have found the reason.
>
> The first issue is the change from runnable load average to load average
> in computing wake_affine_weight:
>
> ---
>
> commit 11f10e5420f6cecac7d4823638bff040c257aba9
> Author: Vincent Guittot <vincent.guittot@linaro.org>
> Date:   Fri Oct 18 15:26:36 2019 +0200
>
>     sched/fair: Use load instead of runnable load in wakeup path
>
>     Runnable load was originally introduced to take into account the case where
>     blocked load biases the wake up path which may end to select an overloaded
>     CPU with a large number of runnable tasks instead of an underutilized
>     CPU with a huge blocked load.
>
>     Tha wake up path now starts looking for idle CPUs before comparing
>     runnable load and it's worth aligning the wake up path with the
>     load_balance() logic.
>
> ---
>
> The unfortunate case is illustrated by the following trace (*** on the
> important lines):
>
>        trace-cmd-8006  [118]   451.444751: sched_migrate_task:   comm=containerd pid=2481 prio=120 orig_cpu=114 dest_cpu=118
>           ua.B.x-8007  [105]   451.444752: sched_switch:         ua.B.x:8007 [120] S ==> swapper/105:0 [120]
>        trace-cmd-8006  [118]   451.444769: sched_switch:         ua.B.x:8006 [120] S ==> containerd:2481 [120] ***
>       containerd-2481  [118]   451.444859: sched_switch:         containerd:2481 [120] S ==> swapper/118:0 [120] ***
>           ua.B.x-8148  [016]   451.444910: sched_switch:         ua.B.x:8148 [120] S ==> swapper/16:0 [120]
>           ua.B.x-8154  [127]   451.445186: sched_switch:         ua.B.x:8154 [120] S ==> swapper/127:0 [120]
>           ua.B.x-8145  [047]   451.445199: sched_switch:         ua.B.x:8145 [120] S ==> swapper/47:0 [120]
>           ua.B.x-8138  [147]   451.445200: sched_switch:         ua.B.x:8138 [120] S ==> swapper/147:0 [120]
>           ua.B.x-8152  [032]   451.445210: sched_switch:         ua.B.x:8152 [120] S ==> swapper/32:0 [120]
>           ua.B.x-8144  [067]   451.445215: sched_switch:         ua.B.x:8144 [120] S ==> swapper/67:0 [120]
>           ua.B.x-8137  [000]   451.445219: sched_switch:         ua.B.x:8137 [120] S ==> swapper/0:0 [120]
>           ua.B.x-8140  [075]   451.445225: sched_switch:         ua.B.x:8140 [120] S ==> swapper/75:0 [120]
>           ua.B.x-8155  [084]   451.445229: sched_switch:         ua.B.x:8155 [120] S ==> swapper/84:0 [120]
>           ua.B.x-8161  [155]   451.445232: sched_switch:         ua.B.x:8161 [120] S ==> swapper/155:0 [120]
>           ua.B.x-8156  [095]   451.445261: sched_switch:         ua.B.x:8156 [120] S ==> swapper/95:0 [120]
>           ua.B.x-8153  [134]   451.445268: sched_switch:         ua.B.x:8153 [120] S ==> swapper/134:0 [120]
>           ua.B.x-8151  [154]   451.445268: sched_switch:         ua.B.x:8151 [120] S ==> swapper/154:0 [120]
>           ua.B.x-8141  [107]   451.445273: sched_switch:         ua.B.x:8141 [120] S ==> swapper/107:0 [120]
>           ua.B.x-8146  [131]   451.445275: sched_switch:         ua.B.x:8146 [120] S ==> swapper/131:0 [120]
>           ua.B.x-8160  [035]   451.445286: sched_switch:         ua.B.x:8160 [120] S ==> swapper/35:0 [120]
>           ua.B.x-8136  [088]   451.445286: sched_switch:         ua.B.x:8136 [120] S ==> swapper/88:0 [120]
>           ua.B.x-8159  [056]   451.445290: sched_switch:         ua.B.x:8159 [120] S ==> swapper/56:0 [120]
>           ua.B.x-8147  [036]   451.445294: sched_switch:         ua.B.x:8147 [120] S ==> swapper/36:0 [120]
>           ua.B.x-8150  [150]   451.445298: sched_switch:         ua.B.x:8150 [120] S ==> swapper/150:0 [120]
>           ua.B.x-8142  [159]   451.445300: sched_switch:         ua.B.x:8142 [120] S ==> swapper/159:0 [120]
>           ua.B.x-8157  [058]   451.445309: sched_switch:         ua.B.x:8157 [120] S ==> swapper/58:0 [120]
>           ua.B.x-8149  [123]   451.445311: sched_switch:         ua.B.x:8149 [120] S ==> swapper/123:0 [120]
>           ua.B.x-8162  [156]   451.445313: sched_switch:         ua.B.x:8162 [120] S ==> swapper/156:0 [120]
>           ua.B.x-8164  [019]   451.445317: sched_switch:         ua.B.x:8164 [120] S ==> swapper/19:0 [120]
>           ua.B.x-8139  [068]   451.445319: sched_switch:         ua.B.x:8139 [120] S ==> swapper/68:0 [120]
>           ua.B.x-8143  [126]   451.445335: sched_switch:         ua.B.x:8143 [120] S ==> swapper/126:0 [120]
>           ua.B.x-8163  [062]   451.445361: sched_switch:         ua.B.x:8163 [120] S ==> swapper/62:0 [120]
>           ua.B.x-8158  [129]   451.445371: sched_switch:         ua.B.x:8158 [120] S ==> swapper/129:0 [120]
>           ua.B.x-8040  [043]   451.445413: sched_wake_idle_without_ipi: cpu=0
>           ua.B.x-8165  [098]   451.445451: sched_switch:         ua.B.x:8165 [120] S ==> swapper/98:0 [120]
>           ua.B.x-8069  [009]   451.445622: sched_waking:         comm=ua.B.x pid=8007 prio=120 target_cpu=105
>           ua.B.x-8069  [009]   451.445635: sched_wake_idle_without_ipi: cpu=105
>           ua.B.x-8069  [009]   451.445638: sched_wakeup:         ua.B.x:8007 [120] success=1 CPU:105
>           ua.B.x-8069  [009]   451.445639: sched_waking:         comm=ua.B.x pid=8006 prio=120 target_cpu=118
>           <idle>-0     [105]   451.445641: sched_switch:         swapper/105:0 [120] R ==> ua.B.x:8007 [120]
>           ua.B.x-8069  [009]   451.445645: bprint:               select_task_rq_fair: wake_affine_weight2 returning this_cpu: 614400 < 2981888
>           ua.B.x-8069  [009]   451.445650: sched_migrate_task:   comm=ua.B.x pid=8006 prio=120 orig_cpu=118 dest_cpu=129 ***
>
> Basically, core 118 has run both a thread of the NAS UA benchmark and a
> containerd, and so it seems to have a higher load average than this_cpu, ie
> 9, when it wakes up.  The commit says "The wake up path now starts looking
> for idle CPUs", but this is not always the case.  Here the target and prev

Yes, cross node use cases are not considered because it's not only
about load but also numa in such case

> are not on the same socket, and in that case cpus_share_cache(prev, target)
> fails and there is no check as to whether prev is idle.  The result is that
> an idle core is left idle and the thread is migrated to another socket,
> perhaps impacting locality.
>
> Prior to v5.8 on my machine this was a rare event, because there were not
> many of these background processes.  But in v5.8, the default governor for
> Intel machines without the HWP feature was changed from intel_pstate to
> intel_cpufreq.  The use of intel_cpufreq triggers very frequent kworkers on
> all cores, which makes it much more likely that cores that are currently
> idle, and are overall not at all overloaded, will have a higher load
> average even with the waking thread deducted, than the core managing the
> wakeup of the threads.
>
> Would it be useful to always check whether prev is idle, perhaps in
> wake_affine_idle or perhaps in select_idle_sibling?

Yes, that would make sense to add a condition in  wake_affine_idle to
return prev if this cpu is not idle (or about to become idle)

>
> Traces from various versions are available at
> https://pages.lip6.fr/Julia.Lawall/uas.pdf.  5.8 and 5.9-rc7 are toward the
> end of the file.  In these versions, all the threads systematically move
> around at synchronization points in the program.
>
> thanks,
> julia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SD_LOAD_BALANCE
  2020-10-12 10:16     ` SD_LOAD_BALANCE Vincent Guittot
@ 2020-10-12 10:34       ` Julia Lawall
  2020-10-12 11:09         ` SD_LOAD_BALANCE Vincent Guittot
  0 siblings, 1 reply; 11+ messages in thread
From: Julia Lawall @ 2020-10-12 10:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Valentin Schneider, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	linux-kernel, Gilles Muller

> > Would it be useful to always check whether prev is idle, perhaps in
> > wake_affine_idle or perhaps in select_idle_sibling?
>
> Yes, that would make sense to add a condition in  wake_affine_idle to
> return prev if this cpu is not idle (or about to become idle)

The case where this cpu is idle would be in the interrupt case.  If both
prev cpu and this cpu are idle, is it more desirable to move the thread to
this cpu or to leave it where it was?

julia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SD_LOAD_BALANCE
  2020-10-12 10:34       ` SD_LOAD_BALANCE Julia Lawall
@ 2020-10-12 11:09         ` Vincent Guittot
  2020-10-12 11:18           ` SD_LOAD_BALANCE Julia Lawall
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2020-10-12 11:09 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Valentin Schneider, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	linux-kernel, Gilles Muller

On Mon, 12 Oct 2020 at 12:34, Julia Lawall <julia.lawall@inria.fr> wrote:
>
> > > Would it be useful to always check whether prev is idle, perhaps in
> > > wake_affine_idle or perhaps in select_idle_sibling?
> >
> > Yes, that would make sense to add a condition in  wake_affine_idle to
> > return prev if this cpu is not idle (or about to become idle)
>
> The case where this cpu is idle would be in the interrupt case.  If both
> prev cpu and this cpu are idle, is it more desirable to move the thread to
> this cpu or to leave it where it was?

I think that we should keep the current behavior regarding this cpu
and the shared cache case and add one more test before the last return
of the function.

right now, we select this_cpu:
-if this cpu is idle,  it shares cache with prev and previous is not idle.
-if it's a sync wake up because we expect the task to use local data
of the current running task on this cpu.

Then we add a new case to return prev cpu if it is idle which is your case


>
> julia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SD_LOAD_BALANCE
  2020-10-12 11:09         ` SD_LOAD_BALANCE Vincent Guittot
@ 2020-10-12 11:18           ` Julia Lawall
  0 siblings, 0 replies; 11+ messages in thread
From: Julia Lawall @ 2020-10-12 11:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Valentin Schneider, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	linux-kernel, Gilles Muller



On Mon, 12 Oct 2020, Vincent Guittot wrote:

> On Mon, 12 Oct 2020 at 12:34, Julia Lawall <julia.lawall@inria.fr> wrote:
> >
> > > > Would it be useful to always check whether prev is idle, perhaps in
> > > > wake_affine_idle or perhaps in select_idle_sibling?
> > >
> > > Yes, that would make sense to add a condition in  wake_affine_idle to
> > > return prev if this cpu is not idle (or about to become idle)
> >
> > The case where this cpu is idle would be in the interrupt case.  If both
> > prev cpu and this cpu are idle, is it more desirable to move the thread to
> > this cpu or to leave it where it was?
>
> I think that we should keep the current behavior regarding this cpu
> and the shared cache case and add one more test before the last return
> of the function.
>
> right now, we select this_cpu:
> -if this cpu is idle,  it shares cache with prev and previous is not idle.
> -if it's a sync wake up because we expect the task to use local data
> of the current running task on this cpu.
>
> Then we add a new case to return prev cpu if it is idle which is your case

OK, sounds good, thanks.

julia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SD_LOAD_BALANCE
  2020-10-10 16:14   ` SD_LOAD_BALANCE Julia Lawall
  2020-10-12 10:16     ` SD_LOAD_BALANCE Vincent Guittot
@ 2020-10-12 11:21     ` Peter Zijlstra
  2020-10-12 11:31       ` SD_LOAD_BALANCE Julia Lawall
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2020-10-12 11:21 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Valentin Schneider, Ingo Molnar, Juri Lelli, Vincent Guittot,
	linux-kernel, Gilles Muller, Rafael J. Wysocki

On Sat, Oct 10, 2020 at 06:14:23PM +0200, Julia Lawall wrote:
> Prior to v5.8 on my machine this was a rare event, because there were not
> many of these background processes.  But in v5.8, the default governor for
> Intel machines without the HWP feature was changed from intel_pstate to
> intel_cpufreq.  The use of intel_cpufreq triggers very frequent kworkers on
> all cores, which makes it much more likely that cores that are currently
> idle, and are overall not at all overloaded, will have a higher load
> average even with the waking thread deducted, than the core managing the
> wakeup of the threads.

Rafael, any idea what those kworkers are for, and can we get rid of
them?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SD_LOAD_BALANCE
  2020-10-12 11:21     ` SD_LOAD_BALANCE Peter Zijlstra
@ 2020-10-12 11:31       ` Julia Lawall
  0 siblings, 0 replies; 11+ messages in thread
From: Julia Lawall @ 2020-10-12 11:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Julia Lawall, Valentin Schneider, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, Gilles Muller, Rafael J. Wysocki



On Mon, 12 Oct 2020, Peter Zijlstra wrote:

> On Sat, Oct 10, 2020 at 06:14:23PM +0200, Julia Lawall wrote:
> > Prior to v5.8 on my machine this was a rare event, because there were not
> > many of these background processes.  But in v5.8, the default governor for
> > Intel machines without the HWP feature was changed from intel_pstate to
> > intel_cpufreq.  The use of intel_cpufreq triggers very frequent kworkers on
> > all cores, which makes it much more likely that cores that are currently
> > idle, and are overall not at all overloaded, will have a higher load
> > average even with the waking thread deducted, than the core managing the
> > wakeup of the threads.
>
> Rafael, any idea what those kworkers are for, and can we get rid of
> them?

They execute the function intel_cpufreq_target defined in
drivers/cpufreq/intel_pstate.c

julia

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-10-12 11:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 14:09 SD_LOAD_BALANCE Julia Lawall
2020-09-03 16:59 ` SD_LOAD_BALANCE Valentin Schneider
2020-09-03 17:10   ` SD_LOAD_BALANCE Julia Lawall
2020-09-03 18:17   ` SD_LOAD_BALANCE Julia Lawall
2020-10-10 16:14   ` SD_LOAD_BALANCE Julia Lawall
2020-10-12 10:16     ` SD_LOAD_BALANCE Vincent Guittot
2020-10-12 10:34       ` SD_LOAD_BALANCE Julia Lawall
2020-10-12 11:09         ` SD_LOAD_BALANCE Vincent Guittot
2020-10-12 11:18           ` SD_LOAD_BALANCE Julia Lawall
2020-10-12 11:21     ` SD_LOAD_BALANCE Peter Zijlstra
2020-10-12 11:31       ` SD_LOAD_BALANCE Julia Lawall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.