linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] sched: unused cpu in affine workload
@ 2016-04-04  8:23 Jiri Olsa
  2016-04-04  8:44 ` Peter Zijlstra
  2016-04-04  8:59 ` Ingo Molnar
  0 siblings, 2 replies; 9+ messages in thread
From: Jiri Olsa @ 2016-04-04  8:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: James Hartsock, Rik van Riel, Srivatsa Vaddagiri, Kirill Tkhai,
	linux-kernel

hi,
we've noticed following issue in one of our workloads.

I have 24 CPUs server with following sched domains:
  domain 0: (pairs)
  domain 1: 0-5,12-17 (group1)  6-11,18-23 (group2)
  domain 2: 0-23 level NUMA

I run CPU hogging workload on following CPUs:
  4,6,14,18,19,20,23

that is:
  4,14          CPUs from group1
  6,18,19,20,23 CPUs from group2

the workload process gets affinity setup via 'taskset -c ${CPUs workload ...'
and forks child for every CPU

very often we notice CPUs 4 and 14 running 3 processes of the workload
while CPUs 6,18,19,20,23 running just 4 processes, leaving one of the
CPU from group2 idle

AFAICS from the code the reason for this is that the load balancing
follows domains setup (topology) and does not regard affinity setups
like this. The code in find_busiest_group running under idle cpu from
group2 will find group1 as bussiest, but its average load will be
smaller than the one on the local group, so there's no task pulling.

It's obvious, that load balancer follows sched domain topology.
However is there some sched feature I'm missing that could help
with this? Or do we need to follow sched domains topology when
we select CPUs for workload to get even balancing?

thanks,
jirka

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] sched: unused cpu in affine workload
  2016-04-04  8:23 [RFC] sched: unused cpu in affine workload Jiri Olsa
@ 2016-04-04  8:44 ` Peter Zijlstra
  2016-04-04  8:59 ` Ingo Molnar
  1 sibling, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2016-04-04  8:44 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ingo Molnar, James Hartsock, Rik van Riel, Srivatsa Vaddagiri,
	Kirill Tkhai, linux-kernel

On Mon, Apr 04, 2016 at 10:23:02AM +0200, Jiri Olsa wrote:
> hi,
> we've noticed following issue in one of our workloads.
> 
> I have 24 CPUs server with following sched domains:
>   domain 0: (pairs)
>   domain 1: 0-5,12-17 (group1)  6-11,18-23 (group2)
>   domain 2: 0-23 level NUMA
> 
> I run CPU hogging workload on following CPUs:
>   4,6,14,18,19,20,23
> 
> that is:
>   4,14          CPUs from group1
>   6,18,19,20,23 CPUs from group2
> 
> the workload process gets affinity setup via 'taskset -c ${CPUs workload ...'
> and forks child for every CPU
> 
> very often we notice CPUs 4 and 14 running 3 processes of the workload
> while CPUs 6,18,19,20,23 running just 4 processes, leaving one of the
> CPU from group2 idle
> 
> AFAICS from the code the reason for this is that the load balancing
> follows domains setup (topology) and does not regard affinity setups
> like this. The code in find_busiest_group running under idle cpu from
> group2 will find group1 as bussiest, but its average load will be
> smaller than the one on the local group, so there's no task pulling.
> 
> It's obvious, that load balancer follows sched domain topology.
> However is there some sched feature I'm missing that could help
> with this? Or do we need to follow sched domains topology when
> we select CPUs for workload to get even balancing?

Yeah, this is 'hard', there is some code that tries not to totally blow
with this but its all a bit of a mess. See
kernel/sched/fair.c:sg_imbalanced().

The easiest solution is to simply not do this and stick with the topo
like you suggest.

So far I've not come up with a sane/stable solution for this problem.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] sched: unused cpu in affine workload
  2016-04-04  8:23 [RFC] sched: unused cpu in affine workload Jiri Olsa
  2016-04-04  8:44 ` Peter Zijlstra
@ 2016-04-04  8:59 ` Ingo Molnar
  2016-04-04  9:19   ` Ingo Molnar
  1 sibling, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2016-04-04  8:59 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, James Hartsock, Rik van Riel, Srivatsa Vaddagiri,
	Kirill Tkhai, linux-kernel


* Jiri Olsa <jolsa@redhat.com> wrote:

> hi,
> we've noticed following issue in one of our workloads.
> 
> I have 24 CPUs server with following sched domains:
>   domain 0: (pairs)
>   domain 1: 0-5,12-17 (group1)  6-11,18-23 (group2)
>   domain 2: 0-23 level NUMA
> 
> I run CPU hogging workload on following CPUs:
>   4,6,14,18,19,20,23
> 
> that is:
>   4,14          CPUs from group1
>   6,18,19,20,23 CPUs from group2
> 
> the workload process gets affinity setup via 'taskset -c ${CPUs workload ...'
> and forks child for every CPU
> 
> very often we notice CPUs 4 and 14 running 3 processes of the workload
> while CPUs 6,18,19,20,23 running just 4 processes, leaving one of the
> CPU from group2 idle
> 
> AFAICS from the code the reason for this is that the load balancing
> follows domains setup (topology) and does not regard affinity setups
> like this. The code in find_busiest_group running under idle cpu from
> group2 will find group1 as bussiest, but its average load will be
> smaller than the one on the local group, so there's no task pulling.
> 
> It's obvious, that load balancer follows sched domain topology.
> However is there some sched feature I'm missing that could help
> with this? Or do we need to follow sched domains topology when
> we select CPUs for workload to get even balancing?

Yeah, so the principle with user-pinning of tasks to CPUs was always:

 - pinning a task to a single CPU should obviously work fine, it's the primary
   usecase for isolation.

 - pinning a task to an arbitrary subset of CPUs is a 'hard' problem
   mathematically that the scheduler never truly wanted to solve in a frontal
   fashion.

... but that principle was set into place well before we did the NUMA scheduling 
work, which in itself is a highly non-trivial load optimization problem to begin 
with, so we might want to reconsider.

So there's two directions I can suggest:

 - if you can come up with workable small-scale solutions to scratch an itch
   that comes up in practice then that's obviously good, as long as it does not
   regress anything else.

 - if you want to come up with a 'complete' solution then please don't put it into
   hot paths such as wakeup or context switching, or any of the hardirq methods,
   but try to integrate it with the NUMA scheduling slow path.

The NUMA balancing slow path: that is softirq driven and reasonably low freq to 
not cause many performance problems.

The two problems (NUMA affinity and user affinity) are also losely related on a 
conceptual level: the NUMA affinity optimization problem can be considered as a 
workload determined, arbitrary 'NUMA mask' being optimized from first principles.

There's one ABI detail: this is true only as long as SMP affinity masks follow 
node boundaries - the current NUMA balancing code is very much node granular, so 
the two can only be merged if the ->cpus_allowed mask follows node boundaries as 
well.

A third approach would be to extend the NUMA balancing code to be CPU granular 
(without changing anytask placement behavior of the current NUMA balancing code of 
course), with node granular being a special case. This would fit the cgroups (and 
virtualization) usecases, but that would be a major change.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] sched: unused cpu in affine workload
  2016-04-04  8:59 ` Ingo Molnar
@ 2016-04-04  9:19   ` Ingo Molnar
  2016-04-04  9:38     ` Ingo Molnar
  0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2016-04-04  9:19 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, James Hartsock, Rik van Riel, Srivatsa Vaddagiri,
	Kirill Tkhai, linux-kernel


* Ingo Molnar <mingo@kernel.org> wrote:

>  - if you want to come up with a 'complete' solution then please don't put it into
>    hot paths such as wakeup or context switching, or any of the hardirq methods,
>    but try to integrate it with the NUMA scheduling slow path.
> 
> The NUMA balancing slow path: that is softirq driven and reasonably low freq to 
> not cause many performance problems.
> 
> The two problems (NUMA affinity and user affinity) are also losely related on a 
> conceptual level: the NUMA affinity optimization problem can be considered as a 
> workload determined, arbitrary 'NUMA mask' being optimized from first 
> principles.
> 
> There's one ABI detail: this is true only as long as SMP affinity masks follow 
> node boundaries - the current NUMA balancing code is very much node granular, so 
> the two can only be merged if the ->cpus_allowed mask follows node boundaries as 
> well.
> 
> A third approach would be to extend the NUMA balancing code to be CPU granular 
> (without changing anytask placement behavior of the current NUMA balancing code 
> of course), with node granular being a special case. This would fit the cgroups 
> (and virtualization) usecases, but that would be a major change.

So my thinking here is: if the NUMA balancing code (which is node granular at the 
moment and uses node masks, etc.) is extended to be CPU granular (which is a big 
task in itself), then the two problems can be 'unified':

  - the NUMA balancing code inputs arbitrarly CPU (node) affinity masks from the
    MM code into the scheduler.

  - the scheduler syscall ABI (and other configuration sources) inputs arbitrary 
    CPU affinity masks into the scheduler.

it's a similar problem, with two (minor looking) complication:

 - the NUMA code right now is 'statistical', while ->cpus_allowed are hard 
   constraints that must never be violated. So there always has to be a final 
   layer to implement the hard constraint - which does not exist in the NUMA 
   balancing case. This should be relatively easy I think as we already do it
   with the regular balancer.

 - the balancing slowpath would have to be activated on non-NUMA systems as well, 
   so that it can handle ->cpus_allowed balancing.

... once all that is solved, I can see several advantages from unifying the NUMA 
balancing and SMP affinity balancing code:

 - the NUMA balancer would improve: cpus_allowed isolation is used more 
   frequently, so fixes from those workloads would benefit the NUMA balancing case 
   as well.

 - testing the NUMA balancer would become easier: we'd simply set cpus_allowed and
   would watch how it balances. No need to coax workloads into actual MM NUMA 
   usage patters to set up interesting scenarios.

 - our existing half-hearted ways to deal with cpus_allowed balancing could be 
   outsourced to the NUMA slow path, which would simplify the SMP balancing fast 
   path.

But it's a major piece of work, and I might be missing implementational details. 
It would be the biggest new scheduler feature since NUMA balancing for sure.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] sched: unused cpu in affine workload
  2016-04-04  9:19   ` Ingo Molnar
@ 2016-04-04  9:38     ` Ingo Molnar
  2016-04-04 13:23       ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2016-04-04  9:38 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, James Hartsock, Rik van Riel, Srivatsa Vaddagiri,
	Kirill Tkhai, linux-kernel


* Ingo Molnar <mingo@kernel.org> wrote:

> So my thinking here is: if the NUMA balancing code (which is node granular at 
> the moment and uses node masks, etc.) is extended to be CPU granular (which is a 
> big task in itself), then the two problems can be 'unified':
> 
>   - the NUMA balancing code inputs arbitrarly CPU (node) affinity masks from the
>     MM code into the scheduler.
> 
>   - the scheduler syscall ABI (and other configuration sources) inputs arbitrary 
>     CPU affinity masks into the scheduler.
> 
> it's a similar problem, with two (minor looking) complication:

btw., this highlights how hard the optimization problem is: the NUMA balancing 
code is (at least ...) O(nr_nodes^2) complex - but we had O(nr_nodes^3) passes too 
in some of the NUMA balancing submissions...

We'd upgrade that to O(nr_cpus^2), which is totally unrealistic with 16,000 CPUs 
even in a slowpath - but it would probably cause problems even with 120 CPUs. It 
will get quadratically worse as the number of CPUs in a system increases on its 
current exponential trajectory ...

So the safest bet would be to restrict any 'perfect' balancing attempts to node 
boundaries. Which won't solve the problem you outlined to begin with.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] sched: unused cpu in affine workload
  2016-04-04  9:38     ` Ingo Molnar
@ 2016-04-04 13:23       ` Peter Zijlstra
  2016-04-04 19:45         ` Rik van Riel
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2016-04-04 13:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jiri Olsa, James Hartsock, Rik van Riel, Srivatsa Vaddagiri,
	Kirill Tkhai, linux-kernel

On Mon, Apr 04, 2016 at 11:38:44AM +0200, Ingo Molnar wrote:
> We'd upgrade that to O(nr_cpus^2), which is totally unrealistic with 16,000 CPUs 
> even in a slowpath - but it would probably cause problems even with 120 CPUs. It 
> will get quadratically worse as the number of CPUs in a system increases on its 
> current exponential trajectory ...

The arbitrary affinity thing is I think a packing problem, which is NP
hard IIRC.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] sched: unused cpu in affine workload
  2016-04-04 13:23       ` Peter Zijlstra
@ 2016-04-04 19:45         ` Rik van Riel
  2016-04-04 21:34           ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Rik van Riel @ 2016-04-04 19:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Jiri Olsa, James Hartsock, Srivatsa Vaddagiri, Kirill Tkhai,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 806 bytes --]

On Mon, 2016-04-04 at 15:23 +0200, Peter Zijlstra wrote:
> On Mon, Apr 04, 2016 at 11:38:44AM +0200, Ingo Molnar wrote:
> > 
> > We'd upgrade that to O(nr_cpus^2), which is totally unrealistic
> > with 16,000 CPUs 
> > even in a slowpath - but it would probably cause problems even with
> > 120 CPUs. It 
> > will get quadratically worse as the number of CPUs in a system
> > increases on its 
> > current exponential trajectory ...
> The arbitrary affinity thing is I think a packing problem, which is
> NP
> hard IIRC.

An optimal solution is NP hard.

Heuristics that "move tasks with pressure" may be
much more doable, and lead to perfectly satisfactory
results, especially if most migrations happen within
a socket (and the same shared L3 cache).

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] sched: unused cpu in affine workload
  2016-04-04 19:45         ` Rik van Riel
@ 2016-04-04 21:34           ` Peter Zijlstra
  2016-04-05  8:56             ` Jiri Olsa
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2016-04-04 21:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Jiri Olsa, James Hartsock, Srivatsa Vaddagiri,
	Kirill Tkhai, linux-kernel

On Mon, Apr 04, 2016 at 03:45:16PM -0400, Rik van Riel wrote:
> An optimal solution is NP hard.
> 
> Heuristics that "move tasks with pressure" may be
> much more doable, and lead to perfectly satisfactory
> results, especially if most migrations happen within
> a socket (and the same shared L3 cache).

Right; trick will be finding something that mostly works without making
the regular balance paths increase in complexity.

As per the argument in kernel/sched/fair.c:5694 the current
load-balancing averages out to O(n), and I would very much like to keep
it that way.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] sched: unused cpu in affine workload
  2016-04-04 21:34           ` Peter Zijlstra
@ 2016-04-05  8:56             ` Jiri Olsa
  0 siblings, 0 replies; 9+ messages in thread
From: Jiri Olsa @ 2016-04-05  8:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Ingo Molnar, James Hartsock, Srivatsa Vaddagiri,
	Kirill Tkhai, linux-kernel

On Mon, Apr 04, 2016 at 11:34:50PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 04, 2016 at 03:45:16PM -0400, Rik van Riel wrote:
> > An optimal solution is NP hard.
> > 
> > Heuristics that "move tasks with pressure" may be
> > much more doable, and lead to perfectly satisfactory
> > results, especially if most migrations happen within
> > a socket (and the same shared L3 cache).
> 
> Right; trick will be finding something that mostly works without making
> the regular balance paths increase in complexity.
> 
> As per the argument in kernel/sched/fair.c:5694 the current
> load-balancing averages out to O(n), and I would very much like to keep
> it that way.

guys, thanks a lot for all the thoughts and suggestions,
I'll try to come up with something

jirka

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-04-05  8:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-04  8:23 [RFC] sched: unused cpu in affine workload Jiri Olsa
2016-04-04  8:44 ` Peter Zijlstra
2016-04-04  8:59 ` Ingo Molnar
2016-04-04  9:19   ` Ingo Molnar
2016-04-04  9:38     ` Ingo Molnar
2016-04-04 13:23       ` Peter Zijlstra
2016-04-04 19:45         ` Rik van Riel
2016-04-04 21:34           ` Peter Zijlstra
2016-04-05  8:56             ` Jiri Olsa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).