* [RFC] sched: unused cpu in affine workload
@ 2016-04-04 8:23 Jiri Olsa
2016-04-04 8:44 ` Peter Zijlstra
2016-04-04 8:59 ` Ingo Molnar
0 siblings, 2 replies; 9+ messages in thread
From: Jiri Olsa @ 2016-04-04 8:23 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: James Hartsock, Rik van Riel, Srivatsa Vaddagiri, Kirill Tkhai,
linux-kernel
hi,
we've noticed following issue in one of our workloads.
I have 24 CPUs server with following sched domains:
domain 0: (pairs)
domain 1: 0-5,12-17 (group1) 6-11,18-23 (group2)
domain 2: 0-23 level NUMA
I run CPU hogging workload on following CPUs:
4,6,14,18,19,20,23
that is:
4,14 CPUs from group1
6,18,19,20,23 CPUs from group2
the workload process gets affinity setup via 'taskset -c ${CPUs workload ...'
and forks child for every CPU
very often we notice CPUs 4 and 14 running 3 processes of the workload
while CPUs 6,18,19,20,23 running just 4 processes, leaving one of the
CPU from group2 idle
AFAICS from the code the reason for this is that the load balancing
follows domains setup (topology) and does not regard affinity setups
like this. The code in find_busiest_group running under idle cpu from
group2 will find group1 as bussiest, but its average load will be
smaller than the one on the local group, so there's no task pulling.
It's obvious, that load balancer follows sched domain topology.
However is there some sched feature I'm missing that could help
with this? Or do we need to follow sched domains topology when
we select CPUs for workload to get even balancing?
thanks,
jirka
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] sched: unused cpu in affine workload
2016-04-04 8:23 [RFC] sched: unused cpu in affine workload Jiri Olsa
@ 2016-04-04 8:44 ` Peter Zijlstra
2016-04-04 8:59 ` Ingo Molnar
1 sibling, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2016-04-04 8:44 UTC (permalink / raw)
To: Jiri Olsa
Cc: Ingo Molnar, James Hartsock, Rik van Riel, Srivatsa Vaddagiri,
Kirill Tkhai, linux-kernel
On Mon, Apr 04, 2016 at 10:23:02AM +0200, Jiri Olsa wrote:
> hi,
> we've noticed following issue in one of our workloads.
>
> I have 24 CPUs server with following sched domains:
> domain 0: (pairs)
> domain 1: 0-5,12-17 (group1) 6-11,18-23 (group2)
> domain 2: 0-23 level NUMA
>
> I run CPU hogging workload on following CPUs:
> 4,6,14,18,19,20,23
>
> that is:
> 4,14 CPUs from group1
> 6,18,19,20,23 CPUs from group2
>
> the workload process gets affinity setup via 'taskset -c ${CPUs workload ...'
> and forks child for every CPU
>
> very often we notice CPUs 4 and 14 running 3 processes of the workload
> while CPUs 6,18,19,20,23 running just 4 processes, leaving one of the
> CPU from group2 idle
>
> AFAICS from the code the reason for this is that the load balancing
> follows domains setup (topology) and does not regard affinity setups
> like this. The code in find_busiest_group running under idle cpu from
> group2 will find group1 as bussiest, but its average load will be
> smaller than the one on the local group, so there's no task pulling.
>
> It's obvious, that load balancer follows sched domain topology.
> However is there some sched feature I'm missing that could help
> with this? Or do we need to follow sched domains topology when
> we select CPUs for workload to get even balancing?
Yeah, this is 'hard', there is some code that tries not to totally blow
with this but its all a bit of a mess. See
kernel/sched/fair.c:sg_imbalanced().
The easiest solution is to simply not do this and stick with the topo
like you suggest.
So far I've not come up with a sane/stable solution for this problem.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] sched: unused cpu in affine workload
2016-04-04 8:23 [RFC] sched: unused cpu in affine workload Jiri Olsa
2016-04-04 8:44 ` Peter Zijlstra
@ 2016-04-04 8:59 ` Ingo Molnar
2016-04-04 9:19 ` Ingo Molnar
1 sibling, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2016-04-04 8:59 UTC (permalink / raw)
To: Jiri Olsa
Cc: Peter Zijlstra, James Hartsock, Rik van Riel, Srivatsa Vaddagiri,
Kirill Tkhai, linux-kernel
* Jiri Olsa <jolsa@redhat.com> wrote:
> hi,
> we've noticed following issue in one of our workloads.
>
> I have 24 CPUs server with following sched domains:
> domain 0: (pairs)
> domain 1: 0-5,12-17 (group1) 6-11,18-23 (group2)
> domain 2: 0-23 level NUMA
>
> I run CPU hogging workload on following CPUs:
> 4,6,14,18,19,20,23
>
> that is:
> 4,14 CPUs from group1
> 6,18,19,20,23 CPUs from group2
>
> the workload process gets affinity setup via 'taskset -c ${CPUs workload ...'
> and forks child for every CPU
>
> very often we notice CPUs 4 and 14 running 3 processes of the workload
> while CPUs 6,18,19,20,23 running just 4 processes, leaving one of the
> CPU from group2 idle
>
> AFAICS from the code the reason for this is that the load balancing
> follows domains setup (topology) and does not regard affinity setups
> like this. The code in find_busiest_group running under idle cpu from
> group2 will find group1 as bussiest, but its average load will be
> smaller than the one on the local group, so there's no task pulling.
>
> It's obvious, that load balancer follows sched domain topology.
> However is there some sched feature I'm missing that could help
> with this? Or do we need to follow sched domains topology when
> we select CPUs for workload to get even balancing?
Yeah, so the principle with user-pinning of tasks to CPUs was always:
- pinning a task to a single CPU should obviously work fine, it's the primary
usecase for isolation.
- pinning a task to an arbitrary subset of CPUs is a 'hard' problem
mathematically that the scheduler never truly wanted to solve in a frontal
fashion.
... but that principle was set into place well before we did the NUMA scheduling
work, which in itself is a highly non-trivial load optimization problem to begin
with, so we might want to reconsider.
So there's two directions I can suggest:
- if you can come up with workable small-scale solutions to scratch an itch
that comes up in practice then that's obviously good, as long as it does not
regress anything else.
- if you want to come up with a 'complete' solution then please don't put it into
hot paths such as wakeup or context switching, or any of the hardirq methods,
but try to integrate it with the NUMA scheduling slow path.
The NUMA balancing slow path: that is softirq driven and reasonably low freq to
not cause many performance problems.
The two problems (NUMA affinity and user affinity) are also losely related on a
conceptual level: the NUMA affinity optimization problem can be considered as a
workload determined, arbitrary 'NUMA mask' being optimized from first principles.
There's one ABI detail: this is true only as long as SMP affinity masks follow
node boundaries - the current NUMA balancing code is very much node granular, so
the two can only be merged if the ->cpus_allowed mask follows node boundaries as
well.
A third approach would be to extend the NUMA balancing code to be CPU granular
(without changing anytask placement behavior of the current NUMA balancing code of
course), with node granular being a special case. This would fit the cgroups (and
virtualization) usecases, but that would be a major change.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] sched: unused cpu in affine workload
2016-04-04 8:59 ` Ingo Molnar
@ 2016-04-04 9:19 ` Ingo Molnar
2016-04-04 9:38 ` Ingo Molnar
0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2016-04-04 9:19 UTC (permalink / raw)
To: Jiri Olsa
Cc: Peter Zijlstra, James Hartsock, Rik van Riel, Srivatsa Vaddagiri,
Kirill Tkhai, linux-kernel
* Ingo Molnar <mingo@kernel.org> wrote:
> - if you want to come up with a 'complete' solution then please don't put it into
> hot paths such as wakeup or context switching, or any of the hardirq methods,
> but try to integrate it with the NUMA scheduling slow path.
>
> The NUMA balancing slow path: that is softirq driven and reasonably low freq to
> not cause many performance problems.
>
> The two problems (NUMA affinity and user affinity) are also losely related on a
> conceptual level: the NUMA affinity optimization problem can be considered as a
> workload determined, arbitrary 'NUMA mask' being optimized from first
> principles.
>
> There's one ABI detail: this is true only as long as SMP affinity masks follow
> node boundaries - the current NUMA balancing code is very much node granular, so
> the two can only be merged if the ->cpus_allowed mask follows node boundaries as
> well.
>
> A third approach would be to extend the NUMA balancing code to be CPU granular
> (without changing anytask placement behavior of the current NUMA balancing code
> of course), with node granular being a special case. This would fit the cgroups
> (and virtualization) usecases, but that would be a major change.
So my thinking here is: if the NUMA balancing code (which is node granular at the
moment and uses node masks, etc.) is extended to be CPU granular (which is a big
task in itself), then the two problems can be 'unified':
- the NUMA balancing code inputs arbitrarly CPU (node) affinity masks from the
MM code into the scheduler.
- the scheduler syscall ABI (and other configuration sources) inputs arbitrary
CPU affinity masks into the scheduler.
it's a similar problem, with two (minor looking) complication:
- the NUMA code right now is 'statistical', while ->cpus_allowed are hard
constraints that must never be violated. So there always has to be a final
layer to implement the hard constraint - which does not exist in the NUMA
balancing case. This should be relatively easy I think as we already do it
with the regular balancer.
- the balancing slowpath would have to be activated on non-NUMA systems as well,
so that it can handle ->cpus_allowed balancing.
... once all that is solved, I can see several advantages from unifying the NUMA
balancing and SMP affinity balancing code:
- the NUMA balancer would improve: cpus_allowed isolation is used more
frequently, so fixes from those workloads would benefit the NUMA balancing case
as well.
- testing the NUMA balancer would become easier: we'd simply set cpus_allowed and
would watch how it balances. No need to coax workloads into actual MM NUMA
usage patters to set up interesting scenarios.
- our existing half-hearted ways to deal with cpus_allowed balancing could be
outsourced to the NUMA slow path, which would simplify the SMP balancing fast
path.
But it's a major piece of work, and I might be missing implementational details.
It would be the biggest new scheduler feature since NUMA balancing for sure.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] sched: unused cpu in affine workload
2016-04-04 9:19 ` Ingo Molnar
@ 2016-04-04 9:38 ` Ingo Molnar
2016-04-04 13:23 ` Peter Zijlstra
0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2016-04-04 9:38 UTC (permalink / raw)
To: Jiri Olsa
Cc: Peter Zijlstra, James Hartsock, Rik van Riel, Srivatsa Vaddagiri,
Kirill Tkhai, linux-kernel
* Ingo Molnar <mingo@kernel.org> wrote:
> So my thinking here is: if the NUMA balancing code (which is node granular at
> the moment and uses node masks, etc.) is extended to be CPU granular (which is a
> big task in itself), then the two problems can be 'unified':
>
> - the NUMA balancing code inputs arbitrarly CPU (node) affinity masks from the
> MM code into the scheduler.
>
> - the scheduler syscall ABI (and other configuration sources) inputs arbitrary
> CPU affinity masks into the scheduler.
>
> it's a similar problem, with two (minor looking) complication:
btw., this highlights how hard the optimization problem is: the NUMA balancing
code is (at least ...) O(nr_nodes^2) complex - but we had O(nr_nodes^3) passes too
in some of the NUMA balancing submissions...
We'd upgrade that to O(nr_cpus^2), which is totally unrealistic with 16,000 CPUs
even in a slowpath - but it would probably cause problems even with 120 CPUs. It
will get quadratically worse as the number of CPUs in a system increases on its
current exponential trajectory ...
So the safest bet would be to restrict any 'perfect' balancing attempts to node
boundaries. Which won't solve the problem you outlined to begin with.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] sched: unused cpu in affine workload
2016-04-04 9:38 ` Ingo Molnar
@ 2016-04-04 13:23 ` Peter Zijlstra
2016-04-04 19:45 ` Rik van Riel
0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2016-04-04 13:23 UTC (permalink / raw)
To: Ingo Molnar
Cc: Jiri Olsa, James Hartsock, Rik van Riel, Srivatsa Vaddagiri,
Kirill Tkhai, linux-kernel
On Mon, Apr 04, 2016 at 11:38:44AM +0200, Ingo Molnar wrote:
> We'd upgrade that to O(nr_cpus^2), which is totally unrealistic with 16,000 CPUs
> even in a slowpath - but it would probably cause problems even with 120 CPUs. It
> will get quadratically worse as the number of CPUs in a system increases on its
> current exponential trajectory ...
The arbitrary affinity thing is I think a packing problem, which is NP
hard IIRC.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] sched: unused cpu in affine workload
2016-04-04 13:23 ` Peter Zijlstra
@ 2016-04-04 19:45 ` Rik van Riel
2016-04-04 21:34 ` Peter Zijlstra
0 siblings, 1 reply; 9+ messages in thread
From: Rik van Riel @ 2016-04-04 19:45 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: Jiri Olsa, James Hartsock, Srivatsa Vaddagiri, Kirill Tkhai,
linux-kernel
[-- Attachment #1: Type: text/plain, Size: 806 bytes --]
On Mon, 2016-04-04 at 15:23 +0200, Peter Zijlstra wrote:
> On Mon, Apr 04, 2016 at 11:38:44AM +0200, Ingo Molnar wrote:
> >
> > We'd upgrade that to O(nr_cpus^2), which is totally unrealistic
> > with 16,000 CPUs
> > even in a slowpath - but it would probably cause problems even with
> > 120 CPUs. It
> > will get quadratically worse as the number of CPUs in a system
> > increases on its
> > current exponential trajectory ...
> The arbitrary affinity thing is I think a packing problem, which is
> NP
> hard IIRC.
An optimal solution is NP hard.
Heuristics that "move tasks with pressure" may be
much more doable, and lead to perfectly satisfactory
results, especially if most migrations happen within
a socket (and the same shared L3 cache).
--
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] sched: unused cpu in affine workload
2016-04-04 19:45 ` Rik van Riel
@ 2016-04-04 21:34 ` Peter Zijlstra
2016-04-05 8:56 ` Jiri Olsa
0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2016-04-04 21:34 UTC (permalink / raw)
To: Rik van Riel
Cc: Ingo Molnar, Jiri Olsa, James Hartsock, Srivatsa Vaddagiri,
Kirill Tkhai, linux-kernel
On Mon, Apr 04, 2016 at 03:45:16PM -0400, Rik van Riel wrote:
> An optimal solution is NP hard.
>
> Heuristics that "move tasks with pressure" may be
> much more doable, and lead to perfectly satisfactory
> results, especially if most migrations happen within
> a socket (and the same shared L3 cache).
Right; trick will be finding something that mostly works without making
the regular balance paths increase in complexity.
As per the argument in kernel/sched/fair.c:5694 the current
load-balancing averages out to O(n), and I would very much like to keep
it that way.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] sched: unused cpu in affine workload
2016-04-04 21:34 ` Peter Zijlstra
@ 2016-04-05 8:56 ` Jiri Olsa
0 siblings, 0 replies; 9+ messages in thread
From: Jiri Olsa @ 2016-04-05 8:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rik van Riel, Ingo Molnar, James Hartsock, Srivatsa Vaddagiri,
Kirill Tkhai, linux-kernel
On Mon, Apr 04, 2016 at 11:34:50PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 04, 2016 at 03:45:16PM -0400, Rik van Riel wrote:
> > An optimal solution is NP hard.
> >
> > Heuristics that "move tasks with pressure" may be
> > much more doable, and lead to perfectly satisfactory
> > results, especially if most migrations happen within
> > a socket (and the same shared L3 cache).
>
> Right; trick will be finding something that mostly works without making
> the regular balance paths increase in complexity.
>
> As per the argument in kernel/sched/fair.c:5694 the current
> load-balancing averages out to O(n), and I would very much like to keep
> it that way.
guys, thanks a lot for all the thoughts and suggestions,
I'll try to come up with something
jirka
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-04-05 8:56 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-04 8:23 [RFC] sched: unused cpu in affine workload Jiri Olsa
2016-04-04 8:44 ` Peter Zijlstra
2016-04-04 8:59 ` Ingo Molnar
2016-04-04 9:19 ` Ingo Molnar
2016-04-04 9:38 ` Ingo Molnar
2016-04-04 13:23 ` Peter Zijlstra
2016-04-04 19:45 ` Rik van Riel
2016-04-04 21:34 ` Peter Zijlstra
2016-04-05 8:56 ` Jiri Olsa
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).