All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] sched/topology: NUMA topology limitations
@ 2020-08-14 10:15 Valentin Schneider
  2020-11-20 12:34 ` [tip: sched/core] sched/topology: Warn when NUMA diameter > 2 tip-bot2 for Valentin Schneider
  0 siblings, 1 reply; 2+ messages in thread
From: Valentin Schneider @ 2020-08-14 10:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen

(For those of you who already got this one: sorry! I messed up LKML and
Vincent's addresses)

Hi,

Some of you may have noticed me struggling to plug some topology holes in
[1]. While digging in there I realized there are some non-obvious
limitations wrt supported NUMA topologies; IMO if we don't feel like they
should be "fixed", they should at the very least be documented somewhere.

I should get to ramble about this at this year's (v)LPC during the
scheduler µconf, but this isn't really trivial so I figured an email
wouldn't hurt. I tried to package it into reasonable segments - this is
still a pretty big wall of text, so use of coffee/tea/beer is recommended.

Spoiler alert: I don't have solutions for these yet.

Diameter issue
==============

Definition
++++++++++

Each unique NUMA distance gets an index in sched_domains_numa_distance[],
and as the big wall of text atop topology.c::build_balance_mask() mentions
this can be seen as an array of hops. i.e. in the following table:

      0  1
  0: 10 20
  1: 20 10

  sched_domains_numa_distance = { 10, 20 };

distance 10 is 0 hops (i.e. identity distance), distance 20 is 1 hop. This
isn't strictly speaking always true, as e.g. two consecutive distance
values could represent the same number of hops (with one path slightly
"longer" than the other), but it makes explaining that mess a bit easier.

I'm using 'diameter' in the NoC meaning of the term, i.e. the longest
shortest path between any two nodes, in # of hops.

Diameter <= 2
+++++++++++++

AFAIU, one of the issues I'm encountering in [1] is that the scheduler
topology code cannot cope with a NUMA diameter > 2. I've looked into what's
already out there, but haven't found any (x86) machine that violates this
rule. Below are some examples.

AMD Epyc
--------

node   0   1   2   3   4   5   6   7
  0:  10  16  16  16  32  32  32  32
  1:  16  10  16  16  32  32  32  32
  2:  16  16  10  16  32  32  32  32
  3:  16  16  16  10  32  32  32  32
  4:  32  32  32  32  10  16  16  16
  5:  32  32  32  32  16  10  16  16
  6:  32  32  32  32  16  16  10  16
  7:  32  32  32  32  16  16  16  10

Here, diameter=2

Opteron 6174
------------

This is the 4-node one, which looks like this:

node  0  1  2  3
  0: 10 20 20 20
  1: 20 10 20 20
  2: 20 20 10 20
  3: 20 20 20 10

  0 ----- 3
  | \   / |
  |   X   |
  | /   \ |
  1 ----- 2

Clearly this is diameter=1

AMD Bulldozer
-------------

I'm using the topology described in decade of wasted cores [2]; I'm not
going to bother reproducing the distance table, the topology looks like:

   _______________________
  /        _____________  \
  |       /              \|
  0 ----- 4 ----- 5 ----- 1
  | \   / | \   / | \   / |
  |   X   |   X   |   X   |
  | /   \ | /   \ | /   \ |
  6 ----- 2 ----- 3 ----- 7
  |       \______________/|
  \_______________________/

Which is diameter=2 (assuming equal edge weights)

Diameter > 2
++++++++++++

I haven't found other NUMA systems out there with a diameter > 2 other than
the one I'm rambling about in [1]. Nevertheless, I think there are some
theoretical setups that aren't too insane (at least less than [1]).

One of them is a 6-node mesh, which is diameter=3 and would look like:

  0 ----- 5
  |       |
  1 ----- 4
  |       |
  2 ----- 3

A more convoluted one would be a 12-node "spidergon"; it's a ring with an
even number of nodes N, each node is connected to both of its neighbours
and there is an interlink between node X and node (X + N/2) % N. I won't
pretend I can draw it in ASCII, but that again gives us a diameter of 3.

Case study w/ QEMU
++++++++++++++++++

Setup
-----

The topology I'll describe below is a simplified version of the one in [1],
as it is the smallest reproducer of the issue. Morten has been calling this
a "segmented bus".

   1 ----- 0
   |
   |
   |
   2 ----- 3

node   0   1   2   3
  0:  10  20  30  40
  1:  20  10  20  30
  2:  30  20  10  20
  3:  40  30  20  10

Diameter here is 3. I can recreate this with the following qemu
incantation:

  $ qemu-system-aarch64 -kernel Image -initrd rootfs.cpio -append \
                      'console=ttyAMA0 earlycon=pl011,0x9000000 root=/dev/vda debug earlyprintk=serial sched_debug' \
                      -smp cores=4 -nographic -m 512 -cpu cortex-a53 -machine virt \
                      -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
                      -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
                      -numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
                      -numa dist,src=0,dst=3,val=40, -numa dist,src=1,dst=2,val=20, \
                      -numa dist,src=1,dst=3,val=30, -numa dist,src=2,dst=3,val=20

If you boot the above with CONFIG_SCHED_DEBUG, you'll get:

[    0.245896] CPU0 attaching sched-domain(s):
[    0.246133]  domain-0: span=0-1 level=NUMA
[    0.246592]   groups: 0:{ span=0 cap=1011 }, 1:{ span=1 cap=1008 }
[    0.246998]   domain-1: span=0-2 level=NUMA
[    0.247145]    groups: 0:{ span=0-1 mask=0 cap=2019 }, 2:{ span=1-3 mask=2 cap=3025 }
[    0.247454] ERROR: groups don't span domain->span
[    0.247654]    domain-2: span=0-3 level=NUMA
[    0.247892]     groups: 0:{ span=0-2 mask=0 cap=3021 }, 3:{ span=1-3 mask=3 cap=3047 }

[    0.248915] CPU1 attaching sched-domain(s):
[    0.249050]  domain-0: span=0-2 level=NUMA
[    0.249181]   groups: 1:{ span=1 cap=1008 }, 2:{ span=2 cap=1002 }, 0:{ span=0 cap=1011 }
[    0.249661]   domain-1: span=0-3 level=NUMA
[    0.249810]    groups: 1:{ span=0-2 mask=1 cap=3034 }, 3:{ span=2-3 mask=3 cap=2023 }

[    0.250381] CPU2 attaching sched-domain(s):
[    0.250565]  domain-0: span=1-3 level=NUMA
[    0.250710]   groups: 2:{ span=2 cap=1002 }, 3:{ span=3 cap=999 }, 1:{ span=1 cap=1008 }
[    0.250992]   domain-1: span=0-3 level=NUMA
[    0.251129]    groups: 2:{ span=1-3 mask=2 cap=3025 }, 0:{ span=0-1 mask=0 cap=2019 }

[    0.251474] CPU3 attaching sched-domain(s):
[    0.251606]  domain-0: span=2-3 level=NUMA
[    0.251772]   groups: 3:{ span=3 cap=999 }, 2:{ span=2 cap=1002 }
[    0.252439]   domain-1: span=1-3 level=NUMA
[    0.252587]    groups: 3:{ span=2-3 mask=3 cap=2023 }, 1:{ span=0-2 mask=1 cap=3034 }
[    0.252859] ERROR: groups don't span domain->span
[    0.253009]    domain-2: span=0-3 level=NUMA
[    0.253153]     groups: 3:{ span=1-3 mask=3 cap=3047 }, 0:{ span=0-2 mask=0 cap=3021 }

Why we get this
---------------

The sched_domains we get here are NUMA<=20, NUMA<=30, NUMA<=40.

There's no problem with the sched_domain spans, those are directly derived
from sched_domains_numa_masks[][] and that works just fine. We then build
the sched_groups using those spans, see

  build_overlap_sched_groups()

Let's have a look at what happens for the first domain on CPU0 (node0) and
CPU2 (node2). This would be NUMA<=20 (i.e. 1 hop), whose groups are the
spanned CPUs:

  CPU0 (node0)
  NUMA<=20 span=[0-1], groups=(0), (1)

  CPU2 (node2)
  NUMA<=20 span=[1-3], groups=(2), (3), (1)

So far so good, nothing outlandish here. However, when we get to building
NUMA<=30 (i.e. 2 hops), this falls apart:

  CPU0
  NUMA<=30 span=[0-2], groups=(0-1), (1-3) <<< sched_domain_span(CPU2, NUMA<=20)
                               ^^^
                sched_domain_span(CPU0, NUMA<=20)

CPU3 (node3) is very much *not* <= 30 distance (2 hops) away from CPU0
(node0), but we have a group that spans it anyway, because of the
aforementioned code. IOW, we are creating a completely bogus transitivity
relation that sort of looks like:

  For any nodes i, j, k, and hop count x > 1:
    node_distance(i, j) <= sched_domains_numa_distance[x]
    AND
    node_distance(j, k) <= sched_domains_numa_distance[x-1]
    IMPLIES
    node_distance(i, k) <= sched_domains_numa_distance[x]

Which becomes in our example:

   node_distance(0, 2) <= sched_domains_numa_distance[2] # 30
   AND
   node_distance(2, 3) <= sched_domains_numa_distance[1] # 20
   IMPLIES (bogusly!)
   node_distance(0, 3) <= sched_domains_numa_distance[2] # 30

This actually works for diameters == 2 (given everything is covered by 2
hops anyway) but falls apart for bigger diameters, as is the case here.

Implications of fixing this
---------------------------

Clearly the current sched_group setup for such topologies is not what we
want: this disturbs load balancing on the 'corrupted' domains.

If we *do* want to support systems like this, then we have other
problems to solve. Currently, on the aforementioned QEMU setup, we have:

  CPU0-domain1
    group0: span=0-2, mask=0
    group2: span=1-3, mask=2
  CPU1-domain1
    group1: span=0-2, mask=1
    group3: span=2-3, mask=3
  CPU2-domain1
    group2: span=1-3, mask=2
    group0: span=0-1, mask=0
  CPU3-domain1
    group3: span=2-3, mask=3
    group1: span=0-2, mask=1

We would want to "fix" this into:

  CPU0-domain1
    group0: span=0-2, mask=0
  - group2: span=1-3, mask=2
  + group?: span=1-2, mask=??
  CPU1-domain1
    group1: span=0-2, mask=1
    group3: span=2-3, mask=3
  CPU2-domain1
    group2: span=1-3, mask=2
    group0: span=0-1, mask=0
  CPU3-domain1
    group3: span=2-3, mask=3
  - group1: span=0-2, mask=1
  + group?: span=1-2, mask=??

I've purposedly left a blank for the balance mask, because there isn't
really any valid choice. No CPU will have this (1-2) group as its local
group at domain1 (NUMA<=30); worse than that, we actually have now 5 unique
groups at this topology level despite having only 4 CPUs: there are no
CPUs left to use as identifier of that "new" group (i.e. owner of
sched_group_capacity).

Even if we find some form of unique identifier for that group, it still
isn't the local group of any CPU at any topology level. Since
update_group_capacity() only updates the capacity of the local group, we
don't have any way to update the capacity of such groups - unless we start
updating more than the local group, which will get ugly real quick.


That's as far as I got to. If you got to here, thank you for your
time, and as always, comments / feedback are most welcome.

Links
=====

[1]: https://lkml.kernel.org/r/20200324125533.17447-1-valentin.schneider@arm.com/
[2]: https://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf


Thanks,
Valentin

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [tip: sched/core] sched/topology: Warn when NUMA diameter > 2
  2020-08-14 10:15 [RFC] sched/topology: NUMA topology limitations Valentin Schneider
@ 2020-11-20 12:34 ` tip-bot2 for Valentin Schneider
  0 siblings, 0 replies; 2+ messages in thread
From: tip-bot2 for Valentin Schneider @ 2020-11-20 12:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Valentin Schneider, Peter Zijlstra (Intel),
	Mel Gorman, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     b5b217346de85ed1b03fdecd5c5076b34fbb2f0b
Gitweb:        https://git.kernel.org/tip/b5b217346de85ed1b03fdecd5c5076b34fbb2f0b
Author:        Valentin Schneider <valentin.schneider@arm.com>
AuthorDate:    Tue, 10 Nov 2020 18:43:00 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 19 Nov 2020 11:25:46 +01:00

sched/topology: Warn when NUMA diameter > 2

NUMA topologies where the shortest path between some two nodes requires
three or more hops (i.e. diameter > 2) end up being misrepresented in the
scheduler topology structures.

This is currently detected when booting a kernel with CONFIG_SCHED_DEBUG=y
+ sched_debug on the cmdline, although this will only yield a warning about
sched_group spans not matching sched_domain spans:

  ERROR: groups don't span domain->span

Add an explicit warning for that case, triggered regardless of
CONFIG_SCHED_DEBUG, and decorate it with an appropriate comment.

The topology described in the comment can be booted up on QEMU by appending
the following to your usual QEMU incantation:

    -smp cores=4 \
    -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
    -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
    -numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
    -numa dist,src=0,dst=3,val=40, -numa dist,src=1,dst=2,val=20, \
    -numa dist,src=1,dst=3,val=30, -numa dist,src=2,dst=3,val=20

A somewhat more realistic topology (6-node mesh) with the same affliction
can be conjured with:

    -smp cores=6 \
    -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
    -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
    -numa node,cpus=4,nodeid=4, -numa node,cpus=5,nodeid=5, \
    -numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
    -numa dist,src=0,dst=3,val=40, -numa dist,src=0,dst=4,val=30, \
    -numa dist,src=0,dst=5,val=20, \
    -numa dist,src=1,dst=2,val=20, -numa dist,src=1,dst=3,val=30, \
    -numa dist,src=1,dst=4,val=20, -numa dist,src=1,dst=5,val=30, \
    -numa dist,src=2,dst=3,val=20, -numa dist,src=2,dst=4,val=30, \
    -numa dist,src=2,dst=5,val=40, \
    -numa dist,src=3,dst=4,val=20, -numa dist,src=3,dst=5,val=30, \
    -numa dist,src=4,dst=5,val=20

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com
---
 kernel/sched/topology.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 90f3e55..b296c1c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -675,6 +675,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	struct sched_domain *tmp;
+	int numa_distance = 0;
 
 	/* Remove the sched domains which do not contribute to scheduling. */
 	for (tmp = sd; tmp; ) {
@@ -706,6 +707,38 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 			sd->child = NULL;
 	}
 
+	for (tmp = sd; tmp; tmp = tmp->parent)
+		numa_distance += !!(tmp->flags & SD_NUMA);
+
+	/*
+	 * FIXME: Diameter >=3 is misrepresented.
+	 *
+	 * Smallest diameter=3 topology is:
+	 *
+	 *   node   0   1   2   3
+	 *     0:  10  20  30  40
+	 *     1:  20  10  20  30
+	 *     2:  30  20  10  20
+	 *     3:  40  30  20  10
+	 *
+	 *   0 --- 1 --- 2 --- 3
+	 *
+	 * NUMA-3	0-3		N/A		N/A		0-3
+	 *  groups:	{0-2},{1-3}					{1-3},{0-2}
+	 *
+	 * NUMA-2	0-2		0-3		0-3		1-3
+	 *  groups:	{0-1},{1-3}	{0-2},{2-3}	{1-3},{0-1}	{2-3},{0-2}
+	 *
+	 * NUMA-1	0-1		0-2		1-3		2-3
+	 *  groups:	{0},{1}		{1},{2},{0}	{2},{3},{1}	{3},{2}
+	 *
+	 * NUMA-0	0		1		2		3
+	 *
+	 * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the
+	 * group span isn't a subset of the domain span.
+	 */
+	WARN_ONCE(numa_distance > 2, "Shortest NUMA path spans too many nodes\n");
+
 	sched_domain_debug(sd, cpu);
 
 	rq_attach_root(rq, rd);

^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-11-20 12:34 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-14 10:15 [RFC] sched/topology: NUMA topology limitations Valentin Schneider
2020-11-20 12:34 ` [tip: sched/core] sched/topology: Warn when NUMA diameter > 2 tip-bot2 for Valentin Schneider

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.