linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: [RFC] sched/topology: NUMA topology limitations
       [not found] <6a5f06ff4ecb4f34bd7e9890dc07fb99@hisilicon.com>
@ 2020-08-29  5:32 ` Song Bao Hua (Barry Song)
  2020-08-29 12:33   ` Valentin Schneider
  2020-08-29  5:42 ` Song Bao Hua (Barry Song)
  1 sibling, 1 reply; 7+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-08-29  5:32 UTC (permalink / raw)
  To: valentin.schneider, linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, Linuxarm

> Subject: [RFC] sched/topology: NUMA topology limitations
> 
> (For those of you who already got this one: sorry! I messed up LKML and
> Vincent's addresses)
> 
> Hi,
> 
> Some of you may have noticed me struggling to plug some topology holes in
> [1]. While digging in there I realized there are some non-obvious limitations
> wrt supported NUMA topologies; IMO if we don't feel like they should be
> "fixed", they should at the very least be documented somewhere.
> 
> I should get to ramble about this at this year's (v)LPC during the scheduler
> µconf, but this isn't really trivial so I figured an email wouldn't hurt. I tried to
> package it into reasonable segments - this is still a pretty big wall of text, so
> use of coffee/tea/beer is recommended.
> 
> Spoiler alert: I don't have solutions for these yet.
> 
> Diameter issue
> ==============
> 
> Definition
> ++++++++++
> 
> Each unique NUMA distance gets an index in sched_domains_numa_distance[],
> and as the big wall of text atop topology.c::build_balance_mask() mentions
> this can be seen as an array of hops. i.e. in the following table:
> 
>       0  1
>   0: 10 20
>   1: 20 10
> 
>   sched_domains_numa_distance = { 10, 20 };
> 
> distance 10 is 0 hops (i.e. identity distance), distance 20 is 1 hop. This isn't
> strictly speaking always true, as e.g. two consecutive distance values could
> represent the same number of hops (with one path slightly "longer" than the
> other), but it makes explaining that mess a bit easier.
> 
> I'm using 'diameter' in the NoC meaning of the term, i.e. the longest shortest
> path between any two nodes, in # of hops.
> 
> Diameter <= 2
> +++++++++++++
> 
> AFAIU, one of the issues I'm encountering in [1] is that the scheduler topology
> code cannot cope with a NUMA diameter > 2. I've looked into what's already
> out there, but haven't found any (x86) machine that violates this rule. Below
> are some examples.
> 
> AMD Epyc
> --------
> 
> node   0   1   2   3   4   5   6   7
>   0:  10  16  16  16  32  32  32  32
>   1:  16  10  16  16  32  32  32  32
>   2:  16  16  10  16  32  32  32  32
>   3:  16  16  16  10  32  32  32  32
>   4:  32  32  32  32  10  16  16  16
>   5:  32  32  32  32  16  10  16  16
>   6:  32  32  32  32  16  16  10  16
>   7:  32  32  32  32  16  16  16  10
> 
> Here, diameter=2
> 
> Opteron 6174
> ------------
> 
> This is the 4-node one, which looks like this:
> 
> node  0  1  2  3
>   0: 10 20 20 20
>   1: 20 10 20 20
>   2: 20 20 10 20
>   3: 20 20 20 10
> 
>   0 ----- 3
>   | \   / |
>   |   X   |
>   | /   \ |
>   1 ----- 2
> 
> Clearly this is diameter=1
> 
> AMD Bulldozer
> -------------
> 
> I'm using the topology described in decade of wasted cores [2]; I'm not going
> to bother reproducing the distance table, the topology looks like:
> 
>    _______________________
>   /        _____________  \
>   |       /              \|
>   0 ----- 4 ----- 5 ----- 1
>   | \   / | \   / | \   / |
>   |   X   |   X   |   X   |
>   | /   \ | /   \ | /   \ |
>   6 ----- 2 ----- 3 ----- 7
>   |       \______________/|
>   \_______________________/
> 
> Which is diameter=2 (assuming equal edge weights)
> 
> Diameter > 2
> ++++++++++++
> 
> I haven't found other NUMA systems out there with a diameter > 2 other than
> the one I'm rambling about in [1]. Nevertheless, I think there are some
> theoretical setups that aren't too insane (at least less than [1]).
> 
> One of them is a 6-node mesh, which is diameter=3 and would look like:
> 
>   0 ----- 5
>   |       |
>   1 ----- 4
>   |       |
>   2 ----- 3
> 
> A more convoluted one would be a 12-node "spidergon"; it's a ring with an
> even number of nodes N, each node is connected to both of its neighbours and
> there is an interlink between node X and node (X + N/2) % N. I won't pretend I
> can draw it in ASCII, but that again gives us a diameter of 3.
> 
> Case study w/ QEMU
> ++++++++++++++++++
> 
> Setup
> -----
> 
> The topology I'll describe below is a simplified version of the one in [1], as it is
> the smallest reproducer of the issue. Morten has been calling this a
> "segmented bus".
> 
>    1 ----- 0
>    |
>    |
>    |
>    2 ----- 3
> 
> node   0   1   2   3
>   0:  10  20  30  40
>   1:  20  10  20  30
>   2:  30  20  10  20
>   3:  40  30  20  10
> 
> Diameter here is 3. I can recreate this with the following qemu
> incantation:
> 
>   $ qemu-system-aarch64 -kernel Image -initrd rootfs.cpio -append \
>                       'console=ttyAMA0 earlycon=pl011,0x9000000
> root=/dev/vda debug earlyprintk=serial sched_debug' \
>                       -smp cores=4 -nographic -m 512 -cpu cortex-a53
> -machine virt \
>                       -numa node,cpus=0,nodeid=0 -numa
> node,cpus=1,nodeid=1, \
>                       -numa node,cpus=2,nodeid=2, -numa
> node,cpus=3,nodeid=3, \
>                       -numa dist,src=0,dst=1,val=20, -numa
> dist,src=0,dst=2,val=30, \
>                       -numa dist,src=0,dst=3,val=40, -numa
> dist,src=1,dst=2,val=20, \
>                       -numa dist,src=1,dst=3,val=30, -numa
> dist,src=2,dst=3,val=20
> 
> If you boot the above with CONFIG_SCHED_DEBUG, you'll get:
> 
> [    0.245896] CPU0 attaching sched-domain(s):
> [    0.246133]  domain-0: span=0-1 level=NUMA
> [    0.246592]   groups: 0:{ span=0 cap=1011 }, 1:{ span=1 cap=1008 }
> [    0.246998]   domain-1: span=0-2 level=NUMA
> [    0.247145]    groups: 0:{ span=0-1 mask=0 cap=2019 }, 2:{ span=1-3
> mask=2 cap=3025 }
> [    0.247454] ERROR: groups don't span domain->span

Hi Valtentin,
Thanks for your email. It seems it is quite clear. May I ask what is the real harm
when group 2 is actually out of the span of diameter 2 here? What will happen when group2
doesn't span cpu0_domain1->span?
In domain-1, will scheduler fail to do load balance between group0 and group2?

> [    0.247654]    domain-2: span=0-3 level=NUMA
> [    0.247892]     groups: 0:{ span=0-2 mask=0 cap=3021 }, 3:{ span=1-3
> mask=3 cap=3047 }

Here domain-2 includes all span from 0 to 3, so that means we should still be able to do
load balance in domain-2 between cpu0 and cpu3 even we fail in domain-1?

> 
> [    0.248915] CPU1 attaching sched-domain(s):
> [    0.249050]  domain-0: span=0-2 level=NUMA
> [    0.249181]   groups: 1:{ span=1 cap=1008 }, 2:{ span=2 cap=1002 },
> 0:{ span=0 cap=1011 }
> [    0.249661]   domain-1: span=0-3 level=NUMA
> [    0.249810]    groups: 1:{ span=0-2 mask=1 cap=3034 }, 3:{ span=2-3
> mask=3 cap=2023 }
> 
> [    0.250381] CPU2 attaching sched-domain(s):
> [    0.250565]  domain-0: span=1-3 level=NUMA
> [    0.250710]   groups: 2:{ span=2 cap=1002 }, 3:{ span=3 cap=999 },
> 1:{ span=1 cap=1008 }
> [    0.250992]   domain-1: span=0-3 level=NUMA
> [    0.251129]    groups: 2:{ span=1-3 mask=2 cap=3025 }, 0:{ span=0-1
> mask=0 cap=2019 }
> 
> [    0.251474] CPU3 attaching sched-domain(s):
> [    0.251606]  domain-0: span=2-3 level=NUMA
> [    0.251772]   groups: 3:{ span=3 cap=999 }, 2:{ span=2 cap=1002 }
> [    0.252439]   domain-1: span=1-3 level=NUMA
> [    0.252587]    groups: 3:{ span=2-3 mask=3 cap=2023 }, 1:{ span=0-2
> mask=1 cap=3034 }
> [    0.252859] ERROR: groups don't span domain->span
> [    0.253009]    domain-2: span=0-3 level=NUMA
> [    0.253153]     groups: 3:{ span=1-3 mask=3 cap=3047 }, 0:{ span=0-2
> mask=0 cap=3021 }
> 
> Why we get this
> ---------------
> 
> The sched_domains we get here are NUMA<=20, NUMA<=30, NUMA<=40.
> 
> There's no problem with the sched_domain spans, those are directly derived
> from sched_domains_numa_masks[][] and that works just fine. We then build
> the sched_groups using those spans, see
> 
>   build_overlap_sched_groups()
> 
> Let's have a look at what happens for the first domain on CPU0 (node0) and
> CPU2 (node2). This would be NUMA<=20 (i.e. 1 hop), whose groups are the
> spanned CPUs:
> 
>   CPU0 (node0)
>   NUMA<=20 span=[0-1], groups=(0), (1)
> 
>   CPU2 (node2)
>   NUMA<=20 span=[1-3], groups=(2), (3), (1)
> 
> So far so good, nothing outlandish here. However, when we get to building
> NUMA<=30 (i.e. 2 hops), this falls apart:
> 
>   CPU0
>   NUMA<=30 span=[0-2], groups=(0-1), (1-3) <<< sched_domain_span(CPU2,
> NUMA<=20)
>                                ^^^
>                 sched_domain_span(CPU0, NUMA<=20)
> 
> CPU3 (node3) is very much *not* <= 30 distance (2 hops) away from CPU0
> (node0), but we have a group that spans it anyway, because of the
> aforementioned code. IOW, we are creating a completely bogus transitivity
> relation that sort of looks like:
> 
>   For any nodes i, j, k, and hop count x > 1:
>     node_distance(i, j) <= sched_domains_numa_distance[x]
>     AND
>     node_distance(j, k) <= sched_domains_numa_distance[x-1]
>     IMPLIES
>     node_distance(i, k) <= sched_domains_numa_distance[x]
> 
> Which becomes in our example:
> 
>    node_distance(0, 2) <= sched_domains_numa_distance[2] # 30
>    AND
>    node_distance(2, 3) <= sched_domains_numa_distance[1] # 20
>    IMPLIES (bogusly!)
>    node_distance(0, 3) <= sched_domains_numa_distance[2] # 30
> 
> This actually works for diameters == 2 (given everything is covered by 2 hops
> anyway) but falls apart for bigger diameters, as is the case here.
> 
> Implications of fixing this
> ---------------------------
> 
> Clearly the current sched_group setup for such topologies is not what we
> want: this disturbs load balancing on the 'corrupted' domains.
> 
> If we *do* want to support systems like this, then we have other problems to
> solve. Currently, on the aforementioned QEMU setup, we have:
> 
>   CPU0-domain1
>     group0: span=0-2, mask=0
>     group2: span=1-3, mask=2

Your kernel log is:
[    0.246998]   domain-1: span=0-2 level=NUMA
[    0.247145]    groups: 0:{ span=0-1 mask=0 cap=2019 }, 2:{ span=1-3
mask=2 cap=3025 }

it seems group0 should be:
group0: span=0-1, mask=0

but not
group0: span=0-2, mask=0 ?

is it a typo here?

>   CPU1-domain1
>     group1: span=0-2, mask=1
>     group3: span=2-3, mask=3
>   CPU2-domain1
>     group2: span=1-3, mask=2
>     group0: span=0-1, mask=0
>   CPU3-domain1
>     group3: span=2-3, mask=3
>     group1: span=0-2, mask=1
> 
> We would want to "fix" this into:
> 
>   CPU0-domain1
>     group0: span=0-2, mask=0
>   - group2: span=1-3, mask=2
>   + group?: span=1-2, mask=??

It seems sensible to move span 3 out of the groups of cpu0-domain1.

>   CPU1-domain1
>     group1: span=0-2, mask=1
>     group3: span=2-3, mask=3
>   CPU2-domain1
>     group2: span=1-3, mask=2
>     group0: span=0-1, mask=0
>   CPU3-domain1
>     group3: span=2-3, mask=3
>   - group1: span=0-2, mask=1
>   + group?: span=1-2, mask=??

Again, it seems sensible to move span 0 out of the group of cpu3-domain1.
> 
> I've purposedly left a blank for the balance mask, because there isn't really any
> valid choice. No CPU will have this (1-2) group as its local group at domain1
> (NUMA<=30); worse than that, we actually have now 5 unique groups at this
> topology level despite having only 4 CPUs: there are no CPUs left to use as
> identifier of that "new" group (i.e. owner of sched_group_capacity).
> 
> Even if we find some form of unique identifier for that group, it still isn't the
> local group of any CPU at any topology level. Since
> update_group_capacity() only updates the capacity of the local group, we
> don't have any way to update the capacity of such groups - unless we start
> updating more than the local group, which will get ugly real quick.
> 
> 
> That's as far as I got to. If you got to here, thank you for your time, and as
> always, comments / feedback are most welcome.

What is the real harm you have seen in load balance? Have you even tried
to make cpu0,cpu1,cpu2 busy and wake-up more tasks in cpu0? Will cpu3
join to share the loading in your qemu topology?

> 
> Links
> =====
> 
> [1]:
> https://lkml.kernel.org/r/20200324125533.17447-1-valentin.schneider@arm.
> com/
> [2]: https://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
> 

Thanks
Barry


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [RFC] sched/topology: NUMA topology limitations
       [not found] <6a5f06ff4ecb4f34bd7e9890dc07fb99@hisilicon.com>
  2020-08-29  5:32 ` [RFC] sched/topology: NUMA topology limitations Song Bao Hua (Barry Song)
@ 2020-08-29  5:42 ` Song Bao Hua (Barry Song)
  1 sibling, 0 replies; 7+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-08-29  5:42 UTC (permalink / raw)
  To: valentin.schneider, linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, Linuxarm

> Subject: RE: [RFC] sched/topology: NUMA topology limitations
> 
> > Subject: [RFC] sched/topology: NUMA topology limitations
> >
> > (For those of you who already got this one: sorry! I messed up LKML and
> > Vincent's addresses)
> >
> > Hi,
> >
> > Some of you may have noticed me struggling to plug some topology holes in
> > [1]. While digging in there I realized there are some non-obvious limitations
> > wrt supported NUMA topologies; IMO if we don't feel like they should be
> > "fixed", they should at the very least be documented somewhere.
> >
> > I should get to ramble about this at this year's (v)LPC during the scheduler
> > µconf, but this isn't really trivial so I figured an email wouldn't hurt. I tried to
> > package it into reasonable segments - this is still a pretty big wall of text, so
> > use of coffee/tea/beer is recommended.
> >
> > Spoiler alert: I don't have solutions for these yet.
> >
> > Diameter issue
> > ==============
> >
> > Definition
> > ++++++++++
> >
> > Each unique NUMA distance gets an index in
> sched_domains_numa_distance[],
> > and as the big wall of text atop topology.c::build_balance_mask() mentions
> > this can be seen as an array of hops. i.e. in the following table:
> >
> >       0  1
> >   0: 10 20
> >   1: 20 10
> >
> >   sched_domains_numa_distance = { 10, 20 };
> >
> > distance 10 is 0 hops (i.e. identity distance), distance 20 is 1 hop. This isn't
> > strictly speaking always true, as e.g. two consecutive distance values could
> > represent the same number of hops (with one path slightly "longer" than the
> > other), but it makes explaining that mess a bit easier.
> >
> > I'm using 'diameter' in the NoC meaning of the term, i.e. the longest shortest
> > path between any two nodes, in # of hops.
> >
> > Diameter <= 2
> > +++++++++++++
> >
> > AFAIU, one of the issues I'm encountering in [1] is that the scheduler
> topology
> > code cannot cope with a NUMA diameter > 2. I've looked into what's already
> > out there, but haven't found any (x86) machine that violates this rule. Below
> > are some examples.
> >
> > AMD Epyc
> > --------
> >
> > node   0   1   2   3   4   5   6   7
> >   0:  10  16  16  16  32  32  32  32
> >   1:  16  10  16  16  32  32  32  32
> >   2:  16  16  10  16  32  32  32  32
> >   3:  16  16  16  10  32  32  32  32
> >   4:  32  32  32  32  10  16  16  16
> >   5:  32  32  32  32  16  10  16  16
> >   6:  32  32  32  32  16  16  10  16
> >   7:  32  32  32  32  16  16  16  10
> >
> > Here, diameter=2
> >
> > Opteron 6174
> > ------------
> >
> > This is the 4-node one, which looks like this:
> >
> > node  0  1  2  3
> >   0: 10 20 20 20
> >   1: 20 10 20 20
> >   2: 20 20 10 20
> >   3: 20 20 20 10
> >
> >   0 ----- 3
> >   | \   / |
> >   |   X   |
> >   | /   \ |
> >   1 ----- 2
> >
> > Clearly this is diameter=1
> >
> > AMD Bulldozer
> > -------------
> >
> > I'm using the topology described in decade of wasted cores [2]; I'm not going
> > to bother reproducing the distance table, the topology looks like:
> >
> >    _______________________
> >   /        _____________  \
> >   |       /              \|
> >   0 ----- 4 ----- 5 ----- 1
> >   | \   / | \   / | \   / |
> >   |   X   |   X   |   X   |
> >   | /   \ | /   \ | /   \ |
> >   6 ----- 2 ----- 3 ----- 7
> >   |       \______________/|
> >   \_______________________/
> >
> > Which is diameter=2 (assuming equal edge weights)
> >
> > Diameter > 2
> > ++++++++++++
> >
> > I haven't found other NUMA systems out there with a diameter > 2 other
> than
> > the one I'm rambling about in [1]. Nevertheless, I think there are some
> > theoretical setups that aren't too insane (at least less than [1]).
> >
> > One of them is a 6-node mesh, which is diameter=3 and would look like:
> >
> >   0 ----- 5
> >   |       |
> >   1 ----- 4
> >   |       |
> >   2 ----- 3
> >
> > A more convoluted one would be a 12-node "spidergon"; it's a ring with an
> > even number of nodes N, each node is connected to both of its neighbours
> and
> > there is an interlink between node X and node (X + N/2) % N. I won't pretend
> I
> > can draw it in ASCII, but that again gives us a diameter of 3.
> >
> > Case study w/ QEMU
> > ++++++++++++++++++
> >
> > Setup
> > -----
> >
> > The topology I'll describe below is a simplified version of the one in [1], as it
> is
> > the smallest reproducer of the issue. Morten has been calling this a
> > "segmented bus".
> >
> >    1 ----- 0
> >    |
> >    |
> >    |
> >    2 ----- 3
> >
> > node   0   1   2   3
> >   0:  10  20  30  40
> >   1:  20  10  20  30
> >   2:  30  20  10  20
> >   3:  40  30  20  10
> >
> > Diameter here is 3. I can recreate this with the following qemu
> > incantation:
> >
> >   $ qemu-system-aarch64 -kernel Image -initrd rootfs.cpio -append \
> >                       'console=ttyAMA0 earlycon=pl011,0x9000000
> > root=/dev/vda debug earlyprintk=serial sched_debug' \
> >                       -smp cores=4 -nographic -m 512 -cpu cortex-a53
> > -machine virt \
> >                       -numa node,cpus=0,nodeid=0 -numa
> > node,cpus=1,nodeid=1, \
> >                       -numa node,cpus=2,nodeid=2, -numa
> > node,cpus=3,nodeid=3, \
> >                       -numa dist,src=0,dst=1,val=20, -numa
> > dist,src=0,dst=2,val=30, \
> >                       -numa dist,src=0,dst=3,val=40, -numa
> > dist,src=1,dst=2,val=20, \
> >                       -numa dist,src=1,dst=3,val=30, -numa
> > dist,src=2,dst=3,val=20
> >
> > If you boot the above with CONFIG_SCHED_DEBUG, you'll get:
> >
> > [    0.245896] CPU0 attaching sched-domain(s):
> > [    0.246133]  domain-0: span=0-1 level=NUMA
> > [    0.246592]   groups: 0:{ span=0 cap=1011 }, 1:{ span=1 cap=1008 }
> > [    0.246998]   domain-1: span=0-2 level=NUMA
> > [    0.247145]    groups: 0:{ span=0-1 mask=0 cap=2019 }, 2:{ span=1-3
> > mask=2 cap=3025 }
> > [    0.247454] ERROR: groups don't span domain->span
> 
> Hi Valtentin,
> Thanks for your email. It seems it is quite clear. May I ask what is the real harm
> when group 2 is actually out of the span of diameter 2 here? What will happen
> when group2
> doesn't span cpu0_domain1->span?
> In domain-1, will scheduler fail to do load balance between group0 and
> group2?
> 
After second thought, would the harm be that scheduler should do load balance
among cpu0, cpu1 and cpu2 in domain1, but it is actually doing load balance
among all of cpu0, cpu1, cpu2 and cpu3 since cpu3 is incorrectly put in group2?
So it is possible that scheduler will make wrong decision to put tasks in cpu3 while
it actually should only begin to do that in domain2?

Thanks
Barry


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] sched/topology: NUMA topology limitations
  2020-08-29  5:32 ` [RFC] sched/topology: NUMA topology limitations Song Bao Hua (Barry Song)
@ 2020-08-29 12:33   ` Valentin Schneider
  2020-08-31 10:45     ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 7+ messages in thread
From: Valentin Schneider @ 2020-08-29 12:33 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song)
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, vincent.guittot,
	dietmar.eggemann, morten.rasmussen, Linuxarm


Hi Barry,

Thanks for having a look!

On 29/08/20 06:32, Barry Song wrote:
>> If you boot the above with CONFIG_SCHED_DEBUG, you'll get:
>>
>> [    0.245896] CPU0 attaching sched-domain(s):
>> [    0.246133]  domain-0: span=0-1 level=NUMA
>> [    0.246592]   groups: 0:{ span=0 cap=1011 }, 1:{ span=1 cap=1008 }
>> [    0.246998]   domain-1: span=0-2 level=NUMA
>> [    0.247145]    groups: 0:{ span=0-1 mask=0 cap=2019 }, 2:{ span=1-3
>> mask=2 cap=3025 }
>> [    0.247454] ERROR: groups don't span domain->span
>
> Hi Valtentin,
> Thanks for your email. It seems it is quite clear. May I ask what is the real harm
> when group 2 is actually out of the span of diameter 2 here? What will happen when group2
> doesn't span cpu0_domain1->span?
> In domain-1, will scheduler fail to do load balance between group0 and group2?
>
>> [    0.247654]    domain-2: span=0-3 level=NUMA
>> [    0.247892]     groups: 0:{ span=0-2 mask=0 cap=3021 }, 3:{ span=1-3
>> mask=3 cap=3047 }
>
> Here domain-2 includes all span from 0 to 3, so that means we should still be able to do
> load balance in domain-2 between cpu0 and cpu3 even we fail in domain-1?
>

[...]

>> Implications of fixing this
>> ---------------------------
>>
>> Clearly the current sched_group setup for such topologies is not what we
>> want: this disturbs load balancing on the 'corrupted' domains.
>>
>> If we *do* want to support systems like this, then we have other problems to
>> solve. Currently, on the aforementioned QEMU setup, we have:
>>
>>   CPU0-domain1
>>     group0: span=0-2, mask=0
>>     group2: span=1-3, mask=2
>
> Your kernel log is:
> [    0.246998]   domain-1: span=0-2 level=NUMA
> [    0.247145]    groups: 0:{ span=0-1 mask=0 cap=2019 }, 2:{ span=1-3
> mask=2 cap=3025 }
>
> it seems group0 should be:
> group0: span=0-1, mask=0
>
> but not
> group0: span=0-2, mask=0 ?
>
> is it a typo here?
>

It is indeed, well spotted.

[...]

>
> What is the real harm you have seen in load balance? Have you even tried
> to make cpu0,cpu1,cpu2 busy and wake-up more tasks in cpu0? Will cpu3
> join to share the loading in your qemu topology?
>

(pasting the comment from your other email here)

> After second thought, would the harm be that scheduler should do load balance
> among cpu0, cpu1 and cpu2 in domain1, but it is actually doing load balance
> among all of cpu0, cpu1, cpu2 and cpu3 since cpu3 is incorrectly put in group2?
> So it is possible that scheduler will make wrong decision to put tasks in cpu3 while
> it actually should only begin to do that in domain2?

Ignoring corner cases where task affinity gets in the way, load balance
will always pull tasks to the local CPU (i.e. the CPU who's sched_domain we
are working on).

If we're balancing load for CPU0-domain1, we would be looking at which CPUs
in [0-2] (i.e. the domain's span) we could (if we should) pull tasks from
to migrate them over to CPU0.

We'll first try to figure out which sched_group has the more load (see
find_busiest_group() & friends), and that's where we may hit issues.

Consider a scenario where CPU3 is noticeably busier than the other
CPUs. We'll end up marking CPU0-domain1-group2 (1-3) as the busiest group,
and compute an imbalance (i.e. amount of load to pull) mostly based on the
status of CPU3.

We'll then go to find_busiest_queue(); the mask of CPUs we iterate over is
restricted by the sched_domain_span (i.e. doesn't include CPU3 here), so
we'll pull things from either CPU1 or CPU2 based on stats we built looking
at CPU3, which is bound to be pretty bogus.

To summarise: we won't pull from the "outsider" node(s) (i.e., nodes
included in the sched_groups but not covered by the sched_domain), but they
will influence the stats and heuristics of the load balance.

HTH

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [RFC] sched/topology: NUMA topology limitations
  2020-08-29 12:33   ` Valentin Schneider
@ 2020-08-31 10:45     ` Song Bao Hua (Barry Song)
  2020-09-01  9:40       ` Valentin Schneider
  0 siblings, 1 reply; 7+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-08-31 10:45 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, vincent.guittot,
	dietmar.eggemann, morten.rasmussen, Linuxarm



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Sunday, August 30, 2020 12:33 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: linux-kernel@vger.kernel.org; Ingo Molnar <mingo@kernel.org>; Peter
> Zijlstra <peterz@infradead.org>; vincent.guittot@linaro.org;
> dietmar.eggemann@arm.com; morten.rasmussen@arm.com; Linuxarm
> <linuxarm@huawei.com>
> Subject: Re: [RFC] sched/topology: NUMA topology limitations
> 
> 
> Hi Barry,
> 
> Thanks for having a look!
> 
> On 29/08/20 06:32, Barry Song wrote:
> >> If you boot the above with CONFIG_SCHED_DEBUG, you'll get:
> >>
> >> [    0.245896] CPU0 attaching sched-domain(s):
> >> [    0.246133]  domain-0: span=0-1 level=NUMA
> >> [    0.246592]   groups: 0:{ span=0 cap=1011 }, 1:{ span=1 cap=1008 }
> >> [    0.246998]   domain-1: span=0-2 level=NUMA
> >> [    0.247145]    groups: 0:{ span=0-1 mask=0 cap=2019 }, 2:{ span=1-3
> >> mask=2 cap=3025 }
> >> [    0.247454] ERROR: groups don't span domain->span
> >
> > Hi Valtentin,
> > Thanks for your email. It seems it is quite clear. May I ask what is the real
> harm
> > when group 2 is actually out of the span of diameter 2 here? What will
> happen when group2
> > doesn't span cpu0_domain1->span?
> > In domain-1, will scheduler fail to do load balance between group0 and
> group2?
> >
> >> [    0.247654]    domain-2: span=0-3 level=NUMA
> >> [    0.247892]     groups: 0:{ span=0-2 mask=0 cap=3021 },
> 3:{ span=1-3
> >> mask=3 cap=3047 }
> >
> > Here domain-2 includes all span from 0 to 3, so that means we should still be
> able to do
> > load balance in domain-2 between cpu0 and cpu3 even we fail in domain-1?
> >
> 
> [...]
> 
> >> Implications of fixing this
> >> ---------------------------
> >>
> >> Clearly the current sched_group setup for such topologies is not what we
> >> want: this disturbs load balancing on the 'corrupted' domains.
> >>
> >> If we *do* want to support systems like this, then we have other problems
> to
> >> solve. Currently, on the aforementioned QEMU setup, we have:
> >>
> >>   CPU0-domain1
> >>     group0: span=0-2, mask=0
> >>     group2: span=1-3, mask=2
> >
> > Your kernel log is:
> > [    0.246998]   domain-1: span=0-2 level=NUMA
> > [    0.247145]    groups: 0:{ span=0-1 mask=0 cap=2019 }, 2:{ span=1-3
> > mask=2 cap=3025 }
> >
> > it seems group0 should be:
> > group0: span=0-1, mask=0
> >
> > but not
> > group0: span=0-2, mask=0 ?
> >
> > is it a typo here?
> >
> 
> It is indeed, well spotted.
> 
> [...]
> 
> >
> > What is the real harm you have seen in load balance? Have you even tried
> > to make cpu0,cpu1,cpu2 busy and wake-up more tasks in cpu0? Will cpu3
> > join to share the loading in your qemu topology?
> >
> 
> (pasting the comment from your other email here)
> 
> > After second thought, would the harm be that scheduler should do load
> balance
> > among cpu0, cpu1 and cpu2 in domain1, but it is actually doing load balance
> > among all of cpu0, cpu1, cpu2 and cpu3 since cpu3 is incorrectly put in
> group2?
> > So it is possible that scheduler will make wrong decision to put tasks in cpu3
> while
> > it actually should only begin to do that in domain2?
> 
> Ignoring corner cases where task affinity gets in the way, load balance
> will always pull tasks to the local CPU (i.e. the CPU who's sched_domain we
> are working on).
> 
> If we're balancing load for CPU0-domain1, we would be looking at which CPUs
> in [0-2] (i.e. the domain's span) we could (if we should) pull tasks from
> to migrate them over to CPU0.
> 
> We'll first try to figure out which sched_group has the more load (see
> find_busiest_group() & friends), and that's where we may hit issues.
> 
> Consider a scenario where CPU3 is noticeably busier than the other
> CPUs. We'll end up marking CPU0-domain1-group2 (1-3) as the busiest group,
> and compute an imbalance (i.e. amount of load to pull) mostly based on the
> status of CPU3.
> 
> We'll then go to find_busiest_queue(); the mask of CPUs we iterate over is
> restricted by the sched_domain_span (i.e. doesn't include CPU3 here), so
> we'll pull things from either CPU1 or CPU2 based on stats we built looking
> at CPU3, which is bound to be pretty bogus.
> 
> To summarise: we won't pull from the "outsider" node(s) (i.e., nodes
> included in the sched_groups but not covered by the sched_domain), but they
> will influence the stats and heuristics of the load balance.

Hi Valentin,
Thanks for your clarification. For many scenarios, to achieve good performance, people would
pin processes in numa node. So the priority to pin would be local node first, then domain0 with one hop. Domain1
with two hops is actually too far. Domain2 with three hops would be a disaster. If cpu0 pulls task from cpu2,
but memory is still one CPU2's node, 3 hops would be a big problem for memory access and page migration.

However, for automatic numa balance, I would agree we need to fix the groups layout to make groups
stay in the span of sched_domain. Otherwise, it seems the scheduler is running incorrectly to find the right
cpu to pull task.

In case we have 
0 task on cpu0
1 task on cpu1
1 task on cpu2
4 task on cpu3

In sched_domain1, cpu1+cpu3 is busy, so cpu0 would try to pull task from cpu2 of the group(1-3) because cpu3 is busy,
meanwhile, it is an outsider.

Thanks
Barry

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] sched/topology: NUMA topology limitations
  2020-08-31 10:45     ` Song Bao Hua (Barry Song)
@ 2020-09-01  9:40       ` Valentin Schneider
  2020-09-04  2:02         ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 7+ messages in thread
From: Valentin Schneider @ 2020-09-01  9:40 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song)
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, vincent.guittot,
	dietmar.eggemann, morten.rasmussen, Linuxarm


On 31/08/20 11:45, Barry Song wrote:
>> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
>>
>> Ignoring corner cases where task affinity gets in the way, load balance
>> will always pull tasks to the local CPU (i.e. the CPU who's sched_domain we
>> are working on).
>>
>> If we're balancing load for CPU0-domain1, we would be looking at which CPUs
>> in [0-2] (i.e. the domain's span) we could (if we should) pull tasks from
>> to migrate them over to CPU0.
>>
>> We'll first try to figure out which sched_group has the more load (see
>> find_busiest_group() & friends), and that's where we may hit issues.
>>
>> Consider a scenario where CPU3 is noticeably busier than the other
>> CPUs. We'll end up marking CPU0-domain1-group2 (1-3) as the busiest group,
>> and compute an imbalance (i.e. amount of load to pull) mostly based on the
>> status of CPU3.
>>
>> We'll then go to find_busiest_queue(); the mask of CPUs we iterate over is
>> restricted by the sched_domain_span (i.e. doesn't include CPU3 here), so
>> we'll pull things from either CPU1 or CPU2 based on stats we built looking
>> at CPU3, which is bound to be pretty bogus.
>>
>> To summarise: we won't pull from the "outsider" node(s) (i.e., nodes
>> included in the sched_groups but not covered by the sched_domain), but they
>> will influence the stats and heuristics of the load balance.
>
> Hi Valentin,
> Thanks for your clarification. For many scenarios, to achieve good performance, people would
> pin processes in numa node. So the priority to pin would be local node first, then domain0 with one hop. Domain1
> with two hops is actually too far. Domain2 with three hops would be a disaster. If cpu0 pulls task from cpu2,
> but memory is still one CPU2's node, 3 hops would be a big problem for memory access and page migration.
>

Did you mean CPU3 here?

> However, for automatic numa balance, I would agree we need to fix the groups layout to make groups
> stay in the span of sched_domain. Otherwise, it seems the scheduler is running incorrectly to find the right
> cpu to pull task.
>
> In case we have
> 0 task on cpu0
> 1 task on cpu1
> 1 task on cpu2
> 4 task on cpu3
>
> In sched_domain1, cpu1+cpu3 is busy, so cpu0 would try to pull task from cpu2 of the group(1-3) because cpu3 is busy,
> meanwhile, it is an outsider.
>

Right, we'd pull from either CPU1 or CPU2 (in this case via a tentative
active load balance) because they are in the same group as CPU3 which
inflates the sched_group load stats, but we can't pull from it at this
domain because it's not included in the domain span.

> Thanks
> Barry

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [RFC] sched/topology: NUMA topology limitations
  2020-09-01  9:40       ` Valentin Schneider
@ 2020-09-04  2:02         ` Song Bao Hua (Barry Song)
  0 siblings, 0 replies; 7+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-09-04  2:02 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, vincent.guittot,
	dietmar.eggemann, morten.rasmussen, Linuxarm



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Tuesday, September 1, 2020 9:41 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: linux-kernel@vger.kernel.org; Ingo Molnar <mingo@kernel.org>; Peter
> Zijlstra <peterz@infradead.org>; vincent.guittot@linaro.org;
> dietmar.eggemann@arm.com; morten.rasmussen@arm.com; Linuxarm
> <linuxarm@huawei.com>
> Subject: Re: [RFC] sched/topology: NUMA topology limitations
> 
> 
> On 31/08/20 11:45, Barry Song wrote:
> >> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> >>
> >> Ignoring corner cases where task affinity gets in the way, load balance
> >> will always pull tasks to the local CPU (i.e. the CPU who's sched_domain we
> >> are working on).
> >>
> >> If we're balancing load for CPU0-domain1, we would be looking at which
> CPUs
> >> in [0-2] (i.e. the domain's span) we could (if we should) pull tasks from
> >> to migrate them over to CPU0.
> >>
> >> We'll first try to figure out which sched_group has the more load (see
> >> find_busiest_group() & friends), and that's where we may hit issues.
> >>
> >> Consider a scenario where CPU3 is noticeably busier than the other
> >> CPUs. We'll end up marking CPU0-domain1-group2 (1-3) as the busiest
> group,
> >> and compute an imbalance (i.e. amount of load to pull) mostly based on the
> >> status of CPU3.
> >>
> >> We'll then go to find_busiest_queue(); the mask of CPUs we iterate over is
> >> restricted by the sched_domain_span (i.e. doesn't include CPU3 here), so
> >> we'll pull things from either CPU1 or CPU2 based on stats we built looking
> >> at CPU3, which is bound to be pretty bogus.
> >>
> >> To summarise: we won't pull from the "outsider" node(s) (i.e., nodes
> >> included in the sched_groups but not covered by the sched_domain), but
> they
> >> will influence the stats and heuristics of the load balance.
> >
> > Hi Valentin,
> > Thanks for your clarification. For many scenarios, to achieve good
> performance, people would
> > pin processes in numa node. So the priority to pin would be local node first,
> then domain0 with one hop. Domain1
> > with two hops is actually too far. Domain2 with three hops would be a
> disaster. If cpu0 pulls task from cpu2,
> > but memory is still one CPU2's node, 3 hops would be a big problem for
> memory access and page migration.
> >
> 
> Did you mean CPU3 here?

Yep. I meant cpu3 here.

> 
> > However, for automatic numa balance, I would agree we need to fix the
> groups layout to make groups
> > stay in the span of sched_domain. Otherwise, it seems the scheduler is
> running incorrectly to find the right
> > cpu to pull task.
> >
> > In case we have
> > 0 task on cpu0
> > 1 task on cpu1
> > 1 task on cpu2
> > 4 task on cpu3
> >
> > In sched_domain1, cpu1+cpu3 is busy, so cpu0 would try to pull task from
> cpu2 of the group(1-3) because cpu3 is busy,
> > meanwhile, it is an outsider.
> >
> 
> Right, we'd pull from either CPU1 or CPU2 (in this case via a tentative
> active load balance) because they are in the same group as CPU3 which
> inflates the sched_group load stats, but we can't pull from it at this
> domain because it's not included in the domain span.
> 
Thanks
Barry

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC] sched/topology: NUMA topology limitations
@ 2020-08-14 10:15 Valentin Schneider
  0 siblings, 0 replies; 7+ messages in thread
From: Valentin Schneider @ 2020-08-14 10:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen

(For those of you who already got this one: sorry! I messed up LKML and
Vincent's addresses)

Hi,

Some of you may have noticed me struggling to plug some topology holes in
[1]. While digging in there I realized there are some non-obvious
limitations wrt supported NUMA topologies; IMO if we don't feel like they
should be "fixed", they should at the very least be documented somewhere.

I should get to ramble about this at this year's (v)LPC during the
scheduler µconf, but this isn't really trivial so I figured an email
wouldn't hurt. I tried to package it into reasonable segments - this is
still a pretty big wall of text, so use of coffee/tea/beer is recommended.

Spoiler alert: I don't have solutions for these yet.

Diameter issue
==============

Definition
++++++++++

Each unique NUMA distance gets an index in sched_domains_numa_distance[],
and as the big wall of text atop topology.c::build_balance_mask() mentions
this can be seen as an array of hops. i.e. in the following table:

      0  1
  0: 10 20
  1: 20 10

  sched_domains_numa_distance = { 10, 20 };

distance 10 is 0 hops (i.e. identity distance), distance 20 is 1 hop. This
isn't strictly speaking always true, as e.g. two consecutive distance
values could represent the same number of hops (with one path slightly
"longer" than the other), but it makes explaining that mess a bit easier.

I'm using 'diameter' in the NoC meaning of the term, i.e. the longest
shortest path between any two nodes, in # of hops.

Diameter <= 2
+++++++++++++

AFAIU, one of the issues I'm encountering in [1] is that the scheduler
topology code cannot cope with a NUMA diameter > 2. I've looked into what's
already out there, but haven't found any (x86) machine that violates this
rule. Below are some examples.

AMD Epyc
--------

node   0   1   2   3   4   5   6   7
  0:  10  16  16  16  32  32  32  32
  1:  16  10  16  16  32  32  32  32
  2:  16  16  10  16  32  32  32  32
  3:  16  16  16  10  32  32  32  32
  4:  32  32  32  32  10  16  16  16
  5:  32  32  32  32  16  10  16  16
  6:  32  32  32  32  16  16  10  16
  7:  32  32  32  32  16  16  16  10

Here, diameter=2

Opteron 6174
------------

This is the 4-node one, which looks like this:

node  0  1  2  3
  0: 10 20 20 20
  1: 20 10 20 20
  2: 20 20 10 20
  3: 20 20 20 10

  0 ----- 3
  | \   / |
  |   X   |
  | /   \ |
  1 ----- 2

Clearly this is diameter=1

AMD Bulldozer
-------------

I'm using the topology described in decade of wasted cores [2]; I'm not
going to bother reproducing the distance table, the topology looks like:

   _______________________
  /        _____________  \
  |       /              \|
  0 ----- 4 ----- 5 ----- 1
  | \   / | \   / | \   / |
  |   X   |   X   |   X   |
  | /   \ | /   \ | /   \ |
  6 ----- 2 ----- 3 ----- 7
  |       \______________/|
  \_______________________/

Which is diameter=2 (assuming equal edge weights)

Diameter > 2
++++++++++++

I haven't found other NUMA systems out there with a diameter > 2 other than
the one I'm rambling about in [1]. Nevertheless, I think there are some
theoretical setups that aren't too insane (at least less than [1]).

One of them is a 6-node mesh, which is diameter=3 and would look like:

  0 ----- 5
  |       |
  1 ----- 4
  |       |
  2 ----- 3

A more convoluted one would be a 12-node "spidergon"; it's a ring with an
even number of nodes N, each node is connected to both of its neighbours
and there is an interlink between node X and node (X + N/2) % N. I won't
pretend I can draw it in ASCII, but that again gives us a diameter of 3.

Case study w/ QEMU
++++++++++++++++++

Setup
-----

The topology I'll describe below is a simplified version of the one in [1],
as it is the smallest reproducer of the issue. Morten has been calling this
a "segmented bus".

   1 ----- 0
   |
   |
   |
   2 ----- 3

node   0   1   2   3
  0:  10  20  30  40
  1:  20  10  20  30
  2:  30  20  10  20
  3:  40  30  20  10

Diameter here is 3. I can recreate this with the following qemu
incantation:

  $ qemu-system-aarch64 -kernel Image -initrd rootfs.cpio -append \
                      'console=ttyAMA0 earlycon=pl011,0x9000000 root=/dev/vda debug earlyprintk=serial sched_debug' \
                      -smp cores=4 -nographic -m 512 -cpu cortex-a53 -machine virt \
                      -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
                      -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
                      -numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
                      -numa dist,src=0,dst=3,val=40, -numa dist,src=1,dst=2,val=20, \
                      -numa dist,src=1,dst=3,val=30, -numa dist,src=2,dst=3,val=20

If you boot the above with CONFIG_SCHED_DEBUG, you'll get:

[    0.245896] CPU0 attaching sched-domain(s):
[    0.246133]  domain-0: span=0-1 level=NUMA
[    0.246592]   groups: 0:{ span=0 cap=1011 }, 1:{ span=1 cap=1008 }
[    0.246998]   domain-1: span=0-2 level=NUMA
[    0.247145]    groups: 0:{ span=0-1 mask=0 cap=2019 }, 2:{ span=1-3 mask=2 cap=3025 }
[    0.247454] ERROR: groups don't span domain->span
[    0.247654]    domain-2: span=0-3 level=NUMA
[    0.247892]     groups: 0:{ span=0-2 mask=0 cap=3021 }, 3:{ span=1-3 mask=3 cap=3047 }

[    0.248915] CPU1 attaching sched-domain(s):
[    0.249050]  domain-0: span=0-2 level=NUMA
[    0.249181]   groups: 1:{ span=1 cap=1008 }, 2:{ span=2 cap=1002 }, 0:{ span=0 cap=1011 }
[    0.249661]   domain-1: span=0-3 level=NUMA
[    0.249810]    groups: 1:{ span=0-2 mask=1 cap=3034 }, 3:{ span=2-3 mask=3 cap=2023 }

[    0.250381] CPU2 attaching sched-domain(s):
[    0.250565]  domain-0: span=1-3 level=NUMA
[    0.250710]   groups: 2:{ span=2 cap=1002 }, 3:{ span=3 cap=999 }, 1:{ span=1 cap=1008 }
[    0.250992]   domain-1: span=0-3 level=NUMA
[    0.251129]    groups: 2:{ span=1-3 mask=2 cap=3025 }, 0:{ span=0-1 mask=0 cap=2019 }

[    0.251474] CPU3 attaching sched-domain(s):
[    0.251606]  domain-0: span=2-3 level=NUMA
[    0.251772]   groups: 3:{ span=3 cap=999 }, 2:{ span=2 cap=1002 }
[    0.252439]   domain-1: span=1-3 level=NUMA
[    0.252587]    groups: 3:{ span=2-3 mask=3 cap=2023 }, 1:{ span=0-2 mask=1 cap=3034 }
[    0.252859] ERROR: groups don't span domain->span
[    0.253009]    domain-2: span=0-3 level=NUMA
[    0.253153]     groups: 3:{ span=1-3 mask=3 cap=3047 }, 0:{ span=0-2 mask=0 cap=3021 }

Why we get this
---------------

The sched_domains we get here are NUMA<=20, NUMA<=30, NUMA<=40.

There's no problem with the sched_domain spans, those are directly derived
from sched_domains_numa_masks[][] and that works just fine. We then build
the sched_groups using those spans, see

  build_overlap_sched_groups()

Let's have a look at what happens for the first domain on CPU0 (node0) and
CPU2 (node2). This would be NUMA<=20 (i.e. 1 hop), whose groups are the
spanned CPUs:

  CPU0 (node0)
  NUMA<=20 span=[0-1], groups=(0), (1)

  CPU2 (node2)
  NUMA<=20 span=[1-3], groups=(2), (3), (1)

So far so good, nothing outlandish here. However, when we get to building
NUMA<=30 (i.e. 2 hops), this falls apart:

  CPU0
  NUMA<=30 span=[0-2], groups=(0-1), (1-3) <<< sched_domain_span(CPU2, NUMA<=20)
                               ^^^
                sched_domain_span(CPU0, NUMA<=20)

CPU3 (node3) is very much *not* <= 30 distance (2 hops) away from CPU0
(node0), but we have a group that spans it anyway, because of the
aforementioned code. IOW, we are creating a completely bogus transitivity
relation that sort of looks like:

  For any nodes i, j, k, and hop count x > 1:
    node_distance(i, j) <= sched_domains_numa_distance[x]
    AND
    node_distance(j, k) <= sched_domains_numa_distance[x-1]
    IMPLIES
    node_distance(i, k) <= sched_domains_numa_distance[x]

Which becomes in our example:

   node_distance(0, 2) <= sched_domains_numa_distance[2] # 30
   AND
   node_distance(2, 3) <= sched_domains_numa_distance[1] # 20
   IMPLIES (bogusly!)
   node_distance(0, 3) <= sched_domains_numa_distance[2] # 30

This actually works for diameters == 2 (given everything is covered by 2
hops anyway) but falls apart for bigger diameters, as is the case here.

Implications of fixing this
---------------------------

Clearly the current sched_group setup for such topologies is not what we
want: this disturbs load balancing on the 'corrupted' domains.

If we *do* want to support systems like this, then we have other
problems to solve. Currently, on the aforementioned QEMU setup, we have:

  CPU0-domain1
    group0: span=0-2, mask=0
    group2: span=1-3, mask=2
  CPU1-domain1
    group1: span=0-2, mask=1
    group3: span=2-3, mask=3
  CPU2-domain1
    group2: span=1-3, mask=2
    group0: span=0-1, mask=0
  CPU3-domain1
    group3: span=2-3, mask=3
    group1: span=0-2, mask=1

We would want to "fix" this into:

  CPU0-domain1
    group0: span=0-2, mask=0
  - group2: span=1-3, mask=2
  + group?: span=1-2, mask=??
  CPU1-domain1
    group1: span=0-2, mask=1
    group3: span=2-3, mask=3
  CPU2-domain1
    group2: span=1-3, mask=2
    group0: span=0-1, mask=0
  CPU3-domain1
    group3: span=2-3, mask=3
  - group1: span=0-2, mask=1
  + group?: span=1-2, mask=??

I've purposedly left a blank for the balance mask, because there isn't
really any valid choice. No CPU will have this (1-2) group as its local
group at domain1 (NUMA<=30); worse than that, we actually have now 5 unique
groups at this topology level despite having only 4 CPUs: there are no
CPUs left to use as identifier of that "new" group (i.e. owner of
sched_group_capacity).

Even if we find some form of unique identifier for that group, it still
isn't the local group of any CPU at any topology level. Since
update_group_capacity() only updates the capacity of the local group, we
don't have any way to update the capacity of such groups - unless we start
updating more than the local group, which will get ugly real quick.


That's as far as I got to. If you got to here, thank you for your
time, and as always, comments / feedback are most welcome.

Links
=====

[1]: https://lkml.kernel.org/r/20200324125533.17447-1-valentin.schneider@arm.com/
[2]: https://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf


Thanks,
Valentin

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-09-04  2:02 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <6a5f06ff4ecb4f34bd7e9890dc07fb99@hisilicon.com>
2020-08-29  5:32 ` [RFC] sched/topology: NUMA topology limitations Song Bao Hua (Barry Song)
2020-08-29 12:33   ` Valentin Schneider
2020-08-31 10:45     ` Song Bao Hua (Barry Song)
2020-09-01  9:40       ` Valentin Schneider
2020-09-04  2:02         ` Song Bao Hua (Barry Song)
2020-08-29  5:42 ` Song Bao Hua (Barry Song)
2020-08-14 10:15 Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).