linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "valentin.schneider@arm.com" <valentin.schneider@arm.com>,
	"vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
	"mgorman@suse.de" <mgorman@suse.de>,
	"mingo@kernel.org" <mingo@kernel.org>,
	"dietmar.eggemann@arm.com" <dietmar.eggemann@arm.com>,
	"morten.rasmussen@arm.com" <morten.rasmussen@arm.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linuxarm@openeuler.org" <linuxarm@openeuler.org>,
	"xuwei (O)" <xuwei5@huawei.com>,
	"Liguozhu (Kenneth)" <liguozhu@hisilicon.com>,
	"tiantao (H)" <tiantao6@hisilicon.com>,
	wanghuiqiang <wanghuiqiang@huawei.com>,
	"Zengtao (B)" <prime.zeng@hisilicon.com>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	"guodong.xu@linaro.org" <guodong.xu@linaro.org>,
	Meelis Roos <mroos@linux.ee>
Subject: RE: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
Date: Tue, 9 Feb 2021 20:58:15 +0000	[thread overview]
Message-ID: <4bdaa3e1a54f445fa8e629ea392e7bce@hisilicon.com> (raw)
In-Reply-To: <YCKGVBnXzRsE6/Er@hirez.programming.kicks-ass.net>



> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Wednesday, February 10, 2021 1:56 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: valentin.schneider@arm.com; vincent.guittot@linaro.org; mgorman@suse.de;
> mingo@kernel.org; dietmar.eggemann@arm.com; morten.rasmussen@arm.com;
> linux-kernel@vger.kernel.org; linuxarm@openeuler.org; xuwei (O)
> <xuwei5@huawei.com>; Liguozhu (Kenneth) <liguozhu@hisilicon.com>; tiantao (H)
> <tiantao6@hisilicon.com>; wanghuiqiang <wanghuiqiang@huawei.com>; Zengtao (B)
> <prime.zeng@hisilicon.com>; Jonathan Cameron <jonathan.cameron@huawei.com>;
> guodong.xu@linaro.org; Meelis Roos <mroos@linux.ee>
> Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> On Thu, Feb 04, 2021 at 12:12:01AM +1300, Barry Song wrote:
> > As long as NUMA diameter > 2, building sched_domain by sibling's child
> > domain will definitely create a sched_domain with sched_group which will
> > span out of the sched_domain:
> >
> >                +------+         +------+        +-------+       +------+
> >                | node |  12     |node  | 20     | node  |  12   |node  |
> >                |  0   +---------+1     +--------+ 2     +-------+3     |
> >                +------+         +------+        +-------+       +------+
> >
> > domain0        node0            node1            node2          node3
> >
> > domain1        node0+1          node0+1          node2+3        node2+3
> >                                                  +
> > domain2        node0+1+2                         |
> >              group: node0+1                      |
> >                group:node2+3 <-------------------+
> >
> > when node2 is added into the domain2 of node0, kernel is using the child
> > domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
> > the span of the domain including node0+1+2.
> >
> > This will make load_balance() run based on screwed avg_load and group_type
> > in the sched_group spanning out of the sched_domain, and it also makes
> > select_task_rq_fair() pick an idle CPU out of the sched_domain.
> >
> > Real servers which suffer from this problem include Kunpeng920 and 8-node
> > Sun Fire X4600-M2, at least.
> >
> > Here we move to use the *child* domain of the *child* domain of node2's
> > domain2 as the new added sched_group. At the same time, we re-use the
> > lower level sgc directly.
> >
> >                +------+         +------+        +-------+       +------+
> >                | node |  12     |node  | 20     | node  |  12   |node  |
> >                |  0   +---------+1     +--------+ 2     +-------+3     |
> >                +------+         +------+        +-------+       +------+
> >
> > domain0        node0            node1          +- node2          node3
> >                                                |
> > domain1        node0+1          node0+1        | node2+3        node2+3
> >                                                |
> > domain2        node0+1+2                       |
> >              group: node0+1                    |
> >                group:node2 <-------------------+
> >
> 
> I've finally had a moment to think about this, would it make sense to
> also break up group: node0+1, such that we then end up with 3 groups of
> equal size?

We used to create the sched_groups of sched_domain[n] of node[m] by
1. local group: sched_domain[n-1] of node[m]
2. remote group: sched_domain[n-1] of node[m]'s siblings
in the same level. 
Since the sched_domain[n-1] of a part of node[m]'s siblings are able
to cover the whole span of sched_domain[n] of node[m], there is no
necessity to scan over all siblings of node[m], once sched_domain[n]
of node[m] has been covered, we can stop making more sched_groups. So
the number of sched_groups is small.

So historically, the code has never tried to make sched_groups result
in equal size. And it permits the overlapping of local group and remote
groups.

One issue we are facing in original code is that once the topology
gets to 3-hops NUMA, sched_domain[n-1] of node[m]'s siblings might
span out of the range of sched_domain[n] of node[m]. Here my approach
is trying to find a descanted sibling to build remote groups and fix
this issue for those machines with this problem. So it keeps those
machines without 3-hops issues untouched. 

Valentin sent another RFC to break up all remote groups to include
the remote node only instead of using sched_domain[n-1] of siblings,
this will eliminate the problem from the first beginning. One side
effect is that it changes all machines including those machines w/o
3-hops issue by creating much more remote sched_groups. So we both
agree we can get started from descanted sibling(grandchild) approach
first.

What you are advising seems to be breaking up local sched_group,
it will create much more local groups. It sounds like a huge change
even beyond the scope of the original issue we are trying to fix :-)

> 
> > w/ patch, we don't get "groups don't span domain->span" any more:
> > [    1.486271] CPU0 attaching sched-domain(s):
> > [    1.486820]  domain-0: span=0-1 level=MC
> > [    1.500924]   groups: 0:{ span=0 cap=980 }, 1:{ span=1 cap=994 }
> > [    1.515717]   domain-1: span=0-3 level=NUMA
> > [    1.515903]    groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 }
> > [    1.516989]    domain-2: span=0-5 level=NUMA
> > [    1.517124]     groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 }
> 
> 		     groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3, cap=1989 },
> 4:{ span=4-5, cap=1949 }
> 
> > [    1.517369]     domain-3: span=0-7 level=NUMA
> > [    1.517423]      groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7
> mask=6-7 cap=4054 }
> 
> Let me continue to think about this... it's been a while :/

Sure, thanks!

Barry


  reply	other threads:[~2021-02-10  0:39 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-03 11:12 [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2 Barry Song
2021-02-03 11:57 ` Meelis Roos
2021-02-03 21:31   ` Song Bao Hua (Barry Song)
2021-02-09 12:55 ` Peter Zijlstra
2021-02-09 20:58   ` Song Bao Hua (Barry Song) [this message]
2021-02-10 11:21     ` Peter Zijlstra
2021-02-10 12:27       ` Song Bao Hua (Barry Song)
2021-02-11 19:55       ` Valentin Schneider
2021-02-18  9:17         ` [Linuxarm] " Song Bao Hua (Barry Song)
2021-02-18 12:40           ` Valentin Schneider
2021-02-18 22:07             ` Song Bao Hua (Barry Song)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4bdaa3e1a54f445fa8e629ea392e7bce@hisilicon.com \
    --to=song.bao.hua@hisilicon.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=guodong.xu@linaro.org \
    --cc=jonathan.cameron@huawei.com \
    --cc=liguozhu@hisilicon.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@openeuler.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=morten.rasmussen@arm.com \
    --cc=mroos@linux.ee \
    --cc=peterz@infradead.org \
    --cc=prime.zeng@hisilicon.com \
    --cc=tiantao6@hisilicon.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    --cc=wanghuiqiang@huawei.com \
    --cc=xuwei5@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).