linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: Valentin Schneider <valentin.schneider@arm.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	"linuxarm@openeuler.org" <linuxarm@openeuler.org>
Subject: RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2
Date: Mon, 25 Jan 2021 21:55:40 +0000	[thread overview]
Message-ID: <803c439c1d1f435bb22a6ef6c0c2d99e@hisilicon.com> (raw)
In-Reply-To: <jhjwnw11ak2.mognet@arm.com>



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@arm.com]
> Sent: Tuesday, January 26, 2021 1:11 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Vincent Guittot
> <vincent.guittot@linaro.org>; Mel Gorman <mgorman@suse.de>
> Cc: Ingo Molnar <mingo@kernel.org>; Peter Zijlstra <peterz@infradead.org>;
> Dietmar Eggemann <dietmar.eggemann@arm.com>; Morten Rasmussen
> <morten.rasmussen@arm.com>; linux-kernel <linux-kernel@vger.kernel.org>;
> linuxarm@openeuler.org
> Subject: RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact
> of NUMA diameter > 2
> 
> On 25/01/21 03:13, Song Bao Hua (Barry Song) wrote:
> > As long as NUMA diameter > 2, building sched_domain by sibling's child domain
> > will definitely create a sched_domain with sched_group which will span
> > out of the sched_domain
> >                +------+         +------+        +-------+       +------+
> >                | node |  12     |node  | 20     | node  |  12   |node  |
> >                |  0   +---------+1     +--------+ 2     +-------+3     |
> >                +------+         +------+        +-------+       +------+
> >
> > domain0        node0            node1            node2          node3
> >
> > domain1        node0+1          node0+1          node2+3        node2+3
> >                                                  +
> > domain2        node0+1+2                         |
> >              group: node0+1                      |
> >                group:node2+3 <-------------------+
> >
> > when node2 is added into the domain2 of node0, kernel is using the child
> > domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
> > the span of node0+1+2.
> >
> > Will we move to use the *child* domain of the *child* domain of node2's
> > domain2 to build the sched_group?
> >
> > I mean:
> >                +------+         +------+        +-------+       +------+
> >                | node |  12     |node  | 20     | node  |  12   |node  |
> >                |  0   +---------+1     +--------+ 2     +-------+3     |
> >                +------+         +------+        +-------+       +------+
> >
> > domain0        node0            node1          +- node2          node3
> >                                                |
> > domain1        node0+1          node0+1        | node2+3        node2+3
> >                                                |
> > domain2        node0+1+2                       |
> >              group: node0+1                    |
> >                group:node2 <-------------------+
> >
> > In this way, it seems we don't have to create a new group as we are just
> > reusing the existing group?
> >
> 
> One thing I've been musing over is pretty much this; that is to say we
> would make all non-local NUMA sched_groups span a single node. This would
> let us reuse an existing span+sched_group_capacity: the local group of that
> node at its first NUMA topology level.
> 
> Essentially this means getting rid of the overlapping groups, and the
> balance mask is handled the same way as for !NUMA, i.e. it's the local
> group span. I've not gone far enough through the thought experiment to see
> where does it miserably fall apart... It is at the very least violating the
> expectation that a group span is a child domain's span - here it can be a
> grand^x children domain's span.
> 
> 
> If we take your topology, we currently have:
> 
> | tl\node | 0            | 1             | 2             | 3            |
> |---------+--------------+---------------+---------------+--------------|
> | NUMA0   | (0)->(1)     | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)     |
> | NUMA1   | (0-1)->(1-3) | (0-2)->(2-3)  | (1-3)->(0-1)  | (2-3)->(0-2) |
> | NUMA2   | (0-2)->(1-3) | N/A           | N/A           | (1-3)->(0-2) |
> 
> With the current overlapping group scheme, we would need to make it look
> like so:
> 
> | tl\node | 0             | 1             | 2             | 3             |
> |---------+---------------+---------------+---------------+---------------
> |
> | NUMA0   | (0)->(1)      | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)      |
> | NUMA1   | (0-1)->(1-2)* | (0-2)->(2-3)  | (1-3)->(0-1)  | (2-3)->(1-2)* |
> | NUMA2   | (0-2)->(1-3)  | N/A           | N/A           | (1-3)->(0-2)  |
> 
> But as already discussed, that's tricky to make work. With the node-span
> groups thing, we would turn this into:
> 
> | tl\node | 0          | 1             | 2             | 3          |
> |---------+------------+---------------+---------------+------------|
> | NUMA0   | (0)->(1)   | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)   |
> | NUMA1   | (0-1)->(2) | (0-2)->(3)    | (1-3)->(0)    | (2-3)->(1) |
> | NUMA2   | (0-2)->(3) | N/A           | N/A           | (1-3)->(0) |

Actually I didn't mean going that far. What I was thinking is that
we only fix the sched_domain while sched_group isn't a subset of
sched_domain. For those sched_domains which haven't the group span
issue, we just don't touch it. For NUMA1, we change like your diagram,
but NUMA2 won't be changed. The concept is like:

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1040,6 +1040,19 @@ build_overlap_sched_groups(struct sched_domain
*sd, int cpu)
                }

                sg_span = sched_group_span(sg);
+#if 1
+               if (sibling->child && !cpumask_subset(sg_span, span)) {
+                       sg = build_group_from_child_sched_domain(sibling->child, cpu);
+                       ...
+                       sg_span = sched_group_span(sg);
+               }
+#endif
                cpumask_or(covered, covered, sg_span);

Thanks
Barry

      reply	other threads:[~2021-01-25 22:01 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-15 20:36 [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2 Barry Song
2021-01-18 11:13 ` Vincent Guittot
2021-01-18 11:25   ` Song Bao Hua (Barry Song)
2021-01-21 18:14     ` Valentin Schneider
2021-01-22  2:53       ` Song Bao Hua (Barry Song)
2021-01-25  3:13       ` Song Bao Hua (Barry Song)
2021-01-25 12:10         ` Valentin Schneider
2021-01-25 21:55           ` Song Bao Hua (Barry Song) [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=803c439c1d1f435bb22a6ef6c0c2d99e@hisilicon.com \
    --to=song.bao.hua@hisilicon.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@openeuler.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).