All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: mingo@kernel.org, lvenanci@redhat.com
Cc: lwang@redhat.com, riel@redhat.com, efault@gmx.de,
	tglx@linutronix.de, linux-kernel@vger.kernel.org,
	peterz@infradead.org
Subject: [PATCH 14/14] sched/topology: Add a few comments
Date: Fri, 28 Apr 2017 15:20:12 +0200	[thread overview]
Message-ID: <20170428132503.059625674@infradead.org> (raw)
In-Reply-To: 20170428131958.893188882@infradead.org

[-- Attachment #1: peterz-sched-topo-comments.patch --]
[-- Type: text/plain, Size: 7459 bytes --]

Try and describe what this code is about..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/topology.c |  169 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 162 insertions(+), 7 deletions(-)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -495,11 +495,96 @@ enum s_alloc {
 };
 
 /*
+ * Return the canonical balance CPU for this group, this is the first CPU
+ * of this group that's also in the iteration mask.
+ *
+ * The iteration mask are all those CPUs that could actually end up at this
+ * group. See build_group_mask().
+ *
+ * Also see should_we_balance().
+ */
+int group_balance_cpu(struct sched_group *sg)
+{
+	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
+}
+
+
+/*
+ * NUMA topology (first read the regular topology blurb below)
+ *
+ * Given a node-distance table, for example:
+ *
+ *   node   0   1   2   3
+ *     0:  10  20  30  20
+ *     1:  20  10  20  30
+ *     2:  30  20  10  20
+ *     3:  20  30  20  10
+ *
+ * which represents a 4 node ring topology like:
+ *
+ *   0 ----- 1
+ *   |       |
+ *   |       |
+ *   |       |
+ *   3 ----- 2
+ *
+ * We want to construct domains and groups to represent this. The way we go
+ * about doing this is to build the domains on 'hops'. For each NUMA level we
+ * construct the mask of all nodes reachable in @level hops.
+ *
+ * For the above NUMA topology that gives 3 levels:
+ *
+ * NUMA-2	0-3		0-3		0-3		0-3
+ *  groups:	{0-1,3},{1-3}	{0-2},{0,2-3}	{1-3},{0-1,3}	{0,2-3},{0-2}
+ *
+ * NUMA-1	0-1,3		0-2		1-3		0,2-3
+ *  groups:	{0},{1},{3}	{0},{1},{2}	{1},{2},{3}	{0},{2},{3}
+ *
+ * NUMA-0	0		1		2		3
+ *
+ *
+ * As can be seen; things don't nicely line up as with the regular topology.
+ * When we iterate a domain in child domain chunks some nodes can be
+ * represented multiple times -- hence the "overlap" naming for this part of
+ * the topology.
+ *
+ * In order to minimize this overlap, we only build enough groups to cover the
+ * domain. For instance Node-0 NUMA-2 would only get groups: 0-1,3 and 1-3.
+ *
+ * Because:
+ *
+ *  - the first group of each domain is its child domain; this
+ *    gets us the first 0-1,3
+ *  - the only uncovered node is 2, who's child domain is 1-3.
+ *
+ * However, because of the overlap, computing a unique CPU for each group is
+ * more complicated. Consider for instance the groups of NODE-1 NUMA-2, both
+ * groups include the CPUs of Node-0, while those CPUs would not in fact ever
+ * end up at those groups (they would end up in group: 0-1,3).
+ *
+ * To correct this we have to introduce the group iteration mask. This mask
+ * will contain those CPUs in the group that can reach this group given the
+ * (child) domain tree.
+ *
+ * With this we can once again compute balance_cpu and sched_group_capacity
+ * relations.
+ *
+ * XXX include words on how balance_cpu is unique and therefore can be
+ * used for sched_group_capacity links.
+ */
+
+
+/*
  * Build an iteration mask that can exclude certain CPUs from the upwards
  * domain traversal.
  *
  * Only CPUs that can arrive at this group should be considered to continue
  * balancing.
+ *
+ * We do this during the group creation pass, therefore the group information
+ * isn't complete yet, however since each group represents a (child) domain we
+ * can fully construct this using the sched_domain bits (which are already
+ * complete).
  */
 static void
 build_group_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask)
@@ -534,14 +619,10 @@ build_group_mask(struct sched_domain *sd
 }
 
 /*
- * Return the canonical balance CPU for this group, this is the first CPU
- * of this group that's also in the iteration mask.
+ * XXX: This creates per-node group entries; since the load-balancer will
+ * immediately access remote memory to construct this group's load-balance
+ * statistics having the groups node local is of dubious benefit.
  */
-int group_balance_cpu(struct sched_group *sg)
-{
-	return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg));
-}
-
 static struct sched_group *
 build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 {
@@ -577,6 +658,8 @@ static void init_overlap_sched_group(str
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 	if (atomic_inc_return(&sg->sgc->ref) == 1)
 		cpumask_copy(sched_group_mask(sg), mask);
+	else
+		WARN_ON_ONCE(!cpumask_equal(sched_group_mask(sg)), mask);
 
 	/*
 	 * Initialize sgc->capacity such that even if we mess up the
@@ -647,6 +730,78 @@ build_overlap_sched_groups(struct sched_
 	return -ENOMEM;
 }
 
+
+/*
+ * Package topology (also see the load-balance blurb in fair.c)
+ *
+ * The scheduler builds a tree structure to represent a number of important
+ * topology features. By default (default_topology[]) these include:
+ *
+ *  - Simultaneous multithreading (SMT)
+ *  - Multi-Core Cache (MC)
+ *  - Package (DIE)
+ *
+ * Where the last one more or less denotes everything up to a NUMA node.
+ *
+ * The tree consists of 3 primary data structures:
+ *
+ *	sched_domain -> sched_group -> sched_group_capacity
+ *	    ^ ^             ^ ^
+ *          `-'             `-'
+ *
+ * The sched_domains are per-cpu and have a two way link (parent & child) and
+ * denote the ever growing mask of CPUs belonging to that level of topology.
+ *
+ * Each sched_domain has a circular (double) linked list of sched_group's, each
+ * denoting the domains of the level below (or individual CPUs in case of the
+ * first domain level). The sched_group linked by a sched_domain includes the
+ * CPU of that sched_domain [*].
+ *
+ * Take for instance a 2 threaded, 2 core, 2 cache cluster part:
+ *
+ * CPU   0   1   2   3   4   5   6   7
+ *
+ * DIE  [                             ]
+ * MC   [             ] [             ]
+ * SMT  [     ] [     ] [     ] [     ]
+ *
+ *  - or -
+ *
+ * DIE  0-7 0-7 0-7 0-7 0-7 0-7 0-7 0-7
+ * MC	0-3 0-3 0-3 0-3 4-7 4-7 4-7 4-7
+ * SMT  0-1 0-1 2-3 2-3 4-5 4-5 6-7 6-7
+ *
+ * CPU   0   1   2   3   4   5   6   7
+ *
+ * One way to think about it is: sched_domain moves you up and down among these
+ * topology levels, while sched_group moves you sideways through it, at child
+ * domain granularity.
+ *
+ * sched_group_capacity ensures each unique sched_group has shared storage.
+ *
+ * There are two related construction problems, both require a CPU that
+ * uniquely identify each group (for a given domain):
+ *
+ *  - The first is the balance_cpu (see should_we_balance() and the
+ *    load-balance blub in fair.c); for each group we only want 1 CPU to
+ *    continue balancing at a higher domain.
+ *
+ *  - The second is the sched_group_capacity; we want all identical groups
+ *    to share a single sched_group_capacity.
+ *
+ * Since these topologies are exclusive by construction. That is, its
+ * impossible for an SMT thread to belong to multiple cores, and cores to
+ * be part of multiple caches. There is a very clear and unique location
+ * for each CPU in the hierarchy.
+ *
+ * Therefore computing a unique CPU for each group is trivial (the iteration
+ * mask is redundant and set all 1s; all CPUs in a group will end up at _that_
+ * group), we can simply pick the first CPU in each group.
+ *
+ *
+ * [*] in other words, the first group of each domain is its child domain.
+ */
+
 static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
 {
 	struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);

  parent reply	other threads:[~2017-04-28 13:35 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-28 13:19 [PATCH 00/14] sched/topology fixes Peter Zijlstra
2017-04-28 13:19 ` [PATCH 01/14] sched/topology: Refactor function build_overlap_sched_groups() Peter Zijlstra
2017-04-28 13:20 ` [PATCH 02/14] sched,cpumask: Export for_each_cpu_wrap() Peter Zijlstra
2017-05-01 19:56   ` Rik van Riel
2017-04-28 13:20 ` [PATCH 03/14] sched/topology: Fix building of overlapping sched-groups Peter Zijlstra
2017-05-01 20:08   ` Rik van Riel
2017-04-28 13:20 ` [PATCH 04/14] sched/topology: Simplify build_overlap_sched_groups() Peter Zijlstra
2017-05-01 21:04   ` Rik van Riel
2017-04-28 13:20 ` [PATCH 05/14] sched/topology,debug: Print the group mask Peter Zijlstra
2017-05-01 21:13   ` Rik van Riel
2017-04-28 13:20 ` [PATCH 06/14] sched/topology,debug: Verify the first group matches the child domain Peter Zijlstra
2017-05-01 21:13   ` Rik van Riel
2017-05-02 14:52     ` Peter Zijlstra
2017-05-03 15:00       ` Rik van Riel
2017-04-28 13:20 ` [PATCH 07/14] sched/topology: Optimize build_group_mask() Peter Zijlstra
2017-05-02  0:57   ` Rik van Riel
2017-04-28 13:20 ` [PATCH 08/14] sched/topology: Move comment about asymmetric node setups Peter Zijlstra
2017-05-02  1:08   ` Rik van Riel
2017-04-28 13:20 ` [PATCH 09/14] sched/topology: Remove FORCE_SD_OVERLAP Peter Zijlstra
2017-05-02  1:09   ` Rik van Riel
2017-04-28 13:20 ` [PATCH 10/14] sched/topology: Fix overlapping sched_group_mask Peter Zijlstra
2017-04-28 13:20 ` [PATCH 11/14] sched/topology: Small cleanup Peter Zijlstra
2017-04-28 13:20 ` [PATCH 12/14] sched/topology,debug: Add sched_group_capacity debugging Peter Zijlstra
2017-04-28 13:20 ` [PATCH 13/14] sched/topology: Fix overlapping sched_group_capacity Peter Zijlstra
2017-04-28 13:20 ` Peter Zijlstra [this message]
2017-05-01  8:35   ` [PATCH 14/14] sched/topology: Add a few comments Peter Zijlstra
2017-04-28 13:53 ` [PATCH 00/14] sched/topology fixes Peter Zijlstra
2017-04-28 23:30   ` Lauro Venancio
2017-05-01  8:57     ` Peter Zijlstra
2017-05-02 14:43   ` Peter Zijlstra
2017-05-02 14:54     ` Lauro Venancio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170428132503.059625674@infradead.org \
    --to=peterz@infradead.org \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lvenanci@redhat.com \
    --cc=lwang@redhat.com \
    --cc=mingo@kernel.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.