From: Srikar Dronamraju <srikar@linux.vnet.ibm.com> To: Valentin Schneider <valentin.schneider@arm.com> Cc: Ingo Molnar <mingo@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Michael Ellerman <mpe@ellerman.id.au>, LKML <linux-kernel@vger.kernel.org>, Mel Gorman <mgorman@techsingularity.net>, Rik van Riel <riel@surriel.com>, Thomas Gleixner <tglx@linutronix.de>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, linuxppc-dev@lists.ozlabs.org, Nathan Lynch <nathanl@linux.ibm.com>, Gautham R Shenoy <ego@linux.vnet.ibm.com>, Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>, Laurent Dufour <ldufour@linux.ibm.com> Subject: Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes Date: Tue, 10 Aug 2021 17:17:27 +0530 [thread overview] Message-ID: <20210810114727.GB21942@linux.vnet.ibm.com> (raw) In-Reply-To: <875yweaig9.mognet@arm.com> * Valentin Schneider <valentin.schneider@arm.com> [2021-08-09 13:52:38]: > On 09/08/21 12:22, Srikar Dronamraju wrote: > > * Valentin Schneider <valentin.schneider@arm.com> [2021-08-08 16:56:47]: > >> Wait, doesn't the distance matrix (without any offline node) say > >> > >> distance(0, 3) == 40 > >> > >> ? We should have at the very least: > >> > >> node 0 1 2 3 > >> 0: 10 20 ?? 40 > >> 1: 20 20 ?? 40 > >> 2: ?? ?? ?? ?? > >> 3: 40 40 ?? 10 > >> > > > > Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online) > > Note: Node 2-7 and CPU 2-7 are still offline. > > > > node 0 1 2 3 > > 0: 10 20 40 10 > > 1: 20 20 40 10 > > 2: 40 40 10 10 > > 3: 10 10 10 10 > > > > NODE->mask(0) == 0 > > NODE->mask(1) == 1 > > NODE->mask(2) == 0 > > NODE->mask(3) == 0 > > > > Note: This is with updating Node 2's distance as 40 for figuring out > > the number of numa levels. Since we have all possible distances, we > > dont update Node 3 distance, so it will be as if its local to node 0. > > > > Now when Node 3 and CPU 3 are onlined > > Note: Node 2, 3-7 and CPU 2, 3-7 are still offline. > > > > node 0 1 2 3 > > 0: 10 20 40 40 > > 1: 20 20 40 40 > > 2: 40 40 10 40 > > 3: 40 40 40 10 > > > > NODE->mask(0) == 0 > > NODE->mask(1) == 1 > > NODE->mask(2) == 0 > > NODE->mask(3) == 0,3 > > > > CPU 0 continues to be part of Node->mask(3) because when we online and > > we find the right distance, there is no API to reset the numa mask of > > 3 to remove CPU 0 from the numa masks. > > > > If we had an API to clear/set sched_domains_numa_masks[node][] when > > the node state changes, we could probably plug-in to clear/set the > > node masks whenever node state changes. > > > > Gotcha, this is now coming back to me... > > [...] > > >> Ok, so it looks like we really can't do without that part - even if we get > >> "sensible" distance values for the online nodes, we can't divine values for > >> the offline ones. > >> > > > > Yes > > > > Argh, while your approach does take care of the masks, it leaves > sched_numa_topology_type unchanged. You *can* force an update of it, but > yuck :( > > I got to the below... > Yes, I completely missed that we should update sched_numa_topology_type. > --- > From: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > Date: Thu, 1 Jul 2021 09:45:51 +0530 > Subject: [PATCH 1/1] sched/topology: Skip updating masks for non-online nodes > > The scheduler currently expects NUMA node distances to be stable from init > onwards, and as a consequence builds the related data structures > once-and-for-all at init (see sched_init_numa()). > > Unfortunately, on some architectures node distance is unreliable for > offline nodes and may very well change upon onlining. > > Skip over offline nodes during sched_init_numa(). Track nodes that have > been onlined at least once, and trigger a build of a node's NUMA masks when > it is first onlined post-init. > Your version is much much better than mine. And I have verified that it works as expected. > Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> > --- > kernel/sched/topology.c | 65 +++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 65 insertions(+) > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index b77ad49dc14f..cba95793a9b7 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -1482,6 +1482,8 @@ int sched_max_numa_distance; > static int *sched_domains_numa_distance; > static struct cpumask ***sched_domains_numa_masks; > int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE; > + > +static unsigned long __read_mostly *sched_numa_onlined_nodes; > #endif > > /* > @@ -1833,6 +1835,16 @@ void sched_init_numa(void) > sched_domains_numa_masks[i][j] = mask; > > for_each_node(k) { > + /* > + * Distance information can be unreliable for > + * offline nodes, defer building the node > + * masks to its bringup. > + * This relies on all unique distance values > + * still being visible at init time. > + */ > + if (!node_online(j)) > + continue; > + > if (sched_debug() && (node_distance(j, k) != node_distance(k, j))) > sched_numa_warn("Node-distance not symmetric"); > > @@ -1886,6 +1898,53 @@ void sched_init_numa(void) > sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1]; > > init_numa_topology_type(); > + > + sched_numa_onlined_nodes = bitmap_alloc(nr_node_ids, GFP_KERNEL); > + if (!sched_numa_onlined_nodes) > + return; > + > + bitmap_zero(sched_numa_onlined_nodes, nr_node_ids); > + for_each_online_node(i) > + bitmap_set(sched_numa_onlined_nodes, i, 1); > +} > + > +void __sched_domains_numa_masks_set(unsigned int node) > +{ > + int i, j; > + > + /* > + * NUMA masks are not built for offline nodes in sched_init_numa(). > + * Thus, when a CPU of a never-onlined-before node gets plugged in, > + * adding that new CPU to the right NUMA masks is not sufficient: the > + * masks of that CPU's node must also be updated. > + */ > + if (test_bit(node, sched_numa_onlined_nodes)) > + return; > + > + bitmap_set(sched_numa_onlined_nodes, node, 1); > + > + for (i = 0; i < sched_domains_numa_levels; i++) { > + for (j = 0; j < nr_node_ids; j++) { > + if (!node_online(j) || node == j) > + continue; > + > + if (node_distance(j, node) > sched_domains_numa_distance[i]) > + continue; > + > + /* Add remote nodes in our masks */ > + cpumask_or(sched_domains_numa_masks[i][node], > + sched_domains_numa_masks[i][node], > + sched_domains_numa_masks[0][j]); > + } > + } > + > + /* > + * A new node has been brought up, potentially changing the topology > + * classification. > + * > + * Note that this is racy vs any use of sched_numa_topology_type :/ > + */ > + init_numa_topology_type(); > } > > void sched_domains_numa_masks_set(unsigned int cpu) > @@ -1893,8 +1952,14 @@ void sched_domains_numa_masks_set(unsigned int cpu) > int node = cpu_to_node(cpu); > int i, j; > > + __sched_domains_numa_masks_set(node); > + > for (i = 0; i < sched_domains_numa_levels; i++) { > for (j = 0; j < nr_node_ids; j++) { > + if (!node_online(j)) > + continue; > + > + /* Set ourselves in the remote node's masks */ > if (node_distance(j, node) <= sched_domains_numa_distance[i]) > cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); > } > -- > 2.25.1 > -- Thanks and Regards Srikar Dronamraju
WARNING: multiple messages have this Message-ID (diff)
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com> To: Valentin Schneider <valentin.schneider@arm.com> Cc: Nathan Lynch <nathanl@linux.ibm.com>, Gautham R Shenoy <ego@linux.vnet.ibm.com>, Vincent Guittot <vincent.guittot@linaro.org>, Rik van Riel <riel@surriel.com>, Peter Zijlstra <peterz@infradead.org>, linuxppc-dev@lists.ozlabs.org, Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>, LKML <linux-kernel@vger.kernel.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Thomas Gleixner <tglx@linutronix.de>, Laurent Dufour <ldufour@linux.ibm.com>, Mel Gorman <mgorman@techsingularity.net>, Ingo Molnar <mingo@kernel.org> Subject: Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes Date: Tue, 10 Aug 2021 17:17:27 +0530 [thread overview] Message-ID: <20210810114727.GB21942@linux.vnet.ibm.com> (raw) In-Reply-To: <875yweaig9.mognet@arm.com> * Valentin Schneider <valentin.schneider@arm.com> [2021-08-09 13:52:38]: > On 09/08/21 12:22, Srikar Dronamraju wrote: > > * Valentin Schneider <valentin.schneider@arm.com> [2021-08-08 16:56:47]: > >> Wait, doesn't the distance matrix (without any offline node) say > >> > >> distance(0, 3) == 40 > >> > >> ? We should have at the very least: > >> > >> node 0 1 2 3 > >> 0: 10 20 ?? 40 > >> 1: 20 20 ?? 40 > >> 2: ?? ?? ?? ?? > >> 3: 40 40 ?? 10 > >> > > > > Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online) > > Note: Node 2-7 and CPU 2-7 are still offline. > > > > node 0 1 2 3 > > 0: 10 20 40 10 > > 1: 20 20 40 10 > > 2: 40 40 10 10 > > 3: 10 10 10 10 > > > > NODE->mask(0) == 0 > > NODE->mask(1) == 1 > > NODE->mask(2) == 0 > > NODE->mask(3) == 0 > > > > Note: This is with updating Node 2's distance as 40 for figuring out > > the number of numa levels. Since we have all possible distances, we > > dont update Node 3 distance, so it will be as if its local to node 0. > > > > Now when Node 3 and CPU 3 are onlined > > Note: Node 2, 3-7 and CPU 2, 3-7 are still offline. > > > > node 0 1 2 3 > > 0: 10 20 40 40 > > 1: 20 20 40 40 > > 2: 40 40 10 40 > > 3: 40 40 40 10 > > > > NODE->mask(0) == 0 > > NODE->mask(1) == 1 > > NODE->mask(2) == 0 > > NODE->mask(3) == 0,3 > > > > CPU 0 continues to be part of Node->mask(3) because when we online and > > we find the right distance, there is no API to reset the numa mask of > > 3 to remove CPU 0 from the numa masks. > > > > If we had an API to clear/set sched_domains_numa_masks[node][] when > > the node state changes, we could probably plug-in to clear/set the > > node masks whenever node state changes. > > > > Gotcha, this is now coming back to me... > > [...] > > >> Ok, so it looks like we really can't do without that part - even if we get > >> "sensible" distance values for the online nodes, we can't divine values for > >> the offline ones. > >> > > > > Yes > > > > Argh, while your approach does take care of the masks, it leaves > sched_numa_topology_type unchanged. You *can* force an update of it, but > yuck :( > > I got to the below... > Yes, I completely missed that we should update sched_numa_topology_type. > --- > From: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > Date: Thu, 1 Jul 2021 09:45:51 +0530 > Subject: [PATCH 1/1] sched/topology: Skip updating masks for non-online nodes > > The scheduler currently expects NUMA node distances to be stable from init > onwards, and as a consequence builds the related data structures > once-and-for-all at init (see sched_init_numa()). > > Unfortunately, on some architectures node distance is unreliable for > offline nodes and may very well change upon onlining. > > Skip over offline nodes during sched_init_numa(). Track nodes that have > been onlined at least once, and trigger a build of a node's NUMA masks when > it is first onlined post-init. > Your version is much much better than mine. And I have verified that it works as expected. > Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> > --- > kernel/sched/topology.c | 65 +++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 65 insertions(+) > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index b77ad49dc14f..cba95793a9b7 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -1482,6 +1482,8 @@ int sched_max_numa_distance; > static int *sched_domains_numa_distance; > static struct cpumask ***sched_domains_numa_masks; > int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE; > + > +static unsigned long __read_mostly *sched_numa_onlined_nodes; > #endif > > /* > @@ -1833,6 +1835,16 @@ void sched_init_numa(void) > sched_domains_numa_masks[i][j] = mask; > > for_each_node(k) { > + /* > + * Distance information can be unreliable for > + * offline nodes, defer building the node > + * masks to its bringup. > + * This relies on all unique distance values > + * still being visible at init time. > + */ > + if (!node_online(j)) > + continue; > + > if (sched_debug() && (node_distance(j, k) != node_distance(k, j))) > sched_numa_warn("Node-distance not symmetric"); > > @@ -1886,6 +1898,53 @@ void sched_init_numa(void) > sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1]; > > init_numa_topology_type(); > + > + sched_numa_onlined_nodes = bitmap_alloc(nr_node_ids, GFP_KERNEL); > + if (!sched_numa_onlined_nodes) > + return; > + > + bitmap_zero(sched_numa_onlined_nodes, nr_node_ids); > + for_each_online_node(i) > + bitmap_set(sched_numa_onlined_nodes, i, 1); > +} > + > +void __sched_domains_numa_masks_set(unsigned int node) > +{ > + int i, j; > + > + /* > + * NUMA masks are not built for offline nodes in sched_init_numa(). > + * Thus, when a CPU of a never-onlined-before node gets plugged in, > + * adding that new CPU to the right NUMA masks is not sufficient: the > + * masks of that CPU's node must also be updated. > + */ > + if (test_bit(node, sched_numa_onlined_nodes)) > + return; > + > + bitmap_set(sched_numa_onlined_nodes, node, 1); > + > + for (i = 0; i < sched_domains_numa_levels; i++) { > + for (j = 0; j < nr_node_ids; j++) { > + if (!node_online(j) || node == j) > + continue; > + > + if (node_distance(j, node) > sched_domains_numa_distance[i]) > + continue; > + > + /* Add remote nodes in our masks */ > + cpumask_or(sched_domains_numa_masks[i][node], > + sched_domains_numa_masks[i][node], > + sched_domains_numa_masks[0][j]); > + } > + } > + > + /* > + * A new node has been brought up, potentially changing the topology > + * classification. > + * > + * Note that this is racy vs any use of sched_numa_topology_type :/ > + */ > + init_numa_topology_type(); > } > > void sched_domains_numa_masks_set(unsigned int cpu) > @@ -1893,8 +1952,14 @@ void sched_domains_numa_masks_set(unsigned int cpu) > int node = cpu_to_node(cpu); > int i, j; > > + __sched_domains_numa_masks_set(node); > + > for (i = 0; i < sched_domains_numa_levels; i++) { > for (j = 0; j < nr_node_ids; j++) { > + if (!node_online(j)) > + continue; > + > + /* Set ourselves in the remote node's masks */ > if (node_distance(j, node) <= sched_domains_numa_distance[i]) > cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); > } > -- > 2.25.1 > -- Thanks and Regards Srikar Dronamraju
next prev parent reply other threads:[~2021-08-10 11:48 UTC|newest] Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-07-01 4:15 [PATCH v2 0/2] Skip numa distance for offline nodes Srikar Dronamraju 2021-07-01 4:15 ` Srikar Dronamraju 2021-07-01 4:15 ` [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes Srikar Dronamraju 2021-07-01 4:15 ` Srikar Dronamraju 2021-07-01 14:28 ` Valentin Schneider 2021-07-01 14:28 ` Valentin Schneider 2021-07-12 12:48 ` Srikar Dronamraju 2021-07-12 12:48 ` Srikar Dronamraju 2021-07-13 16:32 ` Valentin Schneider 2021-07-13 16:32 ` Valentin Schneider 2021-07-23 14:39 ` Srikar Dronamraju 2021-07-23 14:39 ` Srikar Dronamraju 2021-08-04 10:01 ` Srikar Dronamraju 2021-08-04 10:01 ` Srikar Dronamraju 2021-08-04 10:20 ` Valentin Schneider 2021-08-04 10:20 ` Valentin Schneider 2021-08-08 15:56 ` Valentin Schneider 2021-08-08 15:56 ` Valentin Schneider 2021-08-09 6:52 ` Srikar Dronamraju 2021-08-09 6:52 ` Srikar Dronamraju 2021-08-09 12:52 ` Valentin Schneider 2021-08-09 12:52 ` Valentin Schneider 2021-08-10 11:47 ` Srikar Dronamraju [this message] 2021-08-10 11:47 ` Srikar Dronamraju 2021-08-16 10:33 ` Srikar Dronamraju 2021-08-16 10:33 ` Srikar Dronamraju 2021-08-17 0:01 ` Valentin Schneider 2021-08-17 0:01 ` Valentin Schneider 2021-07-01 4:15 ` [PATCH v2 2/2] powerpc/numa: Fill distance_lookup_table for offline nodes Srikar Dronamraju 2021-07-01 4:15 ` Srikar Dronamraju 2021-07-01 9:36 ` kernel test robot 2021-07-01 9:36 ` kernel test robot 2021-07-01 10:20 ` kernel test robot 2021-07-01 10:20 ` kernel test robot
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20210810114727.GB21942@linux.vnet.ibm.com \ --to=srikar@linux.vnet.ibm.com \ --cc=Geetika.Moolchandani1@ibm.com \ --cc=dietmar.eggemann@arm.com \ --cc=ego@linux.vnet.ibm.com \ --cc=ldufour@linux.ibm.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linuxppc-dev@lists.ozlabs.org \ --cc=mgorman@techsingularity.net \ --cc=mingo@kernel.org \ --cc=mpe@ellerman.id.au \ --cc=nathanl@linux.ibm.com \ --cc=peterz@infradead.org \ --cc=riel@surriel.com \ --cc=tglx@linutronix.de \ --cc=valentin.schneider@arm.com \ --cc=vincent.guittot@linaro.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.