All of lore.kernel.org
 help / color / mirror / Atom feed
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Valentin Schneider <valentin.schneider@arm.com>
Cc: Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	Rik van Riel <riel@surriel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	linuxppc-dev@lists.ozlabs.org,
	Nathan Lynch <nathanl@linux.ibm.com>,
	Gautham R Shenoy <ego@linux.vnet.ibm.com>,
	Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>,
	Laurent Dufour <ldufour@linux.ibm.com>
Subject: Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
Date: Tue, 10 Aug 2021 17:17:27 +0530	[thread overview]
Message-ID: <20210810114727.GB21942@linux.vnet.ibm.com> (raw)
In-Reply-To: <875yweaig9.mognet@arm.com>

* Valentin Schneider <valentin.schneider@arm.com> [2021-08-09 13:52:38]:

> On 09/08/21 12:22, Srikar Dronamraju wrote:
> > * Valentin Schneider <valentin.schneider@arm.com> [2021-08-08 16:56:47]:
> >> Wait, doesn't the distance matrix (without any offline node) say
> >>
> >>   distance(0, 3) == 40
> >>
> >> ? We should have at the very least:
> >>
> >>   node   0   1   2   3
> >>     0:  10  20  ??  40
> >>     1:  20  20  ??  40
> >>     2:  ??  ??  ??  ??
> >>     3:  40  40  ??  10
> >>
> >
> > Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online)
> > Note: Node 2-7 and CPU 2-7 are still offline.
> >
> > node   0   1   2   3
> >   0:  10  20  40  10
> >   1:  20  20  40  10
> >   2:  40  40  10  10
> >   3:  10  10  10  10
> >
> > NODE->mask(0) == 0
> > NODE->mask(1) == 1
> > NODE->mask(2) == 0
> > NODE->mask(3) == 0
> >
> > Note: This is with updating Node 2's distance as 40 for figuring out
> > the number of numa levels. Since we have all possible distances, we
> > dont update Node 3 distance, so it will be as if its local to node 0.
> >
> > Now when Node 3 and CPU 3 are onlined
> > Note: Node 2, 3-7 and CPU 2, 3-7 are still offline.
> >
> > node   0   1   2   3
> >   0:  10  20  40  40
> >   1:  20  20  40  40
> >   2:  40  40  10  40
> >   3:  40  40  40  10
> >
> > NODE->mask(0) == 0
> > NODE->mask(1) == 1
> > NODE->mask(2) == 0
> > NODE->mask(3) == 0,3
> >
> > CPU 0 continues to be part of Node->mask(3) because when we online and
> > we find the right distance, there is no API to reset the numa mask of
> > 3 to remove CPU 0 from the numa masks.
> >
> > If we had an API to clear/set sched_domains_numa_masks[node][] when
> > the node state changes, we could probably plug-in to clear/set the
> > node masks whenever node state changes.
> >
> 
> Gotcha, this is now coming back to me...
> 
> [...]
> 
> >> Ok, so it looks like we really can't do without that part - even if we get
> >> "sensible" distance values for the online nodes, we can't divine values for
> >> the offline ones.
> >>
> >
> > Yes
> >
> 
> Argh, while your approach does take care of the masks, it leaves
> sched_numa_topology_type unchanged. You *can* force an update of it, but
> yuck :(
> 
> I got to the below...
> 

Yes, I completely missed that we should update sched_numa_topology_type.


> ---
> From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Date: Thu, 1 Jul 2021 09:45:51 +0530
> Subject: [PATCH 1/1] sched/topology: Skip updating masks for non-online nodes
> 
> The scheduler currently expects NUMA node distances to be stable from init
> onwards, and as a consequence builds the related data structures
> once-and-for-all at init (see sched_init_numa()).
> 
> Unfortunately, on some architectures node distance is unreliable for
> offline nodes and may very well change upon onlining.
> 
> Skip over offline nodes during sched_init_numa(). Track nodes that have
> been onlined at least once, and trigger a build of a node's NUMA masks when
> it is first onlined post-init.
> 

Your version is much much better than mine.
And I have verified that it works as expected.


> Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/topology.c | 65 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 65 insertions(+)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b77ad49dc14f..cba95793a9b7 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1482,6 +1482,8 @@ int				sched_max_numa_distance;
>  static int			*sched_domains_numa_distance;
>  static struct cpumask		***sched_domains_numa_masks;
>  int __read_mostly		node_reclaim_distance = RECLAIM_DISTANCE;
> +
> +static unsigned long __read_mostly *sched_numa_onlined_nodes;
>  #endif
> 
>  /*
> @@ -1833,6 +1835,16 @@ void sched_init_numa(void)
>  			sched_domains_numa_masks[i][j] = mask;
> 
>  			for_each_node(k) {
> +				/*
> +				 * Distance information can be unreliable for
> +				 * offline nodes, defer building the node
> +				 * masks to its bringup.
> +				 * This relies on all unique distance values
> +				 * still being visible at init time.
> +				 */
> +				if (!node_online(j))
> +					continue;
> +
>  				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
>  					sched_numa_warn("Node-distance not symmetric");
> 
> @@ -1886,6 +1898,53 @@ void sched_init_numa(void)
>  	sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
> 
>  	init_numa_topology_type();
> +
> +	sched_numa_onlined_nodes = bitmap_alloc(nr_node_ids, GFP_KERNEL);
> +	if (!sched_numa_onlined_nodes)
> +		return;
> +
> +	bitmap_zero(sched_numa_onlined_nodes, nr_node_ids);
> +	for_each_online_node(i)
> +		bitmap_set(sched_numa_onlined_nodes, i, 1);
> +}
> +
> +void __sched_domains_numa_masks_set(unsigned int node)
> +{
> +	int i, j;
> +
> +	/*
> +	 * NUMA masks are not built for offline nodes in sched_init_numa().
> +	 * Thus, when a CPU of a never-onlined-before node gets plugged in,
> +	 * adding that new CPU to the right NUMA masks is not sufficient: the
> +	 * masks of that CPU's node must also be updated.
> +	 */
> +	if (test_bit(node, sched_numa_onlined_nodes))
> +		return;
> +
> +	bitmap_set(sched_numa_onlined_nodes, node, 1);
> +
> +	for (i = 0; i < sched_domains_numa_levels; i++) {
> +		for (j = 0; j < nr_node_ids; j++) {
> +			if (!node_online(j) || node == j)
> +				continue;
> +
> +			if (node_distance(j, node) > sched_domains_numa_distance[i])
> +				continue;
> +
> +			/* Add remote nodes in our masks */
> +			cpumask_or(sched_domains_numa_masks[i][node],
> +				   sched_domains_numa_masks[i][node],
> +				   sched_domains_numa_masks[0][j]);
> +		}
> +	}
> +
> +	/*
> +	 * A new node has been brought up, potentially changing the topology
> +	 * classification.
> +	 *
> +	 * Note that this is racy vs any use of sched_numa_topology_type :/
> +	 */
> +	init_numa_topology_type();
>  }
> 
>  void sched_domains_numa_masks_set(unsigned int cpu)
> @@ -1893,8 +1952,14 @@ void sched_domains_numa_masks_set(unsigned int cpu)
>  	int node = cpu_to_node(cpu);
>  	int i, j;
> 
> +	__sched_domains_numa_masks_set(node);
> +
>  	for (i = 0; i < sched_domains_numa_levels; i++) {
>  		for (j = 0; j < nr_node_ids; j++) {
> +			if (!node_online(j))
> +				continue;
> +
> +			/* Set ourselves in the remote node's masks */
>  			if (node_distance(j, node) <= sched_domains_numa_distance[i])
>  				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
>  		}
> -- 
> 2.25.1
> 

-- 
Thanks and Regards
Srikar Dronamraju

WARNING: multiple messages have this Message-ID (diff)
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>,
	Gautham R Shenoy <ego@linux.vnet.ibm.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Rik van Riel <riel@surriel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linuxppc-dev@lists.ozlabs.org,
	Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Laurent Dufour <ldufour@linux.ibm.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Ingo Molnar <mingo@kernel.org>
Subject: Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
Date: Tue, 10 Aug 2021 17:17:27 +0530	[thread overview]
Message-ID: <20210810114727.GB21942@linux.vnet.ibm.com> (raw)
In-Reply-To: <875yweaig9.mognet@arm.com>

* Valentin Schneider <valentin.schneider@arm.com> [2021-08-09 13:52:38]:

> On 09/08/21 12:22, Srikar Dronamraju wrote:
> > * Valentin Schneider <valentin.schneider@arm.com> [2021-08-08 16:56:47]:
> >> Wait, doesn't the distance matrix (without any offline node) say
> >>
> >>   distance(0, 3) == 40
> >>
> >> ? We should have at the very least:
> >>
> >>   node   0   1   2   3
> >>     0:  10  20  ??  40
> >>     1:  20  20  ??  40
> >>     2:  ??  ??  ??  ??
> >>     3:  40  40  ??  10
> >>
> >
> > Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online)
> > Note: Node 2-7 and CPU 2-7 are still offline.
> >
> > node   0   1   2   3
> >   0:  10  20  40  10
> >   1:  20  20  40  10
> >   2:  40  40  10  10
> >   3:  10  10  10  10
> >
> > NODE->mask(0) == 0
> > NODE->mask(1) == 1
> > NODE->mask(2) == 0
> > NODE->mask(3) == 0
> >
> > Note: This is with updating Node 2's distance as 40 for figuring out
> > the number of numa levels. Since we have all possible distances, we
> > dont update Node 3 distance, so it will be as if its local to node 0.
> >
> > Now when Node 3 and CPU 3 are onlined
> > Note: Node 2, 3-7 and CPU 2, 3-7 are still offline.
> >
> > node   0   1   2   3
> >   0:  10  20  40  40
> >   1:  20  20  40  40
> >   2:  40  40  10  40
> >   3:  40  40  40  10
> >
> > NODE->mask(0) == 0
> > NODE->mask(1) == 1
> > NODE->mask(2) == 0
> > NODE->mask(3) == 0,3
> >
> > CPU 0 continues to be part of Node->mask(3) because when we online and
> > we find the right distance, there is no API to reset the numa mask of
> > 3 to remove CPU 0 from the numa masks.
> >
> > If we had an API to clear/set sched_domains_numa_masks[node][] when
> > the node state changes, we could probably plug-in to clear/set the
> > node masks whenever node state changes.
> >
> 
> Gotcha, this is now coming back to me...
> 
> [...]
> 
> >> Ok, so it looks like we really can't do without that part - even if we get
> >> "sensible" distance values for the online nodes, we can't divine values for
> >> the offline ones.
> >>
> >
> > Yes
> >
> 
> Argh, while your approach does take care of the masks, it leaves
> sched_numa_topology_type unchanged. You *can* force an update of it, but
> yuck :(
> 
> I got to the below...
> 

Yes, I completely missed that we should update sched_numa_topology_type.


> ---
> From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Date: Thu, 1 Jul 2021 09:45:51 +0530
> Subject: [PATCH 1/1] sched/topology: Skip updating masks for non-online nodes
> 
> The scheduler currently expects NUMA node distances to be stable from init
> onwards, and as a consequence builds the related data structures
> once-and-for-all at init (see sched_init_numa()).
> 
> Unfortunately, on some architectures node distance is unreliable for
> offline nodes and may very well change upon onlining.
> 
> Skip over offline nodes during sched_init_numa(). Track nodes that have
> been onlined at least once, and trigger a build of a node's NUMA masks when
> it is first onlined post-init.
> 

Your version is much much better than mine.
And I have verified that it works as expected.


> Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/topology.c | 65 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 65 insertions(+)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b77ad49dc14f..cba95793a9b7 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1482,6 +1482,8 @@ int				sched_max_numa_distance;
>  static int			*sched_domains_numa_distance;
>  static struct cpumask		***sched_domains_numa_masks;
>  int __read_mostly		node_reclaim_distance = RECLAIM_DISTANCE;
> +
> +static unsigned long __read_mostly *sched_numa_onlined_nodes;
>  #endif
> 
>  /*
> @@ -1833,6 +1835,16 @@ void sched_init_numa(void)
>  			sched_domains_numa_masks[i][j] = mask;
> 
>  			for_each_node(k) {
> +				/*
> +				 * Distance information can be unreliable for
> +				 * offline nodes, defer building the node
> +				 * masks to its bringup.
> +				 * This relies on all unique distance values
> +				 * still being visible at init time.
> +				 */
> +				if (!node_online(j))
> +					continue;
> +
>  				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
>  					sched_numa_warn("Node-distance not symmetric");
> 
> @@ -1886,6 +1898,53 @@ void sched_init_numa(void)
>  	sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
> 
>  	init_numa_topology_type();
> +
> +	sched_numa_onlined_nodes = bitmap_alloc(nr_node_ids, GFP_KERNEL);
> +	if (!sched_numa_onlined_nodes)
> +		return;
> +
> +	bitmap_zero(sched_numa_onlined_nodes, nr_node_ids);
> +	for_each_online_node(i)
> +		bitmap_set(sched_numa_onlined_nodes, i, 1);
> +}
> +
> +void __sched_domains_numa_masks_set(unsigned int node)
> +{
> +	int i, j;
> +
> +	/*
> +	 * NUMA masks are not built for offline nodes in sched_init_numa().
> +	 * Thus, when a CPU of a never-onlined-before node gets plugged in,
> +	 * adding that new CPU to the right NUMA masks is not sufficient: the
> +	 * masks of that CPU's node must also be updated.
> +	 */
> +	if (test_bit(node, sched_numa_onlined_nodes))
> +		return;
> +
> +	bitmap_set(sched_numa_onlined_nodes, node, 1);
> +
> +	for (i = 0; i < sched_domains_numa_levels; i++) {
> +		for (j = 0; j < nr_node_ids; j++) {
> +			if (!node_online(j) || node == j)
> +				continue;
> +
> +			if (node_distance(j, node) > sched_domains_numa_distance[i])
> +				continue;
> +
> +			/* Add remote nodes in our masks */
> +			cpumask_or(sched_domains_numa_masks[i][node],
> +				   sched_domains_numa_masks[i][node],
> +				   sched_domains_numa_masks[0][j]);
> +		}
> +	}
> +
> +	/*
> +	 * A new node has been brought up, potentially changing the topology
> +	 * classification.
> +	 *
> +	 * Note that this is racy vs any use of sched_numa_topology_type :/
> +	 */
> +	init_numa_topology_type();
>  }
> 
>  void sched_domains_numa_masks_set(unsigned int cpu)
> @@ -1893,8 +1952,14 @@ void sched_domains_numa_masks_set(unsigned int cpu)
>  	int node = cpu_to_node(cpu);
>  	int i, j;
> 
> +	__sched_domains_numa_masks_set(node);
> +
>  	for (i = 0; i < sched_domains_numa_levels; i++) {
>  		for (j = 0; j < nr_node_ids; j++) {
> +			if (!node_online(j))
> +				continue;
> +
> +			/* Set ourselves in the remote node's masks */
>  			if (node_distance(j, node) <= sched_domains_numa_distance[i])
>  				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
>  		}
> -- 
> 2.25.1
> 

-- 
Thanks and Regards
Srikar Dronamraju

  reply	other threads:[~2021-08-10 11:48 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-01  4:15 [PATCH v2 0/2] Skip numa distance for offline nodes Srikar Dronamraju
2021-07-01  4:15 ` Srikar Dronamraju
2021-07-01  4:15 ` [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes Srikar Dronamraju
2021-07-01  4:15   ` Srikar Dronamraju
2021-07-01 14:28   ` Valentin Schneider
2021-07-01 14:28     ` Valentin Schneider
2021-07-12 12:48     ` Srikar Dronamraju
2021-07-12 12:48       ` Srikar Dronamraju
2021-07-13 16:32       ` Valentin Schneider
2021-07-13 16:32         ` Valentin Schneider
2021-07-23 14:39         ` Srikar Dronamraju
2021-07-23 14:39           ` Srikar Dronamraju
2021-08-04 10:01           ` Srikar Dronamraju
2021-08-04 10:01             ` Srikar Dronamraju
2021-08-04 10:20             ` Valentin Schneider
2021-08-04 10:20               ` Valentin Schneider
2021-08-08 15:56           ` Valentin Schneider
2021-08-08 15:56             ` Valentin Schneider
2021-08-09  6:52             ` Srikar Dronamraju
2021-08-09  6:52               ` Srikar Dronamraju
2021-08-09 12:52               ` Valentin Schneider
2021-08-09 12:52                 ` Valentin Schneider
2021-08-10 11:47                 ` Srikar Dronamraju [this message]
2021-08-10 11:47                   ` Srikar Dronamraju
2021-08-16 10:33                   ` Srikar Dronamraju
2021-08-16 10:33                     ` Srikar Dronamraju
2021-08-17  0:01                     ` Valentin Schneider
2021-08-17  0:01                       ` Valentin Schneider
2021-07-01  4:15 ` [PATCH v2 2/2] powerpc/numa: Fill distance_lookup_table for offline nodes Srikar Dronamraju
2021-07-01  4:15   ` Srikar Dronamraju
2021-07-01  9:36   ` kernel test robot
2021-07-01  9:36     ` kernel test robot
2021-07-01 10:20   ` kernel test robot
2021-07-01 10:20     ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210810114727.GB21942@linux.vnet.ibm.com \
    --to=srikar@linux.vnet.ibm.com \
    --cc=Geetika.Moolchandani1@ibm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=ego@linux.vnet.ibm.com \
    --cc=ldufour@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=nathanl@linux.ibm.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=tglx@linutronix.de \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.