All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Huang, Ying" <ying.huang@intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org,
	Valentin Schneider <valentin.schneider@arm.com>,
	Ingo Molnar <mingo@redhat.com>, Mel Gorman <mgorman@suse.de>,
	Rik van Riel <riel@surriel.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Subject: Re: [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes
Date: Tue, 15 Feb 2022 09:29:02 +0800	[thread overview]
Message-ID: <87mtisgboh.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <YgpvyE7oV1lZDRQL@hirez.programming.kicks-ass.net> (Peter Zijlstra's message of "Mon, 14 Feb 2022 16:05:44 +0100")

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Feb 14, 2022 at 08:15:52PM +0800, Huang Ying wrote:
>
>> This isn't a practical problem now yet.  Because the PMEM nodes (node
>> 2 and node 3 in example system) are offlined by default during system
>> boot.  So init_numa_topology_type() called during system boot will
>> ignore them and set sched_numa_topology_type to NUMA_DIRECT.  And
>> init_numa_topology_type() is only called at runtime when a CPU of a
>> never-onlined-before node gets plugged in.  And there's no CPU in the
>> PMEM nodes.  But it appears better to fix this to make the code more
>> robust.
>
> IIRC there are pre-existing issues with this; namely the distance_map is
> created for all nodes, online or not, therefore the levels and
> max_distance include the pmem stuff.
>
> At the same time, the numa_topolog_type() uses those values, and the
> only reason it 'worked' is because the combination of arguments fails to
> hit any of the existing types and exits without setting a type,
> defaulting to NUMA_DIRECT by 'accident' of that being type 0 and
> bss/data being 0 initialized.
>
> Also, Power (and possibly other architectures) already have CPU-less
> nodes and are similarly suffering issues.
>
> Anyway, aside from this the patches look like they should do.
>
> There's a few niggles, like using READ_ONCE() on sched_max_numa_distance
> without using WRITE_ONCE() (see below) and having
> sched_domains_numa_distance and sched_domains_numa_masks separate RCU
> variables (that could go side-ways if there were a function using both,
> afaict there isn't and I couldn't be bothered changing that, but it's
> something to keep in mind).

If you think that's necessary, I can work on it with a follow up patch.

> I'll go queue these, thanks!

Thanks!

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1259,11 +1259,10 @@ static bool numa_is_active_node(int nid,
>  
>  /* Handle placement on systems where not all nodes are directly connected. */
>  static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
> -					int maxdist, bool task)
> +					int lim_dist, bool task)
>  {
>  	unsigned long score = 0;
> -	int node;
> -	int sys_max_dist;
> +	int node, max_dist;
>  
>  	/*
>  	 * All nodes are directly connected, and the same distance
> @@ -1273,7 +1272,7 @@ static unsigned long score_nearby_nodes(
>  		return 0;
>  
>  	/* sched_max_numa_distance may be changed in parallel. */
> -	sys_max_dist = READ_ONCE(sched_max_numa_distance);
> +	max_dist = READ_ONCE(sched_max_numa_distance);
>  	/*
>  	 * This code is called for each node, introducing N^2 complexity,
>  	 * which should be ok given the number of nodes rarely exceeds 8.
> @@ -1286,7 +1285,7 @@ static unsigned long score_nearby_nodes(
>  		 * The furthest away nodes in the system are not interesting
>  		 * for placement; nid was already counted.
>  		 */
> -		if (dist >= sys_max_dist || node == nid)
> +		if (dist >= max_dist || node == nid)
>  			continue;
>  
>  		/*
> @@ -1296,8 +1295,7 @@ static unsigned long score_nearby_nodes(
>  		 * "hoplimit", only nodes closer by than "hoplimit" are part
>  		 * of each group. Skip other nodes.
>  		 */
> -		if (sched_numa_topology_type == NUMA_BACKPLANE &&
> -					dist >= maxdist)
> +		if (sched_numa_topology_type == NUMA_BACKPLANE && dist >= lim_dist)
>  			continue;
>  
>  		/* Add up the faults from nearby nodes. */
> @@ -1315,8 +1313,8 @@ static unsigned long score_nearby_nodes(
>  		 * This seems to result in good task placement.
>  		 */
>  		if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
> -			faults *= (sys_max_dist - dist);
> -			faults /= (sys_max_dist - LOCAL_DISTANCE);
> +			faults *= (max_dist - dist);
> +			faults /= (max_dist - LOCAL_DISTANCE);
>  		}
>  
>  		score += faults;
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1927,7 +1927,7 @@ void sched_init_numa(int offline_node)
>  	sched_domain_topology = tl;
>  
>  	sched_domains_numa_levels = nr_levels;
> -	sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
> +	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
>  
>  	init_numa_topology_type(offline_node);
>  }

Sorry, 0-Day building robot reported another small issue, can you fold
the following patch too?

Best Regards,
Huang, Ying

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 490acd4b835d..d72128151e56 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1947,7 +1947,7 @@ void sched_init_numa(int offline_node)
 		sched_max_numa_distance);
 }
 
-void sched_reset_numa(void)
+static void sched_reset_numa(void)
 {
 	int nr_levels, *distances;
 	struct cpumask ***masks;
-- 
2.30.2


  reply	other threads:[~2022-02-15  1:29 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-14 12:15 [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Huang Ying
2022-02-14 12:15 ` [PATCH -V3 2/2] NUMA balancing: avoid to migrate task to CPU-less node Huang Ying
2022-02-17 18:56   ` [tip: sched/core] sched/numa: Avoid migrating " tip-bot2 for Huang Ying
2022-03-01 20:54     ` Qian Cai
2022-03-01 20:54       ` Qian Cai
2022-03-02  0:59       ` Huang, Ying
2022-03-02  0:59         ` Huang, Ying
2022-03-02 12:37         ` Qian Cai
2022-03-02 12:37           ` Qian Cai
2022-03-07  5:51         ` Huang, Ying
2022-03-07  5:51           ` Huang, Ying
2022-03-07 13:53           ` Qian Cai
2022-03-07 13:53             ` Qian Cai
2022-03-08  0:40             ` Huang, Ying
2022-03-08  0:40               ` Huang, Ying
2022-03-08  2:05   ` [PATCH -V3 2/2 UPDATE] NUMA balancing: avoid to migrate " Huang, Ying
2022-03-08  2:11     ` Huang, Ying
2022-03-16  0:37       ` Huang, Ying
2022-02-14 15:05 ` [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Peter Zijlstra
2022-02-15  1:29   ` Huang, Ying [this message]
2022-02-17 18:56 ` [tip: sched/core] sched/numa: Fix " tip-bot2 for Huang Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mtisgboh.fsf@yhuang6-desk2.ccr.corp.intel.com \
    --to=ying.huang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.