Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id

From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>, linuxppc-dev@lists.ozlabs.org
Subject: Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
Date: Tue, 4 Aug 2020 12:55:07 +0530	[thread overview]
Message-ID: <20200804072507.GI24375@linux.vnet.ibm.com> (raw)
In-Reply-To: <87h7tl162y.fsf@linux.ibm.com>

* Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-08-02 19:51:41]:
> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > * Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-07-31 16:49:14]:
> >
> >
> > If its just to eliminate node 0, then we have 2 other probably better
> > solutions.
> > 1. Dont mark node 0 as spl (currently still in mm-tree and a result in
> > linux-next)
> > 2. powerpc specific: explicitly clear node 0 during numa bringup.
> >
> 
> 
> I am not sure I consider them better. But yes, those patches are good
> and also resolves the node 0 initialization when the firmware didn't
> indicate the presence of such a node.
> 
> This patch in addition make sure that we get the same topolgy report
> across reboot on a virtualized partitions as longs as the cpu/memory
> ratio per powervm domains remain the same. This should also help to
> avoid confusion after an LPM migration once we start applying topology
> updates. 
> 

What do we mean by cpu/memory ratio. The topology across reboot would have
changed only if PowerVM would have allocated resources differently by
scrambling/unscrambling. We are no more processing topology updates at
runtime. As far as I know, after LPM, the source topology is maintained.

> >> This can  be resolved by mapping the firmware provided group id to a logical Linux
> >> NUMA id. In this patch, we do this only for pseries platforms considering the
> >
> > On PowerVM, as you would know the nid is already a logical or a flattened
> > chip-id and not the actual hardware chip-id.
> 
> Yes. But then they are derived based on PowerVM resources AKA domains.
> Now based on the available resource on a system, we could end up with
> different node numbers with same toplogy across reboots. Making it
> logical at OS level prevent that. 

The above statement kind of gives an impression, that topology changes
across every reboot.  We only end up with different node numbers if and only
if the underlying topology has changed and that case is very rare. Or am I
missing something?

> 
> >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> >> index e437a9ac4956..6c659aada55b 100644
> >> --- a/arch/powerpc/mm/numa.c
> >> +++ b/arch/powerpc/mm/numa.c
> >> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
> >>  	}
> >>  }
> >> 
> >> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  NUMA_NO_NODE};
> >> +
> >> +int firmware_group_id_to_nid(int firmware_gid)
> >> +{
> >> +	static int last_nid = 0;
> >> +
> >> +	/*
> >> +	 * For PowerNV we don't change the node id. This helps to avoid
> >> +	 * confusion w.r.t the expected node ids. On pseries, node numbers
> >> +	 * are virtualized. Hence do logical node id for pseries.
> >> +	 */
> >> +	if (!firmware_has_feature(FW_FEATURE_LPAR))
> >> +		return firmware_gid;
> >> +
> >> +	if (firmware_gid ==  -1)
> >> +		return NUMA_NO_NODE;
> >> +
> >> +	if (nid_map[firmware_gid] == NUMA_NO_NODE)
> >> +		nid_map[firmware_gid] = last_nid++;
> >
> > How do we ensure 2 simultaneous firmware_group_id_to_nid() calls dont end up
> > at this place in parallel?
> 
> Do we have a code path where we do that? All the node id init should
> happen early and there should not be two cpus doing node init at the
> same time. I might be mistaken. Can you point to the code path where you
> expect this to be called in parallel?
> 

associativity_to_nid gets called the first time a cpu is being made present
from offline. So it need not be in boot path. We may to verify if cpu
hotplug, dlpar, operations are synchronized. For example a memory hotadd and
cpu hotplug are they synchronized? I am not sure if they are synchronized at
this time.

> >
> >> +
> >> +	return nid_map[firmware_gid];
> >> +}
> >> +
> >>  /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
> >>   * info is found.
> >>   */
> >>  static int associativity_to_nid(const __be32 *associativity)
> >>  {
> >>  	int nid = NUMA_NO_NODE;
> >> +	int firmware_gid = -1;
> >> 
> >>  	if (!numa_enabled)
> >>  		goto out;
> >> 
> >>  	if (of_read_number(associativity, 1) >= min_common_depth)
> >> -		nid = of_read_number(&associativity[min_common_depth], 1);
> >> +		firmware_gid = of_read_number(&associativity[min_common_depth], 1);
> >> 
> >>  	/* POWER4 LPAR uses 0xffff as invalid node */
> >> -	if (nid == 0xffff || nid >= MAX_NUMNODES)
> >> -		nid = NUMA_NO_NODE;
> >> +	if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES)
> >> +		firmware_gid = -1;
> >
> > Lets assume two or more invocations of associativity_to_nid for the same
> > associativity, end up with -1, In each case aren't giving different
> > nids?
> 
> 
> I didn't quiet get the comment here. But I assume you are indicating the
> same one you mentioned above?
> 

No its not related to the above comment.
We are incrementing the nid_map table for every unique firmware_gid or for
every -1 (aka invalid associativities). If there are sufficiently large
number of associativities that end up returning invalid associativities,
then don't we quickly overflow the nid_map table? Not only about the
overflow but a 8 node machine may soon look like a 80 node machine.

-- 
Thanks and Regards
Srikar Dronamraju