LinuxPPC-Dev Archive on lore.kernel.org
 help / color / Atom feed
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>, linuxppc-dev@lists.ozlabs.org
Subject: Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
Date: Thu, 06 Aug 2020 16:14:21 +0530
Message-ID: <87bljoqcje.fsf@linux.ibm.com> (raw)
In-Reply-To: <20200804072507.GI24375@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:

> * Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-08-02 19:51:41]:
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > * Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-07-31 16:49:14]:
>> >
>> >
>> > If its just to eliminate node 0, then we have 2 other probably better
>> > solutions.
>> > 1. Dont mark node 0 as spl (currently still in mm-tree and a result in
>> > linux-next)
>> > 2. powerpc specific: explicitly clear node 0 during numa bringup.
>> >
>> 
>> 
>> I am not sure I consider them better. But yes, those patches are good
>> and also resolves the node 0 initialization when the firmware didn't
>> indicate the presence of such a node.
>> 
>> This patch in addition make sure that we get the same topolgy report
>> across reboot on a virtualized partitions as longs as the cpu/memory
>> ratio per powervm domains remain the same. This should also help to
>> avoid confusion after an LPM migration once we start applying topology
>> updates. 
>> 
>
> What do we mean by cpu/memory ratio. The topology across reboot would have
> changed only if PowerVM would have allocated resources differently by
> scrambling/unscrambling. We are no more processing topology updates at
> runtime. As far as I know, after LPM, the source topology is maintained.

A LPAR running with one numa node and 10GB of memory on PowerVM domain
10 will report node 10 and 10GB memory in the current scheme. After LPM
migration or a CEC shutdown/reboot if the domain from which the resource
allocated becomes 11, then the LPAR will report node 11 and 10GB memory.

Having a logical node number means in the both the above cases we report
node 0, 10GB memory. 


>
>> >> This can  be resolved by mapping the firmware provided group id to a logical Linux
>> >> NUMA id. In this patch, we do this only for pseries platforms considering the
>> >
>> > On PowerVM, as you would know the nid is already a logical or a flattened
>> > chip-id and not the actual hardware chip-id.
>> 
>> Yes. But then they are derived based on PowerVM resources AKA domains.
>> Now based on the available resource on a system, we could end up with
>> different node numbers with same toplogy across reboots. Making it
>> logical at OS level prevent that. 
>
> The above statement kind of gives an impression, that topology changes
> across every reboot.  We only end up with different node numbers if and only
> if the underlying topology has changed and that case is very rare. Or am I
> missing something?

IIUC it also depends on availability of resources within the
domain at the time of LPAR start.

>
>> 
>> >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> >> index e437a9ac4956..6c659aada55b 100644
>> >> --- a/arch/powerpc/mm/numa.c
>> >> +++ b/arch/powerpc/mm/numa.c
>> >> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
>> >>  	}
>> >>  }
>> >> 
>> >> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  NUMA_NO_NODE};
>> >> +
>> >> +int firmware_group_id_to_nid(int firmware_gid)
>> >> +{
>> >> +	static int last_nid = 0;
>> >> +
>> >> +	/*
>> >> +	 * For PowerNV we don't change the node id. This helps to avoid
>> >> +	 * confusion w.r.t the expected node ids. On pseries, node numbers
>> >> +	 * are virtualized. Hence do logical node id for pseries.
>> >> +	 */
>> >> +	if (!firmware_has_feature(FW_FEATURE_LPAR))
>> >> +		return firmware_gid;
>> >> +
>> >> +	if (firmware_gid ==  -1)
>> >> +		return NUMA_NO_NODE;
>> >> +
>> >> +	if (nid_map[firmware_gid] == NUMA_NO_NODE)
>> >> +		nid_map[firmware_gid] = last_nid++;
>> >
>> > How do we ensure 2 simultaneous firmware_group_id_to_nid() calls dont end up
>> > at this place in parallel?
>> 
>> Do we have a code path where we do that? All the node id init should
>> happen early and there should not be two cpus doing node init at the
>> same time. I might be mistaken. Can you point to the code path where you
>> expect this to be called in parallel?
>> 
>
> associativity_to_nid gets called the first time a cpu is being made present
> from offline. So it need not be in boot path. We may to verify if cpu
> hotplug, dlpar, operations are synchronized. For example a memory hotadd and
> cpu hotplug are they synchronized? I am not sure if they are synchronized at
> this time.

But you don't online cpu or memory to a non existent node post boot
right?. If the node is existent we have already initialized the nid_map.

However i am not sure whether we do a parallel initialization of devices. ie,
of_device_add getting called in parallel. if it can then we need the
below?

@@ -226,6 +226,7 @@ static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  NUMA_NO_NODE};
 int firmware_group_id_to_nid(int firmware_gid)
 {
        static int last_nid = 0;
+       static DEFINE_SPINLOCK(node_id_lock);
 
        /*
         * For PowerNV we don't change the node id. This helps to avoid
@@ -238,8 +239,13 @@ int firmware_group_id_to_nid(int firmware_gid)
        if (firmware_gid ==  -1)
                return NUMA_NO_NODE;
 
-       if (nid_map[firmware_gid] == NUMA_NO_NODE)
-               nid_map[firmware_gid] = last_nid++;
+       if (nid_map[firmware_gid] == NUMA_NO_NODE) {
+               spin_lock(&node_id_lock);
+               /*  recheck with lock held */
+               if (nid_map[firmware_gid] == NUMA_NO_NODE)
+                       nid_map[firmware_gid] = last_nid++;
+               spin_unlock(&node_id_lock);
+       }
 
        return nid_map[firmware_gid];
 }



>
>> >
>> >> +
>> >> +	return nid_map[firmware_gid];
>> >> +}
>> >> +
>> >>  /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
>> >>   * info is found.
>> >>   */
>> >>  static int associativity_to_nid(const __be32 *associativity)
>> >>  {
>> >>  	int nid = NUMA_NO_NODE;
>> >> +	int firmware_gid = -1;
>> >> 
>> >>  	if (!numa_enabled)
>> >>  		goto out;
>> >> 
>> >>  	if (of_read_number(associativity, 1) >= min_common_depth)
>> >> -		nid = of_read_number(&associativity[min_common_depth], 1);
>> >> +		firmware_gid = of_read_number(&associativity[min_common_depth], 1);
>> >> 
>> >>  	/* POWER4 LPAR uses 0xffff as invalid node */
>> >> -	if (nid == 0xffff || nid >= MAX_NUMNODES)
>> >> -		nid = NUMA_NO_NODE;
>> >> +	if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES)
>> >> +		firmware_gid = -1;
>> >
>> > Lets assume two or more invocations of associativity_to_nid for the same
>> > associativity, end up with -1, In each case aren't giving different
>> > nids?
>> 
>> 
>> I didn't quiet get the comment here. But I assume you are indicating the
>> same one you mentioned above?
>> 
>
> No its not related to the above comment.
> We are incrementing the nid_map table for every unique firmware_gid or for
> every -1 (aka invalid associativities). If there are sufficiently large
> number of associativities that end up returning invalid associativities,
> then don't we quickly overflow the nid_map table? Not only about the
> overflow but a 8 node machine may soon look like a 80 node machine.

Not sure I follow. What does a large number of associativies imply? Are
you looking at ibm,associativity-lookup-arrays that got entries which
are invalid? Even there we are not parsing the full array, we lookup
only a specific firmware_gid (in case of lookup-arrays we use aa_index
value from drmem_lmb).

I will also add a las_nid > MAX_NUMNODES check in
firmware_group_id_to_nid() to handle the case where we find more numa
nodes than MAX_NUMANODES in device tree.

-aneesh

  reply index

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-31 11:19 Aneesh Kumar K.V
2020-07-31 11:22 ` [RFC PATCH 2/2] powerpc/powernv/cpufreq: Don't assume chip id is same as Linux node id Aneesh Kumar K.V
2020-08-04  7:47   ` Gautham R Shenoy
2020-08-01  5:20 ` [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id Srikar Dronamraju
2020-08-02 14:21   ` Aneesh Kumar K.V
2020-08-04  7:25     ` Srikar Dronamraju
2020-08-06 10:44       ` Aneesh Kumar K.V [this message]
2020-08-10  8:05         ` Srikar Dronamraju
2020-08-07  4:24 ` Nathan Lynch
2020-08-07  5:02   ` Aneesh Kumar K.V
2020-08-07 20:45     ` Nathan Lynch
2020-08-09 14:12       ` Aneesh Kumar K.V
2020-08-09 18:40         ` Aneesh Kumar K.V
2020-08-13 22:53           ` Nathan Lynch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bljoqcje.fsf@linux.ibm.com \
    --to=aneesh.kumar@linux.ibm.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=nathanl@linux.ibm.com \
    --cc=srikar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LinuxPPC-Dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linuxppc-dev/0 linuxppc-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linuxppc-dev linuxppc-dev/ https://lore.kernel.org/linuxppc-dev \
		linuxppc-dev@lists.ozlabs.org linuxppc-dev@ozlabs.org
	public-inbox-index linuxppc-dev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.ozlabs.lists.linuxppc-dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git