LinuxPPC-Dev Archive on lore.kernel.org
 help / color / Atom feed
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>, linuxppc-dev@lists.ozlabs.org
Subject: Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
Date: Tue, 4 Aug 2020 12:55:07 +0530
Message-ID: <20200804072507.GI24375@linux.vnet.ibm.com> (raw)
In-Reply-To: <87h7tl162y.fsf@linux.ibm.com>

* Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-08-02 19:51:41]:
> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > * Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-07-31 16:49:14]:
> >
> >
> > If its just to eliminate node 0, then we have 2 other probably better
> > solutions.
> > 1. Dont mark node 0 as spl (currently still in mm-tree and a result in
> > linux-next)
> > 2. powerpc specific: explicitly clear node 0 during numa bringup.
> >
> 
> 
> I am not sure I consider them better. But yes, those patches are good
> and also resolves the node 0 initialization when the firmware didn't
> indicate the presence of such a node.
> 
> This patch in addition make sure that we get the same topolgy report
> across reboot on a virtualized partitions as longs as the cpu/memory
> ratio per powervm domains remain the same. This should also help to
> avoid confusion after an LPM migration once we start applying topology
> updates. 
> 

What do we mean by cpu/memory ratio. The topology across reboot would have
changed only if PowerVM would have allocated resources differently by
scrambling/unscrambling. We are no more processing topology updates at
runtime. As far as I know, after LPM, the source topology is maintained.

> >> This can  be resolved by mapping the firmware provided group id to a logical Linux
> >> NUMA id. In this patch, we do this only for pseries platforms considering the
> >
> > On PowerVM, as you would know the nid is already a logical or a flattened
> > chip-id and not the actual hardware chip-id.
> 
> Yes. But then they are derived based on PowerVM resources AKA domains.
> Now based on the available resource on a system, we could end up with
> different node numbers with same toplogy across reboots. Making it
> logical at OS level prevent that. 

The above statement kind of gives an impression, that topology changes
across every reboot.  We only end up with different node numbers if and only
if the underlying topology has changed and that case is very rare. Or am I
missing something?

> 
> >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> >> index e437a9ac4956..6c659aada55b 100644
> >> --- a/arch/powerpc/mm/numa.c
> >> +++ b/arch/powerpc/mm/numa.c
> >> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
> >>  	}
> >>  }
> >> 
> >> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  NUMA_NO_NODE};
> >> +
> >> +int firmware_group_id_to_nid(int firmware_gid)
> >> +{
> >> +	static int last_nid = 0;
> >> +
> >> +	/*
> >> +	 * For PowerNV we don't change the node id. This helps to avoid
> >> +	 * confusion w.r.t the expected node ids. On pseries, node numbers
> >> +	 * are virtualized. Hence do logical node id for pseries.
> >> +	 */
> >> +	if (!firmware_has_feature(FW_FEATURE_LPAR))
> >> +		return firmware_gid;
> >> +
> >> +	if (firmware_gid ==  -1)
> >> +		return NUMA_NO_NODE;
> >> +
> >> +	if (nid_map[firmware_gid] == NUMA_NO_NODE)
> >> +		nid_map[firmware_gid] = last_nid++;
> >
> > How do we ensure 2 simultaneous firmware_group_id_to_nid() calls dont end up
> > at this place in parallel?
> 
> Do we have a code path where we do that? All the node id init should
> happen early and there should not be two cpus doing node init at the
> same time. I might be mistaken. Can you point to the code path where you
> expect this to be called in parallel?
> 

associativity_to_nid gets called the first time a cpu is being made present
from offline. So it need not be in boot path. We may to verify if cpu
hotplug, dlpar, operations are synchronized. For example a memory hotadd and
cpu hotplug are they synchronized? I am not sure if they are synchronized at
this time.

> >
> >> +
> >> +	return nid_map[firmware_gid];
> >> +}
> >> +
> >>  /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
> >>   * info is found.
> >>   */
> >>  static int associativity_to_nid(const __be32 *associativity)
> >>  {
> >>  	int nid = NUMA_NO_NODE;
> >> +	int firmware_gid = -1;
> >> 
> >>  	if (!numa_enabled)
> >>  		goto out;
> >> 
> >>  	if (of_read_number(associativity, 1) >= min_common_depth)
> >> -		nid = of_read_number(&associativity[min_common_depth], 1);
> >> +		firmware_gid = of_read_number(&associativity[min_common_depth], 1);
> >> 
> >>  	/* POWER4 LPAR uses 0xffff as invalid node */
> >> -	if (nid == 0xffff || nid >= MAX_NUMNODES)
> >> -		nid = NUMA_NO_NODE;
> >> +	if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES)
> >> +		firmware_gid = -1;
> >
> > Lets assume two or more invocations of associativity_to_nid for the same
> > associativity, end up with -1, In each case aren't giving different
> > nids?
> 
> 
> I didn't quiet get the comment here. But I assume you are indicating the
> same one you mentioned above?
> 

No its not related to the above comment.
We are incrementing the nid_map table for every unique firmware_gid or for
every -1 (aka invalid associativities). If there are sufficiently large
number of associativities that end up returning invalid associativities,
then don't we quickly overflow the nid_map table? Not only about the
overflow but a 8 node machine may soon look like a 80 node machine.


-- 
Thanks and Regards
Srikar Dronamraju

  reply index

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-31 11:19 Aneesh Kumar K.V
2020-07-31 11:22 ` [RFC PATCH 2/2] powerpc/powernv/cpufreq: Don't assume chip id is same as Linux node id Aneesh Kumar K.V
2020-08-04  7:47   ` Gautham R Shenoy
2020-08-01  5:20 ` [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id Srikar Dronamraju
2020-08-02 14:21   ` Aneesh Kumar K.V
2020-08-04  7:25     ` Srikar Dronamraju [this message]
2020-08-06 10:44       ` Aneesh Kumar K.V
2020-08-10  8:05         ` Srikar Dronamraju
2020-08-07  4:24 ` Nathan Lynch
2020-08-07  5:02   ` Aneesh Kumar K.V
2020-08-07 20:45     ` Nathan Lynch
2020-08-09 14:12       ` Aneesh Kumar K.V
2020-08-09 18:40         ` Aneesh Kumar K.V
2020-08-13 22:53           ` Nathan Lynch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200804072507.GI24375@linux.vnet.ibm.com \
    --to=srikar@linux.vnet.ibm.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=nathanl@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LinuxPPC-Dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linuxppc-dev/0 linuxppc-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linuxppc-dev linuxppc-dev/ https://lore.kernel.org/linuxppc-dev \
		linuxppc-dev@lists.ozlabs.org linuxppc-dev@ozlabs.org
	public-inbox-index linuxppc-dev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.ozlabs.lists.linuxppc-dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git