From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>, linuxppc-dev@lists.ozlabs.org
Subject: Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
Date: Thu, 06 Aug 2020 16:14:21 +0530 [thread overview]
Message-ID: <87bljoqcje.fsf@linux.ibm.com> (raw)
In-Reply-To: <20200804072507.GI24375@linux.vnet.ibm.com>
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-08-02 19:51:41]:
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > * Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-07-31 16:49:14]:
>> >
>> >
>> > If its just to eliminate node 0, then we have 2 other probably better
>> > solutions.
>> > 1. Dont mark node 0 as spl (currently still in mm-tree and a result in
>> > linux-next)
>> > 2. powerpc specific: explicitly clear node 0 during numa bringup.
>> >
>>
>>
>> I am not sure I consider them better. But yes, those patches are good
>> and also resolves the node 0 initialization when the firmware didn't
>> indicate the presence of such a node.
>>
>> This patch in addition make sure that we get the same topolgy report
>> across reboot on a virtualized partitions as longs as the cpu/memory
>> ratio per powervm domains remain the same. This should also help to
>> avoid confusion after an LPM migration once we start applying topology
>> updates.
>>
>
> What do we mean by cpu/memory ratio. The topology across reboot would have
> changed only if PowerVM would have allocated resources differently by
> scrambling/unscrambling. We are no more processing topology updates at
> runtime. As far as I know, after LPM, the source topology is maintained.
A LPAR running with one numa node and 10GB of memory on PowerVM domain
10 will report node 10 and 10GB memory in the current scheme. After LPM
migration or a CEC shutdown/reboot if the domain from which the resource
allocated becomes 11, then the LPAR will report node 11 and 10GB memory.
Having a logical node number means in the both the above cases we report
node 0, 10GB memory.
>
>> >> This can be resolved by mapping the firmware provided group id to a logical Linux
>> >> NUMA id. In this patch, we do this only for pseries platforms considering the
>> >
>> > On PowerVM, as you would know the nid is already a logical or a flattened
>> > chip-id and not the actual hardware chip-id.
>>
>> Yes. But then they are derived based on PowerVM resources AKA domains.
>> Now based on the available resource on a system, we could end up with
>> different node numbers with same toplogy across reboots. Making it
>> logical at OS level prevent that.
>
> The above statement kind of gives an impression, that topology changes
> across every reboot. We only end up with different node numbers if and only
> if the underlying topology has changed and that case is very rare. Or am I
> missing something?
IIUC it also depends on availability of resources within the
domain at the time of LPAR start.
>
>>
>> >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> >> index e437a9ac4956..6c659aada55b 100644
>> >> --- a/arch/powerpc/mm/numa.c
>> >> +++ b/arch/powerpc/mm/numa.c
>> >> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
>> >> }
>> >> }
>> >>
>> >> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE};
>> >> +
>> >> +int firmware_group_id_to_nid(int firmware_gid)
>> >> +{
>> >> + static int last_nid = 0;
>> >> +
>> >> + /*
>> >> + * For PowerNV we don't change the node id. This helps to avoid
>> >> + * confusion w.r.t the expected node ids. On pseries, node numbers
>> >> + * are virtualized. Hence do logical node id for pseries.
>> >> + */
>> >> + if (!firmware_has_feature(FW_FEATURE_LPAR))
>> >> + return firmware_gid;
>> >> +
>> >> + if (firmware_gid == -1)
>> >> + return NUMA_NO_NODE;
>> >> +
>> >> + if (nid_map[firmware_gid] == NUMA_NO_NODE)
>> >> + nid_map[firmware_gid] = last_nid++;
>> >
>> > How do we ensure 2 simultaneous firmware_group_id_to_nid() calls dont end up
>> > at this place in parallel?
>>
>> Do we have a code path where we do that? All the node id init should
>> happen early and there should not be two cpus doing node init at the
>> same time. I might be mistaken. Can you point to the code path where you
>> expect this to be called in parallel?
>>
>
> associativity_to_nid gets called the first time a cpu is being made present
> from offline. So it need not be in boot path. We may to verify if cpu
> hotplug, dlpar, operations are synchronized. For example a memory hotadd and
> cpu hotplug are they synchronized? I am not sure if they are synchronized at
> this time.
But you don't online cpu or memory to a non existent node post boot
right?. If the node is existent we have already initialized the nid_map.
However i am not sure whether we do a parallel initialization of devices. ie,
of_device_add getting called in parallel. if it can then we need the
below?
@@ -226,6 +226,7 @@ static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE};
int firmware_group_id_to_nid(int firmware_gid)
{
static int last_nid = 0;
+ static DEFINE_SPINLOCK(node_id_lock);
/*
* For PowerNV we don't change the node id. This helps to avoid
@@ -238,8 +239,13 @@ int firmware_group_id_to_nid(int firmware_gid)
if (firmware_gid == -1)
return NUMA_NO_NODE;
- if (nid_map[firmware_gid] == NUMA_NO_NODE)
- nid_map[firmware_gid] = last_nid++;
+ if (nid_map[firmware_gid] == NUMA_NO_NODE) {
+ spin_lock(&node_id_lock);
+ /* recheck with lock held */
+ if (nid_map[firmware_gid] == NUMA_NO_NODE)
+ nid_map[firmware_gid] = last_nid++;
+ spin_unlock(&node_id_lock);
+ }
return nid_map[firmware_gid];
}
>
>> >
>> >> +
>> >> + return nid_map[firmware_gid];
>> >> +}
>> >> +
>> >> /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
>> >> * info is found.
>> >> */
>> >> static int associativity_to_nid(const __be32 *associativity)
>> >> {
>> >> int nid = NUMA_NO_NODE;
>> >> + int firmware_gid = -1;
>> >>
>> >> if (!numa_enabled)
>> >> goto out;
>> >>
>> >> if (of_read_number(associativity, 1) >= min_common_depth)
>> >> - nid = of_read_number(&associativity[min_common_depth], 1);
>> >> + firmware_gid = of_read_number(&associativity[min_common_depth], 1);
>> >>
>> >> /* POWER4 LPAR uses 0xffff as invalid node */
>> >> - if (nid == 0xffff || nid >= MAX_NUMNODES)
>> >> - nid = NUMA_NO_NODE;
>> >> + if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES)
>> >> + firmware_gid = -1;
>> >
>> > Lets assume two or more invocations of associativity_to_nid for the same
>> > associativity, end up with -1, In each case aren't giving different
>> > nids?
>>
>>
>> I didn't quiet get the comment here. But I assume you are indicating the
>> same one you mentioned above?
>>
>
> No its not related to the above comment.
> We are incrementing the nid_map table for every unique firmware_gid or for
> every -1 (aka invalid associativities). If there are sufficiently large
> number of associativities that end up returning invalid associativities,
> then don't we quickly overflow the nid_map table? Not only about the
> overflow but a 8 node machine may soon look like a 80 node machine.
Not sure I follow. What does a large number of associativies imply? Are
you looking at ibm,associativity-lookup-arrays that got entries which
are invalid? Even there we are not parsing the full array, we lookup
only a specific firmware_gid (in case of lookup-arrays we use aa_index
value from drmem_lmb).
I will also add a las_nid > MAX_NUMNODES check in
firmware_group_id_to_nid() to handle the case where we find more numa
nodes than MAX_NUMANODES in device tree.
-aneesh
next prev parent reply other threads:[~2020-08-06 11:04 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-31 11:19 [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id Aneesh Kumar K.V
2020-07-31 11:22 ` [RFC PATCH 2/2] powerpc/powernv/cpufreq: Don't assume chip id is same as Linux node id Aneesh Kumar K.V
2020-08-04 7:47 ` Gautham R Shenoy
2020-08-01 5:20 ` [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id Srikar Dronamraju
2020-08-02 14:21 ` Aneesh Kumar K.V
2020-08-04 7:25 ` Srikar Dronamraju
2020-08-06 10:44 ` Aneesh Kumar K.V [this message]
2020-08-10 8:05 ` Srikar Dronamraju
2020-08-07 4:24 ` Nathan Lynch
2020-08-07 5:02 ` Aneesh Kumar K.V
2020-08-07 20:45 ` Nathan Lynch
2020-08-09 14:12 ` Aneesh Kumar K.V
2020-08-09 18:40 ` Aneesh Kumar K.V
2020-08-13 22:53 ` Nathan Lynch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87bljoqcje.fsf@linux.ibm.com \
--to=aneesh.kumar@linux.ibm.com \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=nathanl@linux.ibm.com \
--cc=srikar@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).