From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>, linuxppc-dev@lists.ozlabs.org
Subject: Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
Date: Sun, 02 Aug 2020 19:51:41 +0530 [thread overview]
Message-ID: <87h7tl162y.fsf@linux.ibm.com> (raw)
In-Reply-To: <20200801052059.GA24375@linux.vnet.ibm.com>
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-07-31 16:49:14]:
>
>> We use ibm,associativity and ibm,associativity-lookup-arrays to derive the numa
>> node numbers. These device tree properties are firmware indicated grouping of
>> resources based on their hierarchy in the platform. These numbers (group id) are
>> not sequential and hypervisor/firmware can follow different numbering schemes.
>> For ex: on powernv platforms, we group them in the below order.
>>
>> * - CCM node ID
>> * - HW card ID
>> * - HW module ID
>> * - Chip ID
>> * - Core ID
>>
>> Based on ibm,associativity-reference-points we use one of the above group ids as
>> Linux NUMA node id. (On PowerNV platform Chip ID is used). This results
>> in Linux reporting non-linear NUMA node id and which also results in Linux
>> reporting empty node 0 NUMA nodes.
>>
>
> If its just to eliminate node 0, then we have 2 other probably better
> solutions.
> 1. Dont mark node 0 as spl (currently still in mm-tree and a result in
> linux-next)
> 2. powerpc specific: explicitly clear node 0 during numa bringup.
>
I am not sure I consider them better. But yes, those patches are good
and also resolves the node 0 initialization when the firmware didn't
indicate the presence of such a node.
This patch in addition make sure that we get the same topolgy report
across reboot on a virtualized partitions as longs as the cpu/memory
ratio per powervm domains remain the same. This should also help to
avoid confusion after an LPM migration once we start applying topology
updates.
>> This can be resolved by mapping the firmware provided group id to a logical Linux
>> NUMA id. In this patch, we do this only for pseries platforms considering the
>
> On PowerVM, as you would know the nid is already a logical or a flattened
> chip-id and not the actual hardware chip-id.
Yes. But then they are derived based on PowerVM resources AKA domains.
Now based on the available resource on a system, we could end up with
different node numbers with same toplogy across reboots. Making it
logical at OS level prevent that.
>
>> firmware group id is a virtualized entity and users would not have drawn any
>> conclusion based on the Linux Numa Node id.
>>
>> On PowerNV platform since we have historically mapped Chip ID as Linux NUMA node
>> id, we keep the existing Linux NUMA node id numbering.
>>
>> Before Fix:
>> # numactl -H
>> available: 2 nodes (0-1)
>> node 0 cpus:
>> node 0 size: 0 MB
>> node 0 free: 0 MB
>> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
>> node 1 size: 50912 MB
>> node 1 free: 45248 MB
>> node distances:
>> node 0 1
>> 0: 10 40
>> 1: 40 10
>>
>> after fix
>> # numactl -H
>> available: 1 nodes (0)
>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
>> node 0 size: 50912 MB
>> node 0 free: 49724 MB
>> node distances:
>> node 0
>> 0: 10
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>> arch/powerpc/include/asm/topology.h | 1 +
>> arch/powerpc/mm/numa.c | 49 ++++++++++++++++++++++-------
>> 2 files changed, 39 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
>> index f0b6300e7dd3..15b0424a27a8 100644
>> --- a/arch/powerpc/include/asm/topology.h
>> +++ b/arch/powerpc/include/asm/topology.h
>> @@ -118,5 +118,6 @@ int get_physical_package_id(int cpu);
>> #endif
>> #endif
>>
>> +int firmware_group_id_to_nid(int firmware_gid);
>> #endif /* __KERNEL__ */
>> #endif /* _ASM_POWERPC_TOPOLOGY_H */
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index e437a9ac4956..6c659aada55b 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
>> }
>> }
>>
>> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE};
>> +
>> +int firmware_group_id_to_nid(int firmware_gid)
>> +{
>> + static int last_nid = 0;
>> +
>> + /*
>> + * For PowerNV we don't change the node id. This helps to avoid
>> + * confusion w.r.t the expected node ids. On pseries, node numbers
>> + * are virtualized. Hence do logical node id for pseries.
>> + */
>> + if (!firmware_has_feature(FW_FEATURE_LPAR))
>> + return firmware_gid;
>> +
>> + if (firmware_gid == -1)
>> + return NUMA_NO_NODE;
>> +
>> + if (nid_map[firmware_gid] == NUMA_NO_NODE)
>> + nid_map[firmware_gid] = last_nid++;
>
> How do we ensure 2 simultaneous firmware_group_id_to_nid() calls dont end up
> at this place in parallel?
Do we have a code path where we do that? All the node id init should
happen early and there should not be two cpus doing node init at the
same time. I might be mistaken. Can you point to the code path where you
expect this to be called in parallel?
>
>> +
>> + return nid_map[firmware_gid];
>> +}
>> +
>> /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
>> * info is found.
>> */
>> static int associativity_to_nid(const __be32 *associativity)
>> {
>> int nid = NUMA_NO_NODE;
>> + int firmware_gid = -1;
>>
>> if (!numa_enabled)
>> goto out;
>>
>> if (of_read_number(associativity, 1) >= min_common_depth)
>> - nid = of_read_number(&associativity[min_common_depth], 1);
>> + firmware_gid = of_read_number(&associativity[min_common_depth], 1);
>>
>> /* POWER4 LPAR uses 0xffff as invalid node */
>> - if (nid == 0xffff || nid >= MAX_NUMNODES)
>> - nid = NUMA_NO_NODE;
>> + if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES)
>> + firmware_gid = -1;
>
> Lets assume two or more invocations of associativity_to_nid for the same
> associativity, end up with -1, In each case aren't giving different
> nids?
I didn't quiet get the comment here. But I assume you are indicating the
same one you mentioned above?
>
>
>> +
>> + nid = firmware_group_id_to_nid(firmware_gid);
>>
>> if (nid > 0 &&
>> - of_read_number(associativity, 1) >= distance_ref_points_depth) {
>> + of_read_number(associativity, 1) >= distance_ref_points_depth) {
>> /*
>> * Skip the length field and send start of associativity array
>> */
>> @@ -432,24 +458,25 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
>> static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>> {
>> struct assoc_arrays aa = { .arrays = NULL };
>> - int default_nid = NUMA_NO_NODE;
>> - int nid = default_nid;
>> + int nid = NUMA_NO_NODE, firmware_gid;
>> int rc, index;
>>
>> if ((min_common_depth < 0) || !numa_enabled)
>> - return default_nid;
>> + return NUMA_NO_NODE;
>>
>> rc = of_get_assoc_arrays(&aa);
>> if (rc)
>> - return default_nid;
>> + return NUMA_NO_NODE;
>
> https://lore.kernel.org/linuxppc-dev/87lfjc1b5f.fsf@linux.ibm.com/t/#u
Not sure what I should conclude on that. I am changing the function here
and would like to make NUMA_NO_NODE as the error return.
>
>>
>> if (min_common_depth <= aa.array_sz &&
>> !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
>> index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
>> - nid = of_read_number(&aa.arrays[index], 1);
>> + firmware_gid = of_read_number(&aa.arrays[index], 1);
>>
>> - if (nid == 0xffff || nid >= MAX_NUMNODES)
>> - nid = default_nid;
>> + if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES)
>> + firmware_gid = -1;
>
> Same case as above, How do we ensure that we return unique nid for a
> similar assoc_array?
Can you ellaborate this?
>
>> +
>> + nid = firmware_group_id_to_nid(firmware_gid);
>>
>> if (nid > 0) {
>> index = lmb->aa_index * aa.array_sz;
>> --
>> 2.26.2
>>
-aneesh
next prev parent reply other threads:[~2020-08-02 14:24 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-31 11:19 [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id Aneesh Kumar K.V
2020-07-31 11:22 ` [RFC PATCH 2/2] powerpc/powernv/cpufreq: Don't assume chip id is same as Linux node id Aneesh Kumar K.V
2020-08-04 7:47 ` Gautham R Shenoy
2020-08-01 5:20 ` [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id Srikar Dronamraju
2020-08-02 14:21 ` Aneesh Kumar K.V [this message]
2020-08-04 7:25 ` Srikar Dronamraju
2020-08-06 10:44 ` Aneesh Kumar K.V
2020-08-10 8:05 ` Srikar Dronamraju
2020-08-07 4:24 ` Nathan Lynch
2020-08-07 5:02 ` Aneesh Kumar K.V
2020-08-07 20:45 ` Nathan Lynch
2020-08-09 14:12 ` Aneesh Kumar K.V
2020-08-09 18:40 ` Aneesh Kumar K.V
2020-08-13 22:53 ` Nathan Lynch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87h7tl162y.fsf@linux.ibm.com \
--to=aneesh.kumar@linux.ibm.com \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=nathanl@linux.ibm.com \
--cc=srikar@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).