All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Henrique Barboza <danielhb413@gmail.com>
To: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	David Gibson <david@gibson.dropbear.id.au>
Cc: Nathan Lynch <nathanl@linux.ibm.com>, linuxppc-dev@lists.ozlabs.org
Subject: Re: [RFC PATCH 8/8] powerpc/papr_scm: Use FORM2 associativity details
Date: Thu, 17 Jun 2021 17:00:01 -0300	[thread overview]
Message-ID: <5e180af8-9d48-7519-0b35-967065f8e3e1@gmail.com> (raw)
In-Reply-To: <87r1h0n3u6.fsf@linux.ibm.com>



On 6/17/21 8:11 AM, Aneesh Kumar K.V wrote:
> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> 
>> On 6/17/21 4:46 AM, David Gibson wrote:
>>> On Tue, Jun 15, 2021 at 12:35:17PM +0530, Aneesh Kumar K.V wrote:
>>>> David Gibson <david@gibson.dropbear.id.au> writes:
>>>>
>>>>> On Tue, Jun 15, 2021 at 11:27:50AM +0530, Aneesh Kumar K.V wrote:
>>>>>> David Gibson <david@gibson.dropbear.id.au> writes:
>>>>>>
>>>>>>> On Mon, Jun 14, 2021 at 10:10:03PM +0530, Aneesh Kumar K.V wrote:
>>>>>>>> FORM2 introduce a concept of secondary domain which is identical to the
>>>>>>>> conceept of FORM1 primary domain. Use secondary domain as the numa node
>>>>>>>> when using persistent memory device. For DAX kmem use the logical domain
>>>>>>>> id introduced in FORM2. This new numa node
>>>>>>>>
>>>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>>>> ---
>>>>>>>>    arch/powerpc/mm/numa.c                    | 28 +++++++++++++++++++++++
>>>>>>>>    arch/powerpc/platforms/pseries/papr_scm.c | 26 +++++++++++++--------
>>>>>>>>    arch/powerpc/platforms/pseries/pseries.h  |  1 +
>>>>>>>>    3 files changed, 45 insertions(+), 10 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>>>>>>>> index 86cd2af014f7..b9ac6d02e944 100644
>>>>>>>> --- a/arch/powerpc/mm/numa.c
>>>>>>>> +++ b/arch/powerpc/mm/numa.c
>>>>>>>> @@ -265,6 +265,34 @@ static int associativity_to_nid(const __be32 *associativity)
>>>>>>>>    	return nid;
>>>>>>>>    }
>>>>>>>>    
>>>>>>>> +int get_primary_and_secondary_domain(struct device_node *node, int *primary, int *secondary)
>>>>>>>> +{
>>>>>>>> +	int secondary_index;
>>>>>>>> +	const __be32 *associativity;
>>>>>>>> +
>>>>>>>> +	if (!numa_enabled) {
>>>>>>>> +		*primary = NUMA_NO_NODE;
>>>>>>>> +		*secondary = NUMA_NO_NODE;
>>>>>>>> +		return 0;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	associativity = of_get_associativity(node);
>>>>>>>> +	if (!associativity)
>>>>>>>> +		return -ENODEV;
>>>>>>>> +
>>>>>>>> +	if (of_read_number(associativity, 1) >= primary_domain_index) {
>>>>>>>> +		*primary = of_read_number(&associativity[primary_domain_index], 1);
>>>>>>>> +		secondary_index = of_read_number(&distance_ref_points[1], 1);
>>>>>>>
>>>>>>> Secondary ID is always the second reference point, but primary depends
>>>>>>> on the length of resources?  That seems very weird.
>>>>>>
>>>>>> primary_domain_index is distance_ref_point[0]. With Form2 we would find
>>>>>> both primary and secondary domain ID same for all resources other than
>>>>>> persistent memory device. The usage w.r.t. persistent memory is
>>>>>> explained in patch 7.
>>>>>
>>>>> Right, I misunderstood
>>>>>
>>>>>>
>>>>>> With Form2 the primary domainID and secondary domainID are used to identify the NUMA nodes
>>>>>> the kernel should use when using persistent memory devices.
>>>>>
>>>>> This seems kind of bogus.  With Form1, the primary/secondary ID are a
>>>>> sort of heirarchy of distance (things with same primary ID are very
>>>>> close, things with same secondary are kinda-close, etc.).  With Form2,
>>>>> it's referring to their effective node for different purposes.
>>>>>
>>>>> Using the same terms for different meanings seems unnecessarily
>>>>> confusing.
>>>>
>>>> They are essentially domainIDs. The interpretation of them are different
>>>> between Form1 and Form2. Hence I kept referring to them as primary and
>>>> secondary domainID. Any suggestion on what to name them with Form2?
>>>
>>> My point is that reusing associativity-reference-points for something
>>> with completely unrelated semantics seems like a very poor choice.
>>
>>
>> I agree that this reuse can be confusing. I could argue that there is
>> precedent for that in PAPR - FORM0 puts a different spin on the same
>> property as well - but there is no need to keep following existing PAPR
>> practices in new spec (and some might argue it's best not to).
>>
>> As far as QEMU goes, renaming this property to "numa-associativity-mode"
>> (just an example) is a quick change to do since we separated FORM1 and FORM2
>> code over there.
>>
>> Doing such a rename can also help with the issue of having to describe new
>> FORM2 semantics using "least significant boundary" or "primary domain" or
>> any FORM0|FORM1 related terminology.
>>
> 
> It is not just changing the name, we will then have to explain the
> meaning of ibm,associativity-reference-points with FORM2 right?

Hmmmm why? My idea over there was to add a new property that indicates that
resource might have a different NUMA affinity based on the mode of operation
(like PMEM), and get rid of ibm,associativity-reference-points altogether.

The NUMA distances already express the topology. Closer distances indicates
closer proximity, larger distances indicates otherwise. Having
"associativity-reference-points" to reflect a  associativity domain
relationship, when you already have all the distances from each node, is
somewhat redundant.

The concept of 'associativity domain' was necessary in FORM1 because we had no
other way of telling distance between NUMA nodes. We needed to rely on these
overly complex and convoluted subdomain abstractions to say that "nodeA belongs
to the same third-level domain as node B, and in the second-level domain with
node C". The kernel would read that and calculate that each level is doubling
the distance from the level before and local_distance is 10, so:

distAA = 10  distAB= 20 distAC = 40

With FORM2, if this information is already explicit in ibm,numa-distance-table,
why bother calculating associativity domains? If you want to know whether
PROCA is closer to PROCB or PROCX, just look at the NUMA distance table and
see which one is closer.

  

> 
> With FORM2 we want to represent the topology better
> 
>   --------------------------------------------------------------------------------
> |                                                         domainID 20            |
> |   ---------------------------------------                                      |
> |  |                            NUMA node1 |                                     |
> |  |                                       |            --------------------     |
> |  |    ProcB -------> MEMC                |           |        NUMA node40 |    |
> |  |	|                                  |           |                    |    |
> |  |	---------------------------------- |-------->  |  PMEMD             |    |
> |  |                                       |            --------------------     |
> |  |                                       |                                     |
> |   ---------------------------------------                                      |
>   --------------------------------------------------------------------------------
> 
> ibm,associativity:
>          { 20, 1, 40}  -> PMEMD
>          { 20, 1, 1}  -> PROCB/MEMC
> 
> is the suggested FORM2 representation.


The way I see it, the '20' over there is not needed at all. What utility it
brings? And why create an associativity domain '1' in the MEMC associativity
at 0x3?

What the current QEMU FORM2 implementation is doing would be this:

           { 0, 0, 1, 40}  -> PMEMD
           { 0, 0, 0, 1}  -> PROCB/MEMC


PMEMD has a pointer to the NUMA node in which it would run as persistent
memory, node 1. All the memory/cpu nodes of node1 would be oblivious
to what PMEMD is doing.

I don't see the need of creating an associativity domain between node1
and node40 in 0x3. Besides, if a device_add operation of a PMEM that wants
to use nodeN as the node for persistent memory would trigger a massive
ibm,associativity update, on all LMBs that belongs to nodeN, because then
everyone needs to have the same third level associativity domain as the
hotplugged PMEM. To avoid that, if the idea is to 'just duplicate the
logical_domain_id in 0x3 for all non-PMEM devices' then what's the
difference of looking into the logical_numa_id at 0x4 in the first
place?



In fact, the more I speak about this PMEM scenario the more I wonder:
why doesn't the PMEM driver, when switching from persistent to regular
memory and vice-versa, take care of all the necessary updates in the
numa-distance-table and kernel internals to reflect the current distances
of its current mode? Is this a technical limitation?



Thanks


Daniel


> 
> -aneesh
> 

  parent reply	other threads:[~2021-06-17 20:00 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-14 16:39 [RFC PATCH 0/8] Add support for FORM2 associativity Aneesh Kumar K.V
2021-06-14 16:39 ` [RFC PATCH 1/8] powerpc/pseries: rename min_common_depth to primary_domain_index Aneesh Kumar K.V
2021-06-15  3:00   ` David Gibson
2021-06-15  8:21     ` Aneesh Kumar K.V
2021-06-14 16:39 ` [RFC PATCH 2/8] powerpc/pseries: rename distance_ref_points_depth to max_domain_index Aneesh Kumar K.V
2021-06-15  3:01   ` David Gibson
2021-06-15  8:22     ` Aneesh Kumar K.V
2021-06-14 16:39 ` [RFC PATCH 3/8] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY Aneesh Kumar K.V
2021-06-15  3:04   ` David Gibson
2021-06-14 16:39 ` [RFC PATCH 4/8] powerpc/pseries: Consolidate DLPAR NUMA distance update Aneesh Kumar K.V
2021-06-15  3:13   ` David Gibson
2021-06-15  8:26     ` Aneesh Kumar K.V
2021-06-14 16:40 ` [RFC PATCH 5/8] powerpc/pseries: Consolidate NUMA distance update during boot Aneesh Kumar K.V
2021-06-14 16:40 ` [RFC PATCH 6/8] powerpc/pseries: Add a helper for form1 cpu distance Aneesh Kumar K.V
2021-06-15  3:21   ` David Gibson
2021-06-14 16:40 ` [RFC PATCH 7/8] powerpc/pseries: Add support for FORM2 associativity Aneesh Kumar K.V
2021-06-15  3:53   ` David Gibson
2021-06-15  5:28     ` Aneesh Kumar K.V
2021-06-15  6:25       ` David Gibson
2021-06-15  7:40         ` Aneesh Kumar K.V
2021-06-17  7:50           ` David Gibson
2021-06-17 10:46             ` Aneesh Kumar K.V
2021-06-14 16:40 ` [RFC PATCH 8/8] powerpc/papr_scm: Use FORM2 associativity details Aneesh Kumar K.V
2021-06-15  3:55   ` David Gibson
2021-06-15  5:57     ` Aneesh Kumar K.V
2021-06-15  6:34       ` David Gibson
2021-06-15  7:05         ` Aneesh Kumar K.V
2021-06-17  7:46           ` David Gibson
2021-06-17 10:53             ` Daniel Henrique Barboza
2021-06-17 11:11               ` Aneesh Kumar K.V
2021-06-17 11:46                 ` Aneesh Kumar K.V
2021-06-17 20:00                 ` Daniel Henrique Barboza [this message]
2021-06-18  3:18                   ` Aneesh Kumar K.V
2021-06-17 10:59             ` Aneesh Kumar K.V
2021-06-24  3:16               ` David Gibson
2021-06-17 13:55             ` Aneesh Kumar K.V
2021-06-17 14:04               ` Aneesh Kumar K.V
2021-06-15  1:47 ` [RFC PATCH 0/8] Add support for FORM2 associativity Daniel Henrique Barboza

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5e180af8-9d48-7519-0b35-967065f8e3e1@gmail.com \
    --to=danielhb413@gmail.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=nathanl@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.