On Tue, Jun 15, 2021 at 01:10:27PM +0530, Aneesh Kumar K.V wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > On Tue, Jun 15, 2021 at 10:58:42AM +0530, Aneesh Kumar K.V wrote:
> >> David Gibson <david@gibson.dropbear.id.au> writes:
> >> 
> >> > On Mon, Jun 14, 2021 at 10:10:02PM +0530, Aneesh Kumar K.V wrote:
> >> >> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> >> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> >> >> ---
> >> >>  Documentation/powerpc/associativity.rst   | 139 ++++++++++++++++++++
> >> >>  arch/powerpc/include/asm/firmware.h       |   3 +-
> >> >>  arch/powerpc/include/asm/prom.h           |   1 +
> >> >>  arch/powerpc/kernel/prom_init.c           |   3 +-
> >> >>  arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
> >> >>  arch/powerpc/platforms/pseries/firmware.c |   1 +
> >> >>  6 files changed, 290 insertions(+), 6 deletions(-)
> >> >>  create mode 100644 Documentation/powerpc/associativity.rst
> >> >> 
> >> >> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
> >> >> new file mode 100644
> >> >> index 000000000000..58abedea81d7
> >> >> --- /dev/null
> >> >> +++ b/Documentation/powerpc/associativity.rst
> >> >> @@ -0,0 +1,139 @@
> >> >> +============================
> >> >> +NUMA resource associativity
> >> >> +=============================
> >> >> +
> >> >> +Associativity represents the groupings of the various platform resources into
> >> >> +domains of substantially similar mean performance relative to resources outside
> >> >> +of that domain. Resources subsets of a given domain that exhibit better
> >> >> +performance relative to each other than relative to other resources subsets
> >> >> +are represented as being members of a sub-grouping domain. This performance
> >> >> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
> >> >> +From the platform view, these groups are also referred to as domains.
> >> >> +
> >> >> +PAPR interface currently supports two different ways of communicating these resource
> >> >
> >> > You describe form 2 below as well, which contradicts this.
> >> 
> >> Fixed as below.
> >> 
> >> PAPR interface currently supports different ways of communicating these resource
> >> grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
> >> associativity grouping. Form 0 is the older format and is now considered deprecated.
> >> 
> >> Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
> >> Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
> >> A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
> >> bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
> >
> > LGTM.
> >
> >> >> +grouping details to the OS. These are referred to as Form 0 and Form 1 associativity grouping.
> >> >> +Form 0 is the older format and is now considered deprecated.
> >> >> +
> >> >> +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
> >> >> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
> >> >> +A value of 1 indicates the usage of Form 1 associativity.
> >> >> +
> >> >> +Form 0
> >> >> +-----
> >> >> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
> >> >> +
> >> >> +Form 1
> >> >> +-----
> >> >> +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity
> >> >> +device tree properties are used to determine the NUMA distance between resource groups/domains. 
> >> >> +
> >> >> +The “ibm,associativity” property contains one or more lists of numbers (domainID)
> >> >> +representing the resource’s platform grouping domains.
> >> >> +
> >> >> +The “ibm,associativity-reference-points” property contains one or more list of numbers
> >> >> +(domain index) that represents the 1 based ordinal in the associativity lists of the most
> >> >> +significant boundary, with subsequent entries indicating progressively less significant boundaries.
> >> >> +
> >> >> +Linux kernel uses the domain id of the most significant boundary (aka primary domain)
> >> >
> >> > I thought we used the *least* significant boundary (the smallest
> >> > grouping, not the largest).  That is, the last index, not the first.
> >> >
> >> > Actually... come to think of it, I'm not even sure how to interpret
> >> > "most significant".  Does that mean a change in grouping at that "most
> >> > significant" level results in the largest perfomance difference?
> >> 
> >> PAPR defines "most significant" as below
> >> 
> >> When the “ibm,architecture-vec-5” property byte 5 bit 0 has the value of one, the “ibm,associativ-
> >> ity-reference-points” property indicates boundaries between associativity domains presented by the
> >> “ibm,associativity” property containing “near” and “far” resources. The
> >> first such boundary in the list represents the 1 based ordinal in the
> >> associativity lists of the most significant boundary, with subsequent
> >> entries indicating progressively less significant boundaries
> >
> > No... that's not a definition.  Like your draft PAPR uses the term
> > while entirely failing to define it.  From what I can tell about how
> > it is used the "most significant" boundary corresponds to what Linux
> > simply thinks of as the node id.  But intuitively, I'd think of that
> > as the "least significant" boundary, since that's basically the
> > smallest granularity at which we care about NUMA distances.
> >
> >
> >> I would interpret it as the boundary where we start defining NUMA
> >> nodes.
> >
> > That isn't any clearer to me.
> 
> How about calling it least significant boundary then?

Heck, no.  My whole point here is that the meaning is unclear: my
first guess at the meaning is different from whoever wrote that text.
We need to come up with a way of describing it that's clearer.

> The “ibm,associativity-reference-points” property contains one or more list of numbers
> (domainID index) that represents the 1 based ordinal in the associativity lists of the
> least significant boundary, with subsequent entries indicating progressively higher
> significant boundaries.
> 
> ex:
> { primary domainID index, secondary domainID index, tertiary domainID index.. }
> 
> Linux kernel uses the domainID of the least significant boundary (aka primary domain)
> as the NUMA node id. Linux kernel computes NUMA distance between two domains by
> recursively comparing if they belong to the same higher-level domains. For mismatch
> at every higher level of the resource group, the kernel doubles the NUMA distance between
> the comparing domains.
> 
> >
> >> >> +as the NUMA node id. Linux kernel computes NUMA distance between two domains by
> >> >> +recursively comparing if they belong to the same higher-level domains. For mismatch
> >> >> +at every higher level of the resource group, the kernel doubles the NUMA distance between
> >> >> +the comparing domains.
> >> >> +
> >> >> +Form 2
> >> >> +-------
> >> >> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
> >> >> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
> >> >> +domain numbering. With numa distance computation now detached from the index value of
> >> >> +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the
> >> >> +same domain index representing resource groups of different
> >> >> performance/latency characteristics.
> >> >
> >> > The meaning of "domain index" is not clear to me here.
> >> 
> >> Sorry for the confusion there. domain index is the index where domainID
> >> is appearing. W.r.t "ibm,associativity"  we have
> >
> > Ok, I think I eventually deduced that.  We should start out clearly
> > defining both domainID and index here.
> >
> > Also.. I think we need to find more distinct terms, because "index" is
> > being used for both where the ID appears in an associativity array,
> > and also when an ID appears in the Form2 "lookup-index-table" and the
> > two usages are totally unconnected.
> >
> >> The “ibm,associativity” property contains one or more lists of numbers (domainID)
> >> representing the resource’s platform grouping domains. If we can look at
> >> an example property.
> >> 
> >> { 4, 6, 7, 0, 0}
> >> { 4, 6, 7, 0, 40}
> >> 
> >> With Form 1 both NUMA node 0 and 40 will appear at the same distance.
> >> They both are at domain index 4. With Form 2 we can represent them with
> >> different NUMA distance values.
> >
> > Ok.  Note that PAPR was never clear about what space domain IDs need
> > to be unique within: do they need to be (a) globally unique (not true
> > in practice), (b) unique at their index level or (c) unique only
> > within their "parent" node at the previous index level.
> >
> > We should take the opportunity with Form2 to make that clearer.
> >
> > My understanding is that with Form2 it should be entirely feasible to
> > built a dt have associativity arrays that are always of length 1.  Is
> > that correct?
> 
> Correct, unless you have persistent memory device attached in which case
> you need two entries.
> 
> >
> >> >> +
> >> >> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
> >> >> +"ibm,architecture-vec-5" property.
> >> >> +
> >> >> +"ibm,numa-lookup-index-table" property contains one or more list numbers representing
> >> >> +the domainIDs present in the system. The offset of the domainID in this property is considered
> >> >> +the domainID index.
> >> >
> >> > You haven't really introduced the term "domainID".  Is "domainID
> >> > index" the same as "domain index" above?  It's not clear to me.
> >> 
> >> The earlier part of the documented said 
> >> 
> >> The “ibm,associativity” property contains one or more lists of numbers (domainID)
> >> representing the resource’s platform grouping domains.
> >> 
> >> I will update domain index to domainID index. 
> >> 
> >> >
> >> > The distinction between "domain index" and "primary domain id" is also
> >> > not clear to me.
> >> 
> >> primary domain id is the domainID appearing in the primary domainID
> >> index. Linux kenrel also use that as the NUMA node number.
> >
> > nit s/kenrel/kernel/
> >
> >> Primary domainID index is defined by ibm,associativity-reference-points
> >> and we consider that as the most significant resource group boundary.
> >> 
> >> ibm,associativity-reference-points can be looked at as
> >> { primary domainID index, secondary domainID index, tertiary domainID index.. }
> >
> > Ok, explicitly stating that in the doc would help a lot.
> >
> >> >
> >> >> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
> >> >> +N domainID encoded as with encode-int
> >> >> +
> >> >> +For ex:
> >> >> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for domainID 8 is 1.
> >> >
> >> > Above you say "Form 2 allows a large number of primary domain ids at
> >> > the same domain index", but this encoding doesn't appear to permit
> >> > that.
> >> 
> >> I didn't follow that question.
> >
> > Ah, that's because I was thinking of index here as the index within
> > the lookup-index-table, not the index within the associativity
> > arrays.
> >
> >> >
> >> >> +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
> >> >> +distance between resource groups/domains present in the system.
> >> >> +
> >> >> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> >> >> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
> >> >> +
> >> >> +For ex:
> >> >> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
> >> >> +ibm,numa-distance-table     =  {9, 1, 2, 8, 2, 1, 16, 8, 16, 1}
> >> >> +
> >> >> +  | 0    8   40
> >> >> +--|------------
> >> >> +  |
> >> >> +0 | 10   20  80
> >> >> +  |
> >> >> +8 | 20   10  160
> >> >> +  |
> >> >> +40| 80   160  10
> >> >
> >> > What's the reason for multiplying the values by 10 in the expanded
> >> > table version?
> >> 
> >> That was me missing a document update. Since we used 8 bits to encode
> >> distance at some point we were looking at a SCALE factor. But later
> >> realized other architectures also restrict distance to 8 bits. I will
> >> update ibm,numa-distance-table in the document.
> >
> > Ok.
> >
> >> >> +
> >> >> +With Form2 "ibm,associativity" for resources is listed as below:
> >> >> +
> >> >> +"ibm,associativity" property for resources in node 0, 8 and 40
> >> >> +{ 4, 6, 7, 0, 0}
> >> >> +{ 4, 6, 9, 8, 8}
> >> >> +{ 4, 6, 7, 0, 40}
> >> >> +
> >> >> +With "ibm,associativity-reference-points"  { 0x4, 0x3, 0x2 }
> >> >> +
> >> >> +With Form2 the primary domainID and secondary domainID are used to identify the NUMA nodes
> >> >
> >> > What the heck is the secondary domainID
> >> 
> >> domainID appearing the secondary domainID index.
> >
> > I understand that from the clarifications you've made about, but
> > second domainID index wasn't any more defined in the original draft.
> >
> >> ibm,associativity-reference-points gives an indication of different
> >> hierachy of resource grouping as below.
> >> 
> >> ibm,associativity-reference-points can be looked at as
> >> { primary domainID index, secondary domainID index, tertiary domainID index.. }
> >> 
> >> >
> >> >> +the kernel should use when using persistent memory devices. Persistent memory devices
> >> >> +can also be used as regular memory using DAX KMEM driver and primary domainID indicates
> >> >> +the numa node number OS should use when using these devices as regular memory. Secondary
> >> >> +domainID is the numa node number that should be used when using this device as
> >> >> +persistent memory. In the later case, we are interested in the locality of the
> >> >> +device to an established numa node. In the above example, if the last row represents a
> >> >> +persistent memory device/resource, NUMA node number 40 will be used when using the device
> >> >> +as regular memory and NUMA node number 0 will be the device numa node when using it as
> >> >> +a persistent memory device.
> >> >> +
> >> >> +Each resource (drcIndex) now also supports additional optional device tree properties.
> >> >> +These properties are marked optional because the platform can choose not to export
> >> >> +them and provide the system topology details using the earlier defined device tree
> >> >> +properties alone. The optional device tree properties are used when adding new resources
> >> >> +(DLPAR) and when the platform didn't provide the topology details of the domain which
> >> >> +contains the newly added resource during boot.
> >> >> +
> >> >> +"ibm,numa-lookup-index" property contains a number representing the domainID index to be used
> >> >> +when building the NUMA distance of the numa node to which this resource belongs. The domain id
> >> >> +of the new resource can be obtained from the existing "ibm,associativity" property. This
> >> >> +can be used to build distance information of a newly onlined NUMA node via DLPAR operation.
> >> >> +The value is 1 based array index value.
> >> >
> >> > Am I correct in thinking that if we have an entirely form2 world, we'd
> >> > only need this and the ibm,associativity properties could be dropped?
> >> 
> >> Not really. ibm,numa-lookup-index-table was added to have a concise
> >> representation of numa distance via ibm,numa-distance-table. 
> >> 
> >> For ex: With domainID 0, 4, 5 we could do a 5x5 matrix to represent the
> >> numa distance. Instead ibm,numa-lookup-index-table allows us to present
> >> the same in a 3x3 matrix  distance[index0][index1] is the  distance
> >> between NUMA node 0 and 4 and distance[index0][index2] is the distance
> >> between NUMA node 0 and 5
> >
> > Right, I get the purpose of it, and I realized I misphrashed my
> > question.  My point is that in a Form2 world, the *only* thing the
> > associativity array is used for is to deduce its position in
> > lookup-index-table.  Once you have have that for each resource, you
> > have everything you need, yes?
> 
> 
> ibm,associativity is used find the domainID/NUMA node id of the
> resource.
> 
> ibm,lookup-index-table is used compute the distance information between
> NUMA nodes using ibm,numa-distance-table.

I get that you need to use lookup-index-table to work out how to
interpret numa-distance-table.  My point is that IIUC once you've done
the lookup in lookup-index-table once for each associativity array
value, the number you get out (which just a compacted version of the
node id) should be all you need ever again.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson