On Tue, Jun 15, 2021 at 01:10:27PM +0530, Aneesh Kumar K.V wrote: > David Gibson writes: > > > On Tue, Jun 15, 2021 at 10:58:42AM +0530, Aneesh Kumar K.V wrote: > >> David Gibson writes: > >> > >> > On Mon, Jun 14, 2021 at 10:10:02PM +0530, Aneesh Kumar K.V wrote: > >> >> Signed-off-by: Daniel Henrique Barboza > >> >> Signed-off-by: Aneesh Kumar K.V > >> >> --- > >> >> Documentation/powerpc/associativity.rst | 139 ++++++++++++++++++++ > >> >> arch/powerpc/include/asm/firmware.h | 3 +- > >> >> arch/powerpc/include/asm/prom.h | 1 + > >> >> arch/powerpc/kernel/prom_init.c | 3 +- > >> >> arch/powerpc/mm/numa.c | 149 +++++++++++++++++++++- > >> >> arch/powerpc/platforms/pseries/firmware.c | 1 + > >> >> 6 files changed, 290 insertions(+), 6 deletions(-) > >> >> create mode 100644 Documentation/powerpc/associativity.rst > >> >> > >> >> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst > >> >> new file mode 100644 > >> >> index 000000000000..58abedea81d7 > >> >> --- /dev/null > >> >> +++ b/Documentation/powerpc/associativity.rst > >> >> @@ -0,0 +1,139 @@ > >> >> +============================ > >> >> +NUMA resource associativity > >> >> +============================= > >> >> + > >> >> +Associativity represents the groupings of the various platform resources into > >> >> +domains of substantially similar mean performance relative to resources outside > >> >> +of that domain. Resources subsets of a given domain that exhibit better > >> >> +performance relative to each other than relative to other resources subsets > >> >> +are represented as being members of a sub-grouping domain. This performance > >> >> +characteristic is presented in terms of NUMA node distance within the Linux kernel. > >> >> +From the platform view, these groups are also referred to as domains. > >> >> + > >> >> +PAPR interface currently supports two different ways of communicating these resource > >> > > >> > You describe form 2 below as well, which contradicts this. > >> > >> Fixed as below. > >> > >> PAPR interface currently supports different ways of communicating these resource > >> grouping details to the OS. These are referred to as Form 0, Form 1 and Form2 > >> associativity grouping. Form 0 is the older format and is now considered deprecated. > >> > >> Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property". > >> Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1. > >> A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity > >> bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used. > > > > LGTM. > > > >> >> +grouping details to the OS. These are referred to as Form 0 and Form 1 associativity grouping. > >> >> +Form 0 is the older format and is now considered deprecated. > >> >> + > >> >> +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property". > >> >> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1. > >> >> +A value of 1 indicates the usage of Form 1 associativity. > >> >> + > >> >> +Form 0 > >> >> +----- > >> >> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE). > >> >> + > >> >> +Form 1 > >> >> +----- > >> >> +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity > >> >> +device tree properties are used to determine the NUMA distance between resource groups/domains. > >> >> + > >> >> +The “ibm,associativity” property contains one or more lists of numbers (domainID) > >> >> +representing the resource’s platform grouping domains. > >> >> + > >> >> +The “ibm,associativity-reference-points” property contains one or more list of numbers > >> >> +(domain index) that represents the 1 based ordinal in the associativity lists of the most > >> >> +significant boundary, with subsequent entries indicating progressively less significant boundaries. > >> >> + > >> >> +Linux kernel uses the domain id of the most significant boundary (aka primary domain) > >> > > >> > I thought we used the *least* significant boundary (the smallest > >> > grouping, not the largest). That is, the last index, not the first. > >> > > >> > Actually... come to think of it, I'm not even sure how to interpret > >> > "most significant". Does that mean a change in grouping at that "most > >> > significant" level results in the largest perfomance difference? > >> > >> PAPR defines "most significant" as below > >> > >> When the “ibm,architecture-vec-5” property byte 5 bit 0 has the value of one, the “ibm,associativ- > >> ity-reference-points” property indicates boundaries between associativity domains presented by the > >> “ibm,associativity” property containing “near” and “far” resources. The > >> first such boundary in the list represents the 1 based ordinal in the > >> associativity lists of the most significant boundary, with subsequent > >> entries indicating progressively less significant boundaries > > > > No... that's not a definition. Like your draft PAPR uses the term > > while entirely failing to define it. From what I can tell about how > > it is used the "most significant" boundary corresponds to what Linux > > simply thinks of as the node id. But intuitively, I'd think of that > > as the "least significant" boundary, since that's basically the > > smallest granularity at which we care about NUMA distances. > > > > > >> I would interpret it as the boundary where we start defining NUMA > >> nodes. > > > > That isn't any clearer to me. > > How about calling it least significant boundary then? Heck, no. My whole point here is that the meaning is unclear: my first guess at the meaning is different from whoever wrote that text. We need to come up with a way of describing it that's clearer. > The “ibm,associativity-reference-points” property contains one or more list of numbers > (domainID index) that represents the 1 based ordinal in the associativity lists of the > least significant boundary, with subsequent entries indicating progressively higher > significant boundaries. > > ex: > { primary domainID index, secondary domainID index, tertiary domainID index.. } > > Linux kernel uses the domainID of the least significant boundary (aka primary domain) > as the NUMA node id. Linux kernel computes NUMA distance between two domains by > recursively comparing if they belong to the same higher-level domains. For mismatch > at every higher level of the resource group, the kernel doubles the NUMA distance between > the comparing domains. > > > > >> >> +as the NUMA node id. Linux kernel computes NUMA distance between two domains by > >> >> +recursively comparing if they belong to the same higher-level domains. For mismatch > >> >> +at every higher level of the resource group, the kernel doubles the NUMA distance between > >> >> +the comparing domains. > >> >> + > >> >> +Form 2 > >> >> +------- > >> >> +Form 2 associativity format adds separate device tree properties representing NUMA node distance > >> >> +thereby making the node distance computation flexible. Form 2 also allows flexible primary > >> >> +domain numbering. With numa distance computation now detached from the index value of > >> >> +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the > >> >> +same domain index representing resource groups of different > >> >> performance/latency characteristics. > >> > > >> > The meaning of "domain index" is not clear to me here. > >> > >> Sorry for the confusion there. domain index is the index where domainID > >> is appearing. W.r.t "ibm,associativity" we have > > > > Ok, I think I eventually deduced that. We should start out clearly > > defining both domainID and index here. > > > > Also.. I think we need to find more distinct terms, because "index" is > > being used for both where the ID appears in an associativity array, > > and also when an ID appears in the Form2 "lookup-index-table" and the > > two usages are totally unconnected. > > > >> The “ibm,associativity” property contains one or more lists of numbers (domainID) > >> representing the resource’s platform grouping domains. If we can look at > >> an example property. > >> > >> { 4, 6, 7, 0, 0} > >> { 4, 6, 7, 0, 40} > >> > >> With Form 1 both NUMA node 0 and 40 will appear at the same distance. > >> They both are at domain index 4. With Form 2 we can represent them with > >> different NUMA distance values. > > > > Ok. Note that PAPR was never clear about what space domain IDs need > > to be unique within: do they need to be (a) globally unique (not true > > in practice), (b) unique at their index level or (c) unique only > > within their "parent" node at the previous index level. > > > > We should take the opportunity with Form2 to make that clearer. > > > > My understanding is that with Form2 it should be entirely feasible to > > built a dt have associativity arrays that are always of length 1. Is > > that correct? > > Correct, unless you have persistent memory device attached in which case > you need two entries. > > > > >> >> + > >> >> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the > >> >> +"ibm,architecture-vec-5" property. > >> >> + > >> >> +"ibm,numa-lookup-index-table" property contains one or more list numbers representing > >> >> +the domainIDs present in the system. The offset of the domainID in this property is considered > >> >> +the domainID index. > >> > > >> > You haven't really introduced the term "domainID". Is "domainID > >> > index" the same as "domain index" above? It's not clear to me. > >> > >> The earlier part of the documented said > >> > >> The “ibm,associativity” property contains one or more lists of numbers (domainID) > >> representing the resource’s platform grouping domains. > >> > >> I will update domain index to domainID index. > >> > >> > > >> > The distinction between "domain index" and "primary domain id" is also > >> > not clear to me. > >> > >> primary domain id is the domainID appearing in the primary domainID > >> index. Linux kenrel also use that as the NUMA node number. > > > > nit s/kenrel/kernel/ > > > >> Primary domainID index is defined by ibm,associativity-reference-points > >> and we consider that as the most significant resource group boundary. > >> > >> ibm,associativity-reference-points can be looked at as > >> { primary domainID index, secondary domainID index, tertiary domainID index.. } > > > > Ok, explicitly stating that in the doc would help a lot. > > > >> > > >> >> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by > >> >> +N domainID encoded as with encode-int > >> >> + > >> >> +For ex: > >> >> +ibm,numa-lookup-index-table = {4, 0, 8, 250, 252}, domainID index for domainID 8 is 1. > >> > > >> > Above you say "Form 2 allows a large number of primary domain ids at > >> > the same domain index", but this encoding doesn't appear to permit > >> > that. > >> > >> I didn't follow that question. > > > > Ah, that's because I was thinking of index here as the index within > > the lookup-index-table, not the index within the associativity > > arrays. > > > >> > > >> >> +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA > >> >> +distance between resource groups/domains present in the system. > >> >> + > >> >> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by > >> >> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255. > >> >> + > >> >> +For ex: > >> >> +ibm,numa-lookup-index-table = {3, 0, 8, 40} > >> >> +ibm,numa-distance-table = {9, 1, 2, 8, 2, 1, 16, 8, 16, 1} > >> >> + > >> >> + | 0 8 40 > >> >> +--|------------ > >> >> + | > >> >> +0 | 10 20 80 > >> >> + | > >> >> +8 | 20 10 160 > >> >> + | > >> >> +40| 80 160 10 > >> > > >> > What's the reason for multiplying the values by 10 in the expanded > >> > table version? > >> > >> That was me missing a document update. Since we used 8 bits to encode > >> distance at some point we were looking at a SCALE factor. But later > >> realized other architectures also restrict distance to 8 bits. I will > >> update ibm,numa-distance-table in the document. > > > > Ok. > > > >> >> + > >> >> +With Form2 "ibm,associativity" for resources is listed as below: > >> >> + > >> >> +"ibm,associativity" property for resources in node 0, 8 and 40 > >> >> +{ 4, 6, 7, 0, 0} > >> >> +{ 4, 6, 9, 8, 8} > >> >> +{ 4, 6, 7, 0, 40} > >> >> + > >> >> +With "ibm,associativity-reference-points" { 0x4, 0x3, 0x2 } > >> >> + > >> >> +With Form2 the primary domainID and secondary domainID are used to identify the NUMA nodes > >> > > >> > What the heck is the secondary domainID > >> > >> domainID appearing the secondary domainID index. > > > > I understand that from the clarifications you've made about, but > > second domainID index wasn't any more defined in the original draft. > > > >> ibm,associativity-reference-points gives an indication of different > >> hierachy of resource grouping as below. > >> > >> ibm,associativity-reference-points can be looked at as > >> { primary domainID index, secondary domainID index, tertiary domainID index.. } > >> > >> > > >> >> +the kernel should use when using persistent memory devices. Persistent memory devices > >> >> +can also be used as regular memory using DAX KMEM driver and primary domainID indicates > >> >> +the numa node number OS should use when using these devices as regular memory. Secondary > >> >> +domainID is the numa node number that should be used when using this device as > >> >> +persistent memory. In the later case, we are interested in the locality of the > >> >> +device to an established numa node. In the above example, if the last row represents a > >> >> +persistent memory device/resource, NUMA node number 40 will be used when using the device > >> >> +as regular memory and NUMA node number 0 will be the device numa node when using it as > >> >> +a persistent memory device. > >> >> + > >> >> +Each resource (drcIndex) now also supports additional optional device tree properties. > >> >> +These properties are marked optional because the platform can choose not to export > >> >> +them and provide the system topology details using the earlier defined device tree > >> >> +properties alone. The optional device tree properties are used when adding new resources > >> >> +(DLPAR) and when the platform didn't provide the topology details of the domain which > >> >> +contains the newly added resource during boot. > >> >> + > >> >> +"ibm,numa-lookup-index" property contains a number representing the domainID index to be used > >> >> +when building the NUMA distance of the numa node to which this resource belongs. The domain id > >> >> +of the new resource can be obtained from the existing "ibm,associativity" property. This > >> >> +can be used to build distance information of a newly onlined NUMA node via DLPAR operation. > >> >> +The value is 1 based array index value. > >> > > >> > Am I correct in thinking that if we have an entirely form2 world, we'd > >> > only need this and the ibm,associativity properties could be dropped? > >> > >> Not really. ibm,numa-lookup-index-table was added to have a concise > >> representation of numa distance via ibm,numa-distance-table. > >> > >> For ex: With domainID 0, 4, 5 we could do a 5x5 matrix to represent the > >> numa distance. Instead ibm,numa-lookup-index-table allows us to present > >> the same in a 3x3 matrix distance[index0][index1] is the distance > >> between NUMA node 0 and 4 and distance[index0][index2] is the distance > >> between NUMA node 0 and 5 > > > > Right, I get the purpose of it, and I realized I misphrashed my > > question. My point is that in a Form2 world, the *only* thing the > > associativity array is used for is to deduce its position in > > lookup-index-table. Once you have have that for each resource, you > > have everything you need, yes? > > > ibm,associativity is used find the domainID/NUMA node id of the > resource. > > ibm,lookup-index-table is used compute the distance information between > NUMA nodes using ibm,numa-distance-table. I get that you need to use lookup-index-table to work out how to interpret numa-distance-table. My point is that IIUC once you've done the lookup in lookup-index-table once for each associativity array value, the number you get out (which just a compacted version of the node id) should be all you need ever again. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson