nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/7] Add support for FORM2 associativity
@ 2021-06-17 16:50 Aneesh Kumar K.V
  2021-06-17 16:50 ` [PATCH v4 1/7] powerpc/pseries: rename min_common_depth to primary_domain_index Aneesh Kumar K.V
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-17 16:50 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, David Gibson, Daniel Henrique Barboza, nvdimm,
	dan.j.williams, Aneesh Kumar K.V

Form2 associativity adds a much more flexible NUMA topology layout
than what is provided by Form1. More details can be found in patch 7.

$ numactl -H
...
node distances:
node   0   1   2   3 
  0:  10  11  222  33 
  1:  44  10  55  66 
  2:  77  88  10  99 
  3:  101  121  132  10 
$

After DAX kmem memory add
# numactl -H
available: 5 nodes (0-4)
...
node distances:
node   0   1   2   3   4 
  0:  10  11  222  33  240 
  1:  44  10  55  66  255 
  2:  77  88  10  99  255 
  3:  101  121  132  10  230 
  4:  255  255  255  230  10 


PAPR SCM now use the numa distance details to find the numa_node and target_node
for the device.

kvaneesh@ubuntu-guest:~$ ndctl  list -N -v 
[
  {
    "dev":"namespace0.0",
    "mode":"devdax",
    "map":"dev",
    "size":1071644672,
    "uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2",
    "raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362",
    "daxregion":{
      "id":0,
      "size":1071644672,
      "devices":[
        {
          "chardev":"dax0.0",
          "size":1071644672,
          "target_node":4,
          "mode":"devdax"
        }
      ]
    },
    "align":2097152,
    "numa_node":3
  }
]
kvaneesh@ubuntu-guest:~$ 


The above output is with a Qemu command line

-numa node,nodeid=4 \
-numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
-numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
-numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \
-numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \
-object memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}  \
-device nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4

Qemu changes can be found at https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb413@gmail.com/

Changes from v3:
* Drop PAPR SCM specific changes and depend completely on NUMA distance information.

Changes from v2:
* Add nvdimm list to Cc:
* update PATCH 8 commit message.

Changes from v1:
* Update FORM2 documentation.
* rename max_domain_index to max_associativity_domain_index

Aneesh Kumar K.V (7):
  powerpc/pseries: rename min_common_depth to primary_domain_index
  powerpc/pseries: rename distance_ref_points_depth to
    max_associativity_domain_index
  powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
  powerpc/pseries: Consolidate DLPAR NUMA distance update
  powerpc/pseries: Consolidate NUMA distance update during boot
  powerpc/pseries: Add a helper for form1 cpu distance
  powerpc/pseries: Add support for FORM2 associativity

 Documentation/powerpc/associativity.rst       | 135 ++++++
 arch/powerpc/include/asm/firmware.h           |   7 +-
 arch/powerpc/include/asm/prom.h               |   3 +-
 arch/powerpc/kernel/prom_init.c               |   3 +-
 arch/powerpc/mm/numa.c                        | 410 ++++++++++++++----
 arch/powerpc/platforms/pseries/firmware.c     |   3 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c        |   2 +
 arch/powerpc/platforms/pseries/pseries.h      |   1 +
 9 files changed, 474 insertions(+), 92 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

-- 
2.31.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 1/7] powerpc/pseries: rename min_common_depth to primary_domain_index
  2021-06-17 16:50 [PATCH v4 0/7] Add support for FORM2 associativity Aneesh Kumar K.V
@ 2021-06-17 16:50 ` Aneesh Kumar K.V
  2021-06-24  1:46   ` David Gibson
  2021-06-17 16:51 ` [PATCH v4 2/7] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-17 16:50 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, David Gibson, Daniel Henrique Barboza, nvdimm,
	dan.j.williams, Aneesh Kumar K.V

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/numa.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..8365b298ec48 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
-static int min_common_depth;
+static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
@@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 *associativity)
 	if (!numa_enabled)
 		goto out;
 
-	if (of_read_number(associativity, 1) >= min_common_depth)
-		nid = of_read_number(&associativity[min_common_depth], 1);
+	if (of_read_number(associativity, 1) >= primary_domain_index)
+		nid = of_read_number(&associativity[primary_domain_index], 1);
 
 	/* POWER4 LPAR uses 0xffff as invalid node */
 	if (nid == 0xffff || nid >= nr_node_ids)
@@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
-static int __init find_min_common_depth(void)
+static int __init find_primary_domain_index(void)
 {
-	int depth;
+	int index;
 	struct device_node *root;
 
 	if (firmware_has_feature(FW_FEATURE_OPAL))
@@ -326,7 +326,7 @@ static int __init find_min_common_depth(void)
 	}
 
 	if (form1_affinity) {
-		depth = of_read_number(distance_ref_points, 1);
+		index = of_read_number(distance_ref_points, 1);
 	} else {
 		if (distance_ref_points_depth < 2) {
 			printk(KERN_WARNING "NUMA: "
@@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
 			goto err;
 		}
 
-		depth = of_read_number(&distance_ref_points[1], 1);
+		index = of_read_number(&distance_ref_points[1], 1);
 	}
 
 	/*
@@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
 	}
 
 	of_node_put(root);
-	return depth;
+	return index;
 
 err:
 	of_node_put(root);
@@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 	int nid = default_nid;
 	int rc, index;
 
-	if ((min_common_depth < 0) || !numa_enabled)
+	if ((primary_domain_index < 0) || !numa_enabled)
 		return default_nid;
 
 	rc = of_get_assoc_arrays(&aa);
 	if (rc)
 		return default_nid;
 
-	if (min_common_depth <= aa.array_sz &&
+	if (primary_domain_index <= aa.array_sz &&
 	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
-		index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
+		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
 		nid = of_read_number(&aa.arrays[index], 1);
 
 		if (nid == 0xffff || nid >= nr_node_ids)
@@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
 		return -1;
 	}
 
-	min_common_depth = find_min_common_depth();
+	primary_domain_index = find_primary_domain_index();
 
-	if (min_common_depth < 0) {
+	if (primary_domain_index < 0) {
 		/*
-		 * if we fail to parse min_common_depth from device tree
+		 * if we fail to parse primary_domain_index from device tree
 		 * mark the numa disabled, boot with numa disabled.
 		 */
 		numa_enabled = false;
-		return min_common_depth;
+		return primary_domain_index;
 	}
 
-	dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+	dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index);
 
 	/*
 	 * Even though we connect cpus to numa domains later in SMP
@@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
 			goto out;
 	}
 
-	max_nodes = of_read_number(&domains[min_common_depth], 1);
+	max_nodes = of_read_number(&domains[primary_domain_index], 1);
 	for (i = 0; i < max_nodes; i++) {
 		if (!node_possible(i))
 			node_set(i, node_possible_map);
 	}
 
 	prop_length /= sizeof(int);
-	if (prop_length > min_common_depth + 2)
+	if (prop_length > primary_domain_index + 2)
 		coregroup_enabled = 1;
 
 out:
@@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
 		goto out;
 
 	index = of_read_number(associativity, 1);
-	if (index > min_common_depth + 1)
+	if (index > primary_domain_index + 1)
 		return of_read_number(&associativity[index - 1], 1);
 
 out:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 2/7] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index
  2021-06-17 16:50 [PATCH v4 0/7] Add support for FORM2 associativity Aneesh Kumar K.V
  2021-06-17 16:50 ` [PATCH v4 1/7] powerpc/pseries: rename min_common_depth to primary_domain_index Aneesh Kumar K.V
@ 2021-06-17 16:51 ` Aneesh Kumar K.V
  2021-06-24  3:10   ` David Gibson
  2021-06-17 16:51 ` [PATCH v4 3/7] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-17 16:51 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, David Gibson, Daniel Henrique Barboza, nvdimm,
	dan.j.williams, Aneesh Kumar K.V

No functional change in this patch

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/numa.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8365b298ec48..132813dd1a6c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -56,7 +56,7 @@ static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
 #define MAX_DISTANCE_REF_POINTS 4
-static int distance_ref_points_depth;
+static int max_associativity_domain_index;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
 
@@ -169,7 +169,7 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 
 	int i, index;
 
-	for (i = 0; i < distance_ref_points_depth; i++) {
+	for (i = 0; i < max_associativity_domain_index; i++) {
 		index = be32_to_cpu(distance_ref_points[i]);
 		if (cpu1_assoc[index] == cpu2_assoc[index])
 			break;
@@ -193,7 +193,7 @@ int __node_distance(int a, int b)
 	if (!form1_affinity)
 		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
-	for (i = 0; i < distance_ref_points_depth; i++) {
+	for (i = 0; i < max_associativity_domain_index; i++) {
 		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
 			break;
 
@@ -213,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
 	if (!form1_affinity)
 		return;
 
-	for (i = 0; i < distance_ref_points_depth; i++) {
+	for (i = 0; i < max_associativity_domain_index; i++) {
 		const __be32 *entry;
 
 		entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
@@ -240,7 +240,7 @@ static int associativity_to_nid(const __be32 *associativity)
 		nid = NUMA_NO_NODE;
 
 	if (nid > 0 &&
-		of_read_number(associativity, 1) >= distance_ref_points_depth) {
+		of_read_number(associativity, 1) >= max_associativity_domain_index) {
 		/*
 		 * Skip the length field and send start of associativity array
 		 */
@@ -310,14 +310,14 @@ static int __init find_primary_domain_index(void)
 	 */
 	distance_ref_points = of_get_property(root,
 					"ibm,associativity-reference-points",
-					&distance_ref_points_depth);
+					&max_associativity_domain_index);
 
 	if (!distance_ref_points) {
 		dbg("NUMA: ibm,associativity-reference-points not found.\n");
 		goto err;
 	}
 
-	distance_ref_points_depth /= sizeof(int);
+	max_associativity_domain_index /= sizeof(int);
 
 	if (firmware_has_feature(FW_FEATURE_OPAL) ||
 	    firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
@@ -328,7 +328,7 @@ static int __init find_primary_domain_index(void)
 	if (form1_affinity) {
 		index = of_read_number(distance_ref_points, 1);
 	} else {
-		if (distance_ref_points_depth < 2) {
+		if (max_associativity_domain_index < 2) {
 			printk(KERN_WARNING "NUMA: "
 				"short ibm,associativity-reference-points\n");
 			goto err;
@@ -341,10 +341,10 @@ static int __init find_primary_domain_index(void)
 	 * Warn and cap if the hardware supports more than
 	 * MAX_DISTANCE_REF_POINTS domains.
 	 */
-	if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) {
+	if (max_associativity_domain_index > MAX_DISTANCE_REF_POINTS) {
 		printk(KERN_WARNING "NUMA: distance array capped at "
 			"%d entries\n", MAX_DISTANCE_REF_POINTS);
-		distance_ref_points_depth = MAX_DISTANCE_REF_POINTS;
+		max_associativity_domain_index = MAX_DISTANCE_REF_POINTS;
 	}
 
 	of_node_put(root);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 3/7] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
  2021-06-17 16:50 [PATCH v4 0/7] Add support for FORM2 associativity Aneesh Kumar K.V
  2021-06-17 16:50 ` [PATCH v4 1/7] powerpc/pseries: rename min_common_depth to primary_domain_index Aneesh Kumar K.V
  2021-06-17 16:51 ` [PATCH v4 2/7] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index Aneesh Kumar K.V
@ 2021-06-17 16:51 ` Aneesh Kumar K.V
  2021-06-17 16:51 ` [PATCH v4 4/7] powerpc/pseries: Consolidate DLPAR NUMA distance update Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-17 16:51 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, David Gibson, Daniel Henrique Barboza, nvdimm,
	dan.j.williams, Aneesh Kumar K.V

Also make related code cleanup that will allow adding FORM2_AFFINITY in
later patches. No functional change in this patch.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/firmware.h       |  4 +--
 arch/powerpc/include/asm/prom.h           |  2 +-
 arch/powerpc/kernel/prom_init.c           |  2 +-
 arch/powerpc/mm/numa.c                    | 35 ++++++++++++++---------
 arch/powerpc/platforms/pseries/firmware.c |  2 +-
 5 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
index 7604673787d6..60b631161360 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -44,7 +44,7 @@
 #define FW_FEATURE_OPAL		ASM_CONST(0x0000000010000000)
 #define FW_FEATURE_SET_MODE	ASM_CONST(0x0000000040000000)
 #define FW_FEATURE_BEST_ENERGY	ASM_CONST(0x0000000080000000)
-#define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0000000100000000)
+#define FW_FEATURE_FORM1_AFFINITY ASM_CONST(0x0000000100000000)
 #define FW_FEATURE_PRRN		ASM_CONST(0x0000000200000000)
 #define FW_FEATURE_DRMEM_V2	ASM_CONST(0x0000000400000000)
 #define FW_FEATURE_DRC_INFO	ASM_CONST(0x0000000800000000)
@@ -69,7 +69,7 @@ enum {
 		FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
 		FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
 		FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-		FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+		FW_FEATURE_FORM1_AFFINITY | FW_FEATURE_PRRN |
 		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
 		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
 		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 324a13351749..df9fec9d232c 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -147,7 +147,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_MSI			0x0201	/* PCIe/MSI support */
 #define OV5_CMO			0x0480	/* Cooperative Memory Overcommitment */
 #define OV5_XCMO		0x0440	/* Page Coalescing */
-#define OV5_TYPE1_AFFINITY	0x0580	/* Type 1 NUMA affinity */
+#define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
 #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
 #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
 #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 41ed7e33d897..64b9593038a7 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1070,7 +1070,7 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
 #else
 		0,
 #endif
-		.associativity = OV5_FEAT(OV5_TYPE1_AFFINITY) | OV5_FEAT(OV5_PRRN),
+		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
 		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
 		.micro_checkpoint = 0,
 		.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 132813dd1a6c..0ec16999beef 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -53,7 +53,10 @@ EXPORT_SYMBOL(node_data);
 
 static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
-static int form1_affinity;
+
+#define FORM0_AFFINITY 0
+#define FORM1_AFFINITY 1
+static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int max_associativity_domain_index;
@@ -190,7 +193,7 @@ int __node_distance(int a, int b)
 	int i;
 	int distance = LOCAL_DISTANCE;
 
-	if (!form1_affinity)
+	if (affinity_form == FORM0_AFFINITY)
 		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
 	for (i = 0; i < max_associativity_domain_index; i++) {
@@ -210,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
 {
 	int i;
 
-	if (!form1_affinity)
+	if (affinity_form != FORM1_AFFINITY)
 		return;
 
 	for (i = 0; i < max_associativity_domain_index; i++) {
@@ -289,6 +292,17 @@ static int __init find_primary_domain_index(void)
 	int index;
 	struct device_node *root;
 
+	/*
+	 * Check for which form of affinity.
+	 */
+	if (firmware_has_feature(FW_FEATURE_OPAL)) {
+		affinity_form = FORM1_AFFINITY;
+	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
+		dbg("Using form 1 affinity\n");
+		affinity_form = FORM1_AFFINITY;
+	} else
+		affinity_form = FORM0_AFFINITY;
+
 	if (firmware_has_feature(FW_FEATURE_OPAL))
 		root = of_find_node_by_path("/ibm,opal");
 	else
@@ -318,23 +332,16 @@ static int __init find_primary_domain_index(void)
 	}
 
 	max_associativity_domain_index /= sizeof(int);
-
-	if (firmware_has_feature(FW_FEATURE_OPAL) ||
-	    firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
-		dbg("Using form 1 affinity\n");
-		form1_affinity = 1;
-	}
-
-	if (form1_affinity) {
-		index = of_read_number(distance_ref_points, 1);
-	} else {
+	if (affinity_form == FORM0_AFFINITY) {
 		if (max_associativity_domain_index < 2) {
 			printk(KERN_WARNING "NUMA: "
-				"short ibm,associativity-reference-points\n");
+			       "short ibm,associativity-reference-points\n");
 			goto err;
 		}
 
 		index = of_read_number(&distance_ref_points[1], 1);
+	} else {
+		index = of_read_number(distance_ref_points, 1);
 	}
 
 	/*
diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
index 4c7b7f5a2ebc..5d4c2bc20bba 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -119,7 +119,7 @@ struct vec5_fw_feature {
 
 static __initdata struct vec5_fw_feature
 vec5_fw_features_table[] = {
-	{FW_FEATURE_TYPE1_AFFINITY,	OV5_TYPE1_AFFINITY},
+	{FW_FEATURE_FORM1_AFFINITY,	OV5_FORM1_AFFINITY},
 	{FW_FEATURE_PRRN,		OV5_PRRN},
 	{FW_FEATURE_DRMEM_V2,		OV5_DRMEM_V2},
 	{FW_FEATURE_DRC_INFO,		OV5_DRC_INFO},
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 4/7] powerpc/pseries: Consolidate DLPAR NUMA distance update
  2021-06-17 16:50 [PATCH v4 0/7] Add support for FORM2 associativity Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2021-06-17 16:51 ` [PATCH v4 3/7] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY Aneesh Kumar K.V
@ 2021-06-17 16:51 ` Aneesh Kumar K.V
  2021-06-24  3:11   ` David Gibson
  2021-06-17 16:51 ` [PATCH v4 5/7] powerpc/pseries: Consolidate NUMA distance update during boot Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-17 16:51 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, David Gibson, Daniel Henrique Barboza, nvdimm,
	dan.j.williams, Aneesh Kumar K.V

The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call. In
later patch we will remove updating NUMA distance when we are looking
for node id from associativity array.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/numa.c                        | 41 +++++++++++++++++++
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |  2 +
 .../platforms/pseries/hotplug-memory.c        |  2 +
 arch/powerpc/platforms/pseries/pseries.h      |  1 +
 4 files changed, 46 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0ec16999beef..645a95e3a7ea 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -287,6 +287,47 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
+static void __initialize_form1_numa_distance(const __be32 *associativity)
+{
+	int i, nid;
+
+	if (of_read_number(associativity, 1) >= primary_domain_index) {
+		nid = of_read_number(&associativity[primary_domain_index], 1);
+
+		for (i = 0; i < max_domain_index; i++) {
+			const __be32 *entry;
+
+			entry = &associativity[be32_to_cpu(distance_ref_points[i])];
+			distance_lookup_table[nid][i] = of_read_number(entry, 1);
+		}
+	}
+}
+
+static void initialize_form1_numa_distance(struct device_node *node)
+{
+	const __be32 *associativity;
+
+	associativity = of_get_associativity(node);
+	if (!associativity)
+		return;
+
+	__initialize_form1_numa_distance(associativity);
+	return;
+}
+
+/*
+ * Used to update distance information w.r.t newly added node.
+ */
+void update_numa_distance(struct device_node *node)
+{
+	if (affinity_form == FORM0_AFFINITY)
+		return;
+	else if (affinity_form == FORM1_AFFINITY) {
+		initialize_form1_numa_distance(node);
+		return;
+	}
+}
+
 static int __init find_primary_domain_index(void)
 {
 	int index;
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 7e970f81d8ff..778b6ab35f0d 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -498,6 +498,8 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
 		return saved_rc;
 	}
 
+	update_numa_distance(dn);
+
 	rc = dlpar_online_cpu(dn);
 	if (rc) {
 		saved_rc = rc;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..0e602c3b01ea 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -180,6 +180,8 @@ static int update_lmb_associativity_index(struct drmem_lmb *lmb)
 		return -ENODEV;
 	}
 
+	update_numa_distance(lmb_node);
+
 	dr_node = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
 	if (!dr_node) {
 		dlpar_free_cc_nodes(lmb_node);
diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
index 1f051a786fb3..663a0859cf13 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -113,4 +113,5 @@ extern u32 pseries_security_flavor;
 void pseries_setup_security_mitigations(void);
 void pseries_lpar_read_hblkrm_characteristics(void);
 
+void update_numa_distance(struct device_node *node);
 #endif /* _PSERIES_PSERIES_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 5/7] powerpc/pseries: Consolidate NUMA distance update during boot
  2021-06-17 16:50 [PATCH v4 0/7] Add support for FORM2 associativity Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2021-06-17 16:51 ` [PATCH v4 4/7] powerpc/pseries: Consolidate DLPAR NUMA distance update Aneesh Kumar K.V
@ 2021-06-17 16:51 ` Aneesh Kumar K.V
  2021-06-17 16:51 ` [PATCH v4 6/7] powerpc/pseries: Add a helper for form1 cpu distance Aneesh Kumar K.V
  2021-06-17 16:51 ` [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity Aneesh Kumar K.V
  6 siblings, 0 replies; 20+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-17 16:51 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, David Gibson, Daniel Henrique Barboza, nvdimm,
	dan.j.williams, Aneesh Kumar K.V

Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/numa.c | 135 +++++++++++++++++++++++++++--------------
 1 file changed, 88 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 645a95e3a7ea..c481f08d565b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -208,22 +208,6 @@ int __node_distance(int a, int b)
 }
 EXPORT_SYMBOL(__node_distance);
 
-static void initialize_distance_lookup_table(int nid,
-		const __be32 *associativity)
-{
-	int i;
-
-	if (affinity_form != FORM1_AFFINITY)
-		return;
-
-	for (i = 0; i < max_associativity_domain_index; i++) {
-		const __be32 *entry;
-
-		entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
-		distance_lookup_table[nid][i] = of_read_number(entry, 1);
-	}
-}
-
 /*
  * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
  * info is found.
@@ -241,15 +225,6 @@ static int associativity_to_nid(const __be32 *associativity)
 	/* POWER4 LPAR uses 0xffff as invalid node */
 	if (nid == 0xffff || nid >= nr_node_ids)
 		nid = NUMA_NO_NODE;
-
-	if (nid > 0 &&
-		of_read_number(associativity, 1) >= max_associativity_domain_index) {
-		/*
-		 * Skip the length field and send start of associativity array
-		 */
-		initialize_distance_lookup_table(nid, associativity + 1);
-	}
-
 out:
 	return nid;
 }
@@ -291,10 +266,13 @@ static void __initialize_form1_numa_distance(const __be32 *associativity)
 {
 	int i, nid;
 
+	if (affinity_form != FORM1_AFFINITY)
+		return;
+
 	if (of_read_number(associativity, 1) >= primary_domain_index) {
 		nid = of_read_number(&associativity[primary_domain_index], 1);
 
-		for (i = 0; i < max_domain_index; i++) {
+		for (i = 0; i < max_associativity_domain_index; i++) {
 			const __be32 *entry;
 
 			entry = &associativity[be32_to_cpu(distance_ref_points[i])];
@@ -474,6 +452,48 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
 	return 0;
 }
 
+static int get_nid_and_numa_distance(struct drmem_lmb *lmb)
+{
+	struct assoc_arrays aa = { .arrays = NULL };
+	int default_nid = NUMA_NO_NODE;
+	int nid = default_nid;
+	int rc, index;
+
+	if ((primary_domain_index < 0) || !numa_enabled)
+		return default_nid;
+
+	rc = of_get_assoc_arrays(&aa);
+	if (rc)
+		return default_nid;
+
+	if (primary_domain_index <= aa.array_sz &&
+	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
+		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
+		nid = of_read_number(&aa.arrays[index], 1);
+
+		if (nid == 0xffff || nid >= nr_node_ids)
+			nid = default_nid;
+		if (nid > 0 && affinity_form == FORM1_AFFINITY) {
+			int i;
+			const __be32 *associativity;
+
+			index = lmb->aa_index * aa.array_sz;
+			associativity = &aa.arrays[index];
+			/*
+			 * lookup array associativity entries have different format
+			 * There is no length of the array as the first element.
+			 */
+			for (i = 0; i < max_associativity_domain_index; i++) {
+				const __be32 *entry;
+
+				entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
+				distance_lookup_table[nid][i] = of_read_number(entry, 1);
+			}
+		}
+	}
+	return nid;
+}
+
 /*
  * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
@@ -499,21 +519,14 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 
 		if (nid == 0xffff || nid >= nr_node_ids)
 			nid = default_nid;
-
-		if (nid > 0) {
-			index = lmb->aa_index * aa.array_sz;
-			initialize_distance_lookup_table(nid,
-							&aa.arrays[index]);
-		}
 	}
-
 	return nid;
 }
 
 #ifdef CONFIG_PPC_SPLPAR
-static int vphn_get_nid(long lcpu)
+
+static int __vphn_get_associativity(long lcpu, __be32 *associativity)
 {
-	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
 	long rc, hwid;
 
 	/*
@@ -533,10 +546,22 @@ static int vphn_get_nid(long lcpu)
 
 		rc = hcall_vphn(hwid, VPHN_FLAG_VCPU, associativity);
 		if (rc == H_SUCCESS)
-			return associativity_to_nid(associativity);
+			return 0;
 	}
 
+	return -1;
+}
+
+static int vphn_get_nid(long lcpu)
+{
+	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+
+
+	if (!__vphn_get_associativity(lcpu, associativity))
+		return associativity_to_nid(associativity);
+
 	return NUMA_NO_NODE;
+
 }
 #else
 static int vphn_get_nid(long unused)
@@ -733,7 +758,7 @@ static int __init numa_setup_drmem_lmb(struct drmem_lmb *lmb,
 			size = read_n_cells(n_mem_size_cells, usm);
 		}
 
-		nid = of_drconf_to_nid_single(lmb);
+		nid = get_nid_and_numa_distance(lmb);
 		fake_numa_create_new_node(((base + size) >> PAGE_SHIFT),
 					  &nid);
 		node_set_online(nid);
@@ -750,6 +775,7 @@ static int __init parse_numa_properties(void)
 	struct device_node *memory;
 	int default_nid = 0;
 	unsigned long i;
+	const __be32 *associativity;
 
 	if (numa_enabled == 0) {
 		printk(KERN_WARNING "NUMA disabled by user\n");
@@ -775,18 +801,30 @@ static int __init parse_numa_properties(void)
 	 * each node to be onlined must have NODE_DATA etc backing it.
 	 */
 	for_each_present_cpu(i) {
+		__be32 vphn_assoc[VPHN_ASSOC_BUFSIZE];
 		struct device_node *cpu;
-		int nid = vphn_get_nid(i);
+		int nid = NUMA_NO_NODE;
 
-		/*
-		 * Don't fall back to default_nid yet -- we will plug
-		 * cpus into nodes once the memory scan has discovered
-		 * the topology.
-		 */
-		if (nid == NUMA_NO_NODE) {
+		memset(vphn_assoc, 0, VPHN_ASSOC_BUFSIZE * sizeof(__be32));
+
+		if (__vphn_get_associativity(i, vphn_assoc) == 0) {
+			nid = associativity_to_nid(vphn_assoc);
+			__initialize_form1_numa_distance(vphn_assoc);
+		} else {
+
+			/*
+			 * Don't fall back to default_nid yet -- we will plug
+			 * cpus into nodes once the memory scan has discovered
+			 * the topology.
+			 */
 			cpu = of_get_cpu_node(i, NULL);
 			BUG_ON(!cpu);
-			nid = of_node_to_nid_single(cpu);
+
+			associativity = of_get_associativity(cpu);
+			if (associativity) {
+				nid = associativity_to_nid(associativity);
+				__initialize_form1_numa_distance(associativity);
+			}
 			of_node_put(cpu);
 		}
 
@@ -822,8 +860,11 @@ static int __init parse_numa_properties(void)
 		 * have associativity properties.  If none, then
 		 * everything goes to default_nid.
 		 */
-		nid = of_node_to_nid_single(memory);
-		if (nid < 0)
+		associativity = of_get_associativity(memory);
+		if (associativity) {
+			nid = associativity_to_nid(associativity);
+			__initialize_form1_numa_distance(associativity);
+		} else
 			nid = default_nid;
 
 		fake_numa_create_new_node(((start + size) >> PAGE_SHIFT), &nid);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 6/7] powerpc/pseries: Add a helper for form1 cpu distance
  2021-06-17 16:50 [PATCH v4 0/7] Add support for FORM2 associativity Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2021-06-17 16:51 ` [PATCH v4 5/7] powerpc/pseries: Consolidate NUMA distance update during boot Aneesh Kumar K.V
@ 2021-06-17 16:51 ` Aneesh Kumar K.V
  2021-06-17 16:51 ` [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity Aneesh Kumar K.V
  6 siblings, 0 replies; 20+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-17 16:51 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, David Gibson, Daniel Henrique Barboza, nvdimm,
	dan.j.williams, Aneesh Kumar K.V

This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/numa.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c481f08d565b..d32729f235b8 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
 	int dist = 0;
 
@@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 	return dist;
 }
 
+int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+	/* We should not get called with FORM0 */
+	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
+
+	return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
+}
+
 /* must hold reference to node during call */
 static const __be32 *of_get_associativity(struct device_node *dev)
 {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
  2021-06-17 16:50 [PATCH v4 0/7] Add support for FORM2 associativity Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2021-06-17 16:51 ` [PATCH v4 6/7] powerpc/pseries: Add a helper for form1 cpu distance Aneesh Kumar K.V
@ 2021-06-17 16:51 ` Aneesh Kumar K.V
  2021-06-21 22:02   ` Daniel Henrique Barboza
                     ` (2 more replies)
  6 siblings, 3 replies; 20+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-17 16:51 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, David Gibson, Daniel Henrique Barboza, nvdimm,
	dan.j.williams, Aneesh Kumar K.V

PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 Documentation/powerpc/associativity.rst   | 135 ++++++++++++++++++++
 arch/powerpc/include/asm/firmware.h       |   3 +-
 arch/powerpc/include/asm/prom.h           |   1 +
 arch/powerpc/kernel/prom_init.c           |   3 +-
 arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 6 files changed, 286 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
new file mode 100644
index 000000000000..93be604ac54d
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,135 @@
+============================
+NUMA resource associativity
+=============================
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the older format and is now considered deprecated.
+
+Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-----
+Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
+
+Form 1
+-----
+With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity
+device tree properties are used to determine the NUMA distance between resource groups/domains.
+
+The “ibm,associativity” property contains one or more lists of numbers (domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains one or more list of numbers
+(domainID index) that represents the 1 based ordinal in the associativity lists.
+The list of domainID index represnets increasing hierachy of resource grouping. 
+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
+Linux kernel computes NUMA distance between two domains by recursively comparing
+if they belong to the same higher-level domains. For mismatch at every higher
+level of the resource group, the kernel doubles the NUMA distance between the
+comparing domains.
+
+Form 2
+-------
+Form 2 associativity format adds separate device tree properties representing NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows flexible primary
+domain numbering. With numa distance computation now detached from the index value of
+"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the
+same domainID index representing resource groups of different performance/latency characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains one or more list numbers representing
+the domainIDs present in the system. The offset of the domainID in this property is considered
+the domainID index.
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
+N domainID encoded as with encode-int
+
+For ex:
+ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for domainID 8 is 1.
+
+"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
+distance between resource groups/domains present in the system.
+
+prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
+N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
+
+For ex:
+ibm,numa-lookup-index-table =  {3, 0, 8, 40}
+ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
+
+  | 0    8   40
+--|------------
+  |
+0 | 10   20  80
+  |
+8 | 20   10  160
+  |
+40| 80   160  10
+
+
+"ibm,associativity" property for resources in node 0, 8 and 40
+
+{ 3, 6, 7, 0 }
+{ 3, 6, 9, 8 }
+{ 3, 6, 7, 40}
+
+With "ibm,associativity-reference-points"  { 0x3 }
+
+Each resource (drcIndex) now also supports additional optional device tree properties.
+These properties are marked optional because the platform can choose not to export
+them and provide the system topology details using the earlier defined device tree
+properties alone. The optional device tree properties are used when adding new resources
+(DLPAR) and when the platform didn't provide the topology details of the domain which
+contains the newly added resource during boot.
+
+"ibm,numa-lookup-index" property contains a number representing the domainID index to be used
+when building the NUMA distance of the numa node to which this resource belongs. This can
+be looked at as the index at which this new domainID would have appeared in
+"ibm,numa-lookup-index-table" if the domain was present during boot. The domainID
+of the new resource can be obtained from the existing "ibm,associativity" property. This
+can be used to build distance information of a newly onlined NUMA node via DLPAR operation.
+The value is 1 based array index value.
+
+prop-encoded-array: An integer encoded as with encode-int specifying the domainID index
+
+"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
+from this resource domain to other resources.
+
+prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
+N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
+
+For ex:
+ibm,associativity     = { 4, 5, 10, 50}
+ibm,numa-lookup-index = { 4 }
+ibm,numa-distance   =  {8, 160, 255, 80, 10, 160, 255, 80, 10}
+
+resulting in a new toplogy as below.
+  | 0    8   40   50
+--|------------------
+  |
+0 | 10   20  80   160
+  |
+8 | 20   10  160  255
+  |
+40| 80   160  10  80
+  |
+50| 160  255  80  10
+
diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
index 60b631161360..97a3bd9ffeb9 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -53,6 +53,7 @@
 #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
 #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
 #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
+#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
 
 #ifndef __ASSEMBLY__
 
@@ -73,7 +74,7 @@ enum {
 		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
 		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
 		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
-		FW_FEATURE_RPT_INVALIDATE,
+		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
 	FW_FEATURE_PSERIES_ALWAYS = 0,
 	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
 	FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index df9fec9d232c..5c80152e8f18 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_XCMO		0x0440	/* Page Coalescing */
 #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
 #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
+#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
 #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
 #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
 #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 64b9593038a7..496fdac54c29 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1070,7 +1070,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
 #else
 		0,
 #endif
-		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
+		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
+		OV5_FEAT(OV5_FORM2_AFFINITY),
 		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
 		.micro_checkpoint = 0,
 		.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d32729f235b8..5a7d94960fb7 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
 
 #define FORM0_AFFINITY 0
 #define FORM1_AFFINITY 1
+#define FORM2_AFFINITY 2
 static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int max_associativity_domain_index;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
+static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
+	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
+};
+static int numa_id_index_table[MAX_NUMNODES];
 
 /*
  * Allocate node_to_cpumask_map based on number of available nodes
@@ -166,6 +171,27 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
+/*
+ * With FORM2 if we are not using logical domain ids, we will find
+ * both primary and seconday domains with same value. Hence always
+ * start comparison from secondary domains
+ */
+static int __cpu_form2_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+	int dist = 0;
+
+	int i, index;
+
+	for (i = 1; i < max_associativity_domain_index; i++) {
+		index = be32_to_cpu(distance_ref_points[i]);
+		if (cpu1_assoc[index] == cpu2_assoc[index])
+			break;
+		dist++;
+	}
+
+	return dist;
+}
+
 static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
 	int dist = 0;
@@ -178,7 +204,6 @@ static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 			break;
 		dist++;
 	}
-
 	return dist;
 }
 
@@ -186,8 +211,9 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
 	/* We should not get called with FORM0 */
 	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
-
-	return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
+	if (affinity_form == FORM1_AFFINITY)
+		return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
+	return __cpu_form2_distance(cpu1_assoc, cpu2_assoc);
 }
 
 /* must hold reference to node during call */
@@ -201,7 +227,9 @@ int __node_distance(int a, int b)
 	int i;
 	int distance = LOCAL_DISTANCE;
 
-	if (affinity_form == FORM0_AFFINITY)
+	if (affinity_form == FORM2_AFFINITY)
+		return numa_distance_table[a][b];
+	else if (affinity_form == FORM0_AFFINITY)
 		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
 	for (i = 0; i < max_associativity_domain_index; i++) {
@@ -303,15 +331,116 @@ static void initialize_form1_numa_distance(struct device_node *node)
 
 /*
  * Used to update distance information w.r.t newly added node.
+ * ibm,numa-lookup-index -> 4
+ * ibm,numa-distance -> {5, 20, 40, 60, 80, 10 }
  */
 void update_numa_distance(struct device_node *node)
 {
+	int i, nid, other_nid, other_nid_index = 0;
+	const __be32 *numa_indexp;
+	const __u8  *numa_distancep;
+	int numa_index, max_numa_index, numa_distance;
+
 	if (affinity_form == FORM0_AFFINITY)
 		return;
 	else if (affinity_form == FORM1_AFFINITY) {
 		initialize_form1_numa_distance(node);
 		return;
 	}
+	/* FORM2 affinity  */
+
+	nid = of_node_to_nid_single(node);
+	if (nid == NUMA_NO_NODE)
+		return;
+
+	/* Already initialized */
+	if (numa_distance_table[nid][nid] != -1)
+		return;
+	/*
+	 * update node distance if not already populated.
+	 */
+	numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
+	if (!numa_distancep)
+		return;
+
+	numa_indexp = of_get_property(node, "ibm,numa-lookup-index", NULL);
+	if (!numa_indexp)
+		return;
+
+	numa_index = of_read_number(numa_indexp, 1);
+	/*
+	 * update the numa_id_index_table. Device tree look at index table as
+	 * 1 based array indexing.
+	 */
+	numa_id_index_table[numa_index - 1] = nid;
+
+	max_numa_index = of_read_number((const __be32 *)numa_distancep, 1);
+	VM_WARN_ON(max_numa_index != 2 * numa_index);
+	/* Skip the size which is encoded int */
+	numa_distancep += sizeof(__be32);
+
+	/*
+	 * First fill the distance information from other node to this node.
+	 */
+	other_nid_index = 0;
+	for (i = 0; i < numa_index; i++) {
+		numa_distance = numa_distancep[i];
+		other_nid = numa_id_index_table[other_nid_index++];
+		numa_distance_table[other_nid][nid] = numa_distance;
+	}
+
+	other_nid_index = 0;
+	for (; i < max_numa_index; i++) {
+		numa_distance = numa_distancep[i];
+		other_nid = numa_id_index_table[other_nid_index++];
+		numa_distance_table[nid][other_nid] = numa_distance;
+	}
+}
+
+/*
+ * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
+ * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
+ */
+static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
+{
+	const __u8 *numa_dist_table;
+	const __be32 *numa_lookup_index;
+	int numa_dist_table_length;
+	int max_numa_index, distance_index;
+	int i, curr_row = 0, curr_column = 0;
+
+	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
+	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
+
+	/* first element of the array is the size and is encode-int */
+	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
+	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
+	/* Skip the size which is encoded int */
+	numa_dist_table += sizeof(__be32);
+
+	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d \n",
+		 numa_dist_table_length, max_numa_index);
+
+	for (i = 0; i < max_numa_index; i++)
+		/* +1 skip the max_numa_index in the property */
+		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
+
+
+	VM_WARN_ON(numa_dist_table_length != max_numa_index * max_numa_index);
+
+	for (distance_index = 0; distance_index < numa_dist_table_length; distance_index++) {
+		int nodeA = numa_id_index_table[curr_row];
+		int nodeB = numa_id_index_table[curr_column++];
+
+		numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index];
+
+		pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
+		if (curr_column >= max_numa_index) {
+			curr_row++;
+			/* reset the column */
+			curr_column = 0;
+		}
+	}
 }
 
 static int __init find_primary_domain_index(void)
@@ -324,6 +453,9 @@ static int __init find_primary_domain_index(void)
 	 */
 	if (firmware_has_feature(FW_FEATURE_OPAL)) {
 		affinity_form = FORM1_AFFINITY;
+	} else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) {
+		dbg("Using form 2 affinity\n");
+		affinity_form = FORM2_AFFINITY;
 	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
 		dbg("Using form 1 affinity\n");
 		affinity_form = FORM1_AFFINITY;
@@ -368,8 +500,17 @@ static int __init find_primary_domain_index(void)
 
 		index = of_read_number(&distance_ref_points[1], 1);
 	} else {
+		/*
+		 * Both FORM1 and FORM2 affinity find the primary domain details
+		 * at the same offset.
+		 */
 		index = of_read_number(distance_ref_points, 1);
 	}
+	/*
+	 * If it is FORM2 also initialize the distance table here.
+	 */
+	if (affinity_form == FORM2_AFFINITY)
+		initialize_form2_numa_distance_lookup_table(root);
 
 	/*
 	 * Warn and cap if the hardware supports more than
diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
index 5d4c2bc20bba..f162156b7b68 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -123,6 +123,7 @@ vec5_fw_features_table[] = {
 	{FW_FEATURE_PRRN,		OV5_PRRN},
 	{FW_FEATURE_DRMEM_V2,		OV5_DRMEM_V2},
 	{FW_FEATURE_DRC_INFO,		OV5_DRC_INFO},
+	{FW_FEATURE_FORM2_AFFINITY,	OV5_FORM2_AFFINITY},
 };
 
 static void __init fw_vec5_feature_init(const char *vec5, unsigned long len)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
  2021-06-17 16:51 ` [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity Aneesh Kumar K.V
@ 2021-06-21 22:02   ` Daniel Henrique Barboza
  2021-06-22 12:07     ` Aneesh Kumar K.V
  2021-06-24  3:08   ` David Gibson
  2021-06-24 10:33   ` Laurent Dufour
  2 siblings, 1 reply; 20+ messages in thread
From: Daniel Henrique Barboza @ 2021-06-21 22:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev, mpe
  Cc: Nathan Lynch, David Gibson, nvdimm, dan.j.williams



On 6/17/21 1:51 PM, Aneesh Kumar K.V wrote:
> PAPR interface currently supports two different ways of communicating resource
> grouping details to the OS. These are referred to as Form 0 and Form 1
> associativity grouping. Form 0 is the older format and is now considered
> deprecated. This patch adds another resource grouping named FORM2.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   Documentation/powerpc/associativity.rst   | 135 ++++++++++++++++++++
>   arch/powerpc/include/asm/firmware.h       |   3 +-
>   arch/powerpc/include/asm/prom.h           |   1 +
>   arch/powerpc/kernel/prom_init.c           |   3 +-
>   arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
>   arch/powerpc/platforms/pseries/firmware.c |   1 +
>   6 files changed, 286 insertions(+), 6 deletions(-)
>   create mode 100644 Documentation/powerpc/associativity.rst
> 
> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
> new file mode 100644
> index 000000000000..93be604ac54d
> --- /dev/null
> +++ b/Documentation/powerpc/associativity.rst
> @@ -0,0 +1,135 @@
> +============================
> +NUMA resource associativity
> +=============================
> +
> +Associativity represents the groupings of the various platform resources into
> +domains of substantially similar mean performance relative to resources outside
> +of that domain. Resources subsets of a given domain that exhibit better
> +performance relative to each other than relative to other resources subsets
> +are represented as being members of a sub-grouping domain. This performance
> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
> +From the platform view, these groups are also referred to as domains.
> +
> +PAPR interface currently supports different ways of communicating these resource
> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
> +associativity grouping. Form 0 is the older format and is now considered deprecated.
> +
> +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
> +
> +Form 0
> +-----
> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
> +
> +Form 1
> +-----
> +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity
> +device tree properties are used to determine the NUMA distance between resource groups/domains.
> +
> +The “ibm,associativity” property contains one or more lists of numbers (domainID)
> +representing the resource’s platform grouping domains.
> +
> +The “ibm,associativity-reference-points” property contains one or more list of numbers
> +(domainID index) that represents the 1 based ordinal in the associativity lists.
> +The list of domainID index represnets increasing hierachy of resource grouping.
> +
> +ex:
> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
> +
> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
> +Linux kernel computes NUMA distance between two domains by recursively comparing
> +if they belong to the same higher-level domains. For mismatch at every higher
> +level of the resource group, the kernel doubles the NUMA distance between the
> +comparing domains.
> +
> +Form 2
> +-------
> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
> +domain numbering. With numa distance computation now detached from the index value of
> +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the
> +same domainID index representing resource groups of different performance/latency characteristics.
> +
> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
> +"ibm,architecture-vec-5" property.
> +
> +"ibm,numa-lookup-index-table" property contains one or more list numbers representing
> +the domainIDs present in the system. The offset of the domainID in this property is considered
> +the domainID index.
> +
> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
> +N domainID encoded as with encode-int
> +
> +For ex:
> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for domainID 8 is 1.
> +
> +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
> +distance between resource groups/domains present in the system.
> +
> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
> +
> +For ex:
> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
> +
> +  | 0    8   40
> +--|------------
> +  |
> +0 | 10   20  80
> +  |
> +8 | 20   10  160
> +  |
> +40| 80   160  10
> +
> +
> +"ibm,associativity" property for resources in node 0, 8 and 40
> +
> +{ 3, 6, 7, 0 }
> +{ 3, 6, 9, 8 }
> +{ 3, 6, 7, 40}
> +
> +With "ibm,associativity-reference-points"  { 0x3 }

With this configuration, would the following ibm,associativity arrays
also be valid?


{ 3, 0, 0, 0 }
{ 3, 0, 0, 8 }
{ 3, 0, 0, 40}

If yes, then we need a way to tell that the associativity domains assignment
are optional, and FORM2 relies solely on finding out the domainID of the
resource (0, 8 and 40) to retrieve the domainID index, and with this
index all performance metrics can be retrieved from the numa-* properties
(numa-distance-table, numa-bandwidth-table ...).

Retrieving the resource domainID is done by using ibm,associativity-reference-points.

This will allow the platform to implement FORM2 such as:

{ 1, 0 }
{ 1, 8 }
{ 1, 40 }
  
- ref-points: { 0x1 }

If the platform chooses to do so.


> +
> +Each resource (drcIndex) now also supports additional optional device tree properties.
> +These properties are marked optional because the platform can choose not to export
> +them and provide the system topology details using the earlier defined device tree
> +properties alone. The optional device tree properties are used when adding new resources
> +(DLPAR) and when the platform didn't provide the topology details of the domain which
> +contains the newly added resource during boot.
> +
> +"ibm,numa-lookup-index" property contains a number representing the domainID index to be used
> +when building the NUMA distance of the numa node to which this resource belongs. This can
> +be looked at as the index at which this new domainID would have appeared in
> +"ibm,numa-lookup-index-table" if the domain was present during boot. The domainID
> +of the new resource can be obtained from the existing "ibm,associativity" property. This
> +can be used to build distance information of a newly onlined NUMA node via DLPAR operation.
> +The value is 1 based array index value.
> +
> +prop-encoded-array: An integer encoded as with encode-int specifying the domainID index
> +
> +"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
> +from this resource domain to other resources.
> +
> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
> +
> +For ex:
> +ibm,associativity     = { 4, 5, 10, 50}
> +ibm,numa-lookup-index = { 4 }
> +ibm,numa-distance   =  {8, 160, 255, 80, 10, 160, 255, 80, 10}
> +
> +resulting in a new toplogy as below.
> +  | 0    8   40   50
> +--|------------------
> +  |
> +0 | 10   20  80   160
> +  |
> +8 | 20   10  160  255
> +  |
> +40| 80   160  10  80
> +  |
> +50| 160  255  80  10
> +

I see there is no mention of the special PAPR SCM handling. I saw in
one of the your replies of v1:

"Another option is to make sure that numa-distance-value is populated
such that PMEMB distance indicates it is closer to node0 when compared
to node1. ie, node_distance[40][0] < node_distance[40][1]. One could
possibly infer the grouping based on the distance value and not deepend
on ibm,associativity for that purpose."


Is that was we're supposed to do with PAPR SCM? I'm not sure how that
affects NVDIMM support in QEMU with FORM2.



Thanks,



Daniel



> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
> index 60b631161360..97a3bd9ffeb9 100644
> --- a/arch/powerpc/include/asm/firmware.h
> +++ b/arch/powerpc/include/asm/firmware.h
> @@ -53,6 +53,7 @@
>   #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
>   #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
>   #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
>   
>   #ifndef __ASSEMBLY__
>   
> @@ -73,7 +74,7 @@ enum {
>   		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
>   		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
>   		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
> -		FW_FEATURE_RPT_INVALIDATE,
> +		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
>   	FW_FEATURE_PSERIES_ALWAYS = 0,
>   	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
>   	FW_FEATURE_POWERNV_ALWAYS = 0,
> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
> index df9fec9d232c..5c80152e8f18 100644
> --- a/arch/powerpc/include/asm/prom.h
> +++ b/arch/powerpc/include/asm/prom.h
> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
>   #define OV5_XCMO		0x0440	/* Page Coalescing */
>   #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
>   #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
> +#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
>   #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
>   #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
>   #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> index 64b9593038a7..496fdac54c29 100644
> --- a/arch/powerpc/kernel/prom_init.c
> +++ b/arch/powerpc/kernel/prom_init.c
> @@ -1070,7 +1070,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
>   #else
>   		0,
>   #endif
> -		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
> +		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
> +		OV5_FEAT(OV5_FORM2_AFFINITY),
>   		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
>   		.micro_checkpoint = 0,
>   		.reserved0 = 0,
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index d32729f235b8..5a7d94960fb7 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
>   
>   #define FORM0_AFFINITY 0
>   #define FORM1_AFFINITY 1
> +#define FORM2_AFFINITY 2
>   static int affinity_form;
>   
>   #define MAX_DISTANCE_REF_POINTS 4
>   static int max_associativity_domain_index;
>   static const __be32 *distance_ref_points;
>   static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
> +	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
> +};
> +static int numa_id_index_table[MAX_NUMNODES];
>   
>   /*
>    * Allocate node_to_cpumask_map based on number of available nodes
> @@ -166,6 +171,27 @@ static void unmap_cpu_from_node(unsigned long cpu)
>   }
>   #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>   
> +/*
> + * With FORM2 if we are not using logical domain ids, we will find
> + * both primary and seconday domains with same value. Hence always
> + * start comparison from secondary domains
> + */
> +static int __cpu_form2_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> +{
> +	int dist = 0;
> +
> +	int i, index;
> +
> +	for (i = 1; i < max_associativity_domain_index; i++) {
> +		index = be32_to_cpu(distance_ref_points[i]);
> +		if (cpu1_assoc[index] == cpu2_assoc[index])
> +			break;
> +		dist++;
> +	}
> +
> +	return dist;
> +}
> +
>   static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>   {
>   	int dist = 0;
> @@ -178,7 +204,6 @@ static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>   			break;
>   		dist++;
>   	}
> -
>   	return dist;
>   }
>   
> @@ -186,8 +211,9 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>   {
>   	/* We should not get called with FORM0 */
>   	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
> -
> -	return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
> +	if (affinity_form == FORM1_AFFINITY)
> +		return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
> +	return __cpu_form2_distance(cpu1_assoc, cpu2_assoc);
>   }
>   
>   /* must hold reference to node during call */
> @@ -201,7 +227,9 @@ int __node_distance(int a, int b)
>   	int i;
>   	int distance = LOCAL_DISTANCE;
>   
> -	if (affinity_form == FORM0_AFFINITY)
> +	if (affinity_form == FORM2_AFFINITY)
> +		return numa_distance_table[a][b];
> +	else if (affinity_form == FORM0_AFFINITY)
>   		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
>   
>   	for (i = 0; i < max_associativity_domain_index; i++) {
> @@ -303,15 +331,116 @@ static void initialize_form1_numa_distance(struct device_node *node)
>   
>   /*
>    * Used to update distance information w.r.t newly added node.
> + * ibm,numa-lookup-index -> 4
> + * ibm,numa-distance -> {5, 20, 40, 60, 80, 10 }
>    */
>   void update_numa_distance(struct device_node *node)
>   {
> +	int i, nid, other_nid, other_nid_index = 0;
> +	const __be32 *numa_indexp;
> +	const __u8  *numa_distancep;
> +	int numa_index, max_numa_index, numa_distance;
> +
>   	if (affinity_form == FORM0_AFFINITY)
>   		return;
>   	else if (affinity_form == FORM1_AFFINITY) {
>   		initialize_form1_numa_distance(node);
>   		return;
>   	}
> +	/* FORM2 affinity  */
> +
> +	nid = of_node_to_nid_single(node);
> +	if (nid == NUMA_NO_NODE)
> +		return;
> +
> +	/* Already initialized */
> +	if (numa_distance_table[nid][nid] != -1)
> +		return;
> +	/*
> +	 * update node distance if not already populated.
> +	 */
> +	numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
> +	if (!numa_distancep)
> +		return;
> +
> +	numa_indexp = of_get_property(node, "ibm,numa-lookup-index", NULL);
> +	if (!numa_indexp)
> +		return;
> +
> +	numa_index = of_read_number(numa_indexp, 1);
> +	/*
> +	 * update the numa_id_index_table. Device tree look at index table as
> +	 * 1 based array indexing.
> +	 */
> +	numa_id_index_table[numa_index - 1] = nid;
> +
> +	max_numa_index = of_read_number((const __be32 *)numa_distancep, 1);
> +	VM_WARN_ON(max_numa_index != 2 * numa_index);
> +	/* Skip the size which is encoded int */
> +	numa_distancep += sizeof(__be32);
> +
> +	/*
> +	 * First fill the distance information from other node to this node.
> +	 */
> +	other_nid_index = 0;
> +	for (i = 0; i < numa_index; i++) {
> +		numa_distance = numa_distancep[i];
> +		other_nid = numa_id_index_table[other_nid_index++];
> +		numa_distance_table[other_nid][nid] = numa_distance;
> +	}
> +
> +	other_nid_index = 0;
> +	for (; i < max_numa_index; i++) {
> +		numa_distance = numa_distancep[i];
> +		other_nid = numa_id_index_table[other_nid_index++];
> +		numa_distance_table[nid][other_nid] = numa_distance;
> +	}
> +}
> +
> +/*
> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
> + */
> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
> +{
> +	const __u8 *numa_dist_table;
> +	const __be32 *numa_lookup_index;
> +	int numa_dist_table_length;
> +	int max_numa_index, distance_index;
> +	int i, curr_row = 0, curr_column = 0;
> +
> +	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
> +	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
> +
> +	/* first element of the array is the size and is encode-int */
> +	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
> +	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
> +	/* Skip the size which is encoded int */
> +	numa_dist_table += sizeof(__be32);
> +
> +	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d \n",
> +		 numa_dist_table_length, max_numa_index);
> +
> +	for (i = 0; i < max_numa_index; i++)
> +		/* +1 skip the max_numa_index in the property */
> +		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
> +
> +
> +	VM_WARN_ON(numa_dist_table_length != max_numa_index * max_numa_index);
> +
> +	for (distance_index = 0; distance_index < numa_dist_table_length; distance_index++) {
> +		int nodeA = numa_id_index_table[curr_row];
> +		int nodeB = numa_id_index_table[curr_column++];
> +
> +		numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index];
> +
> +		pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
> +		if (curr_column >= max_numa_index) {
> +			curr_row++;
> +			/* reset the column */
> +			curr_column = 0;
> +		}
> +	}
>   }
>   
>   static int __init find_primary_domain_index(void)
> @@ -324,6 +453,9 @@ static int __init find_primary_domain_index(void)
>   	 */
>   	if (firmware_has_feature(FW_FEATURE_OPAL)) {
>   		affinity_form = FORM1_AFFINITY;
> +	} else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) {
> +		dbg("Using form 2 affinity\n");
> +		affinity_form = FORM2_AFFINITY;
>   	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
>   		dbg("Using form 1 affinity\n");
>   		affinity_form = FORM1_AFFINITY;
> @@ -368,8 +500,17 @@ static int __init find_primary_domain_index(void)
>   
>   		index = of_read_number(&distance_ref_points[1], 1);
>   	} else {
> +		/*
> +		 * Both FORM1 and FORM2 affinity find the primary domain details
> +		 * at the same offset.
> +		 */
>   		index = of_read_number(distance_ref_points, 1);
>   	}
> +	/*
> +	 * If it is FORM2 also initialize the distance table here.
> +	 */
> +	if (affinity_form == FORM2_AFFINITY)
> +		initialize_form2_numa_distance_lookup_table(root);
>   
>   	/*
>   	 * Warn and cap if the hardware supports more than
> diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
> index 5d4c2bc20bba..f162156b7b68 100644
> --- a/arch/powerpc/platforms/pseries/firmware.c
> +++ b/arch/powerpc/platforms/pseries/firmware.c
> @@ -123,6 +123,7 @@ vec5_fw_features_table[] = {
>   	{FW_FEATURE_PRRN,		OV5_PRRN},
>   	{FW_FEATURE_DRMEM_V2,		OV5_DRMEM_V2},
>   	{FW_FEATURE_DRC_INFO,		OV5_DRC_INFO},
> +	{FW_FEATURE_FORM2_AFFINITY,	OV5_FORM2_AFFINITY},
>   };
>   
>   static void __init fw_vec5_feature_init(const char *vec5, unsigned long len)
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
  2021-06-21 22:02   ` Daniel Henrique Barboza
@ 2021-06-22 12:07     ` Aneesh Kumar K.V
  2021-06-22 16:04       ` Daniel Henrique Barboza
  0 siblings, 1 reply; 20+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-22 12:07 UTC (permalink / raw)
  To: Daniel Henrique Barboza, linuxppc-dev, mpe
  Cc: Nathan Lynch, David Gibson, nvdimm, dan.j.williams

Daniel Henrique Barboza <danielhb413@gmail.com> writes:

> On 6/17/21 1:51 PM, Aneesh Kumar K.V wrote:
>> PAPR interface currently supports two different ways of communicating resource
>> grouping details to the OS. These are referred to as Form 0 and Form 1
>> associativity grouping. Form 0 is the older format and is now considered
>> deprecated. This patch adds another resource grouping named FORM2.
>> 
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   Documentation/powerpc/associativity.rst   | 135 ++++++++++++++++++++
>>   arch/powerpc/include/asm/firmware.h       |   3 +-
>>   arch/powerpc/include/asm/prom.h           |   1 +
>>   arch/powerpc/kernel/prom_init.c           |   3 +-
>>   arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
>>   arch/powerpc/platforms/pseries/firmware.c |   1 +
>>   6 files changed, 286 insertions(+), 6 deletions(-)
>>   create mode 100644 Documentation/powerpc/associativity.rst
>> 
>> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
>> new file mode 100644
>> index 000000000000..93be604ac54d
>> --- /dev/null
>> +++ b/Documentation/powerpc/associativity.rst
>> @@ -0,0 +1,135 @@
>> +============================
>> +NUMA resource associativity
>> +=============================
>> +
>> +Associativity represents the groupings of the various platform resources into
>> +domains of substantially similar mean performance relative to resources outside
>> +of that domain. Resources subsets of a given domain that exhibit better
>> +performance relative to each other than relative to other resources subsets
>> +are represented as being members of a sub-grouping domain. This performance
>> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
>> +From the platform view, these groups are also referred to as domains.
>> +
>> +PAPR interface currently supports different ways of communicating these resource
>> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
>> +associativity grouping. Form 0 is the older format and is now considered deprecated.
>> +
>> +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
>> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
>> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
>> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>> +
>> +Form 0
>> +-----
>> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
>> +
>> +Form 1
>> +-----
>> +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity
>> +device tree properties are used to determine the NUMA distance between resource groups/domains.
>> +
>> +The “ibm,associativity” property contains one or more lists of numbers (domainID)
>> +representing the resource’s platform grouping domains.
>> +
>> +The “ibm,associativity-reference-points” property contains one or more list of numbers
>> +(domainID index) that represents the 1 based ordinal in the associativity lists.
>> +The list of domainID index represnets increasing hierachy of resource grouping.
>> +
>> +ex:
>> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
>> +
>> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
>> +Linux kernel computes NUMA distance between two domains by recursively comparing
>> +if they belong to the same higher-level domains. For mismatch at every higher
>> +level of the resource group, the kernel doubles the NUMA distance between the
>> +comparing domains.
>> +
>> +Form 2
>> +-------
>> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
>> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
>> +domain numbering. With numa distance computation now detached from the index value of
>> +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the
>> +same domainID index representing resource groups of different performance/latency characteristics.
>> +
>> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
>> +"ibm,architecture-vec-5" property.
>> +
>> +"ibm,numa-lookup-index-table" property contains one or more list numbers representing
>> +the domainIDs present in the system. The offset of the domainID in this property is considered
>> +the domainID index.
>> +
>> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
>> +N domainID encoded as with encode-int
>> +
>> +For ex:
>> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for domainID 8 is 1.
>> +
>> +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
>> +distance between resource groups/domains present in the system.
>> +
>> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>> +
>> +For ex:
>> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
>> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
>> +
>> +  | 0    8   40
>> +--|------------
>> +  |
>> +0 | 10   20  80
>> +  |
>> +8 | 20   10  160
>> +  |
>> +40| 80   160  10
>> +
>> +
>> +"ibm,associativity" property for resources in node 0, 8 and 40
>> +
>> +{ 3, 6, 7, 0 }
>> +{ 3, 6, 9, 8 }
>> +{ 3, 6, 7, 40}
>> +
>> +With "ibm,associativity-reference-points"  { 0x3 }
>
> With this configuration, would the following ibm,associativity arrays
> also be valid?
>
>
> { 3, 0, 0, 0 }
> { 3, 0, 0, 8 }
> { 3, 0, 0, 40}
>

Yes

> If yes, then we need a way to tell that the associativity domains assignment
> are optional, and FORM2 relies solely on finding out the domainID of the
> resource (0, 8 and 40) to retrieve the domainID index, and with this
> index all performance metrics can be retrieved from the numa-* properties
> (numa-distance-table, numa-bandwidth-table ...).
>

Where do you suggest we clarify that? I agree that it is not explicitly
mentioned. But we describe the details of how we find the numa distance
with example in the document.

> Retrieving the resource domainID is done by using ibm,associativity-reference-points.
>
> This will allow the platform to implement FORM2 such as:
>
> { 1, 0 }
> { 1, 8 }
> { 1, 40 }
>   
> - ref-points: { 0x1 }
>
> If the platform chooses to do so.
>

That is correct.

>
>> +
>> +Each resource (drcIndex) now also supports additional optional device tree properties.
>> +These properties are marked optional because the platform can choose not to export
>> +them and provide the system topology details using the earlier defined device tree
>> +properties alone. The optional device tree properties are used when adding new resources
>> +(DLPAR) and when the platform didn't provide the topology details of the domain which
>> +contains the newly added resource during boot.
>> +
>> +"ibm,numa-lookup-index" property contains a number representing the domainID index to be used
>> +when building the NUMA distance of the numa node to which this resource belongs. This can
>> +be looked at as the index at which this new domainID would have appeared in
>> +"ibm,numa-lookup-index-table" if the domain was present during boot. The domainID
>> +of the new resource can be obtained from the existing "ibm,associativity" property. This
>> +can be used to build distance information of a newly onlined NUMA node via DLPAR operation.
>> +The value is 1 based array index value.
>> +
>> +prop-encoded-array: An integer encoded as with encode-int specifying the domainID index
>> +
>> +"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
>> +from this resource domain to other resources.
>> +
>> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>> +
>> +For ex:
>> +ibm,associativity     = { 4, 5, 10, 50}
>> +ibm,numa-lookup-index = { 4 }
>> +ibm,numa-distance   =  {8, 160, 255, 80, 10, 160, 255, 80, 10}
>> +
>> +resulting in a new toplogy as below.
>> +  | 0    8   40   50
>> +--|------------------
>> +  |
>> +0 | 10   20  80   160
>> +  |
>> +8 | 20   10  160  255
>> +  |
>> +40| 80   160  10  80
>> +  |
>> +50| 160  255  80  10
>> +
>
> I see there is no mention of the special PAPR SCM handling. I saw in
> one of the your replies of v1:
>
> "Another option is to make sure that numa-distance-value is populated
> such that PMEMB distance indicates it is closer to node0 when compared
> to node1. ie, node_distance[40][0] < node_distance[40][1]. One could
> possibly infer the grouping based on the distance value and not deepend
> on ibm,associativity for that purpose."
>
>
> Is that was we're supposed to do with PAPR SCM? I'm not sure how that
> affects NVDIMM support in QEMU with FORM2.
>
>

yes that is what we are doing with this version of the patchset (v4)
version. We can drop the nvdimm specific changes from Qemu.

-aneesh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
  2021-06-22 12:07     ` Aneesh Kumar K.V
@ 2021-06-22 16:04       ` Daniel Henrique Barboza
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2021-06-22 16:04 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev, mpe
  Cc: Nathan Lynch, David Gibson, nvdimm, dan.j.williams



On 6/22/21 9:07 AM, Aneesh Kumar K.V wrote:
> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> 
>> On 6/17/21 1:51 PM, Aneesh Kumar K.V wrote:
>>> PAPR interface currently supports two different ways of communicating resource
>>> grouping details to the OS. These are referred to as Form 0 and Form 1
>>> associativity grouping. Form 0 is the older format and is now considered
>>> deprecated. This patch adds another resource grouping named FORM2.
>>>
>>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>>    Documentation/powerpc/associativity.rst   | 135 ++++++++++++++++++++
>>>    arch/powerpc/include/asm/firmware.h       |   3 +-
>>>    arch/powerpc/include/asm/prom.h           |   1 +
>>>    arch/powerpc/kernel/prom_init.c           |   3 +-
>>>    arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
>>>    arch/powerpc/platforms/pseries/firmware.c |   1 +
>>>    6 files changed, 286 insertions(+), 6 deletions(-)
>>>    create mode 100644 Documentation/powerpc/associativity.rst
>>>
>>> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
>>> new file mode 100644
>>> index 000000000000..93be604ac54d
>>> --- /dev/null
>>> +++ b/Documentation/powerpc/associativity.rst
>>> @@ -0,0 +1,135 @@
>>> +============================
>>> +NUMA resource associativity
>>> +=============================
>>> +
>>> +Associativity represents the groupings of the various platform resources into
>>> +domains of substantially similar mean performance relative to resources outside
>>> +of that domain. Resources subsets of a given domain that exhibit better
>>> +performance relative to each other than relative to other resources subsets
>>> +are represented as being members of a sub-grouping domain. This performance
>>> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
>>> +From the platform view, these groups are also referred to as domains.
>>> +
>>> +PAPR interface currently supports different ways of communicating these resource
>>> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
>>> +associativity grouping. Form 0 is the older format and is now considered deprecated.
>>> +
>>> +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
>>> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
>>> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
>>> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>>> +
>>> +Form 0
>>> +-----
>>> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
>>> +
>>> +Form 1
>>> +-----
>>> +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity
>>> +device tree properties are used to determine the NUMA distance between resource groups/domains.
>>> +
>>> +The “ibm,associativity” property contains one or more lists of numbers (domainID)
>>> +representing the resource’s platform grouping domains.
>>> +
>>> +The “ibm,associativity-reference-points” property contains one or more list of numbers
>>> +(domainID index) that represents the 1 based ordinal in the associativity lists.
>>> +The list of domainID index represnets increasing hierachy of resource grouping.
>>> +
>>> +ex:
>>> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
>>> +
>>> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
>>> +Linux kernel computes NUMA distance between two domains by recursively comparing
>>> +if they belong to the same higher-level domains. For mismatch at every higher
>>> +level of the resource group, the kernel doubles the NUMA distance between the
>>> +comparing domains.
>>> +
>>> +Form 2
>>> +-------
>>> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
>>> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
>>> +domain numbering. With numa distance computation now detached from the index value of
>>> +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the
>>> +same domainID index representing resource groups of different performance/latency characteristics.
>>> +
>>> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
>>> +"ibm,architecture-vec-5" property.
>>> +
>>> +"ibm,numa-lookup-index-table" property contains one or more list numbers representing
>>> +the domainIDs present in the system. The offset of the domainID in this property is considered
>>> +the domainID index.
>>> +
>>> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
>>> +N domainID encoded as with encode-int
>>> +
>>> +For ex:
>>> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for domainID 8 is 1.
>>> +
>>> +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
>>> +distance between resource groups/domains present in the system.
>>> +
>>> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>>> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>>> +
>>> +For ex:
>>> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
>>> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
>>> +
>>> +  | 0    8   40
>>> +--|------------
>>> +  |
>>> +0 | 10   20  80
>>> +  |
>>> +8 | 20   10  160
>>> +  |
>>> +40| 80   160  10
>>> +
>>> +
>>> +"ibm,associativity" property for resources in node 0, 8 and 40
>>> +
>>> +{ 3, 6, 7, 0 }
>>> +{ 3, 6, 9, 8 }
>>> +{ 3, 6, 7, 40}
>>> +
>>> +With "ibm,associativity-reference-points"  { 0x3 }
>>
>> With this configuration, would the following ibm,associativity arrays
>> also be valid?
>>
>>
>> { 3, 0, 0, 0 }
>> { 3, 0, 0, 8 }
>> { 3, 0, 0, 40}
>>
> 
> Yes
> 
>> If yes, then we need a way to tell that the associativity domains assignment
>> are optional, and FORM2 relies solely on finding out the domainID of the
>> resource (0, 8 and 40) to retrieve the domainID index, and with this
>> index all performance metrics can be retrieved from the numa-* properties
>> (numa-distance-table, numa-bandwidth-table ...).
>>
> 
> Where do you suggest we clarify that? I agree that it is not explicitly
> mentioned. But we describe the details of how we find the numa distance
> with example in the document.


Perhaps something like this, right in the middle of the example:


----------------

(...)

+  | 0    8   40
+--|------------
+  |
+0 | 10   20  80
+  |
+8 | 20   10  160
+  |
+40| 80   160  10
+
+

With "ibm,associativity-reference-points" equal to { 0x3 }, the domainID of
each resource is located at index 3 of each ibm,associativity property:

+{ 3, 6, 7, 0 }
+{ 3, 6, 9, 8 }
+{ 3, 6, 7, 40 }


FORM2 requires the ibm,associativity array to contain the domainID of the
resource, which is defined by the ibm,associativity-reference-points.
Calculating the associativity domains of the remaining ibm,associativity
elements is not obligatory. In this example, the following ibm,associativity
arrays are also valid:

{ 3, 0, 0, 0 }
{ 3, 0, 0, 8 }
{ 3, 0, 0, 40 }

(...)

-------------


> 
>> Retrieving the resource domainID is done by using ibm,associativity-reference-points.
>>
>> This will allow the platform to implement FORM2 such as:
>>
>> { 1, 0 }
>> { 1, 8 }
>> { 1, 40 }
>>    
>> - ref-points: { 0x1 }
>>
>> If the platform chooses to do so.
>>
> 
> That is correct.
> 
>>
>>> +
>>> +Each resource (drcIndex) now also supports additional optional device tree properties.
>>> +These properties are marked optional because the platform can choose not to export
>>> +them and provide the system topology details using the earlier defined device tree
>>> +properties alone. The optional device tree properties are used when adding new resources
>>> +(DLPAR) and when the platform didn't provide the topology details of the domain which
>>> +contains the newly added resource during boot.
>>> +
>>> +"ibm,numa-lookup-index" property contains a number representing the domainID index to be used
>>> +when building the NUMA distance of the numa node to which this resource belongs. This can
>>> +be looked at as the index at which this new domainID would have appeared in
>>> +"ibm,numa-lookup-index-table" if the domain was present during boot. The domainID
>>> +of the new resource can be obtained from the existing "ibm,associativity" property. This
>>> +can be used to build distance information of a newly onlined NUMA node via DLPAR operation.
>>> +The value is 1 based array index value.
>>> +
>>> +prop-encoded-array: An integer encoded as with encode-int specifying the domainID index
>>> +
>>> +"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
>>> +from this resource domain to other resources.
>>> +
>>> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>>> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>>> +
>>> +For ex:
>>> +ibm,associativity     = { 4, 5, 10, 50}
>>> +ibm,numa-lookup-index = { 4 }
>>> +ibm,numa-distance   =  {8, 160, 255, 80, 10, 160, 255, 80, 10}
>>> +
>>> +resulting in a new toplogy as below.
>>> +  | 0    8   40   50
>>> +--|------------------
>>> +  |
>>> +0 | 10   20  80   160
>>> +  |
>>> +8 | 20   10  160  255
>>> +  |
>>> +40| 80   160  10  80
>>> +  |
>>> +50| 160  255  80  10
>>> +
>>
>> I see there is no mention of the special PAPR SCM handling. I saw in
>> one of the your replies of v1:
>>
>> "Another option is to make sure that numa-distance-value is populated
>> such that PMEMB distance indicates it is closer to node0 when compared
>> to node1. ie, node_distance[40][0] < node_distance[40][1]. One could
>> possibly infer the grouping based on the distance value and not deepend
>> on ibm,associativity for that purpose."
>>
>>
>> Is that was we're supposed to do with PAPR SCM? I'm not sure how that
>> affects NVDIMM support in QEMU with FORM2.
>>
>>
> 
> yes that is what we are doing with this version of the patchset (v4)
> version. We can drop the nvdimm specific changes from Qemu.


I see. I'll drop the NVDIMM changes in the QEMU POC of FORM2 then.



Thanks,


Daniel

> 
> -aneesh
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/7] powerpc/pseries: rename min_common_depth to primary_domain_index
  2021-06-17 16:50 ` [PATCH v4 1/7] powerpc/pseries: rename min_common_depth to primary_domain_index Aneesh Kumar K.V
@ 2021-06-24  1:46   ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2021-06-24  1:46 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linuxppc-dev, mpe, Nathan Lynch, Daniel Henrique Barboza, nvdimm,
	dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 4977 bytes --]

On Thu, Jun 17, 2021 at 10:20:59PM +0530, Aneesh Kumar K.V wrote:
> No functional change in this patch.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/mm/numa.c | 38 +++++++++++++++++++-------------------
>  1 file changed, 19 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index f2bf98bdcea2..8365b298ec48 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table);
>  EXPORT_SYMBOL(node_to_cpumask_map);
>  EXPORT_SYMBOL(node_data);
>  
> -static int min_common_depth;
> +static int primary_domain_index;
>  static int n_mem_addr_cells, n_mem_size_cells;
>  static int form1_affinity;
>  
> @@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 *associativity)
>  	if (!numa_enabled)
>  		goto out;
>  
> -	if (of_read_number(associativity, 1) >= min_common_depth)
> -		nid = of_read_number(&associativity[min_common_depth], 1);
> +	if (of_read_number(associativity, 1) >= primary_domain_index)
> +		nid = of_read_number(&associativity[primary_domain_index], 1);
>  
>  	/* POWER4 LPAR uses 0xffff as invalid node */
>  	if (nid == 0xffff || nid >= nr_node_ids)
> @@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device)
>  }
>  EXPORT_SYMBOL(of_node_to_nid);
>  
> -static int __init find_min_common_depth(void)
> +static int __init find_primary_domain_index(void)
>  {
> -	int depth;
> +	int index;
>  	struct device_node *root;
>  
>  	if (firmware_has_feature(FW_FEATURE_OPAL))
> @@ -326,7 +326,7 @@ static int __init find_min_common_depth(void)
>  	}
>  
>  	if (form1_affinity) {
> -		depth = of_read_number(distance_ref_points, 1);
> +		index = of_read_number(distance_ref_points, 1);
>  	} else {
>  		if (distance_ref_points_depth < 2) {
>  			printk(KERN_WARNING "NUMA: "
> @@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
>  			goto err;
>  		}
>  
> -		depth = of_read_number(&distance_ref_points[1], 1);
> +		index = of_read_number(&distance_ref_points[1], 1);
>  	}
>  
>  	/*
> @@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
>  	}
>  
>  	of_node_put(root);
> -	return depth;
> +	return index;
>  
>  err:
>  	of_node_put(root);
> @@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>  	int nid = default_nid;
>  	int rc, index;
>  
> -	if ((min_common_depth < 0) || !numa_enabled)
> +	if ((primary_domain_index < 0) || !numa_enabled)
>  		return default_nid;
>  
>  	rc = of_get_assoc_arrays(&aa);
>  	if (rc)
>  		return default_nid;
>  
> -	if (min_common_depth <= aa.array_sz &&
> +	if (primary_domain_index <= aa.array_sz &&
>  	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
> -		index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
> +		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
>  		nid = of_read_number(&aa.arrays[index], 1);
>  
>  		if (nid == 0xffff || nid >= nr_node_ids)
> @@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
>  		return -1;
>  	}
>  
> -	min_common_depth = find_min_common_depth();
> +	primary_domain_index = find_primary_domain_index();
>  
> -	if (min_common_depth < 0) {
> +	if (primary_domain_index < 0) {
>  		/*
> -		 * if we fail to parse min_common_depth from device tree
> +		 * if we fail to parse primary_domain_index from device tree
>  		 * mark the numa disabled, boot with numa disabled.
>  		 */
>  		numa_enabled = false;
> -		return min_common_depth;
> +		return primary_domain_index;
>  	}
>  
> -	dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
> +	dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index);
>  
>  	/*
>  	 * Even though we connect cpus to numa domains later in SMP
> @@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
>  			goto out;
>  	}
>  
> -	max_nodes = of_read_number(&domains[min_common_depth], 1);
> +	max_nodes = of_read_number(&domains[primary_domain_index], 1);
>  	for (i = 0; i < max_nodes; i++) {
>  		if (!node_possible(i))
>  			node_set(i, node_possible_map);
>  	}
>  
>  	prop_length /= sizeof(int);
> -	if (prop_length > min_common_depth + 2)
> +	if (prop_length > primary_domain_index + 2)
>  		coregroup_enabled = 1;
>  
>  out:
> @@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
>  		goto out;
>  
>  	index = of_read_number(associativity, 1);
> -	if (index > min_common_depth + 1)
> +	if (index > primary_domain_index + 1)
>  		return of_read_number(&associativity[index - 1], 1);
>  
>  out:

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
  2021-06-17 16:51 ` [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity Aneesh Kumar K.V
  2021-06-21 22:02   ` Daniel Henrique Barboza
@ 2021-06-24  3:08   ` David Gibson
  2021-06-24  8:20     ` Aneesh Kumar K.V
  2021-06-24 10:33   ` Laurent Dufour
  2 siblings, 1 reply; 20+ messages in thread
From: David Gibson @ 2021-06-24  3:08 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linuxppc-dev, mpe, Nathan Lynch, Daniel Henrique Barboza, nvdimm,
	dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 22772 bytes --]

On Thu, Jun 17, 2021 at 10:21:05PM +0530, Aneesh Kumar K.V wrote:
> PAPR interface currently supports two different ways of communicating resource
> grouping details to the OS. These are referred to as Form 0 and Form 1
> associativity grouping. Form 0 is the older format and is now considered
> deprecated. This patch adds another resource grouping named FORM2.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  Documentation/powerpc/associativity.rst   | 135 ++++++++++++++++++++
>  arch/powerpc/include/asm/firmware.h       |   3 +-
>  arch/powerpc/include/asm/prom.h           |   1 +
>  arch/powerpc/kernel/prom_init.c           |   3 +-
>  arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>  6 files changed, 286 insertions(+), 6 deletions(-)
>  create mode 100644 Documentation/powerpc/associativity.rst
> 
> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
> new file mode 100644
> index 000000000000..93be604ac54d
> --- /dev/null
> +++ b/Documentation/powerpc/associativity.rst
> @@ -0,0 +1,135 @@
> +============================
> +NUMA resource associativity
> +=============================
> +
> +Associativity represents the groupings of the various platform resources into
> +domains of substantially similar mean performance relative to resources outside
> +of that domain. Resources subsets of a given domain that exhibit better
> +performance relative to each other than relative to other resources subsets
> +are represented as being members of a sub-grouping domain. This performance
> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
> +From the platform view, these groups are also referred to as domains.
> +
> +PAPR interface currently supports different ways of communicating these resource
> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
> +associativity grouping. Form 0 is the older format and is now considered deprecated.
> +
> +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
> +
> +Form 0
> +-----
> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
> +
> +Form 1
> +-----
> +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity
> +device tree properties are used to determine the NUMA distance between resource groups/domains.
> +
> +The “ibm,associativity” property contains one or more lists of numbers (domainID)
> +representing the resource’s platform grouping domains.
> +
> +The “ibm,associativity-reference-points” property contains one or more list of numbers
> +(domainID index) that represents the 1 based ordinal in the associativity lists.
> +The list of domainID index represnets increasing hierachy of
> resource grouping.

Typo "represnets".  Also s/hierachy/hierarchy/

> +
> +ex:
> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }

> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
> +Linux kernel computes NUMA distance between two domains by recursively comparing
> +if they belong to the same higher-level domains. For mismatch at every higher
> +level of the resource group, the kernel doubles the NUMA distance between the
> +comparing domains.

The Form1 description is still kinda confusing, but I don't really
care.  Form1 *is* confusing, it's Form2 that I hope will be clearer.

> +
> +Form 2
> +-------
> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
> +domain numbering. With numa distance computation now detached from the index value of
> +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the
> +same domainID index representing resource groups of different performance/latency characteristics.

So, see you've removed the special handling of secondary IDs for pmem
- big improvement, thanks.  IIUC, in this revised version, for Form2
there's really no reason for ibm,associativity-reference-points to
have >1 entry.  Is that right?

In Form2 everything revolves around the primary domain ID - so much
that I suggest we come up with a short name for it.  How about just
"node id" since that's how Linux uses it.

> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
> +"ibm,architecture-vec-5" property.
> +
> +"ibm,numa-lookup-index-table" property contains one or more list numbers representing
> +the domainIDs present in the system. The offset of the domainID in this property is considered
> +the domainID index.

The lookup-index-table is all about compactifying the representation
of the distance matrix.  Because node ids are sparse, the matrix of
distances is also effectively sparse.  You don't want to have a huge
matrix in the DT with mostly meaningless entries, so you use the
lookup table to compact the node ids.

It's a bit confusing though, because you now have two representations
of exactly the same information the "full" node id (== primary
domainID) and the "short" node id (==domainID lookup table offset).

I can see three ways to clarify this:

  A) Give the short node id an explicit name rather than difficult to
     parse "domainID index" or "domainID offset" which gets easily
     muddled with other uses of index and offset.  Use that name
     throughout.

  B) Eliminate the short ID entirely, and use an explicit sparse
     matrix representation for the distance table e.g. a list of
     (nodeA, nodeB, distance) tuples

     That does have the disadvantage that it's not structurally
     guaranteed that you have entries for every pair of "active" node
     ids.

  C) Eliminate the "long ID" entirely.  In the Form2 world, is there
     actually any point allowing sparse node ids?  In this case you'd
     have the node id read from associativity and use that directly to
     index the distance table

> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
> +N domainID encoded as with encode-int
> +
> +For ex:
> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index
> for domainID 8 is 1.

As noted "domainID index" is confusing here, because it's different
from the "domainID index" values in reference-points.

> +
> +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
> +distance between resource groups/domains present in the system.
> +
> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.

The N here always has to be the square of the N in the
lookup-index-table, yes?  In which case do we actually need to include
it?

> +For ex:
> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
> +
> +  | 0    8   40
> +--|------------
> +  |
> +0 | 10   20  80
> +  |
> +8 | 20   10  160
> +  |
> +40| 80   160  10
> +
> +
> +"ibm,associativity" property for resources in node 0, 8 and 40
> +
> +{ 3, 6, 7, 0 }
> +{ 3, 6, 9, 8 }
> +{ 3, 6, 7, 40}
> +
> +With "ibm,associativity-reference-points"  { 0x3 }
> +
> +Each resource (drcIndex) now also supports additional optional device tree properties.
> +These properties are marked optional because the platform can choose not to export
> +them and provide the system topology details using the earlier defined device tree
> +properties alone. The optional device tree properties are used when adding new resources
> +(DLPAR) and when the platform didn't provide the topology details of the domain which
> +contains the newly added resource during boot.

In what circumstances would you envisage a hot-add creating a new NUMA
node, as opposed to adding resources to an existing (possibly empty)
node?  Do you need any provision for removing NUMA nodes again?

> +"ibm,numa-lookup-index" property contains a number representing the domainID index to be used
> +when building the NUMA distance of the numa node to which this resource belongs. This can
> +be looked at as the index at which this new domainID would have appeared in
> +"ibm,numa-lookup-index-table" if the domain was present during boot. The domainID

Again, confusing use of "index" where it's used both for offsets in
ibm,associativity properties and for offsets in ibm,numa-lookup-index-table

> +of the new resource can be obtained from the existing "ibm,associativity" property. This
> +can be used to build distance information of a newly onlined NUMA node via DLPAR operation.
> +The value is 1 based array index value.
> +
> +prop-encoded-array: An integer encoded as with encode-int specifying the domainID index
> +
> +"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
> +from this resource domain to other resources.
> +
> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.

Again, does N have to equal 2 * (existing number of nodes + 1)?  If so
you should say so, if not you should specify how incomplete
information is interpreted.

> +For ex:
> +ibm,associativity     = { 4, 5, 10, 50}
> +ibm,numa-lookup-index = { 4 }
> +ibm,numa-distance   =  {8, 160, 255, 80, 10, 160, 255, 80, 10}
> +
> +resulting in a new toplogy as below.
> +  | 0    8   40   50
> +--|------------------
> +  |
> +0 | 10   20  80   160
> +  |
> +8 | 20   10  160  255
> +  |
> +40| 80   160  10  80
> +  |
> +50| 160  255  80  10
> +
> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
> index 60b631161360..97a3bd9ffeb9 100644
> --- a/arch/powerpc/include/asm/firmware.h
> +++ b/arch/powerpc/include/asm/firmware.h
> @@ -53,6 +53,7 @@
>  #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
>  #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
>  #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
>  
>  #ifndef __ASSEMBLY__
>  
> @@ -73,7 +74,7 @@ enum {
>  		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
>  		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
>  		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
> -		FW_FEATURE_RPT_INVALIDATE,
> +		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
>  	FW_FEATURE_PSERIES_ALWAYS = 0,
>  	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
>  	FW_FEATURE_POWERNV_ALWAYS = 0,
> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
> index df9fec9d232c..5c80152e8f18 100644
> --- a/arch/powerpc/include/asm/prom.h
> +++ b/arch/powerpc/include/asm/prom.h
> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
>  #define OV5_XCMO		0x0440	/* Page Coalescing */
>  #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
>  #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
> +#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
>  #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
>  #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
>  #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> index 64b9593038a7..496fdac54c29 100644
> --- a/arch/powerpc/kernel/prom_init.c
> +++ b/arch/powerpc/kernel/prom_init.c
> @@ -1070,7 +1070,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
>  #else
>  		0,
>  #endif
> -		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
> +		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
> +		OV5_FEAT(OV5_FORM2_AFFINITY),
>  		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
>  		.micro_checkpoint = 0,
>  		.reserved0 = 0,
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index d32729f235b8..5a7d94960fb7 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
>  
>  #define FORM0_AFFINITY 0
>  #define FORM1_AFFINITY 1
> +#define FORM2_AFFINITY 2
>  static int affinity_form;
>  
>  #define MAX_DISTANCE_REF_POINTS 4
>  static int max_associativity_domain_index;
>  static const __be32 *distance_ref_points;
>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
> +	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
> +};
> +static int numa_id_index_table[MAX_NUMNODES];
>  
>  /*
>   * Allocate node_to_cpumask_map based on number of available nodes
> @@ -166,6 +171,27 @@ static void unmap_cpu_from_node(unsigned long cpu)
>  }
>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>  
> +/*
> + * With FORM2 if we are not using logical domain ids, we will find
> + * both primary and seconday domains with same value. Hence always
> + * start comparison from secondary domains


IIUC, in this new draft, secondary domains are no longer a meaningful
thing in Form2, so this comment and code seem outdated.

> + */
> +static int __cpu_form2_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> +{
> +	int dist = 0;
> +
> +	int i, index;
> +
> +	for (i = 1; i < max_associativity_domain_index; i++) {
> +		index = be32_to_cpu(distance_ref_points[i]);
> +		if (cpu1_assoc[index] == cpu2_assoc[index])
> +			break;
> +		dist++;
> +	}
> +
> +	return dist;
> +}
> +
>  static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>  {
>  	int dist = 0;
> @@ -178,7 +204,6 @@ static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>  			break;
>  		dist++;
>  	}
> -
>  	return dist;
>  }
>  
> @@ -186,8 +211,9 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>  {
>  	/* We should not get called with FORM0 */
>  	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
> -
> -	return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
> +	if (affinity_form == FORM1_AFFINITY)
> +		return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
> +	return __cpu_form2_distance(cpu1_assoc, cpu2_assoc);
>  }
>  
>  /* must hold reference to node during call */
> @@ -201,7 +227,9 @@ int __node_distance(int a, int b)
>  	int i;
>  	int distance = LOCAL_DISTANCE;
>  
> -	if (affinity_form == FORM0_AFFINITY)
> +	if (affinity_form == FORM2_AFFINITY)
> +		return numa_distance_table[a][b];
> +	else if (affinity_form == FORM0_AFFINITY)
>  		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
>  
>  	for (i = 0; i < max_associativity_domain_index; i++) {
> @@ -303,15 +331,116 @@ static void initialize_form1_numa_distance(struct device_node *node)
>  
>  /*
>   * Used to update distance information w.r.t newly added node.
> + * ibm,numa-lookup-index -> 4
> + * ibm,numa-distance -> {5, 20, 40, 60, 80, 10 }
>   */
>  void update_numa_distance(struct device_node *node)
>  {
> +	int i, nid, other_nid, other_nid_index = 0;
> +	const __be32 *numa_indexp;
> +	const __u8  *numa_distancep;
> +	int numa_index, max_numa_index, numa_distance;
> +
>  	if (affinity_form == FORM0_AFFINITY)
>  		return;
>  	else if (affinity_form == FORM1_AFFINITY) {
>  		initialize_form1_numa_distance(node);
>  		return;
>  	}
> +	/* FORM2 affinity  */
> +
> +	nid = of_node_to_nid_single(node);
> +	if (nid == NUMA_NO_NODE)
> +		return;
> +
> +	/* Already initialized */
> +	if (numa_distance_table[nid][nid] != -1)

IIUC, this should exactly correspond to the case where the new
resource lies in an existing NUMA node, yes?

> +		return;
> +	/*
> +	 * update node distance if not already populated.
> +	 */
> +	numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
> +	if (!numa_distancep)
> +		return;
> +
> +	numa_indexp = of_get_property(node, "ibm,numa-lookup-index", NULL);
> +	if (!numa_indexp)
> +		return;
> +
> +	numa_index = of_read_number(numa_indexp, 1);
> +	/*
> +	 * update the numa_id_index_table. Device tree look at index table as
> +	 * 1 based array indexing.
> +	 */
> +	numa_id_index_table[numa_index - 1] = nid;
> +
> +	max_numa_index = of_read_number((const __be32 *)numa_distancep, 1);
> +	VM_WARN_ON(max_numa_index != 2 * numa_index);

You WARN_ON(), but you don't actually bail out to avoid indexing past
the end of the property below.

AFAICT none of this really works unless numa_index == (previous number
of numa nodes + 1), which makes all the use of these different
variables kind of confusing.  If you had a variable that you just set
equal to the new number of numa nodes (previous number plus the one
being added), then I think you can replace all numa_index and
max_numa_index references with that and it will be clearer.


> +	/* Skip the size which is encoded int */
> +	numa_distancep += sizeof(__be32);
> +
> +	/*
> +	 * First fill the distance information from other node to this node.
> +	 */
> +	other_nid_index = 0;
> +	for (i = 0; i < numa_index; i++) {
> +		numa_distance = numa_distancep[i];
> +		other_nid = numa_id_index_table[other_nid_index++];
> +		numa_distance_table[other_nid][nid] = numa_distance;
> +	}
> +
> +	other_nid_index = 0;
> +	for (; i < max_numa_index; i++) {

Again, splitting this loop and carrying i over seems a confusing way
to code this.  It's basically two loops of N, one writing a row of the
distance matrix, one writing a column (other_nid will even go through
the same values in each loop).

> +		numa_distance = numa_distancep[i];
> +		other_nid = numa_id_index_table[other_nid_index++];
> +		numa_distance_table[nid][other_nid] = numa_distance;
> +	}
> +}
> +
> +/*
> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
> + */
> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
> +{
> +	const __u8 *numa_dist_table;
> +	const __be32 *numa_lookup_index;
> +	int numa_dist_table_length;
> +	int max_numa_index, distance_index;
> +	int i, curr_row = 0, curr_column = 0;
> +
> +	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
> +	max_numa_index = of_read_number(&numa_lookup_index[0], 1);

max_numa_index here has a different meaning to max_numa_index in the
previous function, which is pointlessly confusing.

> +	/* first element of the array is the size and is encode-int */
> +	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
> +	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
> +	/* Skip the size which is encoded int */
> +	numa_dist_table += sizeof(__be32);
> +
> +	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d \n",
> +		 numa_dist_table_length, max_numa_index);
> +
> +	for (i = 0; i < max_numa_index; i++)
> +		/* +1 skip the max_numa_index in the property */
> +		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
> +
> +
> +	VM_WARN_ON(numa_dist_table_length != max_numa_index * max_numa_index);

Again, you don't actually bail out in this case.  And if it has to
have this value, what's the point of encoding it into the property.

> +	for (distance_index = 0; distance_index < numa_dist_table_length; distance_index++) {
> +		int nodeA = numa_id_index_table[curr_row];
> +		int nodeB = numa_id_index_table[curr_column++];

You've already (sort of) verified that the distance table has size
N^2, in which case you can just to a simple two dimensional loop
rather than having to do ugly calculations of row and column.

> +
> +		numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index];
> +
> +		pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
> +		if (curr_column >= max_numa_index) {
> +			curr_row++;
> +			/* reset the column */
> +			curr_column = 0;
> +		}
> +	}
>  }
>  
>  static int __init find_primary_domain_index(void)
> @@ -324,6 +453,9 @@ static int __init find_primary_domain_index(void)
>  	 */
>  	if (firmware_has_feature(FW_FEATURE_OPAL)) {
>  		affinity_form = FORM1_AFFINITY;
> +	} else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) {
> +		dbg("Using form 2 affinity\n");
> +		affinity_form = FORM2_AFFINITY;
>  	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
>  		dbg("Using form 1 affinity\n");
>  		affinity_form = FORM1_AFFINITY;
> @@ -368,8 +500,17 @@ static int __init find_primary_domain_index(void)
>  
>  		index = of_read_number(&distance_ref_points[1], 1);
>  	} else {
> +		/*
> +		 * Both FORM1 and FORM2 affinity find the primary domain details
> +		 * at the same offset.
> +		 */
>  		index = of_read_number(distance_ref_points, 1);
>  	}
> +	/*
> +	 * If it is FORM2 also initialize the distance table here.
> +	 */
> +	if (affinity_form == FORM2_AFFINITY)
> +		initialize_form2_numa_distance_lookup_table(root);
>  
>  	/*
>  	 * Warn and cap if the hardware supports more than
> diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
> index 5d4c2bc20bba..f162156b7b68 100644
> --- a/arch/powerpc/platforms/pseries/firmware.c
> +++ b/arch/powerpc/platforms/pseries/firmware.c
> @@ -123,6 +123,7 @@ vec5_fw_features_table[] = {
>  	{FW_FEATURE_PRRN,		OV5_PRRN},
>  	{FW_FEATURE_DRMEM_V2,		OV5_DRMEM_V2},
>  	{FW_FEATURE_DRC_INFO,		OV5_DRC_INFO},
> +	{FW_FEATURE_FORM2_AFFINITY,	OV5_FORM2_AFFINITY},
>  };
>  
>  static void __init fw_vec5_feature_init(const char *vec5, unsigned long len)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/7] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index
  2021-06-17 16:51 ` [PATCH v4 2/7] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index Aneesh Kumar K.V
@ 2021-06-24  3:10   ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2021-06-24  3:10 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linuxppc-dev, mpe, Nathan Lynch, Daniel Henrique Barboza, nvdimm,
	dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 4088 bytes --]

On Thu, Jun 17, 2021 at 10:21:00PM +0530, Aneesh Kumar K.V wrote:
> No functional change in this patch

I've been convinced of your other rename, but I'm not yet convinced
this one actually clarifies anything.

> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/numa.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 8365b298ec48..132813dd1a6c 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -56,7 +56,7 @@ static int n_mem_addr_cells, n_mem_size_cells;
>  static int form1_affinity;
>  
>  #define MAX_DISTANCE_REF_POINTS 4
> -static int distance_ref_points_depth;
> +static int max_associativity_domain_index;
>  static const __be32 *distance_ref_points;
>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
>  
> @@ -169,7 +169,7 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>  
>  	int i, index;
>  
> -	for (i = 0; i < distance_ref_points_depth; i++) {
> +	for (i = 0; i < max_associativity_domain_index; i++) {
>  		index = be32_to_cpu(distance_ref_points[i]);
>  		if (cpu1_assoc[index] == cpu2_assoc[index])
>  			break;
> @@ -193,7 +193,7 @@ int __node_distance(int a, int b)
>  	if (!form1_affinity)
>  		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
>  
> -	for (i = 0; i < distance_ref_points_depth; i++) {
> +	for (i = 0; i < max_associativity_domain_index; i++) {
>  		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
>  			break;
>  
> @@ -213,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
>  	if (!form1_affinity)
>  		return;
>  
> -	for (i = 0; i < distance_ref_points_depth; i++) {
> +	for (i = 0; i < max_associativity_domain_index; i++) {
>  		const __be32 *entry;
>  
>  		entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
> @@ -240,7 +240,7 @@ static int associativity_to_nid(const __be32 *associativity)
>  		nid = NUMA_NO_NODE;
>  
>  	if (nid > 0 &&
> -		of_read_number(associativity, 1) >= distance_ref_points_depth) {
> +		of_read_number(associativity, 1) >= max_associativity_domain_index) {
>  		/*
>  		 * Skip the length field and send start of associativity array
>  		 */
> @@ -310,14 +310,14 @@ static int __init find_primary_domain_index(void)
>  	 */
>  	distance_ref_points = of_get_property(root,
>  					"ibm,associativity-reference-points",
> -					&distance_ref_points_depth);
> +					&max_associativity_domain_index);
>  
>  	if (!distance_ref_points) {
>  		dbg("NUMA: ibm,associativity-reference-points not found.\n");
>  		goto err;
>  	}
>  
> -	distance_ref_points_depth /= sizeof(int);
> +	max_associativity_domain_index /= sizeof(int);
>  
>  	if (firmware_has_feature(FW_FEATURE_OPAL) ||
>  	    firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
> @@ -328,7 +328,7 @@ static int __init find_primary_domain_index(void)
>  	if (form1_affinity) {
>  		index = of_read_number(distance_ref_points, 1);
>  	} else {
> -		if (distance_ref_points_depth < 2) {
> +		if (max_associativity_domain_index < 2) {
>  			printk(KERN_WARNING "NUMA: "
>  				"short ibm,associativity-reference-points\n");
>  			goto err;
> @@ -341,10 +341,10 @@ static int __init find_primary_domain_index(void)
>  	 * Warn and cap if the hardware supports more than
>  	 * MAX_DISTANCE_REF_POINTS domains.
>  	 */
> -	if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) {
> +	if (max_associativity_domain_index > MAX_DISTANCE_REF_POINTS) {
>  		printk(KERN_WARNING "NUMA: distance array capped at "
>  			"%d entries\n", MAX_DISTANCE_REF_POINTS);
> -		distance_ref_points_depth = MAX_DISTANCE_REF_POINTS;
> +		max_associativity_domain_index = MAX_DISTANCE_REF_POINTS;
>  	}
>  
>  	of_node_put(root);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 4/7] powerpc/pseries: Consolidate DLPAR NUMA distance update
  2021-06-17 16:51 ` [PATCH v4 4/7] powerpc/pseries: Consolidate DLPAR NUMA distance update Aneesh Kumar K.V
@ 2021-06-24  3:11   ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2021-06-24  3:11 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linuxppc-dev, mpe, Nathan Lynch, Daniel Henrique Barboza, nvdimm,
	dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 4131 bytes --]

On Thu, Jun 17, 2021 at 10:21:02PM +0530, Aneesh Kumar K.V wrote:
> The associativity details of the newly added resourced are collected from
> the hypervisor via "ibm,configure-connector" rtas call. Update the numa
> distance details of the newly added numa node after the above call. In
> later patch we will remove updating NUMA distance when we are looking
> for node id from associativity array.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

I think this patch and the next would be easier to review if merged
together.  That would make the fact that this is (half of) a code
motion clearer.

> ---
>  arch/powerpc/mm/numa.c                        | 41 +++++++++++++++++++
>  arch/powerpc/platforms/pseries/hotplug-cpu.c  |  2 +
>  .../platforms/pseries/hotplug-memory.c        |  2 +
>  arch/powerpc/platforms/pseries/pseries.h      |  1 +
>  4 files changed, 46 insertions(+)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 0ec16999beef..645a95e3a7ea 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -287,6 +287,47 @@ int of_node_to_nid(struct device_node *device)
>  }
>  EXPORT_SYMBOL(of_node_to_nid);
>  
> +static void __initialize_form1_numa_distance(const __be32 *associativity)
> +{
> +	int i, nid;
> +
> +	if (of_read_number(associativity, 1) >= primary_domain_index) {
> +		nid = of_read_number(&associativity[primary_domain_index], 1);
> +
> +		for (i = 0; i < max_domain_index; i++) {
> +			const __be32 *entry;
> +
> +			entry = &associativity[be32_to_cpu(distance_ref_points[i])];
> +			distance_lookup_table[nid][i] = of_read_number(entry, 1);
> +		}
> +	}
> +}
> +
> +static void initialize_form1_numa_distance(struct device_node *node)
> +{
> +	const __be32 *associativity;
> +
> +	associativity = of_get_associativity(node);
> +	if (!associativity)
> +		return;
> +
> +	__initialize_form1_numa_distance(associativity);
> +	return;
> +}
> +
> +/*
> + * Used to update distance information w.r.t newly added node.
> + */
> +void update_numa_distance(struct device_node *node)
> +{
> +	if (affinity_form == FORM0_AFFINITY)
> +		return;
> +	else if (affinity_form == FORM1_AFFINITY) {
> +		initialize_form1_numa_distance(node);
> +		return;
> +	}
> +}
> +
>  static int __init find_primary_domain_index(void)
>  {
>  	int index;
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 7e970f81d8ff..778b6ab35f0d 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -498,6 +498,8 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
>  		return saved_rc;
>  	}
>  
> +	update_numa_distance(dn);
> +
>  	rc = dlpar_online_cpu(dn);
>  	if (rc) {
>  		saved_rc = rc;
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 8377f1f7c78e..0e602c3b01ea 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -180,6 +180,8 @@ static int update_lmb_associativity_index(struct drmem_lmb *lmb)
>  		return -ENODEV;
>  	}
>  
> +	update_numa_distance(lmb_node);
> +
>  	dr_node = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
>  	if (!dr_node) {
>  		dlpar_free_cc_nodes(lmb_node);
> diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
> index 1f051a786fb3..663a0859cf13 100644
> --- a/arch/powerpc/platforms/pseries/pseries.h
> +++ b/arch/powerpc/platforms/pseries/pseries.h
> @@ -113,4 +113,5 @@ extern u32 pseries_security_flavor;
>  void pseries_setup_security_mitigations(void);
>  void pseries_lpar_read_hblkrm_characteristics(void);
>  
> +void update_numa_distance(struct device_node *node);
>  #endif /* _PSERIES_PSERIES_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
  2021-06-24  3:08   ` David Gibson
@ 2021-06-24  8:20     ` Aneesh Kumar K.V
  2021-06-28  6:18       ` David Gibson
  0 siblings, 1 reply; 20+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-24  8:20 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, mpe, Nathan Lynch, Daniel Henrique Barboza, nvdimm,
	dan.j.williams

David Gibson <david@gibson.dropbear.id.au> writes:

> On Thu, Jun 17, 2021 at 10:21:05PM +0530, Aneesh Kumar K.V wrote:
>> PAPR interface currently supports two different ways of communicating resource
>> grouping details to the OS. These are referred to as Form 0 and Form 1
>> associativity grouping. Form 0 is the older format and is now considered
>> deprecated. This patch adds another resource grouping named FORM2.
>> 
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  Documentation/powerpc/associativity.rst   | 135 ++++++++++++++++++++
>>  arch/powerpc/include/asm/firmware.h       |   3 +-
>>  arch/powerpc/include/asm/prom.h           |   1 +
>>  arch/powerpc/kernel/prom_init.c           |   3 +-
>>  arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
>>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>>  6 files changed, 286 insertions(+), 6 deletions(-)
>>  create mode 100644 Documentation/powerpc/associativity.rst
>> 
>> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
>> new file mode 100644
>> index 000000000000..93be604ac54d
>> --- /dev/null
>> +++ b/Documentation/powerpc/associativity.rst
>> @@ -0,0 +1,135 @@
>> +============================
>> +NUMA resource associativity
>> +=============================
>> +
>> +Associativity represents the groupings of the various platform resources into
>> +domains of substantially similar mean performance relative to resources outside
>> +of that domain. Resources subsets of a given domain that exhibit better
>> +performance relative to each other than relative to other resources subsets
>> +are represented as being members of a sub-grouping domain. This performance
>> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
>> +From the platform view, these groups are also referred to as domains.
>> +
>> +PAPR interface currently supports different ways of communicating these resource
>> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
>> +associativity grouping. Form 0 is the older format and is now considered deprecated.
>> +
>> +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
>> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
>> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
>> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>> +
>> +Form 0
>> +-----
>> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
>> +
>> +Form 1
>> +-----
>> +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity
>> +device tree properties are used to determine the NUMA distance between resource groups/domains.
>> +
>> +The “ibm,associativity” property contains one or more lists of numbers (domainID)
>> +representing the resource’s platform grouping domains.
>> +
>> +The “ibm,associativity-reference-points” property contains one or more list of numbers
>> +(domainID index) that represents the 1 based ordinal in the associativity lists.
>> +The list of domainID index represnets increasing hierachy of
>> resource grouping.
>
> Typo "represnets".  Also s/hierachy/hierarchy/
>
>> +
>> +ex:
>> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
>
>> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
>> +Linux kernel computes NUMA distance between two domains by recursively comparing
>> +if they belong to the same higher-level domains. For mismatch at every higher
>> +level of the resource group, the kernel doubles the NUMA distance between the
>> +comparing domains.
>
> The Form1 description is still kinda confusing, but I don't really
> care.  Form1 *is* confusing, it's Form2 that I hope will be clearer.
>
>> +
>> +Form 2
>> +-------
>> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
>> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
>> +domain numbering. With numa distance computation now detached from the index value of
>> +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the
>> +same domainID index representing resource groups of different performance/latency characteristics.
>
> So, see you've removed the special handling of secondary IDs for pmem
> - big improvement, thanks.  IIUC, in this revised version, for Form2
> there's really no reason for ibm,associativity-reference-points to
> have >1 entry.  Is that right?
>
> In Form2 everything revolves around the primary domain ID - so much
> that I suggest we come up with a short name for it.  How about just
> "node id" since that's how Linux uses it.

We don't really refer primary domain ID in rest of FORM2 details. We
do refer the entries as domainID. Is renaming domainID to NUMA
node id better?

>
>> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
>> +"ibm,architecture-vec-5" property.
>> +
>> +"ibm,numa-lookup-index-table" property contains one or more list numbers representing
>> +the domainIDs present in the system. The offset of the domainID in this property is considered
>> +the domainID index.
>
> The lookup-index-table is all about compactifying the representation
> of the distance matrix.  Because node ids are sparse, the matrix of
> distances is also effectively sparse.  You don't want to have a huge
> matrix in the DT with mostly meaningless entries, so you use the
> lookup table to compact the node ids.
>
> It's a bit confusing though, because you now have two representations
> of exactly the same information the "full" node id (== primary
> domainID) and the "short" node id (==domainID lookup table offset).
>
> I can see three ways to clarify this:
>
>   A) Give the short node id an explicit name rather than difficult to
>      parse "domainID index" or "domainID offset" which gets easily
>      muddled with other uses of index and offset.  Use that name
>      throughout.

I dopped the domainID index and started referring to that as domain
distance offset.

>
>   B) Eliminate the short ID entirely, and use an explicit sparse
>      matrix representation for the distance table e.g. a list of
>      (nodeA, nodeB, distance) tuples
>
>      That does have the disadvantage that it's not structurally
>      guaranteed that you have entries for every pair of "active" node
>      ids.

Other hypervisor would want to have a large possible domainID value and
mostly want to publish the full topology details during boot. Using the
above format makes a 16 node config have large "ibm,numa-distance-table"
property.

>
>   C) Eliminate the "long ID" entirely.  In the Form2 world, is there
>      actually any point allowing sparse node ids?  In this case you'd
>      have the node id read from associativity and use that directly to
>      index the distance table

Yes, other hypervisor would like to partition the domain ID space for
different resources.

>
>> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
>> +N domainID encoded as with encode-int
>> +
>> +For ex:
>> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index
>> for domainID 8 is 1.
>
> As noted "domainID index" is confusing here, because it's different
> from the "domainID index" values in reference-points.
>
>> +
>> +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
>> +distance between resource groups/domains present in the system.
>> +
>> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>
> The N here always has to be the square of the N in the
> lookup-index-table, yes?  In which case do we actually need to include
> it?

most of PAPR device tree property follows the pattern  {N,....
Nelements}. This follows the same pattern.

>
>> +For ex:
>> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
>> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
>> +
>> +  | 0    8   40
>> +--|------------
>> +  |
>> +0 | 10   20  80
>> +  |
>> +8 | 20   10  160
>> +  |
>> +40| 80   160  10
>> +
>> +
>> +"ibm,associativity" property for resources in node 0, 8 and 40
>> +
>> +{ 3, 6, 7, 0 }
>> +{ 3, 6, 9, 8 }
>> +{ 3, 6, 7, 40}
>> +
>> +With "ibm,associativity-reference-points"  { 0x3 }
>> +
>> +Each resource (drcIndex) now also supports additional optional device tree properties.
>> +These properties are marked optional because the platform can choose not to export
>> +them and provide the system topology details using the earlier defined device tree
>> +properties alone. The optional device tree properties are used when adding new resources
>> +(DLPAR) and when the platform didn't provide the topology details of the domain which
>> +contains the newly added resource during boot.
>
> In what circumstances would you envisage a hot-add creating a new NUMA
> node, as opposed to adding resources to an existing (possibly empty)
> node?  Do you need any provision for removing NUMA nodes again?

Both Qemu and PowerVM don't intend to use this at this point. Both will
provide the full possible NUMA node details during boot. But I was not
sure whether we really need to restrict the possibility of a new
resource being added. This can be true in case of new memory devices
that gets hotplugged in the latency of the device is such that we never
expressed that in the distance table during boot. 



>
>> +"ibm,numa-lookup-index" property contains a number representing the domainID index to be used
>> +when building the NUMA distance of the numa node to which this resource belongs. This can
>> +be looked at as the index at which this new domainID would have appeared in
>> +"ibm,numa-lookup-index-table" if the domain was present during boot. The domainID
>
> Again, confusing use of "index" where it's used both for offsets in
> ibm,associativity properties and for offsets in ibm,numa-lookup-index-table

I guess we can drop the ibm,numa-lookuup-index and state that the
number of distance element in "ibm,numa-distance" suggest the index at
which this NUMA node should appear for compuring the distance details. 

>
>> +of the new resource can be obtained from the existing "ibm,associativity" property. This
>> +can be used to build distance information of a newly onlined NUMA node via DLPAR operation.
>> +The value is 1 based array index value.
>> +
>> +prop-encoded-array: An integer encoded as with encode-int specifying the domainID index
>> +
>> +"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
>> +from this resource domain to other resources.
>> +
>> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>
> Again, does N have to equal 2 * (existing number of nodes + 1)?  If so
> you should say so, if not you should specify how incomplete
> information is interpreted.
>
>> +For ex:
>> +ibm,associativity     = { 4, 5, 10, 50}
>> +ibm,numa-lookup-index = { 4 }
>> +ibm,numa-distance   =  {8, 160, 255, 80, 10, 160, 255, 80, 10}
>> +
>> +resulting in a new toplogy as below.
>> +  | 0    8   40   50
>> +--|------------------
>> +  |
>> +0 | 10   20  80   160
>> +  |
>> +8 | 20   10  160  255
>> +  |
>> +40| 80   160  10  80
>> +  |
>> +50| 160  255  80  10
>> +

Here is the relevant updated part of the doc.

Form 2 associativity format adds separate device tree properties representing NUMA node distance
thereby making the node distance computation flexible. Form 2 also allows flexible primary
domain numbering. With numa distance computation now detached from the index value in
"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain
ids at the same domainID index representing resource groups of different performance/latency
characteristics.

Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
"ibm,architecture-vec-5" property.

"ibm,numa-lookup-index-table" property contains one or more list numbers representing
the domainIDs present in the system. The offset of the domainID in this property is
used as an index while computing numa distance information via "ibm,numa-distance-table".

prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
N domainID encoded as with encode-int

For ex:
"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. Offset of domainID 8 (2) is used when
computing the distance of domain 8 from other domains present in the system. For rest of
this document this offset will be referred to as domain distance offset.

"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
distance between resource groups/domains present in the system.

prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
N distance values encoded as with encode-bytes. The max distance value we could encode is 255.

For ex:
ibm,numa-lookup-index-table =  {3, 0, 8, 40}
ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}

  | 0    8   40
--|------------
  |
0 | 10   20  80
  |
8 | 20   10  160
  |
40| 80   160  10

A possible "ibm,associativity" property for resources in node 0, 8 and 40

{ 3, 6, 7, 0 }
{ 3, 6, 9, 8 }
{ 3, 6, 7, 40}

With "ibm,associativity-reference-points"  { 0x3 }

"ibm,lookup-index-table" helps in having a compact representation of distance matrix.
Since domainID can be sparse, the matrix of distances can also be effectively sparse.
With "ibm,lookup-index-table" we are able to achieve a compact representation of
distance information.

Each resource (drcIndex) now also supports additional optional device tree properties.
These properties are marked optional because the platform can choose not to export
them and provide the system topology details using the earlier defined device tree
properties alone. The optional device tree properties are used when adding new resources
(DLPAR) and when the platform didn't provide the topology details of the domain which
contains the newly added resource during boot.

"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
from this resource domain to other resources.

prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
N distance values encoded as with encode-bytes. The max distance value we could encode is 255.

If the system currently has 3 domains and a DLPAR operation is adding one additional
domain, "ibm,numa-distance" property should have 2 * (3 + 1) = 8 elements as shown below.
In other words the domain distance offset value for the newly added domain is number of
distance value elements in the "ibm,numa-distance" property divided by 2.


>> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
>> index 60b631161360..97a3bd9ffeb9 100644
>> --- a/arch/powerpc/include/asm/firmware.h
>> +++ b/arch/powerpc/include/asm/firmware.h
>> @@ -53,6 +53,7 @@
>>  #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
>>  #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
>>  #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
>> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
>>  
>>  #ifndef __ASSEMBLY__
>>  
>> @@ -73,7 +74,7 @@ enum {
>>  		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
>>  		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
>>  		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
>> -		FW_FEATURE_RPT_INVALIDATE,
>> +		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
>>  	FW_FEATURE_PSERIES_ALWAYS = 0,
>>  	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
>>  	FW_FEATURE_POWERNV_ALWAYS = 0,
>> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
>> index df9fec9d232c..5c80152e8f18 100644
>> --- a/arch/powerpc/include/asm/prom.h
>> +++ b/arch/powerpc/include/asm/prom.h
>> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
>>  #define OV5_XCMO		0x0440	/* Page Coalescing */
>>  #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
>>  #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
>> +#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
>>  #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
>>  #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
>>  #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
>> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
>> index 64b9593038a7..496fdac54c29 100644
>> --- a/arch/powerpc/kernel/prom_init.c
>> +++ b/arch/powerpc/kernel/prom_init.c
>> @@ -1070,7 +1070,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
>>  #else
>>  		0,
>>  #endif
>> -		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
>> +		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
>> +		OV5_FEAT(OV5_FORM2_AFFINITY),
>>  		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
>>  		.micro_checkpoint = 0,
>>  		.reserved0 = 0,
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index d32729f235b8..5a7d94960fb7 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
>>  
>>  #define FORM0_AFFINITY 0
>>  #define FORM1_AFFINITY 1
>> +#define FORM2_AFFINITY 2
>>  static int affinity_form;
>>  
>>  #define MAX_DISTANCE_REF_POINTS 4
>>  static int max_associativity_domain_index;
>>  static const __be32 *distance_ref_points;
>>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
>> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
>> +	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
>> +};
>> +static int numa_id_index_table[MAX_NUMNODES];
>>  
>>  /*
>>   * Allocate node_to_cpumask_map based on number of available nodes
>> @@ -166,6 +171,27 @@ static void unmap_cpu_from_node(unsigned long cpu)
>>  }
>>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>>  
>> +/*
>> + * With FORM2 if we are not using logical domain ids, we will find
>> + * both primary and seconday domains with same value. Hence always
>> + * start comparison from secondary domains
>
>
> IIUC, in this new draft, secondary domains are no longer a meaningful
> thing in Form2, so this comment and code seem outdated.

This needed fixup. With FORM2 our associativity array can be just one
element. I fixed up as below.

static int __cpu_form2_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
{
	int node1, node2;
	int dist = 0;

	node1 = associativity_to_nid(cpu1_assoc);
	node2 = associativity_to_nid(cpu2_assoc);

	dist = numa_distance_table[node1][node2];
	if (dist == LOCAL_DISTANCE)
		return 0;
	else if (dist == REMOTE_DISTANCE)
		return 1;
	else
		return 2;
}

also renamed cpu_distance to cpu_relative_distance.


>> + */
>> +static int __cpu_form2_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> +{
>> +	int dist = 0;
>> +
>> +	int i, index;
>> +
>> +	for (i = 1; i < max_associativity_domain_index; i++) {
>> +		index = be32_to_cpu(distance_ref_points[i]);
>> +		if (cpu1_assoc[index] == cpu2_assoc[index])
>> +			break;
>> +		dist++;
>> +	}
>> +
>> +	return dist;
>> +}
>> +
>>  static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  {
>>  	int dist = 0;
>> @@ -178,7 +204,6 @@ static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  			break;
>>  		dist++;
>>  	}
>> -
>>  	return dist;
>>  }
>>  
>> @@ -186,8 +211,9 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  {
>>  	/* We should not get called with FORM0 */
>>  	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
>> -
>> -	return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
>> +	if (affinity_form == FORM1_AFFINITY)
>> +		return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
>> +	return __cpu_form2_distance(cpu1_assoc, cpu2_assoc);
>>  }
>>  
>>  /* must hold reference to node during call */
>> @@ -201,7 +227,9 @@ int __node_distance(int a, int b)
>>  	int i;
>>  	int distance = LOCAL_DISTANCE;
>>  
>> -	if (affinity_form == FORM0_AFFINITY)
>> +	if (affinity_form == FORM2_AFFINITY)
>> +		return numa_distance_table[a][b];
>> +	else if (affinity_form == FORM0_AFFINITY)
>>  		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
>>  
>>  	for (i = 0; i < max_associativity_domain_index; i++) {
>> @@ -303,15 +331,116 @@ static void initialize_form1_numa_distance(struct device_node *node)
>>  
>>  /*
>>   * Used to update distance information w.r.t newly added node.
>> + * ibm,numa-lookup-index -> 4
>> + * ibm,numa-distance -> {5, 20, 40, 60, 80, 10 }
>>   */
>>  void update_numa_distance(struct device_node *node)
>>  {
>> +	int i, nid, other_nid, other_nid_index = 0;
>> +	const __be32 *numa_indexp;
>> +	const __u8  *numa_distancep;
>> +	int numa_index, max_numa_index, numa_distance;
>> +
>>  	if (affinity_form == FORM0_AFFINITY)
>>  		return;
>>  	else if (affinity_form == FORM1_AFFINITY) {
>>  		initialize_form1_numa_distance(node);
>>  		return;
>>  	}
>> +	/* FORM2 affinity  */
>> +
>> +	nid = of_node_to_nid_single(node);
>> +	if (nid == NUMA_NO_NODE)
>> +		return;
>> +
>> +	/* Already initialized */
>> +	if (numa_distance_table[nid][nid] != -1)
>
> IIUC, this should exactly correspond to the case where the new
> resource lies in an existing NUMA node, yes?

yes.

>
>> +		return;
>> +	/*
>> +	 * update node distance if not already populated.
>> +	 */
>> +	numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
>> +	if (!numa_distancep)
>> +		return;
>> +
>> +	numa_indexp = of_get_property(node, "ibm,numa-lookup-index", NULL);
>> +	if (!numa_indexp)
>> +		return;
>> +
>> +	numa_index = of_read_number(numa_indexp, 1);
>> +	/*
>> +	 * update the numa_id_index_table. Device tree look at index table as
>> +	 * 1 based array indexing.
>> +	 */
>> +	numa_id_index_table[numa_index - 1] = nid;
>> +
>> +	max_numa_index = of_read_number((const __be32 *)numa_distancep, 1);
>> +	VM_WARN_ON(max_numa_index != 2 * numa_index);
>
> You WARN_ON(), but you don't actually bail out to avoid indexing past
> the end of the property below.
>
> AFAICT none of this really works unless numa_index == (previous number
> of numa nodes + 1), which makes all the use of these different
> variables kind of confusing.  If you had a variable that you just set
> equal to the new number of numa nodes (previous number plus the one
> being added), then I think you can replace all numa_index and
> max_numa_index references with that and it will be clearer.

I rewrote that as below.

	numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
	if (!numa_distancep)
		return;

	max_dist_element = of_read_number((const __be32 *)numa_distancep, 1);

	numa_id_index = max_dist_element >> 1;
	/*
	 * update the numa_id_index_table. Device tree look at index table as
	 * 1 based array indexing.
	 */
	if (numa_id_index_table[numa_id_index - 1] != -1) {
		WARN(1, "Wrong NUMA distance information\n");
		return;
	}

	numa_id_index_table[numa_id_index - 1] = nid;

	/* Skip the size which is encoded int */
	numa_distancep += sizeof(__be32);

	/*
	 * First fill the distance information from other node to this node.
	 */
	distance_index = 0;
	for (i = 0; i < numa_id_index; i++) {
		numa_distance = numa_distancep[distance_index++];
		other_nid = numa_id_index_table[i];
		if (other_nid == NUMA_NO_NODE)
			continue;
		numa_distance_table[other_nid][nid] = numa_distance;
	}

	for (i = 0; i < numa_id_index; i++) {
		numa_distance = numa_distancep[distance_index++];
		other_nid = numa_id_index_table[i];
		if (other_nid == NUMA_NO_NODE)
			continue;
		numa_distance_table[nid][other_nid] = numa_distance;
	}
}


>
>
>> +	/* Skip the size which is encoded int */
>> +	numa_distancep += sizeof(__be32);
>> +
>> +	/*
>> +	 * First fill the distance information from other node to this node.
>> +	 */
>> +	other_nid_index = 0;
>> +	for (i = 0; i < numa_index; i++) {
>> +		numa_distance = numa_distancep[i];
>> +		other_nid = numa_id_index_table[other_nid_index++];
>> +		numa_distance_table[other_nid][nid] = numa_distance;
>> +	}
>> +
>> +	other_nid_index = 0;
>> +	for (; i < max_numa_index; i++) {
>
> Again, splitting this loop and carrying i over seems a confusing way
> to code this.  It's basically two loops of N, one writing a row of the
> distance matrix, one writing a column (other_nid will even go through
> the same values in each loop).

Fixed

>
>> +		numa_distance = numa_distancep[i];
>> +		other_nid = numa_id_index_table[other_nid_index++];
>> +		numa_distance_table[nid][other_nid] = numa_distance;
>> +	}
>> +}
>> +
>> +/*
>> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
>> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
>> + */
>> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
>> +{
>> +	const __u8 *numa_dist_table;
>> +	const __be32 *numa_lookup_index;
>> +	int numa_dist_table_length;
>> +	int max_numa_index, distance_index;
>> +	int i, curr_row = 0, curr_column = 0;
>> +
>> +	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
>> +	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
>
> max_numa_index here has a different meaning to max_numa_index in the
> previous function, which is pointlessly confusing.
>
>> +	/* first element of the array is the size and is encode-int */
>> +	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
>> +	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
>> +	/* Skip the size which is encoded int */
>> +	numa_dist_table += sizeof(__be32);
>> +
>> +	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d \n",
>> +		 numa_dist_table_length, max_numa_index);
>> +
>> +	for (i = 0; i < max_numa_index; i++)
>> +		/* +1 skip the max_numa_index in the property */
>> +		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
>> +
>> +
>> +	VM_WARN_ON(numa_dist_table_length != max_numa_index * max_numa_index);
>
> Again, you don't actually bail out in this case.  And if it has to
> have this value, what's the point of encoding it into the property.
>
>> +	for (distance_index = 0; distance_index < numa_dist_table_length; distance_index++) {
>> +		int nodeA = numa_id_index_table[curr_row];
>> +		int nodeB = numa_id_index_table[curr_column++];
>
> You've already (sort of) verified that the distance table has size
> N^2, in which case you can just to a simple two dimensional loop
> rather than having to do ugly calculations of row and column.

Fixed all the above and updated as below.
	if (numa_dist_table_length != max_numa_index * max_numa_index) {
		WARN(1, "Wrong NUMA distnce information\n");
		/* consider everybody else just remote. */

		for (i = 0;  i < max_numa_index; i++) {
			for (j = 0; j < max_numa_index; j++) {
				int nodeA = numa_id_index_table[i];
				int nodeB = numa_id_index_table[j];

				if (nodeA == nodeB)
					numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE;
				else
					numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE;
			}
		}
	}

	distance_index = 0;
	for (i = 0;  i < max_numa_index; i++) {
		for (j = 0; j < max_numa_index; j++) {
			int nodeA = numa_id_index_table[i];
			int nodeB = numa_id_index_table[j];

			numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++];

			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
		}
	}
}


-aneesh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
  2021-06-17 16:51 ` [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity Aneesh Kumar K.V
  2021-06-21 22:02   ` Daniel Henrique Barboza
  2021-06-24  3:08   ` David Gibson
@ 2021-06-24 10:33   ` Laurent Dufour
  2021-06-24 10:55     ` Aneesh Kumar K.V
  2 siblings, 1 reply; 20+ messages in thread
From: Laurent Dufour @ 2021-06-24 10:33 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev, mpe
  Cc: Nathan Lynch, nvdimm, Daniel Henrique Barboza, dan.j.williams,
	David Gibson

Hi Aneesh,

A little bit of wordsmithing below...

Le 17/06/2021 à 18:51, Aneesh Kumar K.V a écrit :
> PAPR interface currently supports two different ways of communicating resource
> grouping details to the OS. These are referred to as Form 0 and Form 1
> associativity grouping. Form 0 is the older format and is now considered
> deprecated. This patch adds another resource grouping named FORM2.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   Documentation/powerpc/associativity.rst   | 135 ++++++++++++++++++++
>   arch/powerpc/include/asm/firmware.h       |   3 +-
>   arch/powerpc/include/asm/prom.h           |   1 +
>   arch/powerpc/kernel/prom_init.c           |   3 +-
>   arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
>   arch/powerpc/platforms/pseries/firmware.c |   1 +
>   6 files changed, 286 insertions(+), 6 deletions(-)
>   create mode 100644 Documentation/powerpc/associativity.rst
> 
> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
> new file mode 100644
> index 000000000000..93be604ac54d
> --- /dev/null
> +++ b/Documentation/powerpc/associativity.rst
> @@ -0,0 +1,135 @@
> +============================
> +NUMA resource associativity
> +=============================
> +
> +Associativity represents the groupings of the various platform resources into
> +domains of substantially similar mean performance relative to resources outside
> +of that domain. Resources subsets of a given domain that exhibit better
> +performance relative to each other than relative to other resources subsets
> +are represented as being members of a sub-grouping domain. This performance
> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
> +From the platform view, these groups are also referred to as domains.
> +
> +PAPR interface currently supports different ways of communicating these resource
> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
> +associativity grouping. Form 0 is the older format and is now considered deprecated.
> +
> +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
                                                            architecture ^

> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
> +
> +Form 0
> +-----
> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
> +
> +Form 1
> +-----
> +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity
> +device tree properties are used to determine the NUMA distance between resource groups/domains.
> +
> +The “ibm,associativity” property contains one or more lists of numbers (domainID)
> +representing the resource’s platform grouping domains.
> +
> +The “ibm,associativity-reference-points” property contains one or more list of numbers
> +(domainID index) that represents the 1 based ordinal in the associativity lists.
> +The list of domainID index represnets increasing hierachy of resource grouping.
                         represents ^

> +
> +ex:
> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
> +
> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
> +Linux kernel computes NUMA distance between two domains by recursively comparing
> +if they belong to the same higher-level domains. For mismatch at every higher
> +level of the resource group, the kernel doubles the NUMA distance between the
> +comparing domains.
> +
> +Form 2
> +-------
> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
> +domain numbering. With numa distance computation now detached from the index value of
> +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the
> +same domainID index representing resource groups of different performance/latency characteristics.
> +
> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
> +"ibm,architecture-vec-5" property.
> +
> +"ibm,numa-lookup-index-table" property contains one or more list numbers representing
> +the domainIDs present in the system. The offset of the domainID in this property is considered
> +the domainID index.
> +
> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
> +N domainID encoded as with encode-int
> +
> +For ex:
> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for domainID 8 is 1.
> +
> +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
> +distance between resource groups/domains present in the system.
> +
> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
> +
> +For ex:
> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
> +
> +  | 0    8   40
> +--|------------
> +  |
> +0 | 10   20  80
> +  |
> +8 | 20   10  160
> +  |
> +40| 80   160  10
> +
> +
> +"ibm,associativity" property for resources in node 0, 8 and 40
> +
> +{ 3, 6, 7, 0 }
> +{ 3, 6, 9, 8 }
> +{ 3, 6, 7, 40}
> +
> +With "ibm,associativity-reference-points"  { 0x3 }
> +
> +Each resource (drcIndex) now also supports additional optional device tree properties.
> +These properties are marked optional because the platform can choose not to export
> +them and provide the system topology details using the earlier defined device tree
> +properties alone. The optional device tree properties are used when adding new resources
> +(DLPAR) and when the platform didn't provide the topology details of the domain which
> +contains the newly added resource during boot.
> +
> +"ibm,numa-lookup-index" property contains a number representing the domainID index to be used
> +when building the NUMA distance of the numa node to which this resource belongs. This can
> +be looked at as the index at which this new domainID would have appeared in
> +"ibm,numa-lookup-index-table" if the domain was present during boot. The domainID
> +of the new resource can be obtained from the existing "ibm,associativity" property. This
> +can be used to build distance information of a newly onlined NUMA node via DLPAR operation.
> +The value is 1 based array index value.
> +
> +prop-encoded-array: An integer encoded as with encode-int specifying the domainID index
> +
> +"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
> +from this resource domain to other resources.
> +
> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
> +
> +For ex:
> +ibm,associativity     = { 4, 5, 10, 50}

Is missing the first byte of the property (length) or an associativity number?

> +ibm,numa-lookup-index = { 4 }
> +ibm,numa-distance   =  {8, 160, 255, 80, 10, 160, 255, 80, 10}
> +
> +resulting in a new toplogy as below.
> +  | 0    8   40   50
> +--|------------------
> +  |
> +0 | 10   20  80   160
> +  |
> +8 | 20   10  160  255
> +  |
> +40| 80   160  10  80
> +  |
> +50| 160  255  80  10
> +
> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
> index 60b631161360..97a3bd9ffeb9 100644
> --- a/arch/powerpc/include/asm/firmware.h
> +++ b/arch/powerpc/include/asm/firmware.h
> @@ -53,6 +53,7 @@
>   #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
>   #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
>   #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
>   
>   #ifndef __ASSEMBLY__
>   
> @@ -73,7 +74,7 @@ enum {
>   		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
>   		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
>   		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
> -		FW_FEATURE_RPT_INVALIDATE,
> +		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
>   	FW_FEATURE_PSERIES_ALWAYS = 0,
>   	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
>   	FW_FEATURE_POWERNV_ALWAYS = 0,
> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
> index df9fec9d232c..5c80152e8f18 100644
> --- a/arch/powerpc/include/asm/prom.h
> +++ b/arch/powerpc/include/asm/prom.h
> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
>   #define OV5_XCMO		0x0440	/* Page Coalescing */
>   #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
>   #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
> +#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
>   #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
>   #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
>   #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> index 64b9593038a7..496fdac54c29 100644
> --- a/arch/powerpc/kernel/prom_init.c
> +++ b/arch/powerpc/kernel/prom_init.c
> @@ -1070,7 +1070,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
>   #else
>   		0,
>   #endif
> -		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
> +		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
> +		OV5_FEAT(OV5_FORM2_AFFINITY),
>   		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
>   		.micro_checkpoint = 0,
>   		.reserved0 = 0,
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index d32729f235b8..5a7d94960fb7 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
>   
>   #define FORM0_AFFINITY 0
>   #define FORM1_AFFINITY 1
> +#define FORM2_AFFINITY 2
>   static int affinity_form;
>   
>   #define MAX_DISTANCE_REF_POINTS 4
>   static int max_associativity_domain_index;
>   static const __be32 *distance_ref_points;
>   static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
> +	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
> +};
> +static int numa_id_index_table[MAX_NUMNODES];
>   
>   /*
>    * Allocate node_to_cpumask_map based on number of available nodes
> @@ -166,6 +171,27 @@ static void unmap_cpu_from_node(unsigned long cpu)
>   }
>   #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>   
> +/*
> + * With FORM2 if we are not using logical domain ids, we will find
> + * both primary and seconday domains with same value. Hence always
> + * start comparison from secondary domains
> + */
> +static int __cpu_form2_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> +{
> +	int dist = 0;
> +

extra blank line.

> +	int i, index;
> +
> +	for (i = 1; i < max_associativity_domain_index; i++) {
> +		index = be32_to_cpu(distance_ref_points[i]);
> +		if (cpu1_assoc[index] == cpu2_assoc[index])
> +			break;
> +		dist++;
> +	}
> +
> +	return dist;
> +}
> +
>   static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>   {
>   	int dist = 0;
> @@ -178,7 +204,6 @@ static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>   			break;
>   		dist++;
>   	}
> -
>   	return dist;
>   }
>   
> @@ -186,8 +211,9 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>   {
>   	/* We should not get called with FORM0 */
>   	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
> -
> -	return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
> +	if (affinity_form == FORM1_AFFINITY)
> +		return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
> +	return __cpu_form2_distance(cpu1_assoc, cpu2_assoc);
>   }
>   
>   /* must hold reference to node during call */
> @@ -201,7 +227,9 @@ int __node_distance(int a, int b)
>   	int i;
>   	int distance = LOCAL_DISTANCE;
>   
> -	if (affinity_form == FORM0_AFFINITY)
> +	if (affinity_form == FORM2_AFFINITY)
> +		return numa_distance_table[a][b];
> +	else if (affinity_form == FORM0_AFFINITY)
>   		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
>   
>   	for (i = 0; i < max_associativity_domain_index; i++) {
> @@ -303,15 +331,116 @@ static void initialize_form1_numa_distance(struct device_node *node)
>   
>   /*
>    * Used to update distance information w.r.t newly added node.
> + * ibm,numa-lookup-index -> 4
> + * ibm,numa-distance -> {5, 20, 40, 60, 80, 10 }
>    */
>   void update_numa_distance(struct device_node *node)
>   {
> +	int i, nid, other_nid, other_nid_index = 0;
> +	const __be32 *numa_indexp;
> +	const __u8  *numa_distancep;
> +	int numa_index, max_numa_index, numa_distance;
> +
>   	if (affinity_form == FORM0_AFFINITY)
>   		return;
>   	else if (affinity_form == FORM1_AFFINITY) {
>   		initialize_form1_numa_distance(node);
>   		return;
>   	}
> +	/* FORM2 affinity  */
> +
> +	nid = of_node_to_nid_single(node);
> +	if (nid == NUMA_NO_NODE)
> +		return;
> +
> +	/* Already initialized */
> +	if (numa_distance_table[nid][nid] != -1)
> +		return;
> +	/*
> +	 * update node distance if not already populated.
> +	 */
> +	numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
> +	if (!numa_distancep)
> +		return;
> +
> +	numa_indexp = of_get_property(node, "ibm,numa-lookup-index", NULL);
> +	if (!numa_indexp)
> +		return;
> +
> +	numa_index = of_read_number(numa_indexp, 1);
> +	/*
> +	 * update the numa_id_index_table. Device tree look at index table as
> +	 * 1 based array indexing.
> +	 */
> +	numa_id_index_table[numa_index - 1] = nid;
> +
> +	max_numa_index = of_read_number((const __be32 *)numa_distancep, 1);
> +	VM_WARN_ON(max_numa_index != 2 * numa_index);

Could you explain shortly in a comment the meaning of this VM_WARN_ON check?

> +	/* Skip the size which is encoded int */
> +	numa_distancep += sizeof(__be32);
> +
> +	/*
> +	 * First fill the distance information from other node to this node.
> +	 */
> +	other_nid_index = 0;
> +	for (i = 0; i < numa_index; i++) {
> +		numa_distance = numa_distancep[i];
> +		other_nid = numa_id_index_table[other_nid_index++];
> +		numa_distance_table[other_nid][nid] = numa_distance;
> +	}
> +
> +	other_nid_index = 0;
> +	for (; i < max_numa_index; i++) {
> +		numa_distance = numa_distancep[i];
> +		other_nid = numa_id_index_table[other_nid_index++];
> +		numa_distance_table[nid][other_nid] = numa_distance;
> +	}
> +}
> +
> +/*
> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
> + */
> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
> +{
> +	const __u8 *numa_dist_table;
> +	const __be32 *numa_lookup_index;
> +	int numa_dist_table_length;
> +	int max_numa_index, distance_index;
> +	int i, curr_row = 0, curr_column = 0;
> +
> +	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
> +	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
> +
> +	/* first element of the array is the size and is encode-int */
> +	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
> +	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
> +	/* Skip the size which is encoded int */
> +	numa_dist_table += sizeof(__be32);
> +
> +	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d \n",
> +		 numa_dist_table_length, max_numa_index);
> +
> +	for (i = 0; i < max_numa_index; i++)
> +		/* +1 skip the max_numa_index in the property */
> +		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
> +
> +
> +	VM_WARN_ON(numa_dist_table_length != max_numa_index * max_numa_index);
> +
> +	for (distance_index = 0; distance_index < numa_dist_table_length; distance_index++) {
> +		int nodeA = numa_id_index_table[curr_row];
> +		int nodeB = numa_id_index_table[curr_column++];
> +
> +		numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index];
> +
> +		pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
> +		if (curr_column >= max_numa_index) {
> +			curr_row++;
> +			/* reset the column */
> +			curr_column = 0;
> +		}
> +	}
>   }
>   
>   static int __init find_primary_domain_index(void)
> @@ -324,6 +453,9 @@ static int __init find_primary_domain_index(void)
>   	 */
>   	if (firmware_has_feature(FW_FEATURE_OPAL)) {
>   		affinity_form = FORM1_AFFINITY;
> +	} else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) {
> +		dbg("Using form 2 affinity\n");
> +		affinity_form = FORM2_AFFINITY;
>   	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
>   		dbg("Using form 1 affinity\n");
>   		affinity_form = FORM1_AFFINITY;
> @@ -368,8 +500,17 @@ static int __init find_primary_domain_index(void)
>   
>   		index = of_read_number(&distance_ref_points[1], 1);
>   	} else {
> +		/*
> +		 * Both FORM1 and FORM2 affinity find the primary domain details
> +		 * at the same offset.
> +		 */
>   		index = of_read_number(distance_ref_points, 1);
>   	}
> +	/*
> +	 * If it is FORM2 also initialize the distance table here.
> +	 */
> +	if (affinity_form == FORM2_AFFINITY)
> +		initialize_form2_numa_distance_lookup_table(root);
>   
>   	/*
>   	 * Warn and cap if the hardware supports more than
> diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
> index 5d4c2bc20bba..f162156b7b68 100644
> --- a/arch/powerpc/platforms/pseries/firmware.c
> +++ b/arch/powerpc/platforms/pseries/firmware.c
> @@ -123,6 +123,7 @@ vec5_fw_features_table[] = {
>   	{FW_FEATURE_PRRN,		OV5_PRRN},
>   	{FW_FEATURE_DRMEM_V2,		OV5_DRMEM_V2},
>   	{FW_FEATURE_DRC_INFO,		OV5_DRC_INFO},
> +	{FW_FEATURE_FORM2_AFFINITY,	OV5_FORM2_AFFINITY},
>   };
>   
>   static void __init fw_vec5_feature_init(const char *vec5, unsigned long len)
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
  2021-06-24 10:33   ` Laurent Dufour
@ 2021-06-24 10:55     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 20+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-24 10:55 UTC (permalink / raw)
  To: Laurent Dufour, linuxppc-dev, mpe
  Cc: Nathan Lynch, nvdimm, Daniel Henrique Barboza, dan.j.williams,
	David Gibson

On 6/24/21 4:03 PM, Laurent Dufour wrote:
> Hi Aneesh,
> 
> A little bit of wordsmithing below...
> 
> Le 17/06/2021 à 18:51, Aneesh Kumar K.V a écrit :
>> PAPR interface currently supports two different ways of communicating 
>> resource
>> grouping details to the OS. These are referred to as Form 0 and Form 1
>> associativity grouping. Form 0 is the older format and is now considered
>> deprecated. This patch adds another resource grouping named FORM2.
>>
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   Documentation/powerpc/associativity.rst   | 135 ++++++++++++++++++++
>>   arch/powerpc/include/asm/firmware.h       |   3 +-
>>   arch/powerpc/include/asm/prom.h           |   1 +
>>   arch/powerpc/kernel/prom_init.c           |   3 +-
>>   arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
>>   arch/powerpc/platforms/pseries/firmware.c |   1 +
>>   6 files changed, 286 insertions(+), 6 deletions(-)
>>   create mode 100644 Documentation/powerpc/associativity.rst
>>
>> diff --git a/Documentation/powerpc/associativity.rst 
>> b/Documentation/powerpc/associativity.rst
>> new file mode 100644
>> index 000000000000..93be604ac54d
>> --- /dev/null
>> +++ b/Documentation/powerpc/associativity.rst
>> @@ -0,0 +1,135 @@
>> +============================
>> +NUMA resource associativity
>> +=============================
>> +
>> +Associativity represents the groupings of the various platform 
>> resources into
>> +domains of substantially similar mean performance relative to 
>> resources outside
>> +of that domain. Resources subsets of a given domain that exhibit better
>> +performance relative to each other than relative to other resources 
>> subsets
>> +are represented as being members of a sub-grouping domain. This 
>> performance
>> +characteristic is presented in terms of NUMA node distance within the 
>> Linux kernel.
>> +From the platform view, these groups are also referred to as domains.
>> +
>> +PAPR interface currently supports different ways of communicating 
>> these resource
>> +grouping details to the OS. These are referred to as Form 0, Form 1 
>> and Form2
>> +associativity grouping. Form 0 is the older format and is now 
>> considered deprecated.
>> +
>> +Hypervisor indicates the type/form of associativity used via 
>> "ibm,arcitecture-vec-5 property".
>                                                             architecture ^
> 

fixed

>> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates 
>> usage of Form 0 or Form 1.
>> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 
>> associativity
>> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>> +
>> +Form 0
>> +-----
>> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
>> +
>> +Form 1
>> +-----
>> +With Form 1 a combination of ibm,associativity-reference-points and 
>> ibm,associativity
>> +device tree properties are used to determine the NUMA distance 
>> between resource groups/domains.
>> +
>> +The “ibm,associativity” property contains one or more lists of 
>> numbers (domainID)
>> +representing the resource’s platform grouping domains.
>> +
>> +The “ibm,associativity-reference-points” property contains one or 
>> more list of numbers
>> +(domainID index) that represents the 1 based ordinal in the 
>> associativity lists.
>> +The list of domainID index represnets increasing hierachy of resource 
>> grouping.
>                          represents ^
> 

fixed

>> +
>> +ex:
>> +{ primary domainID index, secondary domainID index, tertiary domainID 
>> index.. }
>> +
>> +Linux kernel uses the domainID at the primary domainID index as the 
>> NUMA node id.
>> +Linux kernel computes NUMA distance between two domains by 
>> recursively comparing
>> +if they belong to the same higher-level domains. For mismatch at 
>> every higher
>> +level of the resource group, the kernel doubles the NUMA distance 
>> between the
>> +comparing domains.
>> +
>> +Form 2
>> +-------
>> +Form 2 associativity format adds separate device tree properties 
>> representing NUMA node distance
>> +thereby making the node distance computation flexible. Form 2 also 
>> allows flexible primary
>> +domain numbering. With numa distance computation now detached from 
>> the index value of
>> +"ibm,associativity" property, Form 2 allows a large number of primary 
>> domain ids at the
>> +same domainID index representing resource groups of different 
>> performance/latency characteristics.
>> +
>> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of 
>> byte 5 in the
>> +"ibm,architecture-vec-5" property.
>> +
>> +"ibm,numa-lookup-index-table" property contains one or more list 
>> numbers representing
>> +the domainIDs present in the system. The offset of the domainID in 
>> this property is considered
>> +the domainID index.
>> +
>> +prop-encoded-array: The number N of the domainIDs encoded as with 
>> encode-int, followed by
>> +N domainID encoded as with encode-int
>> +
>> +For ex:
>> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index 
>> for domainID 8 is 1.
>> +
>> +"ibm,numa-distance-table" property contains one or more list of 
>> numbers representing the NUMA
>> +distance between resource groups/domains present in the system.
>> +
>> +prop-encoded-array: The number N of the distance values encoded as 
>> with encode-int, followed by
>> +N distance values encoded as with encode-bytes. The max distance 
>> value we could encode is 255.
>> +
>> +For ex:
>> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
>> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
>> +
>> +  | 0    8   40
>> +--|------------
>> +  |
>> +0 | 10   20  80
>> +  |
>> +8 | 20   10  160
>> +  |
>> +40| 80   160  10
>> +
>> +
>> +"ibm,associativity" property for resources in node 0, 8 and 40
>> +
>> +{ 3, 6, 7, 0 }
>> +{ 3, 6, 9, 8 }
>> +{ 3, 6, 7, 40}
>> +
>> +With "ibm,associativity-reference-points"  { 0x3 }
>> +
>> +Each resource (drcIndex) now also supports additional optional device 
>> tree properties.
>> +These properties are marked optional because the platform can choose 
>> not to export
>> +them and provide the system topology details using the earlier 
>> defined device tree
>> +properties alone. The optional device tree properties are used when 
>> adding new resources
>> +(DLPAR) and when the platform didn't provide the topology details of 
>> the domain which
>> +contains the newly added resource during boot.
>> +
>> +"ibm,numa-lookup-index" property contains a number representing the 
>> domainID index to be used
>> +when building the NUMA distance of the numa node to which this 
>> resource belongs. This can
>> +be looked at as the index at which this new domainID would have 
>> appeared in
>> +"ibm,numa-lookup-index-table" if the domain was present during boot. 
>> The domainID
>> +of the new resource can be obtained from the existing 
>> "ibm,associativity" property. This
>> +can be used to build distance information of a newly onlined NUMA 
>> node via DLPAR operation.
>> +The value is 1 based array index value.
>> +
>> +prop-encoded-array: An integer encoded as with encode-int specifying 
>> the domainID index
>> +
>> +"ibm,numa-distance" property contains one or more list of numbers 
>> presenting the NUMA distance
>> +from this resource domain to other resources.
>> +
>> +prop-encoded-array: The number N of the distance values encoded as 
>> with encode-int, followed by
>> +N distance values encoded as with encode-bytes. The max distance 
>> value we could encode is 255.
>> +
>> +For ex:
>> +ibm,associativity     = { 4, 5, 10, 50}
> 
> Is missing the first byte of the property (length) or an associativity 
> number?
> 

that should be {3, 5,10,50}  fixed.

>> +ibm,numa-lookup-index = { 4 }
>> +ibm,numa-distance   =  {8, 160, 255, 80, 10, 160, 255, 80, 10}
>> +
>> +resulting in a new toplogy as below.
>> +  | 0    8   40   50
>> +--|------------------
>> +  |
>> +0 | 10   20  80   160
>> +  |
>> +8 | 20   10  160  255
>> +  |
>> +40| 80   160  10  80
>> +  |
>> +50| 160  255  80  10
>> +
>> diff --git a/arch/powerpc/include/asm/firmware.h 
>> b/arch/powerpc/include/asm/firmware.h
>> index 60b631161360..97a3bd9ffeb9 100644
>> --- a/arch/powerpc/include/asm/firmware.h
>> +++ b/arch/powerpc/include/asm/firmware.h
>

...

>> +    numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
>> +    if (!numa_distancep)
>> +        return;
>> +
>> +    numa_indexp = of_get_property(node, "ibm,numa-lookup-index", NULL);
>> +    if (!numa_indexp)
>> +        return;
>> +
>> +    numa_index = of_read_number(numa_indexp, 1);
>> +    /*
>> +     * update the numa_id_index_table. Device tree look at index 
>> table as
>> +     * 1 based array indexing.
>> +     */
>> +    numa_id_index_table[numa_index - 1] = nid;
>> +
>> +    max_numa_index = of_read_number((const __be32 *)numa_distancep, 1);
>> +    VM_WARN_ON(max_numa_index != 2 * numa_index);
> 
> Could you explain shortly in a comment the meaning of this VM_WARN_ON 
> check?
> 

Based on the other review feedback this is dropped. We now derive domain 
distance offset based on the number of elements in "ibm,numa-distance"

>> +    /* Skip the size which is encoded int */
>> +    numa_distancep += sizeof(__be32);
>> +
>> +    /*
>> +     * First fill the distance information from other node to this node.
>> +     */
>> +    other_nid_index = 0;
>> +    for (i = 0; i < numa_index; i++) {
>> +        numa_distance = numa_distancep[i];
>> +        other_nid = numa_id_index_table[other_nid_index++];
>> +        numa_distance_table[other_nid][nid] = numa_distance;
>> +    }
>> +
>> +    other_nid_index = 0;
>> +    for (; i < max_numa_index; i++) {
>> +        numa_distance = numa_distancep[i];
>> +        other_nid = numa_id_index_table[other_nid_index++];
>> +        numa_distance_table[nid][other_nid] = numa_distance;
>> +    }
>> +}
>> +

Thanks for reviewing the patch.

-aneesh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
  2021-06-24  8:20     ` Aneesh Kumar K.V
@ 2021-06-28  6:18       ` David Gibson
  2021-06-28 15:04         ` Aneesh Kumar K.V
  0 siblings, 1 reply; 20+ messages in thread
From: David Gibson @ 2021-06-28  6:18 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linuxppc-dev, mpe, Nathan Lynch, Daniel Henrique Barboza, nvdimm,
	dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 32246 bytes --]

On Thu, Jun 24, 2021 at 01:50:34PM +0530, Aneesh Kumar K.V wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > On Thu, Jun 17, 2021 at 10:21:05PM +0530, Aneesh Kumar K.V wrote:
> >> PAPR interface currently supports two different ways of communicating resource
> >> grouping details to the OS. These are referred to as Form 0 and Form 1
> >> associativity grouping. Form 0 is the older format and is now considered
> >> deprecated. This patch adds another resource grouping named FORM2.
> >> 
> >> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> >> ---
> >>  Documentation/powerpc/associativity.rst   | 135 ++++++++++++++++++++
> >>  arch/powerpc/include/asm/firmware.h       |   3 +-
> >>  arch/powerpc/include/asm/prom.h           |   1 +
> >>  arch/powerpc/kernel/prom_init.c           |   3 +-
> >>  arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
> >>  arch/powerpc/platforms/pseries/firmware.c |   1 +
> >>  6 files changed, 286 insertions(+), 6 deletions(-)
> >>  create mode 100644 Documentation/powerpc/associativity.rst
> >> 
> >> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
> >> new file mode 100644
> >> index 000000000000..93be604ac54d
> >> --- /dev/null
> >> +++ b/Documentation/powerpc/associativity.rst
> >> @@ -0,0 +1,135 @@
> >> +============================
> >> +NUMA resource associativity
> >> +=============================
> >> +
> >> +Associativity represents the groupings of the various platform resources into
> >> +domains of substantially similar mean performance relative to resources outside
> >> +of that domain. Resources subsets of a given domain that exhibit better
> >> +performance relative to each other than relative to other resources subsets
> >> +are represented as being members of a sub-grouping domain. This performance
> >> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
> >> +From the platform view, these groups are also referred to as domains.
> >> +
> >> +PAPR interface currently supports different ways of communicating these resource
> >> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
> >> +associativity grouping. Form 0 is the older format and is now considered deprecated.
> >> +
> >> +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
> >> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
> >> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
> >> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
> >> +
> >> +Form 0
> >> +-----
> >> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
> >> +
> >> +Form 1
> >> +-----
> >> +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity
> >> +device tree properties are used to determine the NUMA distance between resource groups/domains.
> >> +
> >> +The “ibm,associativity” property contains one or more lists of numbers (domainID)
> >> +representing the resource’s platform grouping domains.
> >> +
> >> +The “ibm,associativity-reference-points” property contains one or more list of numbers
> >> +(domainID index) that represents the 1 based ordinal in the associativity lists.
> >> +The list of domainID index represnets increasing hierachy of
> >> resource grouping.
> >
> > Typo "represnets".  Also s/hierachy/hierarchy/
> >
> >> +
> >> +ex:
> >> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
> >
> >> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
> >> +Linux kernel computes NUMA distance between two domains by recursively comparing
> >> +if they belong to the same higher-level domains. For mismatch at every higher
> >> +level of the resource group, the kernel doubles the NUMA distance between the
> >> +comparing domains.
> >
> > The Form1 description is still kinda confusing, but I don't really
> > care.  Form1 *is* confusing, it's Form2 that I hope will be clearer.
> >
> >> +
> >> +Form 2
> >> +-------
> >> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
> >> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
> >> +domain numbering. With numa distance computation now detached from the index value of
> >> +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the
> >> +same domainID index representing resource groups of different performance/latency characteristics.
> >
> > So, see you've removed the special handling of secondary IDs for pmem
> > - big improvement, thanks.  IIUC, in this revised version, for Form2
> > there's really no reason for ibm,associativity-reference-points to
> > have >1 entry.  Is that right?
> >
> > In Form2 everything revolves around the primary domain ID - so much
> > that I suggest we come up with a short name for it.  How about just
> > "node id" since that's how Linux uses it.
> 
> We don't really refer primary domain ID in rest of FORM2 details. We
> do refer the entries as domainID.

Right, which is unnecessarily confusing, because primary domain ID is
going to be the only one that matters in practice.

> Is renaming domainID to NUMA
> node id better?
> 
> >
> >> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
> >> +"ibm,architecture-vec-5" property.
> >> +
> >> +"ibm,numa-lookup-index-table" property contains one or more list numbers representing
> >> +the domainIDs present in the system. The offset of the domainID in this property is considered
> >> +the domainID index.
> >
> > The lookup-index-table is all about compactifying the representation
> > of the distance matrix.  Because node ids are sparse, the matrix of
> > distances is also effectively sparse.  You don't want to have a huge
> > matrix in the DT with mostly meaningless entries, so you use the
> > lookup table to compact the node ids.
> >
> > It's a bit confusing though, because you now have two representations
> > of exactly the same information the "full" node id (== primary
> > domainID) and the "short" node id (==domainID lookup table offset).
> >
> > I can see three ways to clarify this:
> >
> >   A) Give the short node id an explicit name rather than difficult to
> >      parse "domainID index" or "domainID offset" which gets easily
> >      muddled with other uses of index and offset.  Use that name
> >      throughout.
> 
> I dopped the domainID index and started referring to that as domain
> distance offset.

Better, since it's not ambiguous.  Still kinda long and awkward.

> >   B) Eliminate the short ID entirely, and use an explicit sparse
> >      matrix representation for the distance table e.g. a list of
> >      (nodeA, nodeB, distance) tuples
> >
> >      That does have the disadvantage that it's not structurally
> >      guaranteed that you have entries for every pair of "active" node
> >      ids.
> 
> Other hypervisor would want to have a large possible domainID value and
> mostly want to publish the full topology details during boot. Using the
> above format makes a 16 node config have large "ibm,numa-distance-table"
> property.

It'd be about 3kiB, yes?  That's fairly large, but not ludicrously
large, so does that matter?

> >   C) Eliminate the "long ID" entirely.  In the Form2 world, is there
> >      actually any point allowing sparse node ids?  In this case you'd
> >      have the node id read from associativity and use that directly to
> >      index the distance table
> 
> Yes, other hypervisor would like to partition the domain ID space for
> different resources.

Is that really a hard requirement on the dt format though?  The
hypervisor could always have its own internal lookup table between
long and short forms.

> >> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
> >> +N domainID encoded as with encode-int
> >> +
> >> +For ex:
> >> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index
> >> for domainID 8 is 1.
> >
> > As noted "domainID index" is confusing here, because it's different
> > from the "domainID index" values in reference-points.
> >
> >> +
> >> +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
> >> +distance between resource groups/domains present in the system.
> >> +
> >> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> >> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
> >
> > The N here always has to be the square of the N in the
> > lookup-index-table, yes?  In which case do we actually need to include
> > it?
> 
> most of PAPR device tree property follows the pattern  {N,....
> Nelements}. This follows the same pattern.

Many do, yes, but in many of those cases reading the N is actually
useful.  It's impossible to interpret this table if you don't already
know the number of nodes and if the length of the distance doesn't
match that, there's no reasonable fallback interpretation.

> >> +For ex:
> >> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
> >> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
> >> +
> >> +  | 0    8   40
> >> +--|------------
> >> +  |
> >> +0 | 10   20  80
> >> +  |
> >> +8 | 20   10  160
> >> +  |
> >> +40| 80   160  10
> >> +
> >> +
> >> +"ibm,associativity" property for resources in node 0, 8 and 40
> >> +
> >> +{ 3, 6, 7, 0 }
> >> +{ 3, 6, 9, 8 }
> >> +{ 3, 6, 7, 40}
> >> +
> >> +With "ibm,associativity-reference-points"  { 0x3 }
> >> +
> >> +Each resource (drcIndex) now also supports additional optional device tree properties.
> >> +These properties are marked optional because the platform can choose not to export
> >> +them and provide the system topology details using the earlier defined device tree
> >> +properties alone. The optional device tree properties are used when adding new resources
> >> +(DLPAR) and when the platform didn't provide the topology details of the domain which
> >> +contains the newly added resource during boot.
> >
> > In what circumstances would you envisage a hot-add creating a new NUMA
> > node, as opposed to adding resources to an existing (possibly empty)
> > node?  Do you need any provision for removing NUMA nodes again?
> 
> Both Qemu and PowerVM don't intend to use this at this point. Both will
> provide the full possible NUMA node details during boot. But I was not
> sure whether we really need to restrict the possibility of a new
> resource being added. This can be true in case of new memory devices
> that gets hotplugged in the latency of the device is such that we never
> expressed that in the distance table during boot. 

Hm, ok.  Can we just leave it out of the spec for now and add it in
when/if a hypervisor needs it?

> >> +"ibm,numa-lookup-index" property contains a number representing the domainID index to be used
> >> +when building the NUMA distance of the numa node to which this resource belongs. This can
> >> +be looked at as the index at which this new domainID would have appeared in
> >> +"ibm,numa-lookup-index-table" if the domain was present during boot. The domainID
> >
> > Again, confusing use of "index" where it's used both for offsets in
> > ibm,associativity properties and for offsets in ibm,numa-lookup-index-table
> 
> I guess we can drop the ibm,numa-lookuup-index and state that the
> number of distance element in "ibm,numa-distance" suggest the index at
> which this NUMA node should appear for compuring the distance details. 

I don't really follow what you're saying there.

> >> +of the new resource can be obtained from the existing "ibm,associativity" property. This
> >> +can be used to build distance information of a newly onlined NUMA node via DLPAR operation.
> >> +The value is 1 based array index value.
> >> +
> >> +prop-encoded-array: An integer encoded as with encode-int specifying the domainID index
> >> +
> >> +"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
> >> +from this resource domain to other resources.
> >> +
> >> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> >> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
> >
> > Again, does N have to equal 2 * (existing number of nodes + 1)?  If so
> > you should say so, if not you should specify how incomplete
> > information is interpreted.
> >
> >> +For ex:
> >> +ibm,associativity     = { 4, 5, 10, 50}
> >> +ibm,numa-lookup-index = { 4 }
> >> +ibm,numa-distance   =  {8, 160, 255, 80, 10, 160, 255, 80, 10}
> >> +
> >> +resulting in a new toplogy as below.
> >> +  | 0    8   40   50
> >> +--|------------------
> >> +  |
> >> +0 | 10   20  80   160
> >> +  |
> >> +8 | 20   10  160  255
> >> +  |
> >> +40| 80   160  10  80
> >> +  |
> >> +50| 160  255  80  10
> >> +
> 
> Here is the relevant updated part of the doc.
> 
> Form 2 associativity format adds separate device tree properties representing NUMA node distance
> thereby making the node distance computation flexible. Form 2 also allows flexible primary
> domain numbering. With numa distance computation now detached from the index value in
> "ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain
> ids at the same domainID index representing resource groups of different performance/latency
> characteristics.
> 
> Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
> "ibm,architecture-vec-5" property.
> 
> "ibm,numa-lookup-index-table" property contains one or more list numbers representing

s/one or more list numbers/a list of one or more numbers/

> the domainIDs present in the system. The offset of the domainID in this property is
> used as an index while computing numa distance information via "ibm,numa-distance-table".
> 
> prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
> N domainID encoded as with encode-int
> 
> For ex:
> "ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. Offset of domainID 8 (2) is used when
> computing the distance of domain 8 from other domains present in the system. For rest of
> this document this offset will be referred to as domain distance offset.
> 
> "ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
> distance between resource groups/domains present in the system.
> 
> prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> N distance values encoded as with encode-bytes. The max distance value we could encode is 255.

If you insist on retaining the N here, at least explicitly state that
it must be equal to m^2, where m is the number of nodes in numa-lookup-index-table.

> For ex:
> ibm,numa-lookup-index-table =  {3, 0, 8, 40}
> ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
> 
>   | 0    8   40
> --|------------
>   |
> 0 | 10   20  80
>   |
> 8 | 20   10  160
>   |
> 40| 80   160  10
> 
> A possible "ibm,associativity" property for resources in node 0, 8 and 40
> 
> { 3, 6, 7, 0 }
> { 3, 6, 9, 8 }
> { 3, 6, 7, 40}
> 
> With "ibm,associativity-reference-points"  { 0x3 }
> 
> "ibm,lookup-index-table" helps in having a compact representation of distance matrix.
> Since domainID can be sparse, the matrix of distances can also be effectively sparse.
> With "ibm,lookup-index-table" we are able to achieve a compact representation of
> distance information.
> 
> Each resource (drcIndex) now also supports additional optional device tree properties.
> These properties are marked optional because the platform can choose not to export
> them and provide the system topology details using the earlier defined device tree
> properties alone. The optional device tree properties are used when adding new resources
> (DLPAR) and when the platform didn't provide the topology details of the domain which
> contains the newly added resource during boot.
> 
> "ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
> from this resource domain to other resources.
> 
> prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> N distance values encoded as with encode-bytes. The max distance value we could encode is 255.

> If the system currently has 3 domains and a DLPAR operation is adding one additional
> domain, "ibm,numa-distance" property should have 2 * (3 + 1) = 8 elements as shown below.
> In other words the domain distance offset value for the newly added domain is number of
> distance value elements in the "ibm,numa-distance" property divided by 2.

You need to spell out the distinction between the two halves of the
property - which one adds a row to numa-distance-table, and which one
adds a column.

> >> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
> >> index 60b631161360..97a3bd9ffeb9 100644
> >> --- a/arch/powerpc/include/asm/firmware.h
> >> +++ b/arch/powerpc/include/asm/firmware.h
> >> @@ -53,6 +53,7 @@
> >>  #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
> >>  #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
> >>  #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
> >> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
> >>  
> >>  #ifndef __ASSEMBLY__
> >>  
> >> @@ -73,7 +74,7 @@ enum {
> >>  		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
> >>  		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
> >>  		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
> >> -		FW_FEATURE_RPT_INVALIDATE,
> >> +		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
> >>  	FW_FEATURE_PSERIES_ALWAYS = 0,
> >>  	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
> >>  	FW_FEATURE_POWERNV_ALWAYS = 0,
> >> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
> >> index df9fec9d232c..5c80152e8f18 100644
> >> --- a/arch/powerpc/include/asm/prom.h
> >> +++ b/arch/powerpc/include/asm/prom.h
> >> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
> >>  #define OV5_XCMO		0x0440	/* Page Coalescing */
> >>  #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
> >>  #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
> >> +#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
> >>  #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
> >>  #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
> >>  #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
> >> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> >> index 64b9593038a7..496fdac54c29 100644
> >> --- a/arch/powerpc/kernel/prom_init.c
> >> +++ b/arch/powerpc/kernel/prom_init.c
> >> @@ -1070,7 +1070,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
> >>  #else
> >>  		0,
> >>  #endif
> >> -		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
> >> +		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
> >> +		OV5_FEAT(OV5_FORM2_AFFINITY),
> >>  		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
> >>  		.micro_checkpoint = 0,
> >>  		.reserved0 = 0,
> >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> >> index d32729f235b8..5a7d94960fb7 100644
> >> --- a/arch/powerpc/mm/numa.c
> >> +++ b/arch/powerpc/mm/numa.c
> >> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
> >>  
> >>  #define FORM0_AFFINITY 0
> >>  #define FORM1_AFFINITY 1
> >> +#define FORM2_AFFINITY 2
> >>  static int affinity_form;
> >>  
> >>  #define MAX_DISTANCE_REF_POINTS 4
> >>  static int max_associativity_domain_index;
> >>  static const __be32 *distance_ref_points;
> >>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
> >> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
> >> +	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
> >> +};
> >> +static int numa_id_index_table[MAX_NUMNODES];
> >>  
> >>  /*
> >>   * Allocate node_to_cpumask_map based on number of available nodes
> >> @@ -166,6 +171,27 @@ static void unmap_cpu_from_node(unsigned long cpu)
> >>  }
> >>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
> >>  
> >> +/*
> >> + * With FORM2 if we are not using logical domain ids, we will find
> >> + * both primary and seconday domains with same value. Hence always
> >> + * start comparison from secondary domains
> >
> >
> > IIUC, in this new draft, secondary domains are no longer a meaningful
> > thing in Form2, so this comment and code seem outdated.
> 
> This needed fixup. With FORM2 our associativity array can be just one
> element. I fixed up as below.
> 
> static int __cpu_form2_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> {
> 	int node1, node2;
> 	int dist = 0;
> 
> 	node1 = associativity_to_nid(cpu1_assoc);
> 	node2 = associativity_to_nid(cpu2_assoc);
> 
> 	dist = numa_distance_table[node1][node2];
> 	if (dist == LOCAL_DISTANCE)
> 		return 0;
> 	else if (dist == REMOTE_DISTANCE)
> 		return 1;
> 	else
> 		return 2;
> }
> 
> also renamed cpu_distance to cpu_relative_distance.
> 
> 
> >> + */
> >> +static int __cpu_form2_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >> +{
> >> +	int dist = 0;
> >> +
> >> +	int i, index;
> >> +
> >> +	for (i = 1; i < max_associativity_domain_index; i++) {
> >> +		index = be32_to_cpu(distance_ref_points[i]);
> >> +		if (cpu1_assoc[index] == cpu2_assoc[index])
> >> +			break;
> >> +		dist++;
> >> +	}
> >> +
> >> +	return dist;
> >> +}
> >> +
> >>  static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >>  {
> >>  	int dist = 0;
> >> @@ -178,7 +204,6 @@ static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >>  			break;
> >>  		dist++;
> >>  	}
> >> -
> >>  	return dist;
> >>  }
> >>  
> >> @@ -186,8 +211,9 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >>  {
> >>  	/* We should not get called with FORM0 */
> >>  	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
> >> -
> >> -	return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
> >> +	if (affinity_form == FORM1_AFFINITY)
> >> +		return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
> >> +	return __cpu_form2_distance(cpu1_assoc, cpu2_assoc);
> >>  }
> >>  
> >>  /* must hold reference to node during call */
> >> @@ -201,7 +227,9 @@ int __node_distance(int a, int b)
> >>  	int i;
> >>  	int distance = LOCAL_DISTANCE;
> >>  
> >> -	if (affinity_form == FORM0_AFFINITY)
> >> +	if (affinity_form == FORM2_AFFINITY)
> >> +		return numa_distance_table[a][b];
> >> +	else if (affinity_form == FORM0_AFFINITY)
> >>  		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
> >>  
> >>  	for (i = 0; i < max_associativity_domain_index; i++) {
> >> @@ -303,15 +331,116 @@ static void initialize_form1_numa_distance(struct device_node *node)
> >>  
> >>  /*
> >>   * Used to update distance information w.r.t newly added node.
> >> + * ibm,numa-lookup-index -> 4
> >> + * ibm,numa-distance -> {5, 20, 40, 60, 80, 10 }
> >>   */
> >>  void update_numa_distance(struct device_node *node)
> >>  {
> >> +	int i, nid, other_nid, other_nid_index = 0;
> >> +	const __be32 *numa_indexp;
> >> +	const __u8  *numa_distancep;
> >> +	int numa_index, max_numa_index, numa_distance;
> >> +
> >>  	if (affinity_form == FORM0_AFFINITY)
> >>  		return;
> >>  	else if (affinity_form == FORM1_AFFINITY) {
> >>  		initialize_form1_numa_distance(node);
> >>  		return;
> >>  	}
> >> +	/* FORM2 affinity  */
> >> +
> >> +	nid = of_node_to_nid_single(node);
> >> +	if (nid == NUMA_NO_NODE)
> >> +		return;
> >> +
> >> +	/* Already initialized */
> >> +	if (numa_distance_table[nid][nid] != -1)
> >
> > IIUC, this should exactly correspond to the case where the new
> > resource lies in an existing NUMA node, yes?
> 
> yes.
> 
> >
> >> +		return;
> >> +	/*
> >> +	 * update node distance if not already populated.
> >> +	 */
> >> +	numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
> >> +	if (!numa_distancep)
> >> +		return;
> >> +
> >> +	numa_indexp = of_get_property(node, "ibm,numa-lookup-index", NULL);
> >> +	if (!numa_indexp)
> >> +		return;
> >> +
> >> +	numa_index = of_read_number(numa_indexp, 1);
> >> +	/*
> >> +	 * update the numa_id_index_table. Device tree look at index table as
> >> +	 * 1 based array indexing.
> >> +	 */
> >> +	numa_id_index_table[numa_index - 1] = nid;
> >> +
> >> +	max_numa_index = of_read_number((const __be32 *)numa_distancep, 1);
> >> +	VM_WARN_ON(max_numa_index != 2 * numa_index);
> >
> > You WARN_ON(), but you don't actually bail out to avoid indexing past
> > the end of the property below.
> >
> > AFAICT none of this really works unless numa_index == (previous number
> > of numa nodes + 1), which makes all the use of these different
> > variables kind of confusing.  If you had a variable that you just set
> > equal to the new number of numa nodes (previous number plus the one
> > being added), then I think you can replace all numa_index and
> > max_numa_index references with that and it will be clearer.
> 
> I rewrote that as below.
> 
> 	numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
> 	if (!numa_distancep)
> 		return;
> 
> 	max_dist_element = of_read_number((const __be32 *)numa_distancep, 1);
> 
> 	numa_id_index = max_dist_element >> 1;

You already know the existing number of NUMA nodes, so it seems
clearer to me to alidate the new property against that, rather than
trying to parse it out of the new property.

> 	/*
> 	 * update the numa_id_index_table. Device tree look at index table as
> 	 * 1 based array indexing.
> 	 */
> 	if (numa_id_index_table[numa_id_index - 1] != -1) {
> 		WARN(1, "Wrong NUMA distance information\n");
> 		return;
> 	}

I believe
	if (WARN_ON(...))
		return;

is idiomatic.

> 	numa_id_index_table[numa_id_index - 1] = nid;
> 
> 	/* Skip the size which is encoded int */
> 	numa_distancep += sizeof(__be32);

Is numa_distancep a u32 pointer, or a char pointer?  If it's a u32
pointer then adding sizeof(__be32) rather than 1 is incorrect here.

> 	/*
> 	 * First fill the distance information from other node to this node.
> 	 */
> 	distance_index = 0;
> 	for (i = 0; i < numa_id_index; i++) {
> 		numa_distance = numa_distancep[distance_index++];

.. but if it's a char pointer then this looks wrong.

> 		other_nid = numa_id_index_table[i];
> 		if (other_nid == NUMA_NO_NODE)
> 			continue;
> 		numa_distance_table[other_nid][nid] = numa_distance;
> 	}
> 
> 	for (i = 0; i < numa_id_index; i++) {
> 		numa_distance = numa_distancep[distance_index++];

I'd just compute the offset from i here, rather than having another
variable to track.

> 		other_nid = numa_id_index_table[i];
> 		if (other_nid == NUMA_NO_NODE)
> 			continue;
> 		numa_distance_table[nid][other_nid] = numa_distance;
> 	}
> }
> 
> 
> >
> >
> >> +	/* Skip the size which is encoded int */
> >> +	numa_distancep += sizeof(__be32);
> >> +
> >> +	/*
> >> +	 * First fill the distance information from other node to this node.
> >> +	 */
> >> +	other_nid_index = 0;
> >> +	for (i = 0; i < numa_index; i++) {
> >> +		numa_distance = numa_distancep[i];
> >> +		other_nid = numa_id_index_table[other_nid_index++];
> >> +		numa_distance_table[other_nid][nid] = numa_distance;
> >> +	}
> >> +
> >> +	other_nid_index = 0;
> >> +	for (; i < max_numa_index; i++) {
> >
> > Again, splitting this loop and carrying i over seems a confusing way
> > to code this.  It's basically two loops of N, one writing a row of the
> > distance matrix, one writing a column (other_nid will even go through
> > the same values in each loop).
> 
> Fixed
> 
> >
> >> +		numa_distance = numa_distancep[i];
> >> +		other_nid = numa_id_index_table[other_nid_index++];
> >> +		numa_distance_table[nid][other_nid] = numa_distance;
> >> +	}
> >> +}
> >> +
> >> +/*
> >> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
> >> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
> >> + */
> >> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
> >> +{
> >> +	const __u8 *numa_dist_table;
> >> +	const __be32 *numa_lookup_index;
> >> +	int numa_dist_table_length;
> >> +	int max_numa_index, distance_index;
> >> +	int i, curr_row = 0, curr_column = 0;
> >> +
> >> +	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
> >> +	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
> >
> > max_numa_index here has a different meaning to max_numa_index in the
> > previous function, which is pointlessly confusing.
> >
> >> +	/* first element of the array is the size and is encode-int */
> >> +	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
> >> +	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
> >> +	/* Skip the size which is encoded int */
> >> +	numa_dist_table += sizeof(__be32);
> >> +
> >> +	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d \n",
> >> +		 numa_dist_table_length, max_numa_index);
> >> +
> >> +	for (i = 0; i < max_numa_index; i++)
> >> +		/* +1 skip the max_numa_index in the property */
> >> +		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
> >> +
> >> +
> >> +	VM_WARN_ON(numa_dist_table_length != max_numa_index * max_numa_index);
> >
> > Again, you don't actually bail out in this case.  And if it has to
> > have this value, what's the point of encoding it into the property.
> >
> >> +	for (distance_index = 0; distance_index < numa_dist_table_length; distance_index++) {
> >> +		int nodeA = numa_id_index_table[curr_row];
> >> +		int nodeB = numa_id_index_table[curr_column++];
> >
> > You've already (sort of) verified that the distance table has size
> > N^2, in which case you can just to a simple two dimensional loop
> > rather than having to do ugly calculations of row and column.
> 
> Fixed all the above and updated as below.
> 	if (numa_dist_table_length != max_numa_index * max_numa_index) {
> 		WARN(1, "Wrong NUMA distnce information\n");
                                    ^^^^^^^

> 		/* consider everybody else just remote. */
> 
> 		for (i = 0;  i < max_numa_index; i++) {
> 			for (j = 0; j < max_numa_index; j++) {
> 				int nodeA = numa_id_index_table[i];
> 				int nodeB = numa_id_index_table[j];
> 
> 				if (nodeA == nodeB)
> 					numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE;
> 				else
> 					numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE;
> 			}
> 		}
> 	}
> 
> 	distance_index = 0;
> 	for (i = 0;  i < max_numa_index; i++) {
> 		for (j = 0; j < max_numa_index; j++) {
> 			int nodeA = numa_id_index_table[i];
> 			int nodeB = numa_id_index_table[j];
> 
> 			numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++];
> 
> 			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
> 		}
> 	}
> }
> 
> 
> -aneesh
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity
  2021-06-28  6:18       ` David Gibson
@ 2021-06-28 15:04         ` Aneesh Kumar K.V
  0 siblings, 0 replies; 20+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-28 15:04 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, mpe, Nathan Lynch, Daniel Henrique Barboza, nvdimm,
	dan.j.williams

David Gibson <david@gibson.dropbear.id.au> writes:

> On Thu, Jun 24, 2021 at 01:50:34PM +0530, Aneesh Kumar K.V wrote:
>> David Gibson <david@gibson.dropbear.id.au> writes:
>> 
>> > On Thu, Jun 17, 2021 at 10:21:05PM +0530, Aneesh Kumar K.V wrote:
>> >> PAPR interface currently supports two different ways of communicating resource
>> >> grouping details to the OS. These are referred to as Form 0 and Form 1
>> >> associativity grouping. Form 0 is the older format and is now considered
>> >> deprecated. This patch adds another resource grouping named FORM2.
>> >> 
>> >> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> >> ---
>> >>  Documentation/powerpc/associativity.rst   | 135 ++++++++++++++++++++
>> >>  arch/powerpc/include/asm/firmware.h       |   3 +-
>> >>  arch/powerpc/include/asm/prom.h           |   1 +
>> >>  arch/powerpc/kernel/prom_init.c           |   3 +-
>> >>  arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
>> >>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>> >>  6 files changed, 286 insertions(+), 6 deletions(-)
>> >>  create mode 100644 Documentation/powerpc/associativity.rst
>> >> 
>> >> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
>> >> new file mode 100644
>> >> index 000000000000..93be604ac54d
>> >> --- /dev/null
>> >> +++ b/Documentation/powerpc/associativity.rst
>> >> @@ -0,0 +1,135 @@
>> >> +============================
>> >> +NUMA resource associativity
>> >> +=============================
>> >> +
>> >> +Associativity represents the groupings of the various platform resources into
>> >> +domains of substantially similar mean performance relative to resources outside
>> >> +of that domain. Resources subsets of a given domain that exhibit better
>> >> +performance relative to each other than relative to other resources subsets
>> >> +are represented as being members of a sub-grouping domain. This performance
>> >> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
>> >> +From the platform view, these groups are also referred to as domains.
>> >> +
>> >> +PAPR interface currently supports different ways of communicating these resource
>> >> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
>> >> +associativity grouping. Form 0 is the older format and is now considered deprecated.
>> >> +
>> >> +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
>> >> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
>> >> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
>> >> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>> >> +
>> >> +Form 0
>> >> +-----
>> >> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
>> >> +
>> >> +Form 1
>> >> +-----
>> >> +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity
>> >> +device tree properties are used to determine the NUMA distance between resource groups/domains.
>> >> +
>> >> +The “ibm,associativity” property contains one or more lists of numbers (domainID)
>> >> +representing the resource’s platform grouping domains.
>> >> +
>> >> +The “ibm,associativity-reference-points” property contains one or more list of numbers
>> >> +(domainID index) that represents the 1 based ordinal in the associativity lists.
>> >> +The list of domainID index represnets increasing hierachy of
>> >> resource grouping.
>> >
>> > Typo "represnets".  Also s/hierachy/hierarchy/
>> >
>> >> +
>> >> +ex:
>> >> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
>> >
>> >> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
>> >> +Linux kernel computes NUMA distance between two domains by recursively comparing
>> >> +if they belong to the same higher-level domains. For mismatch at every higher
>> >> +level of the resource group, the kernel doubles the NUMA distance between the
>> >> +comparing domains.
>> >
>> > The Form1 description is still kinda confusing, but I don't really
>> > care.  Form1 *is* confusing, it's Form2 that I hope will be clearer.
>> >
>> >> +
>> >> +Form 2
>> >> +-------
>> >> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
>> >> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
>> >> +domain numbering. With numa distance computation now detached from the index value of
>> >> +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the
>> >> +same domainID index representing resource groups of different performance/latency characteristics.
>> >
>> > So, see you've removed the special handling of secondary IDs for pmem
>> > - big improvement, thanks.  IIUC, in this revised version, for Form2
>> > there's really no reason for ibm,associativity-reference-points to
>> > have >1 entry.  Is that right?
>> >
>> > In Form2 everything revolves around the primary domain ID - so much
>> > that I suggest we come up with a short name for it.  How about just
>> > "node id" since that's how Linux uses it.
>> 
>> We don't really refer primary domain ID in rest of FORM2 details. We
>> do refer the entries as domainID.
>
> Right, which is unnecessarily confusing, because primary domain ID is
> going to be the only one that matters in practice.
>
>> Is renaming domainID to NUMA
>> node id better?
>> 
>> >
>> >> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
>> >> +"ibm,architecture-vec-5" property.
>> >> +
>> >> +"ibm,numa-lookup-index-table" property contains one or more list numbers representing
>> >> +the domainIDs present in the system. The offset of the domainID in this property is considered
>> >> +the domainID index.
>> >
>> > The lookup-index-table is all about compactifying the representation
>> > of the distance matrix.  Because node ids are sparse, the matrix of
>> > distances is also effectively sparse.  You don't want to have a huge
>> > matrix in the DT with mostly meaningless entries, so you use the
>> > lookup table to compact the node ids.
>> >
>> > It's a bit confusing though, because you now have two representations
>> > of exactly the same information the "full" node id (== primary
>> > domainID) and the "short" node id (==domainID lookup table offset).
>> >
>> > I can see three ways to clarify this:
>> >
>> >   A) Give the short node id an explicit name rather than difficult to
>> >      parse "domainID index" or "domainID offset" which gets easily
>> >      muddled with other uses of index and offset.  Use that name
>> >      throughout.
>> 
>> I dopped the domainID index and started referring to that as domain
>> distance offset.
>
> Better, since it's not ambiguous.  Still kinda long and awkward.
>
>> >   B) Eliminate the short ID entirely, and use an explicit sparse
>> >      matrix representation for the distance table e.g. a list of
>> >      (nodeA, nodeB, distance) tuples
>> >
>> >      That does have the disadvantage that it's not structurally
>> >      guaranteed that you have entries for every pair of "active" node
>> >      ids.
>> 
>> Other hypervisor would want to have a large possible domainID value and
>> mostly want to publish the full topology details during boot. Using the
>> above format makes a 16 node config have large "ibm,numa-distance-table"
>> property.
>
> It'd be about 3kiB, yes?  That's fairly large, but not ludicrously
> large, so does that matter?

There is also the possibility that the 16 max node limit will get
updated in the future to a higher value.

>
>> >   C) Eliminate the "long ID" entirely.  In the Form2 world, is there
>> >      actually any point allowing sparse node ids?  In this case you'd
>> >      have the node id read from associativity and use that directly to
>> >      index the distance table
>> 
>> Yes, other hypervisor would like to partition the domain ID space for
>> different resources.
>
> Is that really a hard requirement on the dt format though?  The
> hypervisor could always have its own internal lookup table between
> long and short forms.

There is also the desire to keep new details closer to what values the
hypervisor already use for domainIDs with FORM1. From PAPR spec
point of view, it is desirable to avoid adding another level of
indirection at the hypervisor level.

>
>> >> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
>> >> +N domainID encoded as with encode-int
>> >> +
>> >> +For ex:
>> >> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index
>> >> for domainID 8 is 1.
>> >
>> > As noted "domainID index" is confusing here, because it's different
>> > from the "domainID index" values in reference-points.
>> >
>> >> +
>> >> +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
>> >> +distance between resource groups/domains present in the system.
>> >> +
>> >> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>> >> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>> >
>> > The N here always has to be the square of the N in the
>> > lookup-index-table, yes?  In which case do we actually need to include
>> > it?
>> 
>> most of PAPR device tree property follows the pattern  {N,....
>> Nelements}. This follows the same pattern.
>
> Many do, yes, but in many of those cases reading the N is actually
> useful.  It's impossible to interpret this table if you don't already
> know the number of nodes and if the length of the distance doesn't
> match that, there's no reasonable fallback interpretation.
>
>> >> +For ex:
>> >> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
>> >> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
>> >> +
>> >> +  | 0    8   40
>> >> +--|------------
>> >> +  |
>> >> +0 | 10   20  80
>> >> +  |
>> >> +8 | 20   10  160
>> >> +  |
>> >> +40| 80   160  10
>> >> +
>> >> +
>> >> +"ibm,associativity" property for resources in node 0, 8 and 40
>> >> +
>> >> +{ 3, 6, 7, 0 }
>> >> +{ 3, 6, 9, 8 }
>> >> +{ 3, 6, 7, 40}
>> >> +
>> >> +With "ibm,associativity-reference-points"  { 0x3 }
>> >> +
>> >> +Each resource (drcIndex) now also supports additional optional device tree properties.
>> >> +These properties are marked optional because the platform can choose not to export
>> >> +them and provide the system topology details using the earlier defined device tree
>> >> +properties alone. The optional device tree properties are used when adding new resources
>> >> +(DLPAR) and when the platform didn't provide the topology details of the domain which
>> >> +contains the newly added resource during boot.
>> >
>> > In what circumstances would you envisage a hot-add creating a new NUMA
>> > node, as opposed to adding resources to an existing (possibly empty)
>> > node?  Do you need any provision for removing NUMA nodes again?
>> 
>> Both Qemu and PowerVM don't intend to use this at this point. Both will
>> provide the full possible NUMA node details during boot. But I was not
>> sure whether we really need to restrict the possibility of a new
>> resource being added. This can be true in case of new memory devices
>> that gets hotplugged in the latency of the device is such that we never
>> expressed that in the distance table during boot. 
>
> Hm, ok.  Can we just leave it out of the spec for now and add it in
> when/if a hypervisor needs it?

Sure. I will drop the DLPAR specific changes.

>
>> >> +"ibm,numa-lookup-index" property contains a number representing the domainID index to be used
>> >> +when building the NUMA distance of the numa node to which this resource belongs. This can
>> >> +be looked at as the index at which this new domainID would have appeared in
>> >> +"ibm,numa-lookup-index-table" if the domain was present during boot. The domainID
>> >
>> > Again, confusing use of "index" where it's used both for offsets in
>> > ibm,associativity properties and for offsets in ibm,numa-lookup-index-table
>> 
>> I guess we can drop the ibm,numa-lookuup-index and state that the
>> number of distance element in "ibm,numa-distance" suggest the index at
>> which this NUMA node should appear for compuring the distance details. 
>
> I don't really follow what you're saying there.
>
>> >> +of the new resource can be obtained from the existing "ibm,associativity" property. This
>> >> +can be used to build distance information of a newly onlined NUMA node via DLPAR operation.
>> >> +The value is 1 based array index value.
>> >> +
>> >> +prop-encoded-array: An integer encoded as with encode-int specifying the domainID index
>> >> +
>> >> +"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
>> >> +from this resource domain to other resources.
>> >> +
>> >> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>> >> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>> >
>> > Again, does N have to equal 2 * (existing number of nodes + 1)?  If so
>> > you should say so, if not you should specify how incomplete
>> > information is interpreted.
>> >
>> >> +For ex:
>> >> +ibm,associativity     = { 4, 5, 10, 50}
>> >> +ibm,numa-lookup-index = { 4 }
>> >> +ibm,numa-distance   =  {8, 160, 255, 80, 10, 160, 255, 80, 10}
>> >> +
>> >> +resulting in a new toplogy as below.
>> >> +  | 0    8   40   50
>> >> +--|------------------
>> >> +  |
>> >> +0 | 10   20  80   160
>> >> +  |
>> >> +8 | 20   10  160  255
>> >> +  |
>> >> +40| 80   160  10  80
>> >> +  |
>> >> +50| 160  255  80  10
>> >> +
>> 
>> Here is the relevant updated part of the doc.
>> 
>> Form 2 associativity format adds separate device tree properties representing NUMA node distance
>> thereby making the node distance computation flexible. Form 2 also allows flexible primary
>> domain numbering. With numa distance computation now detached from the index value in
>> "ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain
>> ids at the same domainID index representing resource groups of different performance/latency
>> characteristics.
>> 
>> Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
>> "ibm,architecture-vec-5" property.
>> 
>> "ibm,numa-lookup-index-table" property contains one or more list numbers representing
>
> s/one or more list numbers/a list of one or more numbers/

That was picked from PAPR reference. I updated the properties mentioned
in this file to be "a list of one or more numbers"

>
>> the domainIDs present in the system. The offset of the domainID in this property is
>> used as an index while computing numa distance information via "ibm,numa-distance-table".
>> 
>> prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
>> N domainID encoded as with encode-int
>> 
>> For ex:
>> "ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. Offset of domainID 8 (2) is used when
>> computing the distance of domain 8 from other domains present in the system. For rest of
>> this document this offset will be referred to as domain distance offset.
>> 
>> "ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
>> distance between resource groups/domains present in the system.
>> 
>> prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>> N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>
> If you insist on retaining the N here, at least explicitly state that
> it must be equal to m^2, where m is the number of nodes in numa-lookup-index-table.

updated

>
>> For ex:
>> ibm,numa-lookup-index-table =  {3, 0, 8, 40}
>> ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
>> 
>>   | 0    8   40
>> --|------------
>>   |
>> 0 | 10   20  80
>>   |
>> 8 | 20   10  160
>>   |
>> 40| 80   160  10
>> 
>> A possible "ibm,associativity" property for resources in node 0, 8 and 40
>> 
>> { 3, 6, 7, 0 }
>> { 3, 6, 9, 8 }
>> { 3, 6, 7, 40}
>> 
>> With "ibm,associativity-reference-points"  { 0x3 }
>> 
>> "ibm,lookup-index-table" helps in having a compact representation of distance matrix.
>> Since domainID can be sparse, the matrix of distances can also be effectively sparse.
>> With "ibm,lookup-index-table" we are able to achieve a compact representation of
>> distance information.
>> 
>> Each resource (drcIndex) now also supports additional optional device tree properties.
>> These properties are marked optional because the platform can choose not to export
>> them and provide the system topology details using the earlier defined device tree
>> properties alone. The optional device tree properties are used when adding new resources
>> (DLPAR) and when the platform didn't provide the topology details of the domain which
>> contains the newly added resource during boot.
>> 
>> "ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
>> from this resource domain to other resources.
>> 
>> prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>> N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>
>> If the system currently has 3 domains and a DLPAR operation is adding one additional
>> domain, "ibm,numa-distance" property should have 2 * (3 + 1) = 8 elements as shown below.
>> In other words the domain distance offset value for the newly added domain is number of
>> distance value elements in the "ibm,numa-distance" property divided by 2.
>
> You need to spell out the distinction between the two halves of the
> property - which one adds a row to numa-distance-table, and which one
> adds a column.

updated.

>
>> >> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
>> >> index 60b631161360..97a3bd9ffeb9 100644
>> >> --- a/arch/powerpc/include/asm/firmware.h
>> >> +++ b/arch/powerpc/include/asm/firmware.h
>> >> @@ -53,6 +53,7 @@
>> >>  #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
>> >>  #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
>> >>  #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
>> >> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
>> >>  
>> >>  #ifndef __ASSEMBLY__
>> >>  
>> >> @@ -73,7 +74,7 @@ enum {
>> >>  		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
>> >>  		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
>> >>  		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
>> >> -		FW_FEATURE_RPT_INVALIDATE,
>> >> +		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
>> >>  	FW_FEATURE_PSERIES_ALWAYS = 0,
>> >>  	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
>> >>  	FW_FEATURE_POWERNV_ALWAYS = 0,
>> >> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
>> >> index df9fec9d232c..5c80152e8f18 100644
>> >> --- a/arch/powerpc/include/asm/prom.h
>> >> +++ b/arch/powerpc/include/asm/prom.h
>> >> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
>> >>  #define OV5_XCMO		0x0440	/* Page Coalescing */
>> >>  #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
>> >>  #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
>> >> +#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
>> >>  #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
>> >>  #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
>> >>  #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
>> >> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
>> >> index 64b9593038a7..496fdac54c29 100644
>> >> --- a/arch/powerpc/kernel/prom_init.c
>> >> +++ b/arch/powerpc/kernel/prom_init.c
>> >> @@ -1070,7 +1070,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
>> >>  #else
>> >>  		0,
>> >>  #endif
>> >> -		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
>> >> +		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
>> >> +		OV5_FEAT(OV5_FORM2_AFFINITY),
>> >>  		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
>> >>  		.micro_checkpoint = 0,
>> >>  		.reserved0 = 0,
>> >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> >> index d32729f235b8..5a7d94960fb7 100644
>> >> --- a/arch/powerpc/mm/numa.c
>> >> +++ b/arch/powerpc/mm/numa.c
>> >> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
>> >>  
>> >>  #define FORM0_AFFINITY 0
>> >>  #define FORM1_AFFINITY 1
>> >> +#define FORM2_AFFINITY 2
>> >>  static int affinity_form;
>> >>  
>> >>  #define MAX_DISTANCE_REF_POINTS 4
>> >>  static int max_associativity_domain_index;
>> >>  static const __be32 *distance_ref_points;
>> >>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
>> >> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
>> >> +	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
>> >> +};
>> >> +static int numa_id_index_table[MAX_NUMNODES];
>> >>  
>> >>  /*
>> >>   * Allocate node_to_cpumask_map based on number of available nodes
>> >> @@ -166,6 +171,27 @@ static void unmap_cpu_from_node(unsigned long cpu)
>> >>  }
>> >>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>> >>  
>> >> +/*
>> >> + * With FORM2 if we are not using logical domain ids, we will find
>> >> + * both primary and seconday domains with same value. Hence always
>> >> + * start comparison from secondary domains
>> >
>> >
>> > IIUC, in this new draft, secondary domains are no longer a meaningful
>> > thing in Form2, so this comment and code seem outdated.
>> 
>> This needed fixup. With FORM2 our associativity array can be just one
>> element. I fixed up as below.
>> 
>> static int __cpu_form2_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> {
>> 	int node1, node2;
>> 	int dist = 0;
>> 
>> 	node1 = associativity_to_nid(cpu1_assoc);
>> 	node2 = associativity_to_nid(cpu2_assoc);
>> 
>> 	dist = numa_distance_table[node1][node2];
>> 	if (dist == LOCAL_DISTANCE)
>> 		return 0;
>> 	else if (dist == REMOTE_DISTANCE)
>> 		return 1;
>> 	else
>> 		return 2;
>> }
>> 
>> also renamed cpu_distance to cpu_relative_distance.
>> 
>> 
>> >> + */
>> >> +static int __cpu_form2_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> >> +{
>> >> +	int dist = 0;
>> >> +
>> >> +	int i, index;
>> >> +
>> >> +	for (i = 1; i < max_associativity_domain_index; i++) {
>> >> +		index = be32_to_cpu(distance_ref_points[i]);
>> >> +		if (cpu1_assoc[index] == cpu2_assoc[index])
>> >> +			break;
>> >> +		dist++;
>> >> +	}
>> >> +
>> >> +	return dist;
>> >> +}
>> >> +
>> >>  static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> >>  {
>> >>  	int dist = 0;
>> >> @@ -178,7 +204,6 @@ static int __cpu_form1_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> >>  			break;
>> >>  		dist++;
>> >>  	}
>> >> -
>> >>  	return dist;
>> >>  }
>> >>  
>> >> @@ -186,8 +211,9 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> >>  {
>> >>  	/* We should not get called with FORM0 */
>> >>  	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
>> >> -
>> >> -	return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
>> >> +	if (affinity_form == FORM1_AFFINITY)
>> >> +		return __cpu_form1_distance(cpu1_assoc, cpu2_assoc);
>> >> +	return __cpu_form2_distance(cpu1_assoc, cpu2_assoc);
>> >>  }
>> >>  
>> >>  /* must hold reference to node during call */
>> >> @@ -201,7 +227,9 @@ int __node_distance(int a, int b)
>> >>  	int i;
>> >>  	int distance = LOCAL_DISTANCE;
>> >>  
>> >> -	if (affinity_form == FORM0_AFFINITY)
>> >> +	if (affinity_form == FORM2_AFFINITY)
>> >> +		return numa_distance_table[a][b];
>> >> +	else if (affinity_form == FORM0_AFFINITY)
>> >>  		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
>> >>  
>> >>  	for (i = 0; i < max_associativity_domain_index; i++) {
>> >> @@ -303,15 +331,116 @@ static void initialize_form1_numa_distance(struct device_node *node)
>> >>  
>> >>  /*
>> >>   * Used to update distance information w.r.t newly added node.
>> >> + * ibm,numa-lookup-index -> 4
>> >> + * ibm,numa-distance -> {5, 20, 40, 60, 80, 10 }
>> >>   */
>> >>  void update_numa_distance(struct device_node *node)
>> >>  {
>> >> +	int i, nid, other_nid, other_nid_index = 0;
>> >> +	const __be32 *numa_indexp;
>> >> +	const __u8  *numa_distancep;
>> >> +	int numa_index, max_numa_index, numa_distance;
>> >> +
>> >>  	if (affinity_form == FORM0_AFFINITY)
>> >>  		return;
>> >>  	else if (affinity_form == FORM1_AFFINITY) {
>> >>  		initialize_form1_numa_distance(node);
>> >>  		return;
>> >>  	}
>> >> +	/* FORM2 affinity  */
>> >> +
>> >> +	nid = of_node_to_nid_single(node);
>> >> +	if (nid == NUMA_NO_NODE)
>> >> +		return;
>> >> +
>> >> +	/* Already initialized */
>> >> +	if (numa_distance_table[nid][nid] != -1)
>> >
>> > IIUC, this should exactly correspond to the case where the new
>> > resource lies in an existing NUMA node, yes?
>> 
>> yes.
>> 
>> >
>> >> +		return;
>> >> +	/*
>> >> +	 * update node distance if not already populated.
>> >> +	 */
>> >> +	numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
>> >> +	if (!numa_distancep)
>> >> +		return;
>> >> +
>> >> +	numa_indexp = of_get_property(node, "ibm,numa-lookup-index", NULL);
>> >> +	if (!numa_indexp)
>> >> +		return;
>> >> +
>> >> +	numa_index = of_read_number(numa_indexp, 1);
>> >> +	/*
>> >> +	 * update the numa_id_index_table. Device tree look at index table as
>> >> +	 * 1 based array indexing.
>> >> +	 */
>> >> +	numa_id_index_table[numa_index - 1] = nid;
>> >> +
>> >> +	max_numa_index = of_read_number((const __be32 *)numa_distancep, 1);
>> >> +	VM_WARN_ON(max_numa_index != 2 * numa_index);
>> >
>> > You WARN_ON(), but you don't actually bail out to avoid indexing past
>> > the end of the property below.
>> >
>> > AFAICT none of this really works unless numa_index == (previous number
>> > of numa nodes + 1), which makes all the use of these different
>> > variables kind of confusing.  If you had a variable that you just set
>> > equal to the new number of numa nodes (previous number plus the one
>> > being added), then I think you can replace all numa_index and
>> > max_numa_index references with that and it will be clearer.
>> 
>> I rewrote that as below.
>> 
>> 	numa_distancep = of_get_property(node, "ibm,numa-distance", NULL);
>> 	if (!numa_distancep)
>> 		return;
>> 
>> 	max_dist_element = of_read_number((const __be32 *)numa_distancep, 1);
>> 
>> 	numa_id_index = max_dist_element >> 1;
>
> You already know the existing number of NUMA nodes, so it seems
> clearer to me to alidate the new property against that, rather than
> trying to parse it out of the new property.
>
>> 	/*
>> 	 * update the numa_id_index_table. Device tree look at index table as
>> 	 * 1 based array indexing.
>> 	 */
>> 	if (numa_id_index_table[numa_id_index - 1] != -1) {
>> 		WARN(1, "Wrong NUMA distance information\n");
>> 		return;
>> 	}
>
> I believe
> 	if (WARN_ON(...))
> 		return;
>
> is idiomatic.
>
>> 	numa_id_index_table[numa_id_index - 1] = nid;
>> 
>> 	/* Skip the size which is encoded int */
>> 	numa_distancep += sizeof(__be32);
>
> Is numa_distancep a u32 pointer, or a char pointer?  If it's a u32
> pointer then adding sizeof(__be32) rather than 1 is incorrect here.
>

const __u8  *numa_distancep;

>> 	/*
>> 	 * First fill the distance information from other node to this node.
>> 	 */
>> 	distance_index = 0;
>> 	for (i = 0; i < numa_id_index; i++) {
>> 		numa_distance = numa_distancep[distance_index++];
>
> .. but if it's a char pointer then this looks wrong.

We are reading the 8 bit distance value here.

>
>> 		other_nid = numa_id_index_table[i];
>> 		if (other_nid == NUMA_NO_NODE)
>> 			continue;
>> 		numa_distance_table[other_nid][nid] = numa_distance;
>> 	}
>> 
>> 	for (i = 0; i < numa_id_index; i++) {
>> 		numa_distance = numa_distancep[distance_index++];
>
> I'd just compute the offset from i here, rather than having another
> variable to track.
>
>> 		other_nid = numa_id_index_table[i];
>> 		if (other_nid == NUMA_NO_NODE)
>> 			continue;
>> 		numa_distance_table[nid][other_nid] = numa_distance;
>> 	}
>> }
>> 
>> 
>> >
>> >
>> >> +	/* Skip the size which is encoded int */
>> >> +	numa_distancep += sizeof(__be32);
>> >> +
>> >> +	/*
>> >> +	 * First fill the distance information from other node to this node.
>> >> +	 */
>> >> +	other_nid_index = 0;
>> >> +	for (i = 0; i < numa_index; i++) {
>> >> +		numa_distance = numa_distancep[i];
>> >> +		other_nid = numa_id_index_table[other_nid_index++];
>> >> +		numa_distance_table[other_nid][nid] = numa_distance;
>> >> +	}
>> >> +
>> >> +	other_nid_index = 0;
>> >> +	for (; i < max_numa_index; i++) {
>> >
>> > Again, splitting this loop and carrying i over seems a confusing way
>> > to code this.  It's basically two loops of N, one writing a row of the
>> > distance matrix, one writing a column (other_nid will even go through
>> > the same values in each loop).
>> 
>> Fixed
>> 
>> >
>> >> +		numa_distance = numa_distancep[i];
>> >> +		other_nid = numa_id_index_table[other_nid_index++];
>> >> +		numa_distance_table[nid][other_nid] = numa_distance;
>> >> +	}
>> >> +}
>> >> +
>> >> +/*
>> >> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
>> >> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
>> >> + */
>> >> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
>> >> +{
>> >> +	const __u8 *numa_dist_table;
>> >> +	const __be32 *numa_lookup_index;
>> >> +	int numa_dist_table_length;
>> >> +	int max_numa_index, distance_index;
>> >> +	int i, curr_row = 0, curr_column = 0;
>> >> +
>> >> +	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
>> >> +	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
>> >
>> > max_numa_index here has a different meaning to max_numa_index in the
>> > previous function, which is pointlessly confusing.
>> >
>> >> +	/* first element of the array is the size and is encode-int */
>> >> +	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
>> >> +	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
>> >> +	/* Skip the size which is encoded int */
>> >> +	numa_dist_table += sizeof(__be32);
>> >> +
>> >> +	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d \n",
>> >> +		 numa_dist_table_length, max_numa_index);
>> >> +
>> >> +	for (i = 0; i < max_numa_index; i++)
>> >> +		/* +1 skip the max_numa_index in the property */
>> >> +		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
>> >> +
>> >> +
>> >> +	VM_WARN_ON(numa_dist_table_length != max_numa_index * max_numa_index);
>> >
>> > Again, you don't actually bail out in this case.  And if it has to
>> > have this value, what's the point of encoding it into the property.
>> >
>> >> +	for (distance_index = 0; distance_index < numa_dist_table_length; distance_index++) {
>> >> +		int nodeA = numa_id_index_table[curr_row];
>> >> +		int nodeB = numa_id_index_table[curr_column++];
>> >
>> > You've already (sort of) verified that the distance table has size
>> > N^2, in which case you can just to a simple two dimensional loop
>> > rather than having to do ugly calculations of row and column.
>> 
>> Fixed all the above and updated as below.
>> 	if (numa_dist_table_length != max_numa_index * max_numa_index) {
>> 		WARN(1, "Wrong NUMA distnce information\n");
>                                     ^^^^^^^
>

updated.

>> 		/* consider everybody else just remote. */
>> 
>> 		for (i = 0;  i < max_numa_index; i++) {
>> 			for (j = 0; j < max_numa_index; j++) {
>> 				int nodeA = numa_id_index_table[i];
>> 				int nodeB = numa_id_index_table[j];
>> 
>> 				if (nodeA == nodeB)
>> 					numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE;
>> 				else
>> 					numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE;
>> 			}
>> 		}
>> 	}
>> 
>> 	distance_index = 0;
>> 	for (i = 0;  i < max_numa_index; i++) {
>> 		for (j = 0; j < max_numa_index; j++) {
>> 			int nodeA = numa_id_index_table[i];
>> 			int nodeB = numa_id_index_table[j];
>> 
>> 			numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++];
>> 
>> 			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
>> 		}
>> 	}
>> }
>> 
>> 
>> -aneesh
>> 
>
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-06-28 15:53 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-17 16:50 [PATCH v4 0/7] Add support for FORM2 associativity Aneesh Kumar K.V
2021-06-17 16:50 ` [PATCH v4 1/7] powerpc/pseries: rename min_common_depth to primary_domain_index Aneesh Kumar K.V
2021-06-24  1:46   ` David Gibson
2021-06-17 16:51 ` [PATCH v4 2/7] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index Aneesh Kumar K.V
2021-06-24  3:10   ` David Gibson
2021-06-17 16:51 ` [PATCH v4 3/7] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY Aneesh Kumar K.V
2021-06-17 16:51 ` [PATCH v4 4/7] powerpc/pseries: Consolidate DLPAR NUMA distance update Aneesh Kumar K.V
2021-06-24  3:11   ` David Gibson
2021-06-17 16:51 ` [PATCH v4 5/7] powerpc/pseries: Consolidate NUMA distance update during boot Aneesh Kumar K.V
2021-06-17 16:51 ` [PATCH v4 6/7] powerpc/pseries: Add a helper for form1 cpu distance Aneesh Kumar K.V
2021-06-17 16:51 ` [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity Aneesh Kumar K.V
2021-06-21 22:02   ` Daniel Henrique Barboza
2021-06-22 12:07     ` Aneesh Kumar K.V
2021-06-22 16:04       ` Daniel Henrique Barboza
2021-06-24  3:08   ` David Gibson
2021-06-24  8:20     ` Aneesh Kumar K.V
2021-06-28  6:18       ` David Gibson
2021-06-28 15:04         ` Aneesh Kumar K.V
2021-06-24 10:33   ` Laurent Dufour
2021-06-24 10:55     ` Aneesh Kumar K.V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).