All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/6] Add support for FORM2 associativity
@ 2021-06-28 15:11 Aneesh Kumar K.V
  2021-06-28 15:11 ` [PATCH v5 1/6] powerpc/pseries: rename min_common_depth to primary_domain_index Aneesh Kumar K.V
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-28 15:11 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza, David Gibson

Form2 associativity adds a much more flexible NUMA topology layout
than what is provided by Form1. More details can be found in patch 7.

$ numactl -H
...
node distances:
node   0   1   2   3 
  0:  10  11  222  33 
  1:  44  10  55  66 
  2:  77  88  10  99 
  3:  101  121  132  10 
$

After DAX kmem memory add
# numactl -H
available: 5 nodes (0-4)
...
node distances:
node   0   1   2   3   4 
  0:  10  11  222  33  240 
  1:  44  10  55  66  255 
  2:  77  88  10  99  255 
  3:  101  121  132  10  230 
  4:  255  255  255  230  10 


PAPR SCM now use the numa distance details to find the numa_node and target_node
for the device.

kvaneesh@ubuntu-guest:~$ ndctl  list -N -v 
[
  {
    "dev":"namespace0.0",
    "mode":"devdax",
    "map":"dev",
    "size":1071644672,
    "uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2",
    "raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362",
    "daxregion":{
      "id":0,
      "size":1071644672,
      "devices":[
        {
          "chardev":"dax0.0",
          "size":1071644672,
          "target_node":4,
          "mode":"devdax"
        }
      ]
    },
    "align":2097152,
    "numa_node":3
  }
]
kvaneesh@ubuntu-guest:~$ 


The above output is with a Qemu command line

-numa node,nodeid=4 \
-numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
-numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
-numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \
-numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \
-object memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}  \
-device nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4

Qemu changes can be found at https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb413@gmail.com/

Changes from v4:
* Drop DLPAR related device tree property for now because both Qemu nor PowerVM
  will provide the distance details of all possible NUMA nodes during boot.
* Rework numa distance code based on review feedback.

Changes from v3:
* Drop PAPR SCM specific changes and depend completely on NUMA distance information.

Changes from v2:
* Add nvdimm list to Cc:
* update PATCH 8 commit message.

Changes from v1:
* Update FORM2 documentation.
* rename max_domain_index to max_associativity_domain_index


Aneesh Kumar K.V (6):
  powerpc/pseries: rename min_common_depth to primary_domain_index
  powerpc/pseries: rename distance_ref_points_depth to
    max_associativity_domain_index
  powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
  powerpc/pseries: Consolidate different NUMA distance update code paths
  powerpc/pseries: Add a helper for form1 cpu distance
  powerpc/pseries: Add support for FORM2 associativity

 Documentation/powerpc/associativity.rst       | 103 +++++
 arch/powerpc/include/asm/firmware.h           |   7 +-
 arch/powerpc/include/asm/prom.h               |   3 +-
 arch/powerpc/include/asm/topology.h           |   4 +-
 arch/powerpc/kernel/prom_init.c               |   3 +-
 arch/powerpc/mm/numa.c                        | 415 +++++++++++++-----
 arch/powerpc/platforms/pseries/firmware.c     |   3 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c        |   2 +
 arch/powerpc/platforms/pseries/lpar.c         |   4 +-
 arch/powerpc/platforms/pseries/pseries.h      |   1 +
 11 files changed, 432 insertions(+), 115 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

-- 
2.31.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v5 1/6] powerpc/pseries: rename min_common_depth to primary_domain_index
  2021-06-28 15:11 [PATCH v5 0/6] Add support for FORM2 associativity Aneesh Kumar K.V
@ 2021-06-28 15:11 ` Aneesh Kumar K.V
  2021-07-22  1:59   ` David Gibson
  2021-06-28 15:11 ` [PATCH v5 2/6] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-28 15:11 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza, David Gibson

No functional change in this patch.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/numa.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..8365b298ec48 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
-static int min_common_depth;
+static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
@@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 *associativity)
 	if (!numa_enabled)
 		goto out;
 
-	if (of_read_number(associativity, 1) >= min_common_depth)
-		nid = of_read_number(&associativity[min_common_depth], 1);
+	if (of_read_number(associativity, 1) >= primary_domain_index)
+		nid = of_read_number(&associativity[primary_domain_index], 1);
 
 	/* POWER4 LPAR uses 0xffff as invalid node */
 	if (nid == 0xffff || nid >= nr_node_ids)
@@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
-static int __init find_min_common_depth(void)
+static int __init find_primary_domain_index(void)
 {
-	int depth;
+	int index;
 	struct device_node *root;
 
 	if (firmware_has_feature(FW_FEATURE_OPAL))
@@ -326,7 +326,7 @@ static int __init find_min_common_depth(void)
 	}
 
 	if (form1_affinity) {
-		depth = of_read_number(distance_ref_points, 1);
+		index = of_read_number(distance_ref_points, 1);
 	} else {
 		if (distance_ref_points_depth < 2) {
 			printk(KERN_WARNING "NUMA: "
@@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
 			goto err;
 		}
 
-		depth = of_read_number(&distance_ref_points[1], 1);
+		index = of_read_number(&distance_ref_points[1], 1);
 	}
 
 	/*
@@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
 	}
 
 	of_node_put(root);
-	return depth;
+	return index;
 
 err:
 	of_node_put(root);
@@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 	int nid = default_nid;
 	int rc, index;
 
-	if ((min_common_depth < 0) || !numa_enabled)
+	if ((primary_domain_index < 0) || !numa_enabled)
 		return default_nid;
 
 	rc = of_get_assoc_arrays(&aa);
 	if (rc)
 		return default_nid;
 
-	if (min_common_depth <= aa.array_sz &&
+	if (primary_domain_index <= aa.array_sz &&
 	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
-		index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
+		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
 		nid = of_read_number(&aa.arrays[index], 1);
 
 		if (nid == 0xffff || nid >= nr_node_ids)
@@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
 		return -1;
 	}
 
-	min_common_depth = find_min_common_depth();
+	primary_domain_index = find_primary_domain_index();
 
-	if (min_common_depth < 0) {
+	if (primary_domain_index < 0) {
 		/*
-		 * if we fail to parse min_common_depth from device tree
+		 * if we fail to parse primary_domain_index from device tree
 		 * mark the numa disabled, boot with numa disabled.
 		 */
 		numa_enabled = false;
-		return min_common_depth;
+		return primary_domain_index;
 	}
 
-	dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+	dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index);
 
 	/*
 	 * Even though we connect cpus to numa domains later in SMP
@@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
 			goto out;
 	}
 
-	max_nodes = of_read_number(&domains[min_common_depth], 1);
+	max_nodes = of_read_number(&domains[primary_domain_index], 1);
 	for (i = 0; i < max_nodes; i++) {
 		if (!node_possible(i))
 			node_set(i, node_possible_map);
 	}
 
 	prop_length /= sizeof(int);
-	if (prop_length > min_common_depth + 2)
+	if (prop_length > primary_domain_index + 2)
 		coregroup_enabled = 1;
 
 out:
@@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
 		goto out;
 
 	index = of_read_number(associativity, 1);
-	if (index > min_common_depth + 1)
+	if (index > primary_domain_index + 1)
 		return of_read_number(&associativity[index - 1], 1);
 
 out:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 2/6] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index
  2021-06-28 15:11 [PATCH v5 0/6] Add support for FORM2 associativity Aneesh Kumar K.V
  2021-06-28 15:11 ` [PATCH v5 1/6] powerpc/pseries: rename min_common_depth to primary_domain_index Aneesh Kumar K.V
@ 2021-06-28 15:11 ` Aneesh Kumar K.V
  2021-07-22  0:59   ` David Gibson
  2021-06-28 15:11 ` [PATCH v5 3/6] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-28 15:11 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza, David Gibson

No functional change in this patch

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/numa.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8365b298ec48..132813dd1a6c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -56,7 +56,7 @@ static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
 #define MAX_DISTANCE_REF_POINTS 4
-static int distance_ref_points_depth;
+static int max_associativity_domain_index;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
 
@@ -169,7 +169,7 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 
 	int i, index;
 
-	for (i = 0; i < distance_ref_points_depth; i++) {
+	for (i = 0; i < max_associativity_domain_index; i++) {
 		index = be32_to_cpu(distance_ref_points[i]);
 		if (cpu1_assoc[index] == cpu2_assoc[index])
 			break;
@@ -193,7 +193,7 @@ int __node_distance(int a, int b)
 	if (!form1_affinity)
 		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
-	for (i = 0; i < distance_ref_points_depth; i++) {
+	for (i = 0; i < max_associativity_domain_index; i++) {
 		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
 			break;
 
@@ -213,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
 	if (!form1_affinity)
 		return;
 
-	for (i = 0; i < distance_ref_points_depth; i++) {
+	for (i = 0; i < max_associativity_domain_index; i++) {
 		const __be32 *entry;
 
 		entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
@@ -240,7 +240,7 @@ static int associativity_to_nid(const __be32 *associativity)
 		nid = NUMA_NO_NODE;
 
 	if (nid > 0 &&
-		of_read_number(associativity, 1) >= distance_ref_points_depth) {
+		of_read_number(associativity, 1) >= max_associativity_domain_index) {
 		/*
 		 * Skip the length field and send start of associativity array
 		 */
@@ -310,14 +310,14 @@ static int __init find_primary_domain_index(void)
 	 */
 	distance_ref_points = of_get_property(root,
 					"ibm,associativity-reference-points",
-					&distance_ref_points_depth);
+					&max_associativity_domain_index);
 
 	if (!distance_ref_points) {
 		dbg("NUMA: ibm,associativity-reference-points not found.\n");
 		goto err;
 	}
 
-	distance_ref_points_depth /= sizeof(int);
+	max_associativity_domain_index /= sizeof(int);
 
 	if (firmware_has_feature(FW_FEATURE_OPAL) ||
 	    firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
@@ -328,7 +328,7 @@ static int __init find_primary_domain_index(void)
 	if (form1_affinity) {
 		index = of_read_number(distance_ref_points, 1);
 	} else {
-		if (distance_ref_points_depth < 2) {
+		if (max_associativity_domain_index < 2) {
 			printk(KERN_WARNING "NUMA: "
 				"short ibm,associativity-reference-points\n");
 			goto err;
@@ -341,10 +341,10 @@ static int __init find_primary_domain_index(void)
 	 * Warn and cap if the hardware supports more than
 	 * MAX_DISTANCE_REF_POINTS domains.
 	 */
-	if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) {
+	if (max_associativity_domain_index > MAX_DISTANCE_REF_POINTS) {
 		printk(KERN_WARNING "NUMA: distance array capped at "
 			"%d entries\n", MAX_DISTANCE_REF_POINTS);
-		distance_ref_points_depth = MAX_DISTANCE_REF_POINTS;
+		max_associativity_domain_index = MAX_DISTANCE_REF_POINTS;
 	}
 
 	of_node_put(root);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 3/6] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
  2021-06-28 15:11 [PATCH v5 0/6] Add support for FORM2 associativity Aneesh Kumar K.V
  2021-06-28 15:11 ` [PATCH v5 1/6] powerpc/pseries: rename min_common_depth to primary_domain_index Aneesh Kumar K.V
  2021-06-28 15:11 ` [PATCH v5 2/6] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index Aneesh Kumar K.V
@ 2021-06-28 15:11 ` Aneesh Kumar K.V
  2021-06-28 15:11 ` [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-28 15:11 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza, David Gibson

Also make related code cleanup that will allow adding FORM2_AFFINITY in
later patches. No functional change in this patch.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/firmware.h       |  4 +--
 arch/powerpc/include/asm/prom.h           |  2 +-
 arch/powerpc/kernel/prom_init.c           |  2 +-
 arch/powerpc/mm/numa.c                    | 35 ++++++++++++++---------
 arch/powerpc/platforms/pseries/firmware.c |  2 +-
 5 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
index 7604673787d6..60b631161360 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -44,7 +44,7 @@
 #define FW_FEATURE_OPAL		ASM_CONST(0x0000000010000000)
 #define FW_FEATURE_SET_MODE	ASM_CONST(0x0000000040000000)
 #define FW_FEATURE_BEST_ENERGY	ASM_CONST(0x0000000080000000)
-#define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0000000100000000)
+#define FW_FEATURE_FORM1_AFFINITY ASM_CONST(0x0000000100000000)
 #define FW_FEATURE_PRRN		ASM_CONST(0x0000000200000000)
 #define FW_FEATURE_DRMEM_V2	ASM_CONST(0x0000000400000000)
 #define FW_FEATURE_DRC_INFO	ASM_CONST(0x0000000800000000)
@@ -69,7 +69,7 @@ enum {
 		FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
 		FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
 		FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-		FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+		FW_FEATURE_FORM1_AFFINITY | FW_FEATURE_PRRN |
 		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
 		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
 		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 324a13351749..df9fec9d232c 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -147,7 +147,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_MSI			0x0201	/* PCIe/MSI support */
 #define OV5_CMO			0x0480	/* Cooperative Memory Overcommitment */
 #define OV5_XCMO		0x0440	/* Page Coalescing */
-#define OV5_TYPE1_AFFINITY	0x0580	/* Type 1 NUMA affinity */
+#define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
 #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
 #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
 #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 523b31685c4c..5d9ea059594f 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1069,7 +1069,7 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
 #else
 		0,
 #endif
-		.associativity = OV5_FEAT(OV5_TYPE1_AFFINITY) | OV5_FEAT(OV5_PRRN),
+		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
 		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
 		.micro_checkpoint = 0,
 		.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 132813dd1a6c..0ec16999beef 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -53,7 +53,10 @@ EXPORT_SYMBOL(node_data);
 
 static int primary_domain_index;
 static int n_mem_addr_cells, n_mem_size_cells;
-static int form1_affinity;
+
+#define FORM0_AFFINITY 0
+#define FORM1_AFFINITY 1
+static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int max_associativity_domain_index;
@@ -190,7 +193,7 @@ int __node_distance(int a, int b)
 	int i;
 	int distance = LOCAL_DISTANCE;
 
-	if (!form1_affinity)
+	if (affinity_form == FORM0_AFFINITY)
 		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
 	for (i = 0; i < max_associativity_domain_index; i++) {
@@ -210,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
 {
 	int i;
 
-	if (!form1_affinity)
+	if (affinity_form != FORM1_AFFINITY)
 		return;
 
 	for (i = 0; i < max_associativity_domain_index; i++) {
@@ -289,6 +292,17 @@ static int __init find_primary_domain_index(void)
 	int index;
 	struct device_node *root;
 
+	/*
+	 * Check for which form of affinity.
+	 */
+	if (firmware_has_feature(FW_FEATURE_OPAL)) {
+		affinity_form = FORM1_AFFINITY;
+	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
+		dbg("Using form 1 affinity\n");
+		affinity_form = FORM1_AFFINITY;
+	} else
+		affinity_form = FORM0_AFFINITY;
+
 	if (firmware_has_feature(FW_FEATURE_OPAL))
 		root = of_find_node_by_path("/ibm,opal");
 	else
@@ -318,23 +332,16 @@ static int __init find_primary_domain_index(void)
 	}
 
 	max_associativity_domain_index /= sizeof(int);
-
-	if (firmware_has_feature(FW_FEATURE_OPAL) ||
-	    firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
-		dbg("Using form 1 affinity\n");
-		form1_affinity = 1;
-	}
-
-	if (form1_affinity) {
-		index = of_read_number(distance_ref_points, 1);
-	} else {
+	if (affinity_form == FORM0_AFFINITY) {
 		if (max_associativity_domain_index < 2) {
 			printk(KERN_WARNING "NUMA: "
-				"short ibm,associativity-reference-points\n");
+			       "short ibm,associativity-reference-points\n");
 			goto err;
 		}
 
 		index = of_read_number(&distance_ref_points[1], 1);
+	} else {
+		index = of_read_number(distance_ref_points, 1);
 	}
 
 	/*
diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
index 4c7b7f5a2ebc..5d4c2bc20bba 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -119,7 +119,7 @@ struct vec5_fw_feature {
 
 static __initdata struct vec5_fw_feature
 vec5_fw_features_table[] = {
-	{FW_FEATURE_TYPE1_AFFINITY,	OV5_TYPE1_AFFINITY},
+	{FW_FEATURE_FORM1_AFFINITY,	OV5_FORM1_AFFINITY},
 	{FW_FEATURE_PRRN,		OV5_PRRN},
 	{FW_FEATURE_DRMEM_V2,		OV5_DRMEM_V2},
 	{FW_FEATURE_DRC_INFO,		OV5_DRC_INFO},
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths
  2021-06-28 15:11 [PATCH v5 0/6] Add support for FORM2 associativity Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2021-06-28 15:11 ` [PATCH v5 3/6] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY Aneesh Kumar K.V
@ 2021-06-28 15:11 ` Aneesh Kumar K.V
  2021-06-28 20:21     ` kernel test robot
                     ` (2 more replies)
  2021-06-28 15:11 ` [PATCH v5 5/6] powerpc/pseries: Add a helper for form1 cpu distance Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  6 siblings, 3 replies; 30+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-28 15:11 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza, David Gibson

The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call.

Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/numa.c                        | 173 +++++++++++++-----
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
 .../platforms/pseries/hotplug-memory.c        |   2 +
 arch/powerpc/platforms/pseries/pseries.h      |   1 +
 4 files changed, 132 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0ec16999beef..7b142f79d600 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -208,22 +208,6 @@ int __node_distance(int a, int b)
 }
 EXPORT_SYMBOL(__node_distance);
 
-static void initialize_distance_lookup_table(int nid,
-		const __be32 *associativity)
-{
-	int i;
-
-	if (affinity_form != FORM1_AFFINITY)
-		return;
-
-	for (i = 0; i < max_associativity_domain_index; i++) {
-		const __be32 *entry;
-
-		entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
-		distance_lookup_table[nid][i] = of_read_number(entry, 1);
-	}
-}
-
 /*
  * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
  * info is found.
@@ -241,15 +225,6 @@ static int associativity_to_nid(const __be32 *associativity)
 	/* POWER4 LPAR uses 0xffff as invalid node */
 	if (nid == 0xffff || nid >= nr_node_ids)
 		nid = NUMA_NO_NODE;
-
-	if (nid > 0 &&
-		of_read_number(associativity, 1) >= max_associativity_domain_index) {
-		/*
-		 * Skip the length field and send start of associativity array
-		 */
-		initialize_distance_lookup_table(nid, associativity + 1);
-	}
-
 out:
 	return nid;
 }
@@ -287,6 +262,49 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
+static void __initialize_form1_numa_distance(const __be32 *associativity)
+{
+	int i, nid;
+
+	if (affinity_form != FORM1_AFFINITY)
+		return;
+
+	if (of_read_number(associativity, 1) >= primary_domain_index) {
+		nid = of_read_number(&associativity[primary_domain_index], 1);
+
+		for (i = 0; i < max_associativity_domain_index; i++) {
+			const __be32 *entry;
+
+			entry = &associativity[be32_to_cpu(distance_ref_points[i])];
+			distance_lookup_table[nid][i] = of_read_number(entry, 1);
+		}
+	}
+}
+
+static void initialize_form1_numa_distance(struct device_node *node)
+{
+	const __be32 *associativity;
+
+	associativity = of_get_associativity(node);
+	if (!associativity)
+		return;
+
+	__initialize_form1_numa_distance(associativity);
+}
+
+/*
+ * Used to update distance information w.r.t newly added node.
+ */
+void update_numa_distance(struct device_node *node)
+{
+	if (affinity_form == FORM0_AFFINITY)
+		return;
+	else if (affinity_form == FORM1_AFFINITY) {
+		initialize_form1_numa_distance(node);
+		return;
+	}
+}
+
 static int __init find_primary_domain_index(void)
 {
 	int index;
@@ -433,6 +451,48 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
 	return 0;
 }
 
+static int get_nid_and_numa_distance(struct drmem_lmb *lmb)
+{
+	struct assoc_arrays aa = { .arrays = NULL };
+	int default_nid = NUMA_NO_NODE;
+	int nid = default_nid;
+	int rc, index;
+
+	if ((primary_domain_index < 0) || !numa_enabled)
+		return default_nid;
+
+	rc = of_get_assoc_arrays(&aa);
+	if (rc)
+		return default_nid;
+
+	if (primary_domain_index <= aa.array_sz &&
+	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
+		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
+		nid = of_read_number(&aa.arrays[index], 1);
+
+		if (nid == 0xffff || nid >= nr_node_ids)
+			nid = default_nid;
+		if (nid > 0 && affinity_form == FORM1_AFFINITY) {
+			int i;
+			const __be32 *associativity;
+
+			index = lmb->aa_index * aa.array_sz;
+			associativity = &aa.arrays[index];
+			/*
+			 * lookup array associativity entries have different format
+			 * There is no length of the array as the first element.
+			 */
+			for (i = 0; i < max_associativity_domain_index; i++) {
+				const __be32 *entry;
+
+				entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
+				distance_lookup_table[nid][i] = of_read_number(entry, 1);
+			}
+		}
+	}
+	return nid;
+}
+
 /*
  * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
@@ -458,21 +518,14 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 
 		if (nid == 0xffff || nid >= nr_node_ids)
 			nid = default_nid;
-
-		if (nid > 0) {
-			index = lmb->aa_index * aa.array_sz;
-			initialize_distance_lookup_table(nid,
-							&aa.arrays[index]);
-		}
 	}
-
 	return nid;
 }
 
 #ifdef CONFIG_PPC_SPLPAR
-static int vphn_get_nid(long lcpu)
+
+static int __vphn_get_associativity(long lcpu, __be32 *associativity)
 {
-	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
 	long rc, hwid;
 
 	/*
@@ -492,10 +545,22 @@ static int vphn_get_nid(long lcpu)
 
 		rc = hcall_vphn(hwid, VPHN_FLAG_VCPU, associativity);
 		if (rc == H_SUCCESS)
-			return associativity_to_nid(associativity);
+			return 0;
 	}
 
+	return -1;
+}
+
+static int vphn_get_nid(long lcpu)
+{
+	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+
+
+	if (!__vphn_get_associativity(lcpu, associativity))
+		return associativity_to_nid(associativity);
+
 	return NUMA_NO_NODE;
+
 }
 #else
 static int vphn_get_nid(long unused)
@@ -692,7 +757,7 @@ static int __init numa_setup_drmem_lmb(struct drmem_lmb *lmb,
 			size = read_n_cells(n_mem_size_cells, usm);
 		}
 
-		nid = of_drconf_to_nid_single(lmb);
+		nid = get_nid_and_numa_distance(lmb);
 		fake_numa_create_new_node(((base + size) >> PAGE_SHIFT),
 					  &nid);
 		node_set_online(nid);
@@ -709,6 +774,7 @@ static int __init parse_numa_properties(void)
 	struct device_node *memory;
 	int default_nid = 0;
 	unsigned long i;
+	const __be32 *associativity;
 
 	if (numa_enabled == 0) {
 		printk(KERN_WARNING "NUMA disabled by user\n");
@@ -734,18 +800,30 @@ static int __init parse_numa_properties(void)
 	 * each node to be onlined must have NODE_DATA etc backing it.
 	 */
 	for_each_present_cpu(i) {
+		__be32 vphn_assoc[VPHN_ASSOC_BUFSIZE];
 		struct device_node *cpu;
-		int nid = vphn_get_nid(i);
+		int nid = NUMA_NO_NODE;
 
-		/*
-		 * Don't fall back to default_nid yet -- we will plug
-		 * cpus into nodes once the memory scan has discovered
-		 * the topology.
-		 */
-		if (nid == NUMA_NO_NODE) {
+		memset(vphn_assoc, 0, VPHN_ASSOC_BUFSIZE * sizeof(__be32));
+
+		if (__vphn_get_associativity(i, vphn_assoc) == 0) {
+			nid = associativity_to_nid(vphn_assoc);
+			__initialize_form1_numa_distance(vphn_assoc);
+		} else {
+
+			/*
+			 * Don't fall back to default_nid yet -- we will plug
+			 * cpus into nodes once the memory scan has discovered
+			 * the topology.
+			 */
 			cpu = of_get_cpu_node(i, NULL);
 			BUG_ON(!cpu);
-			nid = of_node_to_nid_single(cpu);
+
+			associativity = of_get_associativity(cpu);
+			if (associativity) {
+				nid = associativity_to_nid(associativity);
+				__initialize_form1_numa_distance(associativity);
+			}
 			of_node_put(cpu);
 		}
 
@@ -781,8 +859,11 @@ static int __init parse_numa_properties(void)
 		 * have associativity properties.  If none, then
 		 * everything goes to default_nid.
 		 */
-		nid = of_node_to_nid_single(memory);
-		if (nid < 0)
+		associativity = of_get_associativity(memory);
+		if (associativity) {
+			nid = associativity_to_nid(associativity);
+			__initialize_form1_numa_distance(associativity);
+		} else
 			nid = default_nid;
 
 		fake_numa_create_new_node(((start + size) >> PAGE_SHIFT), &nid);
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 7e970f81d8ff..778b6ab35f0d 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -498,6 +498,8 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
 		return saved_rc;
 	}
 
+	update_numa_distance(dn);
+
 	rc = dlpar_online_cpu(dn);
 	if (rc) {
 		saved_rc = rc;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 36f66556a7c6..40d350f31a34 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -180,6 +180,8 @@ static int update_lmb_associativity_index(struct drmem_lmb *lmb)
 		return -ENODEV;
 	}
 
+	update_numa_distance(lmb_node);
+
 	dr_node = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
 	if (!dr_node) {
 		dlpar_free_cc_nodes(lmb_node);
diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
index 1f051a786fb3..663a0859cf13 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -113,4 +113,5 @@ extern u32 pseries_security_flavor;
 void pseries_setup_security_mitigations(void);
 void pseries_lpar_read_hblkrm_characteristics(void);
 
+void update_numa_distance(struct device_node *node);
 #endif /* _PSERIES_PSERIES_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 5/6] powerpc/pseries: Add a helper for form1 cpu distance
  2021-06-28 15:11 [PATCH v5 0/6] Add support for FORM2 associativity Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2021-06-28 15:11 ` [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths Aneesh Kumar K.V
@ 2021-06-28 15:11 ` Aneesh Kumar K.V
  2021-07-22  1:42   ` David Gibson
  2021-06-28 15:11 ` [PATCH v5 6/6] powerpc/pseries: Add support for FORM2 associativity Aneesh Kumar K.V
  2021-07-13 14:27 ` [PATCH v5 0/6] " Daniel Henrique Barboza
  6 siblings, 1 reply; 30+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-28 15:11 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza, David Gibson

This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/topology.h   |  4 ++--
 arch/powerpc/mm/numa.c                | 10 +++++++++-
 arch/powerpc/platforms/pseries/lpar.c |  4 ++--
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..ac8b5ed79832 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -36,7 +36,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
 				 cpu_all_mask :				\
 				 cpumask_of_node(pcibus_to_node(bus)))
 
-extern int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
+int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
@@ -83,7 +83,7 @@ static inline void sysfs_remove_device_from_node(struct device *dev,
 
 static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {}
 
-static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static inline int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
 	return 0;
 }
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 7b142f79d600..c6293037a103 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
 	int dist = 0;
 
@@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 	return dist;
 }
 
+int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+	/* We should not get called with FORM0 */
+	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
+
+	return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
+}
+
 /* must hold reference to node during call */
 static const __be32 *of_get_associativity(struct device_node *dev)
 {
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index dab356e3ff87..afefbdfe768d 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -261,7 +261,7 @@ static int cpu_relative_dispatch_distance(int last_disp_cpu, int cur_disp_cpu)
 	if (!last_disp_cpu_assoc || !cur_disp_cpu_assoc)
 		return -EIO;
 
-	return cpu_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
+	return cpu_relative_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
 }
 
 static int cpu_home_node_dispatch_distance(int disp_cpu)
@@ -281,7 +281,7 @@ static int cpu_home_node_dispatch_distance(int disp_cpu)
 	if (!disp_cpu_assoc || !vcpu_assoc)
 		return -EIO;
 
-	return cpu_distance(disp_cpu_assoc, vcpu_assoc);
+	return cpu_relative_distance(disp_cpu_assoc, vcpu_assoc);
 }
 
 static void update_vcpu_disp_stat(int disp_cpu)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v5 6/6] powerpc/pseries: Add support for FORM2 associativity
  2021-06-28 15:11 [PATCH v5 0/6] Add support for FORM2 associativity Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2021-06-28 15:11 ` [PATCH v5 5/6] powerpc/pseries: Add a helper for form1 cpu distance Aneesh Kumar K.V
@ 2021-06-28 15:11 ` Aneesh Kumar K.V
  2021-07-22  2:28   ` David Gibson
  2021-07-13 14:27 ` [PATCH v5 0/6] " Daniel Henrique Barboza
  6 siblings, 1 reply; 30+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-28 15:11 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza, David Gibson

PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 Documentation/powerpc/associativity.rst   | 103 ++++++++++++++
 arch/powerpc/include/asm/firmware.h       |   3 +-
 arch/powerpc/include/asm/prom.h           |   1 +
 arch/powerpc/kernel/prom_init.c           |   3 +-
 arch/powerpc/mm/numa.c                    | 157 ++++++++++++++++++----
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 6 files changed, 242 insertions(+), 26 deletions(-)
 create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
new file mode 100644
index 000000000000..31cc7da2c7a6
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,103 @@
+============================
+NUMA resource associativity
+=============================
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the older format and is now considered deprecated.
+
+Hypervisor indicates the type/form of associativity used via "ibm,architecture-vec-5 property".
+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-----
+Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
+
+Form 1
+-----
+With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity
+device tree properties are used to determine the NUMA distance between resource groups/domains.
+
+The “ibm,associativity” property contains a list of one or more numbers (domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains a list of one or more numbers
+(domainID index) that represents the 1 based ordinal in the associativity lists.
+The list of domainID indexes represents an increasing hierarchy of resource grouping.
+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
+Linux kernel computes NUMA distance between two domains by recursively comparing
+if they belong to the same higher-level domains. For mismatch at every higher
+level of the resource group, the kernel doubles the NUMA distance between the
+comparing domains.
+
+Form 2
+-------
+Form 2 associativity format adds separate device tree properties representing NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows flexible primary
+domain numbering. With numa distance computation now detached from the index value in
+"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain
+ids at the same domainID index representing resource groups of different performance/latency
+characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains a list of one or more numbers representing
+the domainIDs present in the system. The offset of the domainID in this property is
+used as an index while computing numa distance information via "ibm,numa-distance-table".
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
+N domainID encoded as with encode-int
+
+For ex:
+"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. The offset of domainID 8 (2) is used when
+computing the distance of domain 8 from other domains present in the system. For the rest of
+this document, this offset will be referred to as domain distance offset.
+
+"ibm,numa-distance-table" property contains a list of one or more numbers representing the NUMA
+distance between resource groups/domains present in the system.
+
+prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
+N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
+The number N must be equal to the square of m where m is the number of domainIDs in the
+numa-lookup-index-table.
+
+For ex:
+ibm,numa-lookup-index-table =  {3, 0, 8, 40}
+ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
+
+  | 0    8   40
+--|------------
+  |
+0 | 10   20  80
+  |
+8 | 20   10  160
+  |
+40| 80   160  10
+
+A possible "ibm,associativity" property for resources in node 0, 8 and 40
+
+{ 3, 6, 7, 0 }
+{ 3, 6, 9, 8 }
+{ 3, 6, 7, 40}
+
+With "ibm,associativity-reference-points"  { 0x3 }
+
+"ibm,lookup-index-table" helps in having a compact representation of distance matrix.
+Since domainID can be sparse, the matrix of distances can also be effectively sparse.
+With "ibm,lookup-index-table" we can achieve a compact representation of
+distance information.
diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
index 60b631161360..97a3bd9ffeb9 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -53,6 +53,7 @@
 #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
 #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
 #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
+#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
 
 #ifndef __ASSEMBLY__
 
@@ -73,7 +74,7 @@ enum {
 		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
 		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
 		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
-		FW_FEATURE_RPT_INVALIDATE,
+		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
 	FW_FEATURE_PSERIES_ALWAYS = 0,
 	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
 	FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index df9fec9d232c..5c80152e8f18 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
 #define OV5_XCMO		0x0440	/* Page Coalescing */
 #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
 #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
+#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
 #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
 #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
 #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 5d9ea059594f..c483df6c9393 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1069,7 +1069,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
 #else
 		0,
 #endif
-		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
+		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
+		OV5_FEAT(OV5_FORM2_AFFINITY),
 		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
 		.micro_checkpoint = 0,
 		.reserved0 = 0,
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c6293037a103..c68846fc9550 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
 
 #define FORM0_AFFINITY 0
 #define FORM1_AFFINITY 1
+#define FORM2_AFFINITY 2
 static int affinity_form;
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int max_associativity_domain_index;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
+static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
+	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
+};
+static int numa_id_index_table[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
 
 /*
  * Allocate node_to_cpumask_map based on number of available nodes
@@ -166,6 +171,44 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
+/*
+ * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
+ * info is found.
+ */
+static int associativity_to_nid(const __be32 *associativity)
+{
+	int nid = NUMA_NO_NODE;
+
+	if (!numa_enabled)
+		goto out;
+
+	if (of_read_number(associativity, 1) >= primary_domain_index)
+		nid = of_read_number(&associativity[primary_domain_index], 1);
+
+	/* POWER4 LPAR uses 0xffff as invalid node */
+	if (nid == 0xffff || nid >= nr_node_ids)
+		nid = NUMA_NO_NODE;
+out:
+	return nid;
+}
+
+static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+	int dist;
+	int node1, node2;
+
+	node1 = associativity_to_nid(cpu1_assoc);
+	node2 = associativity_to_nid(cpu2_assoc);
+
+	dist = numa_distance_table[node1][node2];
+	if (dist <= LOCAL_DISTANCE)
+		return 0;
+	else if (dist <= REMOTE_DISTANCE)
+		return 1;
+	else
+		return 2;
+}
+
 static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
 	int dist = 0;
@@ -186,8 +229,9 @@ int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
 	/* We should not get called with FORM0 */
 	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
-
-	return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
+	if (affinity_form == FORM1_AFFINITY)
+		return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
+	return __cpu_form2_relative_distance(cpu1_assoc, cpu2_assoc);
 }
 
 /* must hold reference to node during call */
@@ -201,7 +245,9 @@ int __node_distance(int a, int b)
 	int i;
 	int distance = LOCAL_DISTANCE;
 
-	if (affinity_form == FORM0_AFFINITY)
+	if (affinity_form == FORM2_AFFINITY)
+		return numa_distance_table[a][b];
+	else if (affinity_form == FORM0_AFFINITY)
 		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
 	for (i = 0; i < max_associativity_domain_index; i++) {
@@ -216,27 +262,6 @@ int __node_distance(int a, int b)
 }
 EXPORT_SYMBOL(__node_distance);
 
-/*
- * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
- * info is found.
- */
-static int associativity_to_nid(const __be32 *associativity)
-{
-	int nid = NUMA_NO_NODE;
-
-	if (!numa_enabled)
-		goto out;
-
-	if (of_read_number(associativity, 1) >= primary_domain_index)
-		nid = of_read_number(&associativity[primary_domain_index], 1);
-
-	/* POWER4 LPAR uses 0xffff as invalid node */
-	if (nid == 0xffff || nid >= nr_node_ids)
-		nid = NUMA_NO_NODE;
-out:
-	return nid;
-}
-
 /* Returns the nid associated with the given device tree node,
  * or -1 if not found.
  */
@@ -305,12 +330,84 @@ static void initialize_form1_numa_distance(struct device_node *node)
  */
 void update_numa_distance(struct device_node *node)
 {
+	int nid;
+
 	if (affinity_form == FORM0_AFFINITY)
 		return;
 	else if (affinity_form == FORM1_AFFINITY) {
 		initialize_form1_numa_distance(node);
 		return;
 	}
+
+	/* FORM2 affinity  */
+	nid = of_node_to_nid_single(node);
+	if (nid == NUMA_NO_NODE)
+		return;
+
+	/*
+	 * With FORM2 we expect NUMA distance of all possible NUMA
+	 * nodes to be provided during boot.
+	 */
+	WARN(numa_distance_table[nid][nid] == -1,
+	     "NUMA distance details for node %d not provided\n", nid);
+}
+
+/*
+ * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
+ * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
+ */
+static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
+{
+	int i, j;
+	const __u8 *numa_dist_table;
+	const __be32 *numa_lookup_index;
+	int numa_dist_table_length;
+	int max_numa_index, distance_index;
+
+	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
+	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
+
+	/* first element of the array is the size and is encode-int */
+	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
+	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
+	/* Skip the size which is encoded int */
+	numa_dist_table += sizeof(__be32);
+
+	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d\n",
+		 numa_dist_table_length, max_numa_index);
+
+	for (i = 0; i < max_numa_index; i++)
+		/* +1 skip the max_numa_index in the property */
+		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
+
+
+	if (numa_dist_table_length != max_numa_index * max_numa_index) {
+
+		WARN(1, "Wrong NUMA distance information\n");
+		/* consider everybody else just remote. */
+		for (i = 0;  i < max_numa_index; i++) {
+			for (j = 0; j < max_numa_index; j++) {
+				int nodeA = numa_id_index_table[i];
+				int nodeB = numa_id_index_table[j];
+
+				if (nodeA == nodeB)
+					numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE;
+				else
+					numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE;
+			}
+		}
+	}
+
+	distance_index = 0;
+	for (i = 0;  i < max_numa_index; i++) {
+		for (j = 0; j < max_numa_index; j++) {
+			int nodeA = numa_id_index_table[i];
+			int nodeB = numa_id_index_table[j];
+
+			numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++];
+			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
+		}
+	}
 }
 
 static int __init find_primary_domain_index(void)
@@ -323,6 +420,9 @@ static int __init find_primary_domain_index(void)
 	 */
 	if (firmware_has_feature(FW_FEATURE_OPAL)) {
 		affinity_form = FORM1_AFFINITY;
+	} else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) {
+		dbg("Using form 2 affinity\n");
+		affinity_form = FORM2_AFFINITY;
 	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
 		dbg("Using form 1 affinity\n");
 		affinity_form = FORM1_AFFINITY;
@@ -367,8 +467,17 @@ static int __init find_primary_domain_index(void)
 
 		index = of_read_number(&distance_ref_points[1], 1);
 	} else {
+		/*
+		 * Both FORM1 and FORM2 affinity find the primary domain details
+		 * at the same offset.
+		 */
 		index = of_read_number(distance_ref_points, 1);
 	}
+	/*
+	 * If it is FORM2 also initialize the distance table here.
+	 */
+	if (affinity_form == FORM2_AFFINITY)
+		initialize_form2_numa_distance_lookup_table(root);
 
 	/*
 	 * Warn and cap if the hardware supports more than
diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
index 5d4c2bc20bba..f162156b7b68 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -123,6 +123,7 @@ vec5_fw_features_table[] = {
 	{FW_FEATURE_PRRN,		OV5_PRRN},
 	{FW_FEATURE_DRMEM_V2,		OV5_DRMEM_V2},
 	{FW_FEATURE_DRC_INFO,		OV5_DRC_INFO},
+	{FW_FEATURE_FORM2_AFFINITY,	OV5_FORM2_AFFINITY},
 };
 
 static void __init fw_vec5_feature_init(const char *vec5, unsigned long len)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths
  2021-06-28 15:11 ` [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths Aneesh Kumar K.V
@ 2021-06-28 20:21     ` kernel test robot
  2021-06-28 20:40     ` kernel test robot
  2021-07-22  1:40   ` David Gibson
  2 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2021-06-28 20:21 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza,
	kbuild-all, David Gibson

[-- Attachment #1: Type: text/plain, Size: 2187 bytes --]

Hi "Aneesh,

I love your patch! Perhaps something to improve:

[auto build test WARNING on powerpc/next]
[also build test WARNING on v5.13 next-20210628]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Aneesh-Kumar-K-V/Add-support-for-FORM2-associativity/20210628-231546
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/fcbc8b19e99b1cf44fde904817f19616c6baecdb
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Aneesh-Kumar-K-V/Add-support-for-FORM2-associativity/20210628-231546
        git checkout fcbc8b19e99b1cf44fde904817f19616c6baecdb
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> arch/powerpc/mm/numa.c:298:6: warning: no previous prototype for 'update_numa_distance' [-Wmissing-prototypes]
     298 | void update_numa_distance(struct device_node *node)
         |      ^~~~~~~~~~~~~~~~~~~~


vim +/update_numa_distance +298 arch/powerpc/mm/numa.c

   294	
   295	/*
   296	 * Used to update distance information w.r.t newly added node.
   297	 */
 > 298	void update_numa_distance(struct device_node *node)
   299	{
   300		if (affinity_form == FORM0_AFFINITY)
   301			return;
   302		else if (affinity_form == FORM1_AFFINITY) {
   303			initialize_form1_numa_distance(node);
   304			return;
   305		}
   306	}
   307	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 73412 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths
@ 2021-06-28 20:21     ` kernel test robot
  0 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2021-06-28 20:21 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2243 bytes --]

Hi "Aneesh,

I love your patch! Perhaps something to improve:

[auto build test WARNING on powerpc/next]
[also build test WARNING on v5.13 next-20210628]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Aneesh-Kumar-K-V/Add-support-for-FORM2-associativity/20210628-231546
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/fcbc8b19e99b1cf44fde904817f19616c6baecdb
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Aneesh-Kumar-K-V/Add-support-for-FORM2-associativity/20210628-231546
        git checkout fcbc8b19e99b1cf44fde904817f19616c6baecdb
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> arch/powerpc/mm/numa.c:298:6: warning: no previous prototype for 'update_numa_distance' [-Wmissing-prototypes]
     298 | void update_numa_distance(struct device_node *node)
         |      ^~~~~~~~~~~~~~~~~~~~


vim +/update_numa_distance +298 arch/powerpc/mm/numa.c

   294	
   295	/*
   296	 * Used to update distance information w.r.t newly added node.
   297	 */
 > 298	void update_numa_distance(struct device_node *node)
   299	{
   300		if (affinity_form == FORM0_AFFINITY)
   301			return;
   302		else if (affinity_form == FORM1_AFFINITY) {
   303			initialize_form1_numa_distance(node);
   304			return;
   305		}
   306	}
   307	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 73412 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths
  2021-06-28 15:11 ` [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths Aneesh Kumar K.V
@ 2021-06-28 20:40     ` kernel test robot
  2021-06-28 20:40     ` kernel test robot
  2021-07-22  1:40   ` David Gibson
  2 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2021-06-28 20:40 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev, mpe
  Cc: Nathan Lynch, Aneesh Kumar K.V, Daniel Henrique Barboza,
	kbuild-all, David Gibson

[-- Attachment #1: Type: text/plain, Size: 6287 bytes --]

Hi "Aneesh,

I love your patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v5.13 next-20210628]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Aneesh-Kumar-K-V/Add-support-for-FORM2-associativity/20210628-231546
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-randconfig-r024-20210628 (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/fcbc8b19e99b1cf44fde904817f19616c6baecdb
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Aneesh-Kumar-K-V/Add-support-for-FORM2-associativity/20210628-231546
        git checkout fcbc8b19e99b1cf44fde904817f19616c6baecdb
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arch/powerpc/mm/numa.c:298:6: warning: no previous prototype for 'update_numa_distance' [-Wmissing-prototypes]
     298 | void update_numa_distance(struct device_node *node)
         |      ^~~~~~~~~~~~~~~~~~~~
   arch/powerpc/mm/numa.c: In function 'parse_numa_properties':
>> arch/powerpc/mm/numa.c:809:7: error: implicit declaration of function '__vphn_get_associativity'; did you mean 'of_get_associativity'? [-Werror=implicit-function-declaration]
     809 |   if (__vphn_get_associativity(i, vphn_assoc) == 0) {
         |       ^~~~~~~~~~~~~~~~~~~~~~~~
         |       of_get_associativity
   cc1: some warnings being treated as errors


vim +809 arch/powerpc/mm/numa.c

   771	
   772	static int __init parse_numa_properties(void)
   773	{
   774		struct device_node *memory;
   775		int default_nid = 0;
   776		unsigned long i;
   777		const __be32 *associativity;
   778	
   779		if (numa_enabled == 0) {
   780			printk(KERN_WARNING "NUMA disabled by user\n");
   781			return -1;
   782		}
   783	
   784		primary_domain_index = find_primary_domain_index();
   785	
   786		if (primary_domain_index < 0) {
   787			/*
   788			 * if we fail to parse primary_domain_index from device tree
   789			 * mark the numa disabled, boot with numa disabled.
   790			 */
   791			numa_enabled = false;
   792			return primary_domain_index;
   793		}
   794	
   795		dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index);
   796	
   797		/*
   798		 * Even though we connect cpus to numa domains later in SMP
   799		 * init, we need to know the node ids now. This is because
   800		 * each node to be onlined must have NODE_DATA etc backing it.
   801		 */
   802		for_each_present_cpu(i) {
   803			__be32 vphn_assoc[VPHN_ASSOC_BUFSIZE];
   804			struct device_node *cpu;
   805			int nid = NUMA_NO_NODE;
   806	
   807			memset(vphn_assoc, 0, VPHN_ASSOC_BUFSIZE * sizeof(__be32));
   808	
 > 809			if (__vphn_get_associativity(i, vphn_assoc) == 0) {
   810				nid = associativity_to_nid(vphn_assoc);
   811				__initialize_form1_numa_distance(vphn_assoc);
   812			} else {
   813	
   814				/*
   815				 * Don't fall back to default_nid yet -- we will plug
   816				 * cpus into nodes once the memory scan has discovered
   817				 * the topology.
   818				 */
   819				cpu = of_get_cpu_node(i, NULL);
   820				BUG_ON(!cpu);
   821	
   822				associativity = of_get_associativity(cpu);
   823				if (associativity) {
   824					nid = associativity_to_nid(associativity);
   825					__initialize_form1_numa_distance(associativity);
   826				}
   827				of_node_put(cpu);
   828			}
   829	
   830			node_set_online(nid);
   831		}
   832	
   833		get_n_mem_cells(&n_mem_addr_cells, &n_mem_size_cells);
   834	
   835		for_each_node_by_type(memory, "memory") {
   836			unsigned long start;
   837			unsigned long size;
   838			int nid;
   839			int ranges;
   840			const __be32 *memcell_buf;
   841			unsigned int len;
   842	
   843			memcell_buf = of_get_property(memory,
   844				"linux,usable-memory", &len);
   845			if (!memcell_buf || len <= 0)
   846				memcell_buf = of_get_property(memory, "reg", &len);
   847			if (!memcell_buf || len <= 0)
   848				continue;
   849	
   850			/* ranges in cell */
   851			ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
   852	new_range:
   853			/* these are order-sensitive, and modify the buffer pointer */
   854			start = read_n_cells(n_mem_addr_cells, &memcell_buf);
   855			size = read_n_cells(n_mem_size_cells, &memcell_buf);
   856	
   857			/*
   858			 * Assumption: either all memory nodes or none will
   859			 * have associativity properties.  If none, then
   860			 * everything goes to default_nid.
   861			 */
   862			associativity = of_get_associativity(memory);
   863			if (associativity) {
   864				nid = associativity_to_nid(associativity);
   865				__initialize_form1_numa_distance(associativity);
   866			} else
   867				nid = default_nid;
   868	
   869			fake_numa_create_new_node(((start + size) >> PAGE_SHIFT), &nid);
   870			node_set_online(nid);
   871	
   872			size = numa_enforce_memory_limit(start, size);
   873			if (size)
   874				memblock_set_node(start, size, &memblock.memory, nid);
   875	
   876			if (--ranges)
   877				goto new_range;
   878		}
   879	
   880		/*
   881		 * Now do the same thing for each MEMBLOCK listed in the
   882		 * ibm,dynamic-memory property in the
   883		 * ibm,dynamic-reconfiguration-memory node.
   884		 */
   885		memory = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
   886		if (memory) {
   887			walk_drmem_lmbs(memory, NULL, numa_setup_drmem_lmb);
   888			of_node_put(memory);
   889		}
   890	
   891		return 0;
   892	}
   893	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 43062 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths
@ 2021-06-28 20:40     ` kernel test robot
  0 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2021-06-28 20:40 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 6458 bytes --]

Hi "Aneesh,

I love your patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v5.13 next-20210628]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Aneesh-Kumar-K-V/Add-support-for-FORM2-associativity/20210628-231546
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-randconfig-r024-20210628 (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/fcbc8b19e99b1cf44fde904817f19616c6baecdb
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Aneesh-Kumar-K-V/Add-support-for-FORM2-associativity/20210628-231546
        git checkout fcbc8b19e99b1cf44fde904817f19616c6baecdb
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arch/powerpc/mm/numa.c:298:6: warning: no previous prototype for 'update_numa_distance' [-Wmissing-prototypes]
     298 | void update_numa_distance(struct device_node *node)
         |      ^~~~~~~~~~~~~~~~~~~~
   arch/powerpc/mm/numa.c: In function 'parse_numa_properties':
>> arch/powerpc/mm/numa.c:809:7: error: implicit declaration of function '__vphn_get_associativity'; did you mean 'of_get_associativity'? [-Werror=implicit-function-declaration]
     809 |   if (__vphn_get_associativity(i, vphn_assoc) == 0) {
         |       ^~~~~~~~~~~~~~~~~~~~~~~~
         |       of_get_associativity
   cc1: some warnings being treated as errors


vim +809 arch/powerpc/mm/numa.c

   771	
   772	static int __init parse_numa_properties(void)
   773	{
   774		struct device_node *memory;
   775		int default_nid = 0;
   776		unsigned long i;
   777		const __be32 *associativity;
   778	
   779		if (numa_enabled == 0) {
   780			printk(KERN_WARNING "NUMA disabled by user\n");
   781			return -1;
   782		}
   783	
   784		primary_domain_index = find_primary_domain_index();
   785	
   786		if (primary_domain_index < 0) {
   787			/*
   788			 * if we fail to parse primary_domain_index from device tree
   789			 * mark the numa disabled, boot with numa disabled.
   790			 */
   791			numa_enabled = false;
   792			return primary_domain_index;
   793		}
   794	
   795		dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index);
   796	
   797		/*
   798		 * Even though we connect cpus to numa domains later in SMP
   799		 * init, we need to know the node ids now. This is because
   800		 * each node to be onlined must have NODE_DATA etc backing it.
   801		 */
   802		for_each_present_cpu(i) {
   803			__be32 vphn_assoc[VPHN_ASSOC_BUFSIZE];
   804			struct device_node *cpu;
   805			int nid = NUMA_NO_NODE;
   806	
   807			memset(vphn_assoc, 0, VPHN_ASSOC_BUFSIZE * sizeof(__be32));
   808	
 > 809			if (__vphn_get_associativity(i, vphn_assoc) == 0) {
   810				nid = associativity_to_nid(vphn_assoc);
   811				__initialize_form1_numa_distance(vphn_assoc);
   812			} else {
   813	
   814				/*
   815				 * Don't fall back to default_nid yet -- we will plug
   816				 * cpus into nodes once the memory scan has discovered
   817				 * the topology.
   818				 */
   819				cpu = of_get_cpu_node(i, NULL);
   820				BUG_ON(!cpu);
   821	
   822				associativity = of_get_associativity(cpu);
   823				if (associativity) {
   824					nid = associativity_to_nid(associativity);
   825					__initialize_form1_numa_distance(associativity);
   826				}
   827				of_node_put(cpu);
   828			}
   829	
   830			node_set_online(nid);
   831		}
   832	
   833		get_n_mem_cells(&n_mem_addr_cells, &n_mem_size_cells);
   834	
   835		for_each_node_by_type(memory, "memory") {
   836			unsigned long start;
   837			unsigned long size;
   838			int nid;
   839			int ranges;
   840			const __be32 *memcell_buf;
   841			unsigned int len;
   842	
   843			memcell_buf = of_get_property(memory,
   844				"linux,usable-memory", &len);
   845			if (!memcell_buf || len <= 0)
   846				memcell_buf = of_get_property(memory, "reg", &len);
   847			if (!memcell_buf || len <= 0)
   848				continue;
   849	
   850			/* ranges in cell */
   851			ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
   852	new_range:
   853			/* these are order-sensitive, and modify the buffer pointer */
   854			start = read_n_cells(n_mem_addr_cells, &memcell_buf);
   855			size = read_n_cells(n_mem_size_cells, &memcell_buf);
   856	
   857			/*
   858			 * Assumption: either all memory nodes or none will
   859			 * have associativity properties.  If none, then
   860			 * everything goes to default_nid.
   861			 */
   862			associativity = of_get_associativity(memory);
   863			if (associativity) {
   864				nid = associativity_to_nid(associativity);
   865				__initialize_form1_numa_distance(associativity);
   866			} else
   867				nid = default_nid;
   868	
   869			fake_numa_create_new_node(((start + size) >> PAGE_SHIFT), &nid);
   870			node_set_online(nid);
   871	
   872			size = numa_enforce_memory_limit(start, size);
   873			if (size)
   874				memblock_set_node(start, size, &memblock.memory, nid);
   875	
   876			if (--ranges)
   877				goto new_range;
   878		}
   879	
   880		/*
   881		 * Now do the same thing for each MEMBLOCK listed in the
   882		 * ibm,dynamic-memory property in the
   883		 * ibm,dynamic-reconfiguration-memory node.
   884		 */
   885		memory = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
   886		if (memory) {
   887			walk_drmem_lmbs(memory, NULL, numa_setup_drmem_lmb);
   888			of_node_put(memory);
   889		}
   890	
   891		return 0;
   892	}
   893	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 43062 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 0/6] Add support for FORM2 associativity
  2021-06-28 15:11 [PATCH v5 0/6] Add support for FORM2 associativity Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2021-06-28 15:11 ` [PATCH v5 6/6] powerpc/pseries: Add support for FORM2 associativity Aneesh Kumar K.V
@ 2021-07-13 14:27 ` Daniel Henrique Barboza
  2021-07-13 14:30   ` Aneesh Kumar K.V
  6 siblings, 1 reply; 30+ messages in thread
From: Daniel Henrique Barboza @ 2021-07-13 14:27 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev, mpe; +Cc: Nathan Lynch, David Gibson

Aneesh,

This series compiles with a configuration made with "pseries_le_defconfig"
but fails with a config based on an existing RHEL8 config.

The reason, which is hinted in the robot replies in patch 4, is that you defined
a "__vphn_get_associativity" inside a #ifdef CONFIG_PPC_SPLPAR guard but didn't
define how the function would behave without the config, and you ended up
using the function elsewhere.

This fixes the compilation but I'm not sure if this is what you intended
for this function:


diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c68846fc9550..6e8551d16b7a 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -680,6 +680,11 @@ static int vphn_get_nid(long lcpu)
  
  }
  #else
+static int __vphn_get_associativity(long lcpu, __be32 *associativity)
+{
+       return -1;
+}
+
  static int vphn_get_nid(long unused)
  {
         return NUMA_NO_NODE;


I'll post a new version of the QEMU FORM2 changes using these patches as is (with
the above fixup), but I guess you'll want to post a v6.



Thanks,



Daniel




On 6/28/21 12:11 PM, Aneesh Kumar K.V wrote:
> Form2 associativity adds a much more flexible NUMA topology layout
> than what is provided by Form1. More details can be found in patch 7.
> 
> $ numactl -H
> ...
> node distances:
> node   0   1   2   3
>    0:  10  11  222  33
>    1:  44  10  55  66
>    2:  77  88  10  99
>    3:  101  121  132  10
> $
> 
> After DAX kmem memory add
> # numactl -H
> available: 5 nodes (0-4)
> ...
> node distances:
> node   0   1   2   3   4
>    0:  10  11  222  33  240
>    1:  44  10  55  66  255
>    2:  77  88  10  99  255
>    3:  101  121  132  10  230
>    4:  255  255  255  230  10
> 
> 
> PAPR SCM now use the numa distance details to find the numa_node and target_node
> for the device.
> 
> kvaneesh@ubuntu-guest:~$ ndctl  list -N -v
> [
>    {
>      "dev":"namespace0.0",
>      "mode":"devdax",
>      "map":"dev",
>      "size":1071644672,
>      "uuid":"d333d867-3f57-44c8-b386-d4d3abdc2bf2",
>      "raw_uuid":"915361ad-fe6a-42dd-848f-d6dc9f5af362",
>      "daxregion":{
>        "id":0,
>        "size":1071644672,
>        "devices":[
>          {
>            "chardev":"dax0.0",
>            "size":1071644672,
>            "target_node":4,
>            "mode":"devdax"
>          }
>        ]
>      },
>      "align":2097152,
>      "numa_node":3
>    }
> ]
> kvaneesh@ubuntu-guest:~$
> 
> 
> The above output is with a Qemu command line
> 
> -numa node,nodeid=4 \
> -numa dist,src=0,dst=1,val=11 -numa dist,src=0,dst=2,val=222 -numa dist,src=0,dst=3,val=33 -numa dist,src=0,dst=4,val=240 \
> -numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=55 -numa dist,src=1,dst=3,val=66 -numa dist,src=1,dst=4,val=255 \
> -numa dist,src=2,dst=0,val=77 -numa dist,src=2,dst=1,val=88 -numa dist,src=2,dst=3,val=99 -numa dist,src=2,dst=4,val=255 \
> -numa dist,src=3,dst=0,val=101 -numa dist,src=3,dst=1,val=121 -numa dist,src=3,dst=2,val=132 -numa dist,src=3,dst=4,val=230 \
> -numa dist,src=4,dst=0,val=255 -numa dist,src=4,dst=1,val=255 -numa dist,src=4,dst=2,val=255 -numa dist,src=4,dst=3,val=230 \
> -object memory-backend-file,id=memnvdimm1,prealloc=yes,mem-path=$PMEM_DISK,share=yes,size=${PMEM_SIZE}  \
> -device nvdimm,label-size=128K,memdev=memnvdimm1,id=nvdimm1,slot=4,uuid=72511b67-0b3b-42fd-8d1d-5be3cae8bcaa,node=4
> 
> Qemu changes can be found at https://lore.kernel.org/qemu-devel/20210616011944.2996399-1-danielhb413@gmail.com/
> 
> Changes from v4:
> * Drop DLPAR related device tree property for now because both Qemu nor PowerVM
>    will provide the distance details of all possible NUMA nodes during boot.
> * Rework numa distance code based on review feedback.
> 
> Changes from v3:
> * Drop PAPR SCM specific changes and depend completely on NUMA distance information.
> 
> Changes from v2:
> * Add nvdimm list to Cc:
> * update PATCH 8 commit message.
> 
> Changes from v1:
> * Update FORM2 documentation.
> * rename max_domain_index to max_associativity_domain_index
> 
> 
> Aneesh Kumar K.V (6):
>    powerpc/pseries: rename min_common_depth to primary_domain_index
>    powerpc/pseries: rename distance_ref_points_depth to
>      max_associativity_domain_index
>    powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY
>    powerpc/pseries: Consolidate different NUMA distance update code paths
>    powerpc/pseries: Add a helper for form1 cpu distance
>    powerpc/pseries: Add support for FORM2 associativity
> 
>   Documentation/powerpc/associativity.rst       | 103 +++++
>   arch/powerpc/include/asm/firmware.h           |   7 +-
>   arch/powerpc/include/asm/prom.h               |   3 +-
>   arch/powerpc/include/asm/topology.h           |   4 +-
>   arch/powerpc/kernel/prom_init.c               |   3 +-
>   arch/powerpc/mm/numa.c                        | 415 +++++++++++++-----
>   arch/powerpc/platforms/pseries/firmware.c     |   3 +-
>   arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
>   .../platforms/pseries/hotplug-memory.c        |   2 +
>   arch/powerpc/platforms/pseries/lpar.c         |   4 +-
>   arch/powerpc/platforms/pseries/pseries.h      |   1 +
>   11 files changed, 432 insertions(+), 115 deletions(-)
>   create mode 100644 Documentation/powerpc/associativity.rst
> 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 0/6] Add support for FORM2 associativity
  2021-07-13 14:27 ` [PATCH v5 0/6] " Daniel Henrique Barboza
@ 2021-07-13 14:30   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 30+ messages in thread
From: Aneesh Kumar K.V @ 2021-07-13 14:30 UTC (permalink / raw)
  To: Daniel Henrique Barboza, linuxppc-dev, mpe; +Cc: Nathan Lynch, David Gibson

On 7/13/21 7:57 PM, Daniel Henrique Barboza wrote:
> Aneesh,
> 
> This series compiles with a configuration made with "pseries_le_defconfig"
> but fails with a config based on an existing RHEL8 config.
> 
> The reason, which is hinted in the robot replies in patch 4, is that you 
> defined
> a "__vphn_get_associativity" inside a #ifdef CONFIG_PPC_SPLPAR guard but 
> didn't
> define how the function would behave without the config, and you ended up
> using the function elsewhere.
> 
> This fixes the compilation but I'm not sure if this is what you intended
> for this function:
> 
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index c68846fc9550..6e8551d16b7a 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -680,6 +680,11 @@ static int vphn_get_nid(long lcpu)
> 
>   }
>   #else
> +static int __vphn_get_associativity(long lcpu, __be32 *associativity)
> +{
> +       return -1;
> +}
> +
>   static int vphn_get_nid(long unused)
>   {
>          return NUMA_NO_NODE;
> 
> 
> I'll post a new version of the QEMU FORM2 changes using these patches as 
> is (with
> the above fixup), but I guess you'll want to post a v6.
> 

kernel test robot did report that earlier and I have that fixed in my 
local tree. I haven't posted v6 yet because I want to close the review 
on the approach with v5 patchset.

-aneesh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 2/6] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index
  2021-06-28 15:11 ` [PATCH v5 2/6] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index Aneesh Kumar K.V
@ 2021-07-22  0:59   ` David Gibson
  2021-07-22  1:19     ` David Gibson
  0 siblings, 1 reply; 30+ messages in thread
From: David Gibson @ 2021-07-22  0:59 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 4037 bytes --]

On Mon, Jun 28, 2021 at 08:41:13PM +0530, Aneesh Kumar K.V wrote:
> No functional change in this patch
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/mm/numa.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 8365b298ec48..132813dd1a6c 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -56,7 +56,7 @@ static int n_mem_addr_cells, n_mem_size_cells;
>  static int form1_affinity;
>  
>  #define MAX_DISTANCE_REF_POINTS 4
> -static int distance_ref_points_depth;
> +static int max_associativity_domain_index;
>  static const __be32 *distance_ref_points;
>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
>  
> @@ -169,7 +169,7 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>  
>  	int i, index;
>  
> -	for (i = 0; i < distance_ref_points_depth; i++) {
> +	for (i = 0; i < max_associativity_domain_index; i++) {
>  		index = be32_to_cpu(distance_ref_points[i]);
>  		if (cpu1_assoc[index] == cpu2_assoc[index])
>  			break;
> @@ -193,7 +193,7 @@ int __node_distance(int a, int b)
>  	if (!form1_affinity)
>  		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
>  
> -	for (i = 0; i < distance_ref_points_depth; i++) {
> +	for (i = 0; i < max_associativity_domain_index; i++) {
>  		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
>  			break;
>  
> @@ -213,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
>  	if (!form1_affinity)
>  		return;
>  
> -	for (i = 0; i < distance_ref_points_depth; i++) {
> +	for (i = 0; i < max_associativity_domain_index; i++) {
>  		const __be32 *entry;
>  
>  		entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
> @@ -240,7 +240,7 @@ static int associativity_to_nid(const __be32 *associativity)
>  		nid = NUMA_NO_NODE;
>  
>  	if (nid > 0 &&
> -		of_read_number(associativity, 1) >= distance_ref_points_depth) {
> +		of_read_number(associativity, 1) >= max_associativity_domain_index) {
>  		/*
>  		 * Skip the length field and send start of associativity array
>  		 */
> @@ -310,14 +310,14 @@ static int __init find_primary_domain_index(void)
>  	 */
>  	distance_ref_points = of_get_property(root,
>  					"ibm,associativity-reference-points",
> -					&distance_ref_points_depth);
> +					&max_associativity_domain_index);
>  
>  	if (!distance_ref_points) {
>  		dbg("NUMA: ibm,associativity-reference-points not found.\n");
>  		goto err;
>  	}
>  
> -	distance_ref_points_depth /= sizeof(int);
> +	max_associativity_domain_index /= sizeof(int);
>  
>  	if (firmware_has_feature(FW_FEATURE_OPAL) ||
>  	    firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
> @@ -328,7 +328,7 @@ static int __init find_primary_domain_index(void)
>  	if (form1_affinity) {
>  		index = of_read_number(distance_ref_points, 1);
>  	} else {
> -		if (distance_ref_points_depth < 2) {
> +		if (max_associativity_domain_index < 2) {
>  			printk(KERN_WARNING "NUMA: "
>  				"short ibm,associativity-reference-points\n");
>  			goto err;
> @@ -341,10 +341,10 @@ static int __init find_primary_domain_index(void)
>  	 * Warn and cap if the hardware supports more than
>  	 * MAX_DISTANCE_REF_POINTS domains.
>  	 */
> -	if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) {
> +	if (max_associativity_domain_index > MAX_DISTANCE_REF_POINTS) {
>  		printk(KERN_WARNING "NUMA: distance array capped at "
>  			"%d entries\n", MAX_DISTANCE_REF_POINTS);
> -		distance_ref_points_depth = MAX_DISTANCE_REF_POINTS;
> +		max_associativity_domain_index = MAX_DISTANCE_REF_POINTS;
>  	}
>  
>  	of_node_put(root);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 2/6] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index
  2021-07-22  0:59   ` David Gibson
@ 2021-07-22  1:19     ` David Gibson
  0 siblings, 0 replies; 30+ messages in thread
From: David Gibson @ 2021-07-22  1:19 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 4789 bytes --]

On Thu, Jul 22, 2021 at 10:59:09AM +1000, David Gibson wrote:
> On Mon, Jun 28, 2021 at 08:41:13PM +0530, Aneesh Kumar K.V wrote:
> > No functional change in this patch
> > 
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

No... wait, I take that back.  This change makes the code *more*
confusing.

"distance_ref_points_depth" is accurate - it's the length of the
distance_ref_points array.

"max_associativity_domain_index" is not.  That implies it's the
maximum value that a domain index can have - which it isn't.  You
could have 15 entries in every associativity array, but if only 2 of
them are referenced in distance_ref_points, then
"max_associativity_domain_index" would only be 2.

> 
> > ---
> >  arch/powerpc/mm/numa.c | 20 ++++++++++----------
> >  1 file changed, 10 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 8365b298ec48..132813dd1a6c 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -56,7 +56,7 @@ static int n_mem_addr_cells, n_mem_size_cells;
> >  static int form1_affinity;
> >  
> >  #define MAX_DISTANCE_REF_POINTS 4
> > -static int distance_ref_points_depth;
> > +static int max_associativity_domain_index;
> >  static const __be32 *distance_ref_points;
> >  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
> >  
> > @@ -169,7 +169,7 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >  
> >  	int i, index;
> >  
> > -	for (i = 0; i < distance_ref_points_depth; i++) {
> > +	for (i = 0; i < max_associativity_domain_index; i++) {
> >  		index = be32_to_cpu(distance_ref_points[i]);
> >  		if (cpu1_assoc[index] == cpu2_assoc[index])
> >  			break;
> > @@ -193,7 +193,7 @@ int __node_distance(int a, int b)
> >  	if (!form1_affinity)
> >  		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
> >  
> > -	for (i = 0; i < distance_ref_points_depth; i++) {
> > +	for (i = 0; i < max_associativity_domain_index; i++) {
> >  		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
> >  			break;
> >  
> > @@ -213,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
> >  	if (!form1_affinity)
> >  		return;
> >  
> > -	for (i = 0; i < distance_ref_points_depth; i++) {
> > +	for (i = 0; i < max_associativity_domain_index; i++) {
> >  		const __be32 *entry;
> >  
> >  		entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
> > @@ -240,7 +240,7 @@ static int associativity_to_nid(const __be32 *associativity)
> >  		nid = NUMA_NO_NODE;
> >  
> >  	if (nid > 0 &&
> > -		of_read_number(associativity, 1) >= distance_ref_points_depth) {
> > +		of_read_number(associativity, 1) >= max_associativity_domain_index) {
> >  		/*
> >  		 * Skip the length field and send start of associativity array
> >  		 */
> > @@ -310,14 +310,14 @@ static int __init find_primary_domain_index(void)
> >  	 */
> >  	distance_ref_points = of_get_property(root,
> >  					"ibm,associativity-reference-points",
> > -					&distance_ref_points_depth);
> > +					&max_associativity_domain_index);
> >  
> >  	if (!distance_ref_points) {
> >  		dbg("NUMA: ibm,associativity-reference-points not found.\n");
> >  		goto err;
> >  	}
> >  
> > -	distance_ref_points_depth /= sizeof(int);
> > +	max_associativity_domain_index /= sizeof(int);
> >  
> >  	if (firmware_has_feature(FW_FEATURE_OPAL) ||
> >  	    firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
> > @@ -328,7 +328,7 @@ static int __init find_primary_domain_index(void)
> >  	if (form1_affinity) {
> >  		index = of_read_number(distance_ref_points, 1);
> >  	} else {
> > -		if (distance_ref_points_depth < 2) {
> > +		if (max_associativity_domain_index < 2) {
> >  			printk(KERN_WARNING "NUMA: "
> >  				"short ibm,associativity-reference-points\n");
> >  			goto err;
> > @@ -341,10 +341,10 @@ static int __init find_primary_domain_index(void)
> >  	 * Warn and cap if the hardware supports more than
> >  	 * MAX_DISTANCE_REF_POINTS domains.
> >  	 */
> > -	if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) {
> > +	if (max_associativity_domain_index > MAX_DISTANCE_REF_POINTS) {
> >  		printk(KERN_WARNING "NUMA: distance array capped at "
> >  			"%d entries\n", MAX_DISTANCE_REF_POINTS);
> > -		distance_ref_points_depth = MAX_DISTANCE_REF_POINTS;
> > +		max_associativity_domain_index = MAX_DISTANCE_REF_POINTS;
> >  	}
> >  
> >  	of_node_put(root);
> 



-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths
  2021-06-28 15:11 ` [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths Aneesh Kumar K.V
  2021-06-28 20:21     ` kernel test robot
  2021-06-28 20:40     ` kernel test robot
@ 2021-07-22  1:40   ` David Gibson
  2021-07-22  7:07     ` Aneesh Kumar K.V
  2 siblings, 1 reply; 30+ messages in thread
From: David Gibson @ 2021-07-22  1:40 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 11408 bytes --]

On Mon, Jun 28, 2021 at 08:41:15PM +0530, Aneesh Kumar K.V wrote:
> The associativity details of the newly added resourced are collected from
> the hypervisor via "ibm,configure-connector" rtas call. Update the numa
> distance details of the newly added numa node after the above call.
> 
> Instead of updating NUMA distance every time we lookup a node id
> from the associativity property, add helpers that can be used
> during boot which does this only once. Also remove the distance
> update from node id lookup helpers.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/numa.c                        | 173 +++++++++++++-----
>  arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
>  .../platforms/pseries/hotplug-memory.c        |   2 +
>  arch/powerpc/platforms/pseries/pseries.h      |   1 +
>  4 files changed, 132 insertions(+), 46 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 0ec16999beef..7b142f79d600 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -208,22 +208,6 @@ int __node_distance(int a, int b)
>  }
>  EXPORT_SYMBOL(__node_distance);
>  
> -static void initialize_distance_lookup_table(int nid,
> -		const __be32 *associativity)
> -{
> -	int i;
> -
> -	if (affinity_form != FORM1_AFFINITY)
> -		return;
> -
> -	for (i = 0; i < max_associativity_domain_index; i++) {
> -		const __be32 *entry;
> -
> -		entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
> -		distance_lookup_table[nid][i] = of_read_number(entry, 1);
> -	}
> -}
> -
>  /*
>   * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
>   * info is found.
> @@ -241,15 +225,6 @@ static int associativity_to_nid(const __be32 *associativity)
>  	/* POWER4 LPAR uses 0xffff as invalid node */
>  	if (nid == 0xffff || nid >= nr_node_ids)
>  		nid = NUMA_NO_NODE;
> -
> -	if (nid > 0 &&
> -		of_read_number(associativity, 1) >= max_associativity_domain_index) {
> -		/*
> -		 * Skip the length field and send start of associativity array
> -		 */
> -		initialize_distance_lookup_table(nid, associativity + 1);
> -	}
> -
>  out:
>  	return nid;
>  }
> @@ -287,6 +262,49 @@ int of_node_to_nid(struct device_node *device)
>  }
>  EXPORT_SYMBOL(of_node_to_nid);
>  
> +static void __initialize_form1_numa_distance(const __be32 *associativity)
> +{
> +	int i, nid;
> +
> +	if (affinity_form != FORM1_AFFINITY)

Since this shouldn't be called on a !form1 system, this could be a WARN_ON().

> +		return;
> +
> +	if (of_read_number(associativity, 1) >= primary_domain_index) {
> +		nid = of_read_number(&associativity[primary_domain_index], 1);

This computes the nid from the assoc array independently of
associativity_to_nid, which doesn't seem like a good idea.  Wouldn't
it be better to call assocaitivity_to_nid(), then make the next bit
conditional on nid !== NUMA_NO_NODE?

> +
> +		for (i = 0; i < max_associativity_domain_index; i++) {
> +			const __be32 *entry;
> +
> +			entry = &associativity[be32_to_cpu(distance_ref_points[i])];
> +			distance_lookup_table[nid][i] = of_read_number(entry, 1);
> +		}
> +	}
> +}
> +
> +static void initialize_form1_numa_distance(struct device_node *node)
> +{
> +	const __be32 *associativity;
> +
> +	associativity = of_get_associativity(node);
> +	if (!associativity)
> +		return;
> +
> +	__initialize_form1_numa_distance(associativity);
> +}
> +
> +/*
> + * Used to update distance information w.r.t newly added node.
> + */
> +void update_numa_distance(struct device_node *node)
> +{
> +	if (affinity_form == FORM0_AFFINITY)
> +		return;
> +	else if (affinity_form == FORM1_AFFINITY) {
> +		initialize_form1_numa_distance(node);
> +		return;
> +	}
> +}
> +
>  static int __init find_primary_domain_index(void)
>  {
>  	int index;
> @@ -433,6 +451,48 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
>  	return 0;
>  }
>  
> +static int get_nid_and_numa_distance(struct drmem_lmb *lmb)
> +{
> +	struct assoc_arrays aa = { .arrays = NULL };
> +	int default_nid = NUMA_NO_NODE;
> +	int nid = default_nid;
> +	int rc, index;
> +
> +	if ((primary_domain_index < 0) || !numa_enabled)

Under what circumstances could you get primary_domain_index < 0?

> +		return default_nid;
> +
> +	rc = of_get_assoc_arrays(&aa);
> +	if (rc)
> +		return default_nid;
> +
> +	if (primary_domain_index <= aa.array_sz &&
> +	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
> +		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;

Does anywhere verify that primary_domain_index <= aa.array_sz?

> +		nid = of_read_number(&aa.arrays[index], 1);
> +
> +		if (nid == 0xffff || nid >= nr_node_ids)
> +			nid = default_nid;
> +		if (nid > 0 && affinity_form == FORM1_AFFINITY) {
> +			int i;
> +			const __be32 *associativity;
> +
> +			index = lmb->aa_index * aa.array_sz;
> +			associativity = &aa.arrays[index];
> +			/*
> +			 * lookup array associativity entries have different format
> +			 * There is no length of the array as the first element.

The difference it very small, and this is not a hot path.  Couldn't
you reduce a chunk of code by prepending aa.array_sz, then re-using
__initialize_form1_numa_distance.  Or even making
__initialize_form1_numa_distance() take the length as a parameter.

> +			 */
> +			for (i = 0; i < max_associativity_domain_index; i++) {
> +				const __be32 *entry;
> +
> +				entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];

Does anywhere verify that distance_ref_points[i] <= aa.array_size for
every i?

> +				distance_lookup_table[nid][i] = of_read_number(entry, 1);
> +			}
> +		}
> +	}
> +	return nid;
> +}
> +
>  /*
>   * This is like of_node_to_nid_single() for memory represented in the
>   * ibm,dynamic-reconfiguration-memory node.
> @@ -458,21 +518,14 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>  
>  		if (nid == 0xffff || nid >= nr_node_ids)
>  			nid = default_nid;
> -
> -		if (nid > 0) {
> -			index = lmb->aa_index * aa.array_sz;
> -			initialize_distance_lookup_table(nid,
> -							&aa.arrays[index]);
> -		}
>  	}
> -
>  	return nid;
>  }
>  
>  #ifdef CONFIG_PPC_SPLPAR
> -static int vphn_get_nid(long lcpu)
> +
> +static int __vphn_get_associativity(long lcpu, __be32 *associativity)
>  {
> -	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
>  	long rc, hwid;
>  
>  	/*
> @@ -492,10 +545,22 @@ static int vphn_get_nid(long lcpu)
>  
>  		rc = hcall_vphn(hwid, VPHN_FLAG_VCPU, associativity);
>  		if (rc == H_SUCCESS)
> -			return associativity_to_nid(associativity);
> +			return 0;
>  	}
>  
> +	return -1;
> +}
> +
> +static int vphn_get_nid(long lcpu)
> +{
> +	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> +
> +
> +	if (!__vphn_get_associativity(lcpu, associativity))
> +		return associativity_to_nid(associativity);
> +
>  	return NUMA_NO_NODE;
> +
>  }
>  #else
>  static int vphn_get_nid(long unused)
> @@ -692,7 +757,7 @@ static int __init numa_setup_drmem_lmb(struct drmem_lmb *lmb,
>  			size = read_n_cells(n_mem_size_cells, usm);
>  		}
>  
> -		nid = of_drconf_to_nid_single(lmb);
> +		nid = get_nid_and_numa_distance(lmb);
>  		fake_numa_create_new_node(((base + size) >> PAGE_SHIFT),
>  					  &nid);
>  		node_set_online(nid);
> @@ -709,6 +774,7 @@ static int __init parse_numa_properties(void)
>  	struct device_node *memory;
>  	int default_nid = 0;
>  	unsigned long i;
> +	const __be32 *associativity;
>  
>  	if (numa_enabled == 0) {
>  		printk(KERN_WARNING "NUMA disabled by user\n");
> @@ -734,18 +800,30 @@ static int __init parse_numa_properties(void)
>  	 * each node to be onlined must have NODE_DATA etc backing it.
>  	 */
>  	for_each_present_cpu(i) {
> +		__be32 vphn_assoc[VPHN_ASSOC_BUFSIZE];
>  		struct device_node *cpu;
> -		int nid = vphn_get_nid(i);
> +		int nid = NUMA_NO_NODE;
>  
> -		/*
> -		 * Don't fall back to default_nid yet -- we will plug
> -		 * cpus into nodes once the memory scan has discovered
> -		 * the topology.
> -		 */
> -		if (nid == NUMA_NO_NODE) {
> +		memset(vphn_assoc, 0, VPHN_ASSOC_BUFSIZE * sizeof(__be32));

What's the memset() for?  AFAICT you only look at vphn_assoc in the
branch where __vphn_get_associativity() succeeds.

> +
> +		if (__vphn_get_associativity(i, vphn_assoc) == 0) {
> +			nid = associativity_to_nid(vphn_assoc);
> +			__initialize_form1_numa_distance(vphn_assoc);
> +		} else {
> +
> +			/*
> +			 * Don't fall back to default_nid yet -- we will plug
> +			 * cpus into nodes once the memory scan has discovered
> +			 * the topology.
> +			 */
>  			cpu = of_get_cpu_node(i, NULL);
>  			BUG_ON(!cpu);
> -			nid = of_node_to_nid_single(cpu);
> +
> +			associativity = of_get_associativity(cpu);
> +			if (associativity) {
> +				nid = associativity_to_nid(associativity);
> +				__initialize_form1_numa_distance(associativity);
> +			}
>  			of_node_put(cpu);
>  		}
>  
> @@ -781,8 +859,11 @@ static int __init parse_numa_properties(void)
>  		 * have associativity properties.  If none, then
>  		 * everything goes to default_nid.
>  		 */
> -		nid = of_node_to_nid_single(memory);
> -		if (nid < 0)
> +		associativity = of_get_associativity(memory);
> +		if (associativity) {
> +			nid = associativity_to_nid(associativity);
> +			__initialize_form1_numa_distance(associativity);
> +		} else
>  			nid = default_nid;
>  
>  		fake_numa_create_new_node(((start + size) >> PAGE_SHIFT), &nid);
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 7e970f81d8ff..778b6ab35f0d 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -498,6 +498,8 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
>  		return saved_rc;
>  	}
>  
> +	update_numa_distance(dn);
> +
>  	rc = dlpar_online_cpu(dn);
>  	if (rc) {
>  		saved_rc = rc;
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 36f66556a7c6..40d350f31a34 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -180,6 +180,8 @@ static int update_lmb_associativity_index(struct drmem_lmb *lmb)
>  		return -ENODEV;
>  	}
>  
> +	update_numa_distance(lmb_node);
> +
>  	dr_node = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
>  	if (!dr_node) {
>  		dlpar_free_cc_nodes(lmb_node);
> diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
> index 1f051a786fb3..663a0859cf13 100644
> --- a/arch/powerpc/platforms/pseries/pseries.h
> +++ b/arch/powerpc/platforms/pseries/pseries.h
> @@ -113,4 +113,5 @@ extern u32 pseries_security_flavor;
>  void pseries_setup_security_mitigations(void);
>  void pseries_lpar_read_hblkrm_characteristics(void);
>  
> +void update_numa_distance(struct device_node *node);
>  #endif /* _PSERIES_PSERIES_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 5/6] powerpc/pseries: Add a helper for form1 cpu distance
  2021-06-28 15:11 ` [PATCH v5 5/6] powerpc/pseries: Add a helper for form1 cpu distance Aneesh Kumar K.V
@ 2021-07-22  1:42   ` David Gibson
  2021-07-22  7:09     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 30+ messages in thread
From: David Gibson @ 2021-07-22  1:42 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3860 bytes --]

On Mon, Jun 28, 2021 at 08:41:16PM +0530, Aneesh Kumar K.V wrote:
> This helper is only used with the dispatch trace log collection.
> A later patch will add Form2 affinity support and this change helps
> in keeping that simpler. Also add a comment explaining we don't expect
> the code to be called with FORM0
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

What makes it a "relative_distance" rather than just a "distance"?

> ---
>  arch/powerpc/include/asm/topology.h   |  4 ++--
>  arch/powerpc/mm/numa.c                | 10 +++++++++-
>  arch/powerpc/platforms/pseries/lpar.c |  4 ++--
>  3 files changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index e4db64c0e184..ac8b5ed79832 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -36,7 +36,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
>  				 cpu_all_mask :				\
>  				 cpumask_of_node(pcibus_to_node(bus)))
>  
> -extern int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
> +int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
>  extern int __node_distance(int, int);
>  #define node_distance(a, b) __node_distance(a, b)
>  
> @@ -83,7 +83,7 @@ static inline void sysfs_remove_device_from_node(struct device *dev,
>  
>  static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {}
>  
> -static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> +static inline int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>  {
>  	return 0;
>  }
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 7b142f79d600..c6293037a103 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
>  }
>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>  
> -int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> +static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>  {
>  	int dist = 0;
>  
> @@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>  	return dist;
>  }
>  
> +int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> +{
> +	/* We should not get called with FORM0 */
> +	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
> +
> +	return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
> +}
> +
>  /* must hold reference to node during call */
>  static const __be32 *of_get_associativity(struct device_node *dev)
>  {
> diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
> index dab356e3ff87..afefbdfe768d 100644
> --- a/arch/powerpc/platforms/pseries/lpar.c
> +++ b/arch/powerpc/platforms/pseries/lpar.c
> @@ -261,7 +261,7 @@ static int cpu_relative_dispatch_distance(int last_disp_cpu, int cur_disp_cpu)
>  	if (!last_disp_cpu_assoc || !cur_disp_cpu_assoc)
>  		return -EIO;
>  
> -	return cpu_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
> +	return cpu_relative_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
>  }
>  
>  static int cpu_home_node_dispatch_distance(int disp_cpu)
> @@ -281,7 +281,7 @@ static int cpu_home_node_dispatch_distance(int disp_cpu)
>  	if (!disp_cpu_assoc || !vcpu_assoc)
>  		return -EIO;
>  
> -	return cpu_distance(disp_cpu_assoc, vcpu_assoc);
> +	return cpu_relative_distance(disp_cpu_assoc, vcpu_assoc);
>  }
>  
>  static void update_vcpu_disp_stat(int disp_cpu)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 1/6] powerpc/pseries: rename min_common_depth to primary_domain_index
  2021-06-28 15:11 ` [PATCH v5 1/6] powerpc/pseries: rename min_common_depth to primary_domain_index Aneesh Kumar K.V
@ 2021-07-22  1:59   ` David Gibson
  2021-07-22  2:36     ` David Gibson
  0 siblings, 1 reply; 30+ messages in thread
From: David Gibson @ 2021-07-22  1:59 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 5234 bytes --]

On Mon, Jun 28, 2021 at 08:41:12PM +0530, Aneesh Kumar K.V wrote:
> No functional change in this patch.

The new name does not match how you describe "primary domain index" in
the documentation from patch 6/6.  There it comes from the values in
associativity-reference-points, but here it simply comes from the
lengths of all the associativity properties.

> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/numa.c | 38 +++++++++++++++++++-------------------
>  1 file changed, 19 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index f2bf98bdcea2..8365b298ec48 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table);
>  EXPORT_SYMBOL(node_to_cpumask_map);
>  EXPORT_SYMBOL(node_data);
>  
> -static int min_common_depth;
> +static int primary_domain_index;
>  static int n_mem_addr_cells, n_mem_size_cells;
>  static int form1_affinity;
>  
> @@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 *associativity)
>  	if (!numa_enabled)
>  		goto out;
>  
> -	if (of_read_number(associativity, 1) >= min_common_depth)
> -		nid = of_read_number(&associativity[min_common_depth], 1);
> +	if (of_read_number(associativity, 1) >= primary_domain_index)
> +		nid = of_read_number(&associativity[primary_domain_index], 1);
>  
>  	/* POWER4 LPAR uses 0xffff as invalid node */
>  	if (nid == 0xffff || nid >= nr_node_ids)
> @@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device)
>  }
>  EXPORT_SYMBOL(of_node_to_nid);
>  
> -static int __init find_min_common_depth(void)
> +static int __init find_primary_domain_index(void)
>  {
> -	int depth;
> +	int index;
>  	struct device_node *root;
>  
>  	if (firmware_has_feature(FW_FEATURE_OPAL))
> @@ -326,7 +326,7 @@ static int __init find_min_common_depth(void)
>  	}
>  
>  	if (form1_affinity) {
> -		depth = of_read_number(distance_ref_points, 1);
> +		index = of_read_number(distance_ref_points, 1);
>  	} else {
>  		if (distance_ref_points_depth < 2) {
>  			printk(KERN_WARNING "NUMA: "
> @@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
>  			goto err;
>  		}
>  
> -		depth = of_read_number(&distance_ref_points[1], 1);
> +		index = of_read_number(&distance_ref_points[1], 1);
>  	}
>  
>  	/*
> @@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
>  	}
>  
>  	of_node_put(root);
> -	return depth;
> +	return index;
>  
>  err:
>  	of_node_put(root);
> @@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>  	int nid = default_nid;
>  	int rc, index;
>  
> -	if ((min_common_depth < 0) || !numa_enabled)
> +	if ((primary_domain_index < 0) || !numa_enabled)
>  		return default_nid;
>  
>  	rc = of_get_assoc_arrays(&aa);
>  	if (rc)
>  		return default_nid;
>  
> -	if (min_common_depth <= aa.array_sz &&
> +	if (primary_domain_index <= aa.array_sz &&
>  	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
> -		index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
> +		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
>  		nid = of_read_number(&aa.arrays[index], 1);
>  
>  		if (nid == 0xffff || nid >= nr_node_ids)
> @@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
>  		return -1;
>  	}
>  
> -	min_common_depth = find_min_common_depth();
> +	primary_domain_index = find_primary_domain_index();
>  
> -	if (min_common_depth < 0) {
> +	if (primary_domain_index < 0) {
>  		/*
> -		 * if we fail to parse min_common_depth from device tree
> +		 * if we fail to parse primary_domain_index from device tree
>  		 * mark the numa disabled, boot with numa disabled.
>  		 */
>  		numa_enabled = false;
> -		return min_common_depth;
> +		return primary_domain_index;
>  	}
>  
> -	dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
> +	dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index);
>  
>  	/*
>  	 * Even though we connect cpus to numa domains later in SMP
> @@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
>  			goto out;
>  	}
>  
> -	max_nodes = of_read_number(&domains[min_common_depth], 1);
> +	max_nodes = of_read_number(&domains[primary_domain_index], 1);
>  	for (i = 0; i < max_nodes; i++) {
>  		if (!node_possible(i))
>  			node_set(i, node_possible_map);
>  	}
>  
>  	prop_length /= sizeof(int);
> -	if (prop_length > min_common_depth + 2)
> +	if (prop_length > primary_domain_index + 2)
>  		coregroup_enabled = 1;
>  
>  out:
> @@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
>  		goto out;
>  
>  	index = of_read_number(associativity, 1);
> -	if (index > min_common_depth + 1)
> +	if (index > primary_domain_index + 1)
>  		return of_read_number(&associativity[index - 1], 1);
>  
>  out:

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 6/6] powerpc/pseries: Add support for FORM2 associativity
  2021-06-28 15:11 ` [PATCH v5 6/6] powerpc/pseries: Add support for FORM2 associativity Aneesh Kumar K.V
@ 2021-07-22  2:28   ` David Gibson
  2021-07-22  7:34     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 30+ messages in thread
From: David Gibson @ 2021-07-22  2:28 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 18059 bytes --]

On Mon, Jun 28, 2021 at 08:41:17PM +0530, Aneesh Kumar K.V wrote:
> PAPR interface currently supports two different ways of communicating resource
> grouping details to the OS. These are referred to as Form 0 and Form 1
> associativity grouping. Form 0 is the older format and is now considered
> deprecated. This patch adds another resource grouping named FORM2.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  Documentation/powerpc/associativity.rst   | 103 ++++++++++++++
>  arch/powerpc/include/asm/firmware.h       |   3 +-
>  arch/powerpc/include/asm/prom.h           |   1 +
>  arch/powerpc/kernel/prom_init.c           |   3 +-
>  arch/powerpc/mm/numa.c                    | 157 ++++++++++++++++++----
>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>  6 files changed, 242 insertions(+), 26 deletions(-)
>  create mode 100644 Documentation/powerpc/associativity.rst
> 
> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
> new file mode 100644
> index 000000000000..31cc7da2c7a6
> --- /dev/null
> +++ b/Documentation/powerpc/associativity.rst
> @@ -0,0 +1,103 @@
> +============================
> +NUMA resource associativity
> +=============================
> +
> +Associativity represents the groupings of the various platform resources into
> +domains of substantially similar mean performance relative to resources outside
> +of that domain. Resources subsets of a given domain that exhibit better
> +performance relative to each other than relative to other resources subsets
> +are represented as being members of a sub-grouping domain. This performance
> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
> +From the platform view, these groups are also referred to as domains.

Pretty hard to decipher, but that's typical for PAPR.

> +PAPR interface currently supports different ways of communicating these resource
> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
> +associativity grouping. Form 0 is the older format and is now considered deprecated.

Nit: s/older/oldest/ since there are now >2 forms.

> +Hypervisor indicates the type/form of associativity used via "ibm,architecture-vec-5 property".
> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
> +
> +Form 0
> +-----
> +Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
> +
> +Form 1
> +-----
> +With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity
> +device tree properties are used to determine the NUMA distance between resource groups/domains.
> +
> +The “ibm,associativity” property contains a list of one or more numbers (domainID)
> +representing the resource’s platform grouping domains.
> +
> +The “ibm,associativity-reference-points” property contains a list of one or more numbers
> +(domainID index) that represents the 1 based ordinal in the associativity lists.
> +The list of domainID indexes represents an increasing hierarchy of resource grouping.
> +
> +ex:
> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
> +
> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
> +Linux kernel computes NUMA distance between two domains by recursively comparing
> +if they belong to the same higher-level domains. For mismatch at every higher
> +level of the resource group, the kernel doubles the NUMA distance between the
> +comparing domains.
> +
> +Form 2
> +-------
> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
> +domain numbering. With numa distance computation now detached from the index value in
> +"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain
> +ids at the same domainID index representing resource groups of different performance/latency
> +characteristics.
> +
> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
> +"ibm,architecture-vec-5" property.
> +
> +"ibm,numa-lookup-index-table" property contains a list of one or more numbers representing
> +the domainIDs present in the system. The offset of the domainID in this property is
> +used as an index while computing numa distance information via "ibm,numa-distance-table".
> +
> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
> +N domainID encoded as with encode-int
> +
> +For ex:
> +"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. The offset of domainID 8 (2) is used when
> +computing the distance of domain 8 from other domains present in the system. For the rest of
> +this document, this offset will be referred to as domain distance offset.
> +
> +"ibm,numa-distance-table" property contains a list of one or more numbers representing the NUMA
> +distance between resource groups/domains present in the system.
> +
> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
> +The number N must be equal to the square of m where m is the number of domainIDs in the
> +numa-lookup-index-table.
> +
> +For ex:
> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}

This representation doesn't make it clear that the 9 is a u32, but the
rest are u8s.

> +
> +  | 0    8   40
> +--|------------
> +  |
> +0 | 10   20  80
> +  |
> +8 | 20   10  160
> +  |
> +40| 80   160  10
> +
> +A possible "ibm,associativity" property for resources in node 0, 8 and 40
> +
> +{ 3, 6, 7, 0 }
> +{ 3, 6, 9, 8 }
> +{ 3, 6, 7, 40}
> +
> +With "ibm,associativity-reference-points"  { 0x3 }

You haven't actually described how ibm,associativity-reference-points
operates in Form2.

> +"ibm,lookup-index-table" helps in having a compact representation of distance matrix.
> +Since domainID can be sparse, the matrix of distances can also be effectively sparse.
> +With "ibm,lookup-index-table" we can achieve a compact representation of
> +distance information.
> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
> index 60b631161360..97a3bd9ffeb9 100644
> --- a/arch/powerpc/include/asm/firmware.h
> +++ b/arch/powerpc/include/asm/firmware.h
> @@ -53,6 +53,7 @@
>  #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
>  #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
>  #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
>  
>  #ifndef __ASSEMBLY__
>  
> @@ -73,7 +74,7 @@ enum {
>  		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
>  		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
>  		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
> -		FW_FEATURE_RPT_INVALIDATE,
> +		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
>  	FW_FEATURE_PSERIES_ALWAYS = 0,
>  	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
>  	FW_FEATURE_POWERNV_ALWAYS = 0,
> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
> index df9fec9d232c..5c80152e8f18 100644
> --- a/arch/powerpc/include/asm/prom.h
> +++ b/arch/powerpc/include/asm/prom.h
> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
>  #define OV5_XCMO		0x0440	/* Page Coalescing */
>  #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
>  #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
> +#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
>  #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
>  #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
>  #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> index 5d9ea059594f..c483df6c9393 100644
> --- a/arch/powerpc/kernel/prom_init.c
> +++ b/arch/powerpc/kernel/prom_init.c
> @@ -1069,7 +1069,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
>  #else
>  		0,
>  #endif
> -		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
> +		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
> +		OV5_FEAT(OV5_FORM2_AFFINITY),
>  		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
>  		.micro_checkpoint = 0,
>  		.reserved0 = 0,
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index c6293037a103..c68846fc9550 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
>  
>  #define FORM0_AFFINITY 0
>  #define FORM1_AFFINITY 1
> +#define FORM2_AFFINITY 2
>  static int affinity_form;
>  
>  #define MAX_DISTANCE_REF_POINTS 4
>  static int max_associativity_domain_index;
>  static const __be32 *distance_ref_points;
>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
> +	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
> +};
> +static int numa_id_index_table[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
>  
>  /*
>   * Allocate node_to_cpumask_map based on number of available nodes
> @@ -166,6 +171,44 @@ static void unmap_cpu_from_node(unsigned long cpu)
>  }
>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>  
> +/*
> + * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
> + * info is found.
> + */
> +static int associativity_to_nid(const __be32 *associativity)
> +{
> +	int nid = NUMA_NO_NODE;
> +
> +	if (!numa_enabled)
> +		goto out;
> +
> +	if (of_read_number(associativity, 1) >= primary_domain_index)
> +		nid = of_read_number(&associativity[primary_domain_index], 1);
> +
> +	/* POWER4 LPAR uses 0xffff as invalid node */
> +	if (nid == 0xffff || nid >= nr_node_ids)
> +		nid = NUMA_NO_NODE;
> +out:
> +	return nid;
> +}
> +
> +static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> +{
> +	int dist;
> +	int node1, node2;
> +
> +	node1 = associativity_to_nid(cpu1_assoc);
> +	node2 = associativity_to_nid(cpu2_assoc);
> +
> +	dist = numa_distance_table[node1][node2];
> +	if (dist <= LOCAL_DISTANCE)
> +		return 0;
> +	else if (dist <= REMOTE_DISTANCE)
> +		return 1;
> +	else
> +		return 2;

Squashing the full range of distances into just 0, 1 or 2 seems odd.
But then, this whole cpu_distance() thing being distinct from
node_distance() seems odd.

> +}
> +
>  static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>  {
>  	int dist = 0;
> @@ -186,8 +229,9 @@ int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>  {
>  	/* We should not get called with FORM0 */
>  	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
> -
> -	return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
> +	if (affinity_form == FORM1_AFFINITY)
> +		return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
> +	return __cpu_form2_relative_distance(cpu1_assoc, cpu2_assoc);
>  }
>  
>  /* must hold reference to node during call */
> @@ -201,7 +245,9 @@ int __node_distance(int a, int b)
>  	int i;
>  	int distance = LOCAL_DISTANCE;
>  
> -	if (affinity_form == FORM0_AFFINITY)
> +	if (affinity_form == FORM2_AFFINITY)
> +		return numa_distance_table[a][b];
> +	else if (affinity_form == FORM0_AFFINITY)
>  		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
>  
>  	for (i = 0; i < max_associativity_domain_index; i++) {

Hmm.. couldn't we simplify this whole __node_distance function, if we
just update numa_distance_table[][] appropriately for Form0 and Form1
as well?

> @@ -216,27 +262,6 @@ int __node_distance(int a, int b)
>  }
>  EXPORT_SYMBOL(__node_distance);
>  
> -/*
> - * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
> - * info is found.
> - */
> -static int associativity_to_nid(const __be32 *associativity)
> -{
> -	int nid = NUMA_NO_NODE;
> -
> -	if (!numa_enabled)
> -		goto out;
> -
> -	if (of_read_number(associativity, 1) >= primary_domain_index)
> -		nid = of_read_number(&associativity[primary_domain_index], 1);
> -
> -	/* POWER4 LPAR uses 0xffff as invalid node */
> -	if (nid == 0xffff || nid >= nr_node_ids)
> -		nid = NUMA_NO_NODE;
> -out:
> -	return nid;
> -}
> -
>  /* Returns the nid associated with the given device tree node,
>   * or -1 if not found.
>   */
> @@ -305,12 +330,84 @@ static void initialize_form1_numa_distance(struct device_node *node)
>   */
>  void update_numa_distance(struct device_node *node)
>  {
> +	int nid;
> +
>  	if (affinity_form == FORM0_AFFINITY)
>  		return;
>  	else if (affinity_form == FORM1_AFFINITY) {
>  		initialize_form1_numa_distance(node);
>  		return;
>  	}
> +
> +	/* FORM2 affinity  */
> +	nid = of_node_to_nid_single(node);
> +	if (nid == NUMA_NO_NODE)
> +		return;
> +
> +	/*
> +	 * With FORM2 we expect NUMA distance of all possible NUMA
> +	 * nodes to be provided during boot.
> +	 */
> +	WARN(numa_distance_table[nid][nid] == -1,
> +	     "NUMA distance details for node %d not provided\n", nid);
> +}
> +
> +/*
> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
> + */
> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
> +{
> +	int i, j;
> +	const __u8 *numa_dist_table;
> +	const __be32 *numa_lookup_index;
> +	int numa_dist_table_length;
> +	int max_numa_index, distance_index;
> +
> +	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
> +	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
> +
> +	/* first element of the array is the size and is encode-int */
> +	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
> +	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
> +	/* Skip the size which is encoded int */
> +	numa_dist_table += sizeof(__be32);
> +
> +	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d\n",
> +		 numa_dist_table_length, max_numa_index);
> +
> +	for (i = 0; i < max_numa_index; i++)
> +		/* +1 skip the max_numa_index in the property */
> +		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
> +
> +
> +	if (numa_dist_table_length != max_numa_index * max_numa_index) {
> +
> +		WARN(1, "Wrong NUMA distance information\n");
> +		/* consider everybody else just remote. */
> +		for (i = 0;  i < max_numa_index; i++) {
> +			for (j = 0; j < max_numa_index; j++) {
> +				int nodeA = numa_id_index_table[i];
> +				int nodeB = numa_id_index_table[j];
> +
> +				if (nodeA == nodeB)
> +					numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE;
> +				else
> +					numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE;
> +			}
> +		}
> +	}
> +
> +	distance_index = 0;
> +	for (i = 0;  i < max_numa_index; i++) {
> +		for (j = 0; j < max_numa_index; j++) {
> +			int nodeA = numa_id_index_table[i];
> +			int nodeB = numa_id_index_table[j];
> +
> +			numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++];
> +			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
> +		}
> +	}
>  }
>  
>  static int __init find_primary_domain_index(void)
> @@ -323,6 +420,9 @@ static int __init find_primary_domain_index(void)
>  	 */
>  	if (firmware_has_feature(FW_FEATURE_OPAL)) {
>  		affinity_form = FORM1_AFFINITY;
> +	} else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) {
> +		dbg("Using form 2 affinity\n");
> +		affinity_form = FORM2_AFFINITY;
>  	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
>  		dbg("Using form 1 affinity\n");
>  		affinity_form = FORM1_AFFINITY;
> @@ -367,8 +467,17 @@ static int __init find_primary_domain_index(void)
>  
>  		index = of_read_number(&distance_ref_points[1], 1);
>  	} else {
> +		/*
> +		 * Both FORM1 and FORM2 affinity find the primary domain details
> +		 * at the same offset.
> +		 */
>  		index = of_read_number(distance_ref_points, 1);
>  	}
> +	/*
> +	 * If it is FORM2 also initialize the distance table here.
> +	 */
> +	if (affinity_form == FORM2_AFFINITY)
> +		initialize_form2_numa_distance_lookup_table(root);

Ew.  Calling a function called "find_primary_domain_index" to also
initialize the main distance table is needlessly counterintuitive.
Move this call to parse_numa_properties().
>  
>  	/*
>  	 * Warn and cap if the hardware supports more than
> diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
> index 5d4c2bc20bba..f162156b7b68 100644
> --- a/arch/powerpc/platforms/pseries/firmware.c
> +++ b/arch/powerpc/platforms/pseries/firmware.c
> @@ -123,6 +123,7 @@ vec5_fw_features_table[] = {
>  	{FW_FEATURE_PRRN,		OV5_PRRN},
>  	{FW_FEATURE_DRMEM_V2,		OV5_DRMEM_V2},
>  	{FW_FEATURE_DRC_INFO,		OV5_DRC_INFO},
> +	{FW_FEATURE_FORM2_AFFINITY,	OV5_FORM2_AFFINITY},
>  };
>  
>  static void __init fw_vec5_feature_init(const char *vec5, unsigned long len)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 1/6] powerpc/pseries: rename min_common_depth to primary_domain_index
  2021-07-22  1:59   ` David Gibson
@ 2021-07-22  2:36     ` David Gibson
  2021-07-22  5:17       ` Aneesh Kumar K.V
  0 siblings, 1 reply; 30+ messages in thread
From: David Gibson @ 2021-07-22  2:36 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 4309 bytes --]

On Thu, Jul 22, 2021 at 11:59:15AM +1000, David Gibson wrote:
> On Mon, Jun 28, 2021 at 08:41:12PM +0530, Aneesh Kumar K.V wrote:
> > No functional change in this patch.
> 
> The new name does not match how you describe "primary domain index" in
> the documentation from patch 6/6.  There it comes from the values in
> associativity-reference-points, but here it simply comes from the
> lengths of all the associativity properties.

No, sorry, I misread this code... misled by the old name, so it's a
good thing you're changing it.

But.. I'm still not sure the new name is accurate, either...

[snip]
> >  	if (form1_affinity) {
> > -		depth = of_read_number(distance_ref_points, 1);
> > +		index = of_read_number(distance_ref_points, 1);

AFACIT distance_ref_points hasn't been altered from the
of_get_property() at this point, so isn't this setting depth / index
to the number of entries in ref-points, rather than the value of the
first entry (which is what primary domain index is supposed to be).

> >  	} else {
> >  		if (distance_ref_points_depth < 2) {
> >  			printk(KERN_WARNING "NUMA: "
> > @@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
> >  			goto err;
> >  		}
> >  
> > -		depth = of_read_number(&distance_ref_points[1], 1);
> > +		index = of_read_number(&distance_ref_points[1], 1);
> >  	}
> >  
> >  	/*
> > @@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
> >  	}
> >  
> >  	of_node_put(root);
> > -	return depth;
> > +	return index;
> >  
> >  err:
> >  	of_node_put(root);
> > @@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
> >  	int nid = default_nid;
> >  	int rc, index;
> >  
> > -	if ((min_common_depth < 0) || !numa_enabled)
> > +	if ((primary_domain_index < 0) || !numa_enabled)
> >  		return default_nid;
> >  
> >  	rc = of_get_assoc_arrays(&aa);
> >  	if (rc)
> >  		return default_nid;
> >  
> > -	if (min_common_depth <= aa.array_sz &&
> > +	if (primary_domain_index <= aa.array_sz &&
> >  	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
> > -		index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
> > +		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
> >  		nid = of_read_number(&aa.arrays[index], 1);
> >  
> >  		if (nid == 0xffff || nid >= nr_node_ids)
> > @@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
> >  		return -1;
> >  	}
> >  
> > -	min_common_depth = find_min_common_depth();
> > +	primary_domain_index = find_primary_domain_index();
> >  
> > -	if (min_common_depth < 0) {
> > +	if (primary_domain_index < 0) {
> >  		/*
> > -		 * if we fail to parse min_common_depth from device tree
> > +		 * if we fail to parse primary_domain_index from device tree
> >  		 * mark the numa disabled, boot with numa disabled.
> >  		 */
> >  		numa_enabled = false;
> > -		return min_common_depth;
> > +		return primary_domain_index;
> >  	}
> >  
> > -	dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
> > +	dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index);
> >  
> >  	/*
> >  	 * Even though we connect cpus to numa domains later in SMP
> > @@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
> >  			goto out;
> >  	}
> >  
> > -	max_nodes = of_read_number(&domains[min_common_depth], 1);
> > +	max_nodes = of_read_number(&domains[primary_domain_index], 1);
> >  	for (i = 0; i < max_nodes; i++) {
> >  		if (!node_possible(i))
> >  			node_set(i, node_possible_map);
> >  	}
> >  
> >  	prop_length /= sizeof(int);
> > -	if (prop_length > min_common_depth + 2)
> > +	if (prop_length > primary_domain_index + 2)
> >  		coregroup_enabled = 1;
> >  
> >  out:
> > @@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
> >  		goto out;
> >  
> >  	index = of_read_number(associativity, 1);
> > -	if (index > min_common_depth + 1)
> > +	if (index > primary_domain_index + 1)
> >  		return of_read_number(&associativity[index - 1], 1);
> >  
> >  out:
> 



-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 1/6] powerpc/pseries: rename min_common_depth to primary_domain_index
  2021-07-22  2:36     ` David Gibson
@ 2021-07-22  5:17       ` Aneesh Kumar K.V
  2021-07-26  2:28         ` David Gibson
  0 siblings, 1 reply; 30+ messages in thread
From: Aneesh Kumar K.V @ 2021-07-22  5:17 UTC (permalink / raw)
  To: David Gibson; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

On 7/22/21 8:06 AM, David Gibson wrote:
> On Thu, Jul 22, 2021 at 11:59:15AM +1000, David Gibson wrote:
>> On Mon, Jun 28, 2021 at 08:41:12PM +0530, Aneesh Kumar K.V wrote:
>>> No functional change in this patch.
>>
>> The new name does not match how you describe "primary domain index" in
>> the documentation from patch 6/6.  There it comes from the values in
>> associativity-reference-points, but here it simply comes from the
>> lengths of all the associativity properties.
> 
> No, sorry, I misread this code... misled by the old name, so it's a
> good thing you're changing it.
> 
> But.. I'm still not sure the new name is accurate, either...
> 
> [snip]
>>>   	if (form1_affinity) {
>>> -		depth = of_read_number(distance_ref_points, 1);
>>> +		index = of_read_number(distance_ref_points, 1);
> 
> AFACIT distance_ref_points hasn't been altered from the
> of_get_property() at this point, so isn't this setting depth / index
> to the number of entries in ref-points, rather than the value of the
> first entry (which is what primary domain index is supposed to be).
> 

ibm,associativity-reference-points property format is as below.

# lsprop  ibm,associativity-reference-points
ibm,associativity-reference-points
                  00000004 00000002

it doesn't have the number of elements as the first item.

For FORM1 1 element is the NUMA boundary index/primary_domain_index
For FORM0 2 element is the NUMA boundary index/primary_domain_index.


>>>   	} else {
>>>   		if (distance_ref_points_depth < 2) {
>>>   			printk(KERN_WARNING "NUMA: "
>>> @@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
>>>   			goto err;
>>>   		}
>>>   
>>> -		depth = of_read_number(&distance_ref_points[1], 1);
>>> +		index = of_read_number(&distance_ref_points[1], 1);
>>>   	}
>>>   
>>>   	/*
>>> @@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
>>>   	}
>>>   
>>>   	of_node_put(root);
>>> -	return depth;
>>> +	return index;
>>>   
>>>   err:
>>>   	of_node_put(root);
>>> @@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>>>   	int nid = default_nid;
>>>   	int rc, index;
>>>   
>>> -	if ((min_common_depth < 0) || !numa_enabled)
>>> +	if ((primary_domain_index < 0) || !numa_enabled)
>>>   		return default_nid;
>>>   
>>>   	rc = of_get_assoc_arrays(&aa);
>>>   	if (rc)
>>>   		return default_nid;
>>>   
>>> -	if (min_common_depth <= aa.array_sz &&
>>> +	if (primary_domain_index <= aa.array_sz &&
>>>   	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
>>> -		index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
>>> +		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
>>>   		nid = of_read_number(&aa.arrays[index], 1);
>>>   
>>>   		if (nid == 0xffff || nid >= nr_node_ids)
>>> @@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
>>>   		return -1;
>>>   	}
>>>   
>>> -	min_common_depth = find_min_common_depth();
>>> +	primary_domain_index = find_primary_domain_index();
>>>   
>>> -	if (min_common_depth < 0) {
>>> +	if (primary_domain_index < 0) {
>>>   		/*
>>> -		 * if we fail to parse min_common_depth from device tree
>>> +		 * if we fail to parse primary_domain_index from device tree
>>>   		 * mark the numa disabled, boot with numa disabled.
>>>   		 */
>>>   		numa_enabled = false;
>>> -		return min_common_depth;
>>> +		return primary_domain_index;
>>>   	}
>>>   
>>> -	dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
>>> +	dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index);
>>>   
>>>   	/*
>>>   	 * Even though we connect cpus to numa domains later in SMP
>>> @@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
>>>   			goto out;
>>>   	}
>>>   
>>> -	max_nodes = of_read_number(&domains[min_common_depth], 1);
>>> +	max_nodes = of_read_number(&domains[primary_domain_index], 1);
>>>   	for (i = 0; i < max_nodes; i++) {
>>>   		if (!node_possible(i))
>>>   			node_set(i, node_possible_map);
>>>   	}
>>>   
>>>   	prop_length /= sizeof(int);
>>> -	if (prop_length > min_common_depth + 2)
>>> +	if (prop_length > primary_domain_index + 2)
>>>   		coregroup_enabled = 1;
>>>   
>>>   out:
>>> @@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
>>>   		goto out;
>>>   
>>>   	index = of_read_number(associativity, 1);
>>> -	if (index > min_common_depth + 1)
>>> +	if (index > primary_domain_index + 1)
>>>   		return of_read_number(&associativity[index - 1], 1);
>>>   
>>>   out:
>>
> 
> 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths
  2021-07-22  1:40   ` David Gibson
@ 2021-07-22  7:07     ` Aneesh Kumar K.V
  2021-07-26  2:37       ` David Gibson
  0 siblings, 1 reply; 30+ messages in thread
From: Aneesh Kumar K.V @ 2021-07-22  7:07 UTC (permalink / raw)
  To: David Gibson; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

David Gibson <david@gibson.dropbear.id.au> writes:

> On Mon, Jun 28, 2021 at 08:41:15PM +0530, Aneesh Kumar K.V wrote:
>> The associativity details of the newly added resourced are collected from
>> the hypervisor via "ibm,configure-connector" rtas call. Update the numa
>> distance details of the newly added numa node after the above call.
>> 
>> Instead of updating NUMA distance every time we lookup a node id
>> from the associativity property, add helpers that can be used
>> during boot which does this only once. Also remove the distance
>> update from node id lookup helpers.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  arch/powerpc/mm/numa.c                        | 173 +++++++++++++-----
>>  arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
>>  .../platforms/pseries/hotplug-memory.c        |   2 +
>>  arch/powerpc/platforms/pseries/pseries.h      |   1 +
>>  4 files changed, 132 insertions(+), 46 deletions(-)
>> 
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 0ec16999beef..7b142f79d600 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -208,22 +208,6 @@ int __node_distance(int a, int b)
>>  }
>>  EXPORT_SYMBOL(__node_distance);
>>  
>> -static void initialize_distance_lookup_table(int nid,
>> -		const __be32 *associativity)
>> -{
>> -	int i;
>> -
>> -	if (affinity_form != FORM1_AFFINITY)
>> -		return;
>> -
>> -	for (i = 0; i < max_associativity_domain_index; i++) {
>> -		const __be32 *entry;
>> -
>> -		entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
>> -		distance_lookup_table[nid][i] = of_read_number(entry, 1);
>> -	}
>> -}
>> -
>>  /*
>>   * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
>>   * info is found.
>> @@ -241,15 +225,6 @@ static int associativity_to_nid(const __be32 *associativity)
>>  	/* POWER4 LPAR uses 0xffff as invalid node */
>>  	if (nid == 0xffff || nid >= nr_node_ids)
>>  		nid = NUMA_NO_NODE;
>> -
>> -	if (nid > 0 &&
>> -		of_read_number(associativity, 1) >= max_associativity_domain_index) {
>> -		/*
>> -		 * Skip the length field and send start of associativity array
>> -		 */
>> -		initialize_distance_lookup_table(nid, associativity + 1);
>> -	}
>> -
>>  out:
>>  	return nid;
>>  }
>> @@ -287,6 +262,49 @@ int of_node_to_nid(struct device_node *device)
>>  }
>>  EXPORT_SYMBOL(of_node_to_nid);
>>  
>> +static void __initialize_form1_numa_distance(const __be32 *associativity)
>> +{
>> +	int i, nid;
>> +
>> +	if (affinity_form != FORM1_AFFINITY)
>
> Since this shouldn't be called on a !form1 system, this could be a WARN_ON().

The way we call functions currently, instead of doing

if (affinity_form == FORM1_AFFINITY)
    __initialize_form1_numa_distance()

We avoid doing the if check in multiple places. For example
parse_numa_properties will fetch the associativity array to find the
details of online node and set it online. We use the same code path to
initialize distance.

		if (__vphn_get_associativity(i, vphn_assoc) == 0) {
			nid = associativity_to_nid(vphn_assoc);
			__initialize_form1_numa_distance(vphn_assoc);
		} else {

			cpu = of_get_cpu_node(i, NULL);
			BUG_ON(!cpu);

			associativity = of_get_associativity(cpu);
			if (associativity) {
				nid = associativity_to_nid(associativity);
				__initialize_form1_numa_distance(associativity);
			}

We avoid the the if (affinity_form == FORM1_AFFINITY) check there by
moving the check inside __initialize_form1_numa_distance().


>
>> +		return;
>> +
>> +	if (of_read_number(associativity, 1) >= primary_domain_index) {
>> +		nid = of_read_number(&associativity[primary_domain_index], 1);
>
> This computes the nid from the assoc array independently of
> associativity_to_nid, which doesn't seem like a good idea.  Wouldn't
> it be better to call assocaitivity_to_nid(), then make the next bit
> conditional on nid !== NUMA_NO_NODE?

@@ -302,9 +302,8 @@ static void __initialize_form1_numa_distance(const __be32 *associativity)
 	if (affinity_form != FORM1_AFFINITY)
 		return;
 
-	if (of_read_number(associativity, 1) >= primary_domain_index) {
-		nid = of_read_number(&associativity[primary_domain_index], 1);
-
+	nid = associativity_to_nid(associativity);
+	if (nid != NUMA_NO_NODE) {
 		for (i = 0; i < distance_ref_points_depth; i++) {
 			const __be32 *entry;
 

>
>> +
>> +		for (i = 0; i < max_associativity_domain_index; i++) {
>> +			const __be32 *entry;
>> +
>> +			entry = &associativity[be32_to_cpu(distance_ref_points[i])];
>> +			distance_lookup_table[nid][i] = of_read_number(entry, 1);
>> +		}
>> +	}
>> +}
>> +
>> +static void initialize_form1_numa_distance(struct device_node *node)
>> +{
>> +	const __be32 *associativity;
>> +
>> +	associativity = of_get_associativity(node);
>> +	if (!associativity)
>> +		return;
>> +
>> +	__initialize_form1_numa_distance(associativity);
>> +}
>> +
>> +/*
>> + * Used to update distance information w.r.t newly added node.
>> + */
>> +void update_numa_distance(struct device_node *node)
>> +{
>> +	if (affinity_form == FORM0_AFFINITY)
>> +		return;
>> +	else if (affinity_form == FORM1_AFFINITY) {
>> +		initialize_form1_numa_distance(node);
>> +		return;
>> +	}
>> +}
>> +
>>  static int __init find_primary_domain_index(void)
>>  {
>>  	int index;
>> @@ -433,6 +451,48 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
>>  	return 0;
>>  }
>>  
>> +static int get_nid_and_numa_distance(struct drmem_lmb *lmb)
>> +{
>> +	struct assoc_arrays aa = { .arrays = NULL };
>> +	int default_nid = NUMA_NO_NODE;
>> +	int nid = default_nid;
>> +	int rc, index;
>> +
>> +	if ((primary_domain_index < 0) || !numa_enabled)
>
> Under what circumstances could you get primary_domain_index < 0?

IIUC that is to handle failure to parse device tree.
ea9f5b702fe0215188fba2eda117419e4ae90a67

>
>> +		return default_nid;
>> +
>> +	rc = of_get_assoc_arrays(&aa);
>> +	if (rc)
>> +		return default_nid;
>> +
>> +	if (primary_domain_index <= aa.array_sz &&
>> +	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
>> +		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
>
> Does anywhere verify that primary_domain_index <= aa.array_sz?

That is the first part of the check?

>
>> +		nid = of_read_number(&aa.arrays[index], 1);
>> +
>> +		if (nid == 0xffff || nid >= nr_node_ids)
>> +			nid = default_nid;
>> +		if (nid > 0 && affinity_form == FORM1_AFFINITY) {
>> +			int i;
>> +			const __be32 *associativity;
>> +
>> +			index = lmb->aa_index * aa.array_sz;
>> +			associativity = &aa.arrays[index];
>> +			/*
>> +			 * lookup array associativity entries have different format
>> +			 * There is no length of the array as the first element.
>
> The difference it very small, and this is not a hot path.  Couldn't
> you reduce a chunk of code by prepending aa.array_sz, then re-using
> __initialize_form1_numa_distance.  Or even making
> __initialize_form1_numa_distance() take the length as a parameter.

The changes are small but confusing w.r.t how we look at the
associativity-lookup-arrays. The way we interpret associativity array
and associativity lookup array using primary_domain_index is different.
Hence the '-1' in the node lookup here. 

	index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
	nid = of_read_number(&aa.arrays[index], 1);


>
>> +			 */
>> +			for (i = 0; i < max_associativity_domain_index; i++) {
>> +				const __be32 *entry;
>> +
>> +				entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
>
> Does anywhere verify that distance_ref_points[i] <= aa.array_size for
> every i?

We do check for 

	if (primary_domain_index <= aa.array_sz &&

>
>> +				distance_lookup_table[nid][i] = of_read_number(entry, 1);
>> +			}
>> +		}
>> +	}
>> +	return nid;
>> +}
>> +
>>  /*
>>   * This is like of_node_to_nid_single() for memory represented in the
>>   * ibm,dynamic-reconfiguration-memory node.
>> @@ -458,21 +518,14 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>>  
>>  		if (nid == 0xffff || nid >= nr_node_ids)
>>  			nid = default_nid;
>> -
>> -		if (nid > 0) {
>> -			index = lmb->aa_index * aa.array_sz;
>> -			initialize_distance_lookup_table(nid,
>> -							&aa.arrays[index]);
>> -		}
>>  	}
>> -
>>  	return nid;
>>  }
>>  
>>  #ifdef CONFIG_PPC_SPLPAR
>> -static int vphn_get_nid(long lcpu)
>> +
>> +static int __vphn_get_associativity(long lcpu, __be32 *associativity)
>>  {
>> -	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
>>  	long rc, hwid;
>>  
>>  	/*
>> @@ -492,10 +545,22 @@ static int vphn_get_nid(long lcpu)
>>  
>>  		rc = hcall_vphn(hwid, VPHN_FLAG_VCPU, associativity);
>>  		if (rc == H_SUCCESS)
>> -			return associativity_to_nid(associativity);
>> +			return 0;
>>  	}
>>  
>> +	return -1;
>> +}
>> +
>> +static int vphn_get_nid(long lcpu)
>> +{
>> +	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
>> +
>> +
>> +	if (!__vphn_get_associativity(lcpu, associativity))
>> +		return associativity_to_nid(associativity);
>> +
>>  	return NUMA_NO_NODE;
>> +
>>  }
>>  #else
>>  static int vphn_get_nid(long unused)
>> @@ -692,7 +757,7 @@ static int __init numa_setup_drmem_lmb(struct drmem_lmb *lmb,
>>  			size = read_n_cells(n_mem_size_cells, usm);
>>  		}
>>  
>> -		nid = of_drconf_to_nid_single(lmb);
>> +		nid = get_nid_and_numa_distance(lmb);
>>  		fake_numa_create_new_node(((base + size) >> PAGE_SHIFT),
>>  					  &nid);
>>  		node_set_online(nid);
>> @@ -709,6 +774,7 @@ static int __init parse_numa_properties(void)
>>  	struct device_node *memory;
>>  	int default_nid = 0;
>>  	unsigned long i;
>> +	const __be32 *associativity;
>>  
>>  	if (numa_enabled == 0) {
>>  		printk(KERN_WARNING "NUMA disabled by user\n");
>> @@ -734,18 +800,30 @@ static int __init parse_numa_properties(void)
>>  	 * each node to be onlined must have NODE_DATA etc backing it.
>>  	 */
>>  	for_each_present_cpu(i) {
>> +		__be32 vphn_assoc[VPHN_ASSOC_BUFSIZE];
>>  		struct device_node *cpu;
>> -		int nid = vphn_get_nid(i);
>> +		int nid = NUMA_NO_NODE;
>>  
>> -		/*
>> -		 * Don't fall back to default_nid yet -- we will plug
>> -		 * cpus into nodes once the memory scan has discovered
>> -		 * the topology.
>> -		 */
>> -		if (nid == NUMA_NO_NODE) {
>> +		memset(vphn_assoc, 0, VPHN_ASSOC_BUFSIZE * sizeof(__be32));
>
> What's the memset() for?  AFAICT you only look at vphn_assoc in the
> branch where __vphn_get_associativity() succeeds.

That was done to match the existing code. We do use a zero filled array
when making that hcall in this code path. I don't see us doing that
everywhere. But didn't want to change that behaviour in this patch.

-static int vphn_get_nid(long lcpu)
+
+static int __vphn_get_associativity(long lcpu, __be32 *associativity)
 {
 -	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
  	long rc, hwid;

>
>> +
>> +		if (__vphn_get_associativity(i, vphn_assoc) == 0) {
>> +			nid = associativity_to_nid(vphn_assoc);
>> +			__initialize_form1_numa_distance(vphn_assoc);
>> +		} else {
>> +
>> +			/*
>> +			 * Don't fall back to default_nid yet -- we will plug
>> +			 * cpus into nodes once the memory scan has discovered
>> +			 * the topology.
>> +			 */
>>  			cpu = of_get_cpu_node(i, NULL);
>>  			BUG_ON(!cpu);
>> -			nid = of_node_to_nid_single(cpu);
>> +
>> +			associativity = of_get_associativity(cpu);
>> +			if (associativity) {
>> +				nid = associativity_to_nid(associativity);
>> +				__initialize_form1_numa_distance(associativity);
>> +			}
>>  			of_node_put(cpu);
>>  		}
>>  
>> @@ -781,8 +859,11 @@ static int __init parse_numa_properties(void)
>>  		 * have associativity properties.  If none, then
>>  		 * everything goes to default_nid.
>>  		 */
>> -		nid = of_node_to_nid_single(memory);
>> -		if (nid < 0)
>> +		associativity = of_get_associativity(memory);
>> +		if (associativity) {
>> +			nid = associativity_to_nid(associativity);
>> +			__initialize_form1_numa_distance(associativity);
>> +		} else
>>  			nid = default_nid;
>>  
>>  		fake_numa_create_new_node(((start + size) >> PAGE_SHIFT), &nid);
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> index 7e970f81d8ff..778b6ab35f0d 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> @@ -498,6 +498,8 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
>>  		return saved_rc;
>>  	}
>>  
>> +	update_numa_distance(dn);
>> +
>>  	rc = dlpar_online_cpu(dn);
>>  	if (rc) {
>>  		saved_rc = rc;
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> index 36f66556a7c6..40d350f31a34 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> @@ -180,6 +180,8 @@ static int update_lmb_associativity_index(struct drmem_lmb *lmb)
>>  		return -ENODEV;
>>  	}
>>  
>> +	update_numa_distance(lmb_node);
>> +
>>  	dr_node = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
>>  	if (!dr_node) {
>>  		dlpar_free_cc_nodes(lmb_node);
>> diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
>> index 1f051a786fb3..663a0859cf13 100644
>> --- a/arch/powerpc/platforms/pseries/pseries.h
>> +++ b/arch/powerpc/platforms/pseries/pseries.h
>> @@ -113,4 +113,5 @@ extern u32 pseries_security_flavor;
>>  void pseries_setup_security_mitigations(void);
>>  void pseries_lpar_read_hblkrm_characteristics(void);
>>  
>> +void update_numa_distance(struct device_node *node);
>>  #endif /* _PSERIES_PSERIES_H */
>
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 5/6] powerpc/pseries: Add a helper for form1 cpu distance
  2021-07-22  1:42   ` David Gibson
@ 2021-07-22  7:09     ` Aneesh Kumar K.V
  2021-07-26  2:38       ` David Gibson
  0 siblings, 1 reply; 30+ messages in thread
From: Aneesh Kumar K.V @ 2021-07-22  7:09 UTC (permalink / raw)
  To: David Gibson; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

David Gibson <david@gibson.dropbear.id.au> writes:

> On Mon, Jun 28, 2021 at 08:41:16PM +0530, Aneesh Kumar K.V wrote:
>> This helper is only used with the dispatch trace log collection.
>> A later patch will add Form2 affinity support and this change helps
>> in keeping that simpler. Also add a comment explaining we don't expect
>> the code to be called with FORM0
>> 
>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>
> What makes it a "relative_distance" rather than just a "distance"?

I added that to indicate that the function is not returning the actual
distance but a number indicative of 'near', 'far' etc. (it actually returns
1, 2 etc).

>
>> ---
>>  arch/powerpc/include/asm/topology.h   |  4 ++--
>>  arch/powerpc/mm/numa.c                | 10 +++++++++-
>>  arch/powerpc/platforms/pseries/lpar.c |  4 ++--
>>  3 files changed, 13 insertions(+), 5 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
>> index e4db64c0e184..ac8b5ed79832 100644
>> --- a/arch/powerpc/include/asm/topology.h
>> +++ b/arch/powerpc/include/asm/topology.h
>> @@ -36,7 +36,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
>>  				 cpu_all_mask :				\
>>  				 cpumask_of_node(pcibus_to_node(bus)))
>>  
>> -extern int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
>> +int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
>>  extern int __node_distance(int, int);
>>  #define node_distance(a, b) __node_distance(a, b)
>>  
>> @@ -83,7 +83,7 @@ static inline void sysfs_remove_device_from_node(struct device *dev,
>>  
>>  static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {}
>>  
>> -static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> +static inline int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  {
>>  	return 0;
>>  }
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 7b142f79d600..c6293037a103 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
>>  }
>>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>>  
>> -int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> +static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  {
>>  	int dist = 0;
>>  
>> @@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  	return dist;
>>  }
>>  
>> +int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> +{
>> +	/* We should not get called with FORM0 */
>> +	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
>> +
>> +	return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
>> +}
>> +
>>  /* must hold reference to node during call */
>>  static const __be32 *of_get_associativity(struct device_node *dev)
>>  {
>> diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
>> index dab356e3ff87..afefbdfe768d 100644
>> --- a/arch/powerpc/platforms/pseries/lpar.c
>> +++ b/arch/powerpc/platforms/pseries/lpar.c
>> @@ -261,7 +261,7 @@ static int cpu_relative_dispatch_distance(int last_disp_cpu, int cur_disp_cpu)
>>  	if (!last_disp_cpu_assoc || !cur_disp_cpu_assoc)
>>  		return -EIO;
>>  
>> -	return cpu_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
>> +	return cpu_relative_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
>>  }
>>  
>>  static int cpu_home_node_dispatch_distance(int disp_cpu)
>> @@ -281,7 +281,7 @@ static int cpu_home_node_dispatch_distance(int disp_cpu)
>>  	if (!disp_cpu_assoc || !vcpu_assoc)
>>  		return -EIO;
>>  
>> -	return cpu_distance(disp_cpu_assoc, vcpu_assoc);
>> +	return cpu_relative_distance(disp_cpu_assoc, vcpu_assoc);
>>  }
>>  
>>  static void update_vcpu_disp_stat(int disp_cpu)
>
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 6/6] powerpc/pseries: Add support for FORM2 associativity
  2021-07-22  2:28   ` David Gibson
@ 2021-07-22  7:34     ` Aneesh Kumar K.V
  2021-07-26  2:41       ` David Gibson
  0 siblings, 1 reply; 30+ messages in thread
From: Aneesh Kumar K.V @ 2021-07-22  7:34 UTC (permalink / raw)
  To: David Gibson; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

David Gibson <david@gibson.dropbear.id.au> writes:

> On Mon, Jun 28, 2021 at 08:41:17PM +0530, Aneesh Kumar K.V wrote:
>> PAPR interface currently supports two different ways of communicating resource
>> grouping details to the OS. These are referred to as Form 0 and Form 1
>> associativity grouping. Form 0 is the older format and is now considered
>> deprecated. This patch adds another resource grouping named FORM2.
>> 
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  Documentation/powerpc/associativity.rst   | 103 ++++++++++++++
>>  arch/powerpc/include/asm/firmware.h       |   3 +-
>>  arch/powerpc/include/asm/prom.h           |   1 +
>>  arch/powerpc/kernel/prom_init.c           |   3 +-
>>  arch/powerpc/mm/numa.c                    | 157 ++++++++++++++++++----
>>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>>  6 files changed, 242 insertions(+), 26 deletions(-)
>>  create mode 100644 Documentation/powerpc/associativity.rst
>> 
>> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
>> new file mode 100644
>> index 000000000000..31cc7da2c7a6
>> --- /dev/null
>> +++ b/Documentation/powerpc/associativity.rst
>> @@ -0,0 +1,103 @@
>> +============================
>> +NUMA resource associativity
>> +=============================
>> +
>> +Associativity represents the groupings of the various platform resources into
>> +domains of substantially similar mean performance relative to resources outside
>> +of that domain. Resources subsets of a given domain that exhibit better
>> +performance relative to each other than relative to other resources subsets
>> +are represented as being members of a sub-grouping domain. This performance
>> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
>> +From the platform view, these groups are also referred to as domains.
>
> Pretty hard to decipher, but that's typical for PAPR.
>
>> +PAPR interface currently supports different ways of communicating these resource
>> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
>> +associativity grouping. Form 0 is the older format and is now considered deprecated.
>
> Nit: s/older/oldest/ since there are now >2 forms.

updated.

>
>> +Hypervisor indicates the type/form of associativity used via "ibm,architecture-vec-5 property".
>> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
>> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
>> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>> +
>> +Form 0
>> +-----
>> +Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
>> +
>> +Form 1
>> +-----
>> +With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity
>> +device tree properties are used to determine the NUMA distance between resource groups/domains.
>> +
>> +The “ibm,associativity” property contains a list of one or more numbers (domainID)
>> +representing the resource’s platform grouping domains.
>> +
>> +The “ibm,associativity-reference-points” property contains a list of one or more numbers
>> +(domainID index) that represents the 1 based ordinal in the associativity lists.
>> +The list of domainID indexes represents an increasing hierarchy of resource grouping.
>> +
>> +ex:
>> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
>> +
>> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
>> +Linux kernel computes NUMA distance between two domains by recursively comparing
>> +if they belong to the same higher-level domains. For mismatch at every higher
>> +level of the resource group, the kernel doubles the NUMA distance between the
>> +comparing domains.
>> +
>> +Form 2
>> +-------
>> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
>> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
>> +domain numbering. With numa distance computation now detached from the index value in
>> +"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain
>> +ids at the same domainID index representing resource groups of different performance/latency
>> +characteristics.
>> +
>> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
>> +"ibm,architecture-vec-5" property.
>> +
>> +"ibm,numa-lookup-index-table" property contains a list of one or more numbers representing
>> +the domainIDs present in the system. The offset of the domainID in this property is
>> +used as an index while computing numa distance information via "ibm,numa-distance-table".
>> +
>> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
>> +N domainID encoded as with encode-int
>> +
>> +For ex:
>> +"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. The offset of domainID 8 (2) is used when
>> +computing the distance of domain 8 from other domains present in the system. For the rest of
>> +this document, this offset will be referred to as domain distance offset.
>> +
>> +"ibm,numa-distance-table" property contains a list of one or more numbers representing the NUMA
>> +distance between resource groups/domains present in the system.
>> +
>> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>> +The number N must be equal to the square of m where m is the number of domainIDs in the
>> +numa-lookup-index-table.
>> +
>> +For ex:
>> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
>> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
>
> This representation doesn't make it clear that the 9 is a u32, but the
> rest are u8s.

How do you suggest we specify that? I could do 9:u32 10:u8 etc. But
considering the details are explained in the paragraph above, is that
needed? 

>
>> +
>> +  | 0    8   40
>> +--|------------
>> +  |
>> +0 | 10   20  80
>> +  |
>> +8 | 20   10  160
>> +  |
>> +40| 80   160  10
>> +
>> +A possible "ibm,associativity" property for resources in node 0, 8 and 40
>> +
>> +{ 3, 6, 7, 0 }
>> +{ 3, 6, 9, 8 }
>> +{ 3, 6, 7, 40}
>> +
>> +With "ibm,associativity-reference-points"  { 0x3 }
>
> You haven't actually described how ibm,associativity-reference-points
> operates in Form2.

Nothing change w.r.t the definition of associativity-reference-points
w.r.t FORM2. It still will continue to show the increasing hierarchy of
resource groups.

>
>> +"ibm,lookup-index-table" helps in having a compact representation of distance matrix.
>> +Since domainID can be sparse, the matrix of distances can also be effectively sparse.
>> +With "ibm,lookup-index-table" we can achieve a compact representation of
>> +distance information.
>> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
>> index 60b631161360..97a3bd9ffeb9 100644
>> --- a/arch/powerpc/include/asm/firmware.h
>> +++ b/arch/powerpc/include/asm/firmware.h
>> @@ -53,6 +53,7 @@
>>  #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
>>  #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
>>  #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
>> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
>>  
>>  #ifndef __ASSEMBLY__
>>  
>> @@ -73,7 +74,7 @@ enum {
>>  		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
>>  		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
>>  		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
>> -		FW_FEATURE_RPT_INVALIDATE,
>> +		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
>>  	FW_FEATURE_PSERIES_ALWAYS = 0,
>>  	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
>>  	FW_FEATURE_POWERNV_ALWAYS = 0,
>> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
>> index df9fec9d232c..5c80152e8f18 100644
>> --- a/arch/powerpc/include/asm/prom.h
>> +++ b/arch/powerpc/include/asm/prom.h
>> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
>>  #define OV5_XCMO		0x0440	/* Page Coalescing */
>>  #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
>>  #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
>> +#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
>>  #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
>>  #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
>>  #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
>> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
>> index 5d9ea059594f..c483df6c9393 100644
>> --- a/arch/powerpc/kernel/prom_init.c
>> +++ b/arch/powerpc/kernel/prom_init.c
>> @@ -1069,7 +1069,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
>>  #else
>>  		0,
>>  #endif
>> -		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
>> +		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
>> +		OV5_FEAT(OV5_FORM2_AFFINITY),
>>  		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
>>  		.micro_checkpoint = 0,
>>  		.reserved0 = 0,
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index c6293037a103..c68846fc9550 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
>>  
>>  #define FORM0_AFFINITY 0
>>  #define FORM1_AFFINITY 1
>> +#define FORM2_AFFINITY 2
>>  static int affinity_form;
>>  
>>  #define MAX_DISTANCE_REF_POINTS 4
>>  static int max_associativity_domain_index;
>>  static const __be32 *distance_ref_points;
>>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
>> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
>> +	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
>> +};
>> +static int numa_id_index_table[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
>>  
>>  /*
>>   * Allocate node_to_cpumask_map based on number of available nodes
>> @@ -166,6 +171,44 @@ static void unmap_cpu_from_node(unsigned long cpu)
>>  }
>>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>>  
>> +/*
>> + * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
>> + * info is found.
>> + */
>> +static int associativity_to_nid(const __be32 *associativity)
>> +{
>> +	int nid = NUMA_NO_NODE;
>> +
>> +	if (!numa_enabled)
>> +		goto out;
>> +
>> +	if (of_read_number(associativity, 1) >= primary_domain_index)
>> +		nid = of_read_number(&associativity[primary_domain_index], 1);
>> +
>> +	/* POWER4 LPAR uses 0xffff as invalid node */
>> +	if (nid == 0xffff || nid >= nr_node_ids)
>> +		nid = NUMA_NO_NODE;
>> +out:
>> +	return nid;
>> +}
>> +
>> +static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> +{
>> +	int dist;
>> +	int node1, node2;
>> +
>> +	node1 = associativity_to_nid(cpu1_assoc);
>> +	node2 = associativity_to_nid(cpu2_assoc);
>> +
>> +	dist = numa_distance_table[node1][node2];
>> +	if (dist <= LOCAL_DISTANCE)
>> +		return 0;
>> +	else if (dist <= REMOTE_DISTANCE)
>> +		return 1;
>> +	else
>> +		return 2;
>
> Squashing the full range of distances into just 0, 1 or 2 seems odd.
> But then, this whole cpu_distance() thing being distinct from
> node_distance() seems odd.
>
>> +}
>> +
>>  static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  {
>>  	int dist = 0;
>> @@ -186,8 +229,9 @@ int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  {
>>  	/* We should not get called with FORM0 */
>>  	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
>> -
>> -	return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
>> +	if (affinity_form == FORM1_AFFINITY)
>> +		return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
>> +	return __cpu_form2_relative_distance(cpu1_assoc, cpu2_assoc);
>>  }
>>  
>>  /* must hold reference to node during call */
>> @@ -201,7 +245,9 @@ int __node_distance(int a, int b)
>>  	int i;
>>  	int distance = LOCAL_DISTANCE;
>>  
>> -	if (affinity_form == FORM0_AFFINITY)
>> +	if (affinity_form == FORM2_AFFINITY)
>> +		return numa_distance_table[a][b];
>> +	else if (affinity_form == FORM0_AFFINITY)
>>  		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
>>  
>>  	for (i = 0; i < max_associativity_domain_index; i++) {
>
> Hmm.. couldn't we simplify this whole __node_distance function, if we
> just update numa_distance_table[][] appropriately for Form0 and Form1
> as well?

IIUC what you are suggesting is to look at the possibility of using
numa_distance_table[a][b] even for FORM1_AFFINITY? I can do that as part
of separate patch?

>
>> @@ -216,27 +262,6 @@ int __node_distance(int a, int b)
>>  }
>>  EXPORT_SYMBOL(__node_distance);
>>  
>> -/*
>> - * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
>> - * info is found.
>> - */
>> -static int associativity_to_nid(const __be32 *associativity)
>> -{
>> -	int nid = NUMA_NO_NODE;
>> -
>> -	if (!numa_enabled)
>> -		goto out;
>> -
>> -	if (of_read_number(associativity, 1) >= primary_domain_index)
>> -		nid = of_read_number(&associativity[primary_domain_index], 1);
>> -
>> -	/* POWER4 LPAR uses 0xffff as invalid node */
>> -	if (nid == 0xffff || nid >= nr_node_ids)
>> -		nid = NUMA_NO_NODE;
>> -out:
>> -	return nid;
>> -}
>> -
>>  /* Returns the nid associated with the given device tree node,
>>   * or -1 if not found.
>>   */
>> @@ -305,12 +330,84 @@ static void initialize_form1_numa_distance(struct device_node *node)
>>   */
>>  void update_numa_distance(struct device_node *node)
>>  {
>> +	int nid;
>> +
>>  	if (affinity_form == FORM0_AFFINITY)
>>  		return;
>>  	else if (affinity_form == FORM1_AFFINITY) {
>>  		initialize_form1_numa_distance(node);
>>  		return;
>>  	}
>> +
>> +	/* FORM2 affinity  */
>> +	nid = of_node_to_nid_single(node);
>> +	if (nid == NUMA_NO_NODE)
>> +		return;
>> +
>> +	/*
>> +	 * With FORM2 we expect NUMA distance of all possible NUMA
>> +	 * nodes to be provided during boot.
>> +	 */
>> +	WARN(numa_distance_table[nid][nid] == -1,
>> +	     "NUMA distance details for node %d not provided\n", nid);
>> +}
>> +
>> +/*
>> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
>> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
>> + */
>> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
>> +{
>> +	int i, j;
>> +	const __u8 *numa_dist_table;
>> +	const __be32 *numa_lookup_index;
>> +	int numa_dist_table_length;
>> +	int max_numa_index, distance_index;
>> +
>> +	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
>> +	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
>> +
>> +	/* first element of the array is the size and is encode-int */
>> +	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
>> +	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
>> +	/* Skip the size which is encoded int */
>> +	numa_dist_table += sizeof(__be32);
>> +
>> +	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d\n",
>> +		 numa_dist_table_length, max_numa_index);
>> +
>> +	for (i = 0; i < max_numa_index; i++)
>> +		/* +1 skip the max_numa_index in the property */
>> +		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
>> +
>> +
>> +	if (numa_dist_table_length != max_numa_index * max_numa_index) {
>> +
>> +		WARN(1, "Wrong NUMA distance information\n");
>> +		/* consider everybody else just remote. */
>> +		for (i = 0;  i < max_numa_index; i++) {
>> +			for (j = 0; j < max_numa_index; j++) {
>> +				int nodeA = numa_id_index_table[i];
>> +				int nodeB = numa_id_index_table[j];
>> +
>> +				if (nodeA == nodeB)
>> +					numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE;
>> +				else
>> +					numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE;
>> +			}
>> +		}
>> +	}
>> +
>> +	distance_index = 0;
>> +	for (i = 0;  i < max_numa_index; i++) {
>> +		for (j = 0; j < max_numa_index; j++) {
>> +			int nodeA = numa_id_index_table[i];
>> +			int nodeB = numa_id_index_table[j];
>> +
>> +			numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++];
>> +			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
>> +		}
>> +	}
>>  }
>>  
>>  static int __init find_primary_domain_index(void)
>> @@ -323,6 +420,9 @@ static int __init find_primary_domain_index(void)
>>  	 */
>>  	if (firmware_has_feature(FW_FEATURE_OPAL)) {
>>  		affinity_form = FORM1_AFFINITY;
>> +	} else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) {
>> +		dbg("Using form 2 affinity\n");
>> +		affinity_form = FORM2_AFFINITY;
>>  	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
>>  		dbg("Using form 1 affinity\n");
>>  		affinity_form = FORM1_AFFINITY;
>> @@ -367,8 +467,17 @@ static int __init find_primary_domain_index(void)
>>  
>>  		index = of_read_number(&distance_ref_points[1], 1);
>>  	} else {
>> +		/*
>> +		 * Both FORM1 and FORM2 affinity find the primary domain details
>> +		 * at the same offset.
>> +		 */
>>  		index = of_read_number(distance_ref_points, 1);
>>  	}
>> +	/*
>> +	 * If it is FORM2 also initialize the distance table here.
>> +	 */
>> +	if (affinity_form == FORM2_AFFINITY)
>> +		initialize_form2_numa_distance_lookup_table(root);
>
> Ew.  Calling a function called "find_primary_domain_index" to also
> initialize the main distance table is needlessly counterintuitive.
> Move this call to parse_numa_properties().

The reason I ended up doing it here is because 'root' is already fetched
here. But I agree it is confusing. I will move fetching of root inside
initialize_form2_numa_distance_lookup_table() and move the function
outside primary_index lookup.

modified   arch/powerpc/mm/numa.c
@@ -355,14 +355,22 @@ void update_numa_distance(struct device_node *node)
  * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
  * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
  */
-static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
+static void initialize_form2_numa_distance_lookup_table()
 {
 	int i, j;
+	struct device_node *root;
 	const __u8 *numa_dist_table;
 	const __be32 *numa_lookup_index;
 	int numa_dist_table_length;
 	int max_numa_index, distance_index;
 
+	if (firmware_has_feature(FW_FEATURE_OPAL))
+		root = of_find_node_by_path("/ibm,opal");
+	else
+		root = of_find_node_by_path("/rtas");
+	if (!root)
+		root = of_find_node_by_path("/");
+
 	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
 	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
 
@@ -407,6 +415,7 @@ static void initialize_form2_numa_distance_lookup_table(struct device_node *root
 			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
 		}
 	}
+	of_node_put(root);
 }
 
 static int __init find_primary_domain_index(void)
@@ -472,12 +481,6 @@ static int __init find_primary_domain_index(void)
 		 */
 		index = of_read_number(distance_ref_points, 1);
 	}
-	/*
-	 * If it is FORM2 also initialize the distance table here.
-	 */
-	if (affinity_form == FORM2_AFFINITY)
-		initialize_form2_numa_distance_lookup_table(root);
-
 	/*
 	 * Warn and cap if the hardware supports more than
 	 * MAX_DISTANCE_REF_POINTS domains.
@@ -916,6 +919,12 @@ static int __init parse_numa_properties(void)
 
 	dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index);
 
+	/*
+	 * If it is FORM2 also initialize the distance table here.
+	 */
+	if (affinity_form == FORM2_AFFINITY)
+		initialize_form2_numa_distance_lookup_table();
+
 	/*
 	 * Even though we connect cpus to numa domains later in SMP
 	 * init, we need to know the node ids now. This is because

-aneesh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 1/6] powerpc/pseries: rename min_common_depth to primary_domain_index
  2021-07-22  5:17       ` Aneesh Kumar K.V
@ 2021-07-26  2:28         ` David Gibson
  0 siblings, 0 replies; 30+ messages in thread
From: David Gibson @ 2021-07-26  2:28 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1905 bytes --]

On Thu, Jul 22, 2021 at 10:47:49AM +0530, Aneesh Kumar K.V wrote:
> On 7/22/21 8:06 AM, David Gibson wrote:
> > On Thu, Jul 22, 2021 at 11:59:15AM +1000, David Gibson wrote:
> > > On Mon, Jun 28, 2021 at 08:41:12PM +0530, Aneesh Kumar K.V wrote:
> > > > No functional change in this patch.
> > > 
> > > The new name does not match how you describe "primary domain index" in
> > > the documentation from patch 6/6.  There it comes from the values in
> > > associativity-reference-points, but here it simply comes from the
> > > lengths of all the associativity properties.
> > 
> > No, sorry, I misread this code... misled by the old name, so it's a
> > good thing you're changing it.
> > 
> > But.. I'm still not sure the new name is accurate, either...
> > 
> > [snip]
> > > >   	if (form1_affinity) {
> > > > -		depth = of_read_number(distance_ref_points, 1);
> > > > +		index = of_read_number(distance_ref_points, 1);
> > 
> > AFACIT distance_ref_points hasn't been altered from the
> > of_get_property() at this point, so isn't this setting depth / index
> > to the number of entries in ref-points, rather than the value of the
> > first entry (which is what primary domain index is supposed to be).
> > 
> 
> ibm,associativity-reference-points property format is as below.
> 
> # lsprop  ibm,associativity-reference-points
> ibm,associativity-reference-points
>                  00000004 00000002
> 
> it doesn't have the number of elements as the first item.
> 
> For FORM1 1 element is the NUMA boundary index/primary_domain_index
> For FORM0 2 element is the NUMA boundary index/primary_domain_index.

Sorry, my bad.  I foolishly expected consistency from PAPR.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths
  2021-07-22  7:07     ` Aneesh Kumar K.V
@ 2021-07-26  2:37       ` David Gibson
  2021-07-27  3:32         ` Aneesh Kumar K.V
  0 siblings, 1 reply; 30+ messages in thread
From: David Gibson @ 2021-07-26  2:37 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 16044 bytes --]

On Thu, Jul 22, 2021 at 12:37:46PM +0530, Aneesh Kumar K.V wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > On Mon, Jun 28, 2021 at 08:41:15PM +0530, Aneesh Kumar K.V wrote:
> >> The associativity details of the newly added resourced are collected from
> >> the hypervisor via "ibm,configure-connector" rtas call. Update the numa
> >> distance details of the newly added numa node after the above call.
> >> 
> >> Instead of updating NUMA distance every time we lookup a node id
> >> from the associativity property, add helpers that can be used
> >> during boot which does this only once. Also remove the distance
> >> update from node id lookup helpers.
> >> 
> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> >> ---
> >>  arch/powerpc/mm/numa.c                        | 173 +++++++++++++-----
> >>  arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
> >>  .../platforms/pseries/hotplug-memory.c        |   2 +
> >>  arch/powerpc/platforms/pseries/pseries.h      |   1 +
> >>  4 files changed, 132 insertions(+), 46 deletions(-)
> >> 
> >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> >> index 0ec16999beef..7b142f79d600 100644
> >> --- a/arch/powerpc/mm/numa.c
> >> +++ b/arch/powerpc/mm/numa.c
> >> @@ -208,22 +208,6 @@ int __node_distance(int a, int b)
> >>  }
> >>  EXPORT_SYMBOL(__node_distance);
> >>  
> >> -static void initialize_distance_lookup_table(int nid,
> >> -		const __be32 *associativity)
> >> -{
> >> -	int i;
> >> -
> >> -	if (affinity_form != FORM1_AFFINITY)
> >> -		return;
> >> -
> >> -	for (i = 0; i < max_associativity_domain_index; i++) {
> >> -		const __be32 *entry;
> >> -
> >> -		entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
> >> -		distance_lookup_table[nid][i] = of_read_number(entry, 1);
> >> -	}
> >> -}
> >> -
> >>  /*
> >>   * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
> >>   * info is found.
> >> @@ -241,15 +225,6 @@ static int associativity_to_nid(const __be32 *associativity)
> >>  	/* POWER4 LPAR uses 0xffff as invalid node */
> >>  	if (nid == 0xffff || nid >= nr_node_ids)
> >>  		nid = NUMA_NO_NODE;
> >> -
> >> -	if (nid > 0 &&
> >> -		of_read_number(associativity, 1) >= max_associativity_domain_index) {
> >> -		/*
> >> -		 * Skip the length field and send start of associativity array
> >> -		 */
> >> -		initialize_distance_lookup_table(nid, associativity + 1);
> >> -	}
> >> -
> >>  out:
> >>  	return nid;
> >>  }
> >> @@ -287,6 +262,49 @@ int of_node_to_nid(struct device_node *device)
> >>  }
> >>  EXPORT_SYMBOL(of_node_to_nid);
> >>  
> >> +static void __initialize_form1_numa_distance(const __be32 *associativity)
> >> +{
> >> +	int i, nid;
> >> +
> >> +	if (affinity_form != FORM1_AFFINITY)
> >
> > Since this shouldn't be called on a !form1 system, this could be a WARN_ON().
> 
> The way we call functions currently, instead of doing
> 
> if (affinity_form == FORM1_AFFINITY)
>     __initialize_form1_numa_distance()
> 
> We avoid doing the if check in multiple places. For example
> parse_numa_properties will fetch the associativity array to find the
> details of online node and set it online. We use the same code path to
> initialize distance.
> 
> 		if (__vphn_get_associativity(i, vphn_assoc) == 0) {
> 			nid = associativity_to_nid(vphn_assoc);
> 			__initialize_form1_numa_distance(vphn_assoc);
> 		} else {
> 
> 			cpu = of_get_cpu_node(i, NULL);
> 			BUG_ON(!cpu);
> 
> 			associativity = of_get_associativity(cpu);
> 			if (associativity) {
> 				nid = associativity_to_nid(associativity);
> 				__initialize_form1_numa_distance(associativity);
> 			}
> 
> We avoid the the if (affinity_form == FORM1_AFFINITY) check there by
> moving the check inside __initialize_form1_numa_distance().

Oh.. ok.  The only caller I spotted was already doing a test against
affinity_form.

> >> +		return;
> >> +
> >> +	if (of_read_number(associativity, 1) >= primary_domain_index) {
> >> +		nid = of_read_number(&associativity[primary_domain_index], 1);
> >
> > This computes the nid from the assoc array independently of
> > associativity_to_nid, which doesn't seem like a good idea.  Wouldn't
> > it be better to call assocaitivity_to_nid(), then make the next bit
> > conditional on nid !== NUMA_NO_NODE?
> 
> @@ -302,9 +302,8 @@ static void __initialize_form1_numa_distance(const __be32 *associativity)
>  	if (affinity_form != FORM1_AFFINITY)
>  		return;
>  
> -	if (of_read_number(associativity, 1) >= primary_domain_index) {
> -		nid = of_read_number(&associativity[primary_domain_index], 1);
> -
> +	nid = associativity_to_nid(associativity);
> +	if (nid != NUMA_NO_NODE) {
>  		for (i = 0; i < distance_ref_points_depth; i++) {
>  			const __be32 *entry;

Right.

> >
> >> +
> >> +		for (i = 0; i < max_associativity_domain_index; i++) {
> >> +			const __be32 *entry;
> >> +
> >> +			entry = &associativity[be32_to_cpu(distance_ref_points[i])];
> >> +			distance_lookup_table[nid][i] = of_read_number(entry, 1);
> >> +		}
> >> +	}
> >> +}
> >> +
> >> +static void initialize_form1_numa_distance(struct device_node *node)
> >> +{
> >> +	const __be32 *associativity;
> >> +
> >> +	associativity = of_get_associativity(node);
> >> +	if (!associativity)
> >> +		return;
> >> +
> >> +	__initialize_form1_numa_distance(associativity);
> >> +}
> >> +
> >> +/*
> >> + * Used to update distance information w.r.t newly added node.
> >> + */
> >> +void update_numa_distance(struct device_node *node)
> >> +{
> >> +	if (affinity_form == FORM0_AFFINITY)
> >> +		return;
> >> +	else if (affinity_form == FORM1_AFFINITY) {
> >> +		initialize_form1_numa_distance(node);
> >> +		return;
> >> +	}
> >> +}
> >> +
> >>  static int __init find_primary_domain_index(void)
> >>  {
> >>  	int index;
> >> @@ -433,6 +451,48 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
> >>  	return 0;
> >>  }
> >>  
> >> +static int get_nid_and_numa_distance(struct drmem_lmb *lmb)
> >> +{
> >> +	struct assoc_arrays aa = { .arrays = NULL };
> >> +	int default_nid = NUMA_NO_NODE;
> >> +	int nid = default_nid;
> >> +	int rc, index;
> >> +
> >> +	if ((primary_domain_index < 0) || !numa_enabled)
> >
> > Under what circumstances could you get primary_domain_index < 0?
> 
> IIUC that is to handle failure to parse device tree.
> ea9f5b702fe0215188fba2eda117419e4ae90a67

Ok.

> >
> >> +		return default_nid;

Returning NUMA_NO_NODE explicitly, rather than an alias to it might be
clearer here, but it's not a big detail.

> >> +
> >> +	rc = of_get_assoc_arrays(&aa);
> >> +	if (rc)
> >> +		return default_nid;
> >> +
> >> +	if (primary_domain_index <= aa.array_sz &&
> >> +	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
> >> +		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
> >
> > Does anywhere verify that primary_domain_index <= aa.array_sz?
> 
> That is the first part of the check?

Oh, sorry, missed that.  I think I was expecting it to be an early
exit, rather than folded into the rest of this complex condition.

> 
> >
> >> +		nid = of_read_number(&aa.arrays[index], 1);
> >> +
> >> +		if (nid == 0xffff || nid >= nr_node_ids)
> >> +			nid = default_nid;
> >> +		if (nid > 0 && affinity_form == FORM1_AFFINITY) {
> >> +			int i;
> >> +			const __be32 *associativity;
> >> +
> >> +			index = lmb->aa_index * aa.array_sz;
> >> +			associativity = &aa.arrays[index];
> >> +			/*
> >> +			 * lookup array associativity entries have different format
> >> +			 * There is no length of the array as the first element.
> >
> > The difference it very small, and this is not a hot path.  Couldn't
> > you reduce a chunk of code by prepending aa.array_sz, then re-using
> > __initialize_form1_numa_distance.  Or even making
> > __initialize_form1_numa_distance() take the length as a parameter.
> 
> The changes are small but confusing w.r.t how we look at the
> associativity-lookup-arrays. The way we interpret associativity array
> and associativity lookup array using primary_domain_index is different.
> Hence the '-1' in the node lookup here.

They're really not, though.  It's exactly the same interpretation of
the associativity array itself - it's just that one of them has the
array prepended with a (redundant) length.  So you can make
__initialize_form1_numa_distance() work on the "bare" associativity
array, with a given length.  Here you call it with aa.array_sz as the
length, and in the other place you call it with prop[0] as the length.

> 
> 	index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
> 	nid = of_read_number(&aa.arrays[index], 1);
> 
> 
> >
> >> +			 */
> >> +			for (i = 0; i < max_associativity_domain_index; i++) {
> >> +				const __be32 *entry;
> >> +
> >> +				entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
> >
> > Does anywhere verify that distance_ref_points[i] <= aa.array_size for
> > every i?
> 
> We do check for 
> 
> 	if (primary_domain_index <= aa.array_sz &&

Right, but that doesn't check the other distance_ref_points entries.
Not that there's any reason to have extra entries with Form2, but we
still don't want stray array accesses.

> 
> >
> >> +				distance_lookup_table[nid][i] = of_read_number(entry, 1);
> >> +			}
> >> +		}
> >> +	}
> >> +	return nid;
> >> +}
> >> +
> >>  /*
> >>   * This is like of_node_to_nid_single() for memory represented in the
> >>   * ibm,dynamic-reconfiguration-memory node.
> >> @@ -458,21 +518,14 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
> >>  
> >>  		if (nid == 0xffff || nid >= nr_node_ids)
> >>  			nid = default_nid;
> >> -
> >> -		if (nid > 0) {
> >> -			index = lmb->aa_index * aa.array_sz;
> >> -			initialize_distance_lookup_table(nid,
> >> -							&aa.arrays[index]);
> >> -		}
> >>  	}
> >> -
> >>  	return nid;
> >>  }
> >>  
> >>  #ifdef CONFIG_PPC_SPLPAR
> >> -static int vphn_get_nid(long lcpu)
> >> +
> >> +static int __vphn_get_associativity(long lcpu, __be32 *associativity)
> >>  {
> >> -	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> >>  	long rc, hwid;
> >>  
> >>  	/*
> >> @@ -492,10 +545,22 @@ static int vphn_get_nid(long lcpu)
> >>  
> >>  		rc = hcall_vphn(hwid, VPHN_FLAG_VCPU, associativity);
> >>  		if (rc == H_SUCCESS)
> >> -			return associativity_to_nid(associativity);
> >> +			return 0;
> >>  	}
> >>  
> >> +	return -1;
> >> +}
> >> +
> >> +static int vphn_get_nid(long lcpu)
> >> +{
> >> +	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> >> +
> >> +
> >> +	if (!__vphn_get_associativity(lcpu, associativity))
> >> +		return associativity_to_nid(associativity);
> >> +
> >>  	return NUMA_NO_NODE;
> >> +
> >>  }
> >>  #else
> >>  static int vphn_get_nid(long unused)
> >> @@ -692,7 +757,7 @@ static int __init numa_setup_drmem_lmb(struct drmem_lmb *lmb,
> >>  			size = read_n_cells(n_mem_size_cells, usm);
> >>  		}
> >>  
> >> -		nid = of_drconf_to_nid_single(lmb);
> >> +		nid = get_nid_and_numa_distance(lmb);
> >>  		fake_numa_create_new_node(((base + size) >> PAGE_SHIFT),
> >>  					  &nid);
> >>  		node_set_online(nid);
> >> @@ -709,6 +774,7 @@ static int __init parse_numa_properties(void)
> >>  	struct device_node *memory;
> >>  	int default_nid = 0;
> >>  	unsigned long i;
> >> +	const __be32 *associativity;
> >>  
> >>  	if (numa_enabled == 0) {
> >>  		printk(KERN_WARNING "NUMA disabled by user\n");
> >> @@ -734,18 +800,30 @@ static int __init parse_numa_properties(void)
> >>  	 * each node to be onlined must have NODE_DATA etc backing it.
> >>  	 */
> >>  	for_each_present_cpu(i) {
> >> +		__be32 vphn_assoc[VPHN_ASSOC_BUFSIZE];
> >>  		struct device_node *cpu;
> >> -		int nid = vphn_get_nid(i);
> >> +		int nid = NUMA_NO_NODE;
> >>  
> >> -		/*
> >> -		 * Don't fall back to default_nid yet -- we will plug
> >> -		 * cpus into nodes once the memory scan has discovered
> >> -		 * the topology.
> >> -		 */
> >> -		if (nid == NUMA_NO_NODE) {
> >> +		memset(vphn_assoc, 0, VPHN_ASSOC_BUFSIZE * sizeof(__be32));
> >
> > What's the memset() for?  AFAICT you only look at vphn_assoc in the
> > branch where __vphn_get_associativity() succeeds.
> 
> That was done to match the existing code. We do use a zero filled array
> when making that hcall in this code path. I don't see us doing that
> everywhere. But didn't want to change that behaviour in this patch.
> 
> -static int vphn_get_nid(long lcpu)
> +
> +static int __vphn_get_associativity(long lcpu, __be32 *associativity)
>  {
>  -	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
>   	long rc, hwid;

Ok, that makes sense.

> 
> >
> >> +
> >> +		if (__vphn_get_associativity(i, vphn_assoc) == 0) {
> >> +			nid = associativity_to_nid(vphn_assoc);
> >> +			__initialize_form1_numa_distance(vphn_assoc);
> >> +		} else {
> >> +
> >> +			/*
> >> +			 * Don't fall back to default_nid yet -- we will plug
> >> +			 * cpus into nodes once the memory scan has discovered
> >> +			 * the topology.
> >> +			 */
> >>  			cpu = of_get_cpu_node(i, NULL);
> >>  			BUG_ON(!cpu);
> >> -			nid = of_node_to_nid_single(cpu);
> >> +
> >> +			associativity = of_get_associativity(cpu);
> >> +			if (associativity) {
> >> +				nid = associativity_to_nid(associativity);
> >> +				__initialize_form1_numa_distance(associativity);
> >> +			}
> >>  			of_node_put(cpu);
> >>  		}
> >>  
> >> @@ -781,8 +859,11 @@ static int __init parse_numa_properties(void)
> >>  		 * have associativity properties.  If none, then
> >>  		 * everything goes to default_nid.
> >>  		 */
> >> -		nid = of_node_to_nid_single(memory);
> >> -		if (nid < 0)
> >> +		associativity = of_get_associativity(memory);
> >> +		if (associativity) {
> >> +			nid = associativity_to_nid(associativity);
> >> +			__initialize_form1_numa_distance(associativity);
> >> +		} else
> >>  			nid = default_nid;
> >>  
> >>  		fake_numa_create_new_node(((start + size) >> PAGE_SHIFT), &nid);
> >> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> >> index 7e970f81d8ff..778b6ab35f0d 100644
> >> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> >> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> >> @@ -498,6 +498,8 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
> >>  		return saved_rc;
> >>  	}
> >>  
> >> +	update_numa_distance(dn);
> >> +
> >>  	rc = dlpar_online_cpu(dn);
> >>  	if (rc) {
> >>  		saved_rc = rc;
> >> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
> >> index 36f66556a7c6..40d350f31a34 100644
> >> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> >> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> >> @@ -180,6 +180,8 @@ static int update_lmb_associativity_index(struct drmem_lmb *lmb)
> >>  		return -ENODEV;
> >>  	}
> >>  
> >> +	update_numa_distance(lmb_node);
> >> +
> >>  	dr_node = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
> >>  	if (!dr_node) {
> >>  		dlpar_free_cc_nodes(lmb_node);
> >> diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
> >> index 1f051a786fb3..663a0859cf13 100644
> >> --- a/arch/powerpc/platforms/pseries/pseries.h
> >> +++ b/arch/powerpc/platforms/pseries/pseries.h
> >> @@ -113,4 +113,5 @@ extern u32 pseries_security_flavor;
> >>  void pseries_setup_security_mitigations(void);
> >>  void pseries_lpar_read_hblkrm_characteristics(void);
> >>  
> >> +void update_numa_distance(struct device_node *node);
> >>  #endif /* _PSERIES_PSERIES_H */
> >
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 5/6] powerpc/pseries: Add a helper for form1 cpu distance
  2021-07-22  7:09     ` Aneesh Kumar K.V
@ 2021-07-26  2:38       ` David Gibson
  0 siblings, 0 replies; 30+ messages in thread
From: David Gibson @ 2021-07-26  2:38 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 4702 bytes --]

On Thu, Jul 22, 2021 at 12:39:27PM +0530, Aneesh Kumar K.V wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > On Mon, Jun 28, 2021 at 08:41:16PM +0530, Aneesh Kumar K.V wrote:
> >> This helper is only used with the dispatch trace log collection.
> >> A later patch will add Form2 affinity support and this change helps
> >> in keeping that simpler. Also add a comment explaining we don't expect
> >> the code to be called with FORM0
> >> 
> >> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> >
> > What makes it a "relative_distance" rather than just a "distance"?
> 
> I added that to indicate that the function is not returning the actual
> distance but a number indicative of 'near', 'far' etc. (it actually returns
> 1, 2 etc).

Hm... ok.  To me at least it doesn't really convey that meaning, but
then I'm not sure what would.  To be "relative distance" means the
distance relative to some other object, but then all the NUMA
distances are that - the distance of one node relative to another.

> >> ---
> >>  arch/powerpc/include/asm/topology.h   |  4 ++--
> >>  arch/powerpc/mm/numa.c                | 10 +++++++++-
> >>  arch/powerpc/platforms/pseries/lpar.c |  4 ++--
> >>  3 files changed, 13 insertions(+), 5 deletions(-)
> >> 
> >> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> >> index e4db64c0e184..ac8b5ed79832 100644
> >> --- a/arch/powerpc/include/asm/topology.h
> >> +++ b/arch/powerpc/include/asm/topology.h
> >> @@ -36,7 +36,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
> >>  				 cpu_all_mask :				\
> >>  				 cpumask_of_node(pcibus_to_node(bus)))
> >>  
> >> -extern int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
> >> +int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
> >>  extern int __node_distance(int, int);
> >>  #define node_distance(a, b) __node_distance(a, b)
> >>  
> >> @@ -83,7 +83,7 @@ static inline void sysfs_remove_device_from_node(struct device *dev,
> >>  
> >>  static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {}
> >>  
> >> -static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >> +static inline int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >>  {
> >>  	return 0;
> >>  }
> >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> >> index 7b142f79d600..c6293037a103 100644
> >> --- a/arch/powerpc/mm/numa.c
> >> +++ b/arch/powerpc/mm/numa.c
> >> @@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
> >>  }
> >>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
> >>  
> >> -int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >> +static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >>  {
> >>  	int dist = 0;
> >>  
> >> @@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >>  	return dist;
> >>  }
> >>  
> >> +int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >> +{
> >> +	/* We should not get called with FORM0 */
> >> +	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
> >> +
> >> +	return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
> >> +}
> >> +
> >>  /* must hold reference to node during call */
> >>  static const __be32 *of_get_associativity(struct device_node *dev)
> >>  {
> >> diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
> >> index dab356e3ff87..afefbdfe768d 100644
> >> --- a/arch/powerpc/platforms/pseries/lpar.c
> >> +++ b/arch/powerpc/platforms/pseries/lpar.c
> >> @@ -261,7 +261,7 @@ static int cpu_relative_dispatch_distance(int last_disp_cpu, int cur_disp_cpu)
> >>  	if (!last_disp_cpu_assoc || !cur_disp_cpu_assoc)
> >>  		return -EIO;
> >>  
> >> -	return cpu_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
> >> +	return cpu_relative_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
> >>  }
> >>  
> >>  static int cpu_home_node_dispatch_distance(int disp_cpu)
> >> @@ -281,7 +281,7 @@ static int cpu_home_node_dispatch_distance(int disp_cpu)
> >>  	if (!disp_cpu_assoc || !vcpu_assoc)
> >>  		return -EIO;
> >>  
> >> -	return cpu_distance(disp_cpu_assoc, vcpu_assoc);
> >> +	return cpu_relative_distance(disp_cpu_assoc, vcpu_assoc);
> >>  }
> >>  
> >>  static void update_vcpu_disp_stat(int disp_cpu)
> >
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 6/6] powerpc/pseries: Add support for FORM2 associativity
  2021-07-22  7:34     ` Aneesh Kumar K.V
@ 2021-07-26  2:41       ` David Gibson
  0 siblings, 0 replies; 30+ messages in thread
From: David Gibson @ 2021-07-26  2:41 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 22269 bytes --]

On Thu, Jul 22, 2021 at 01:04:42PM +0530, Aneesh Kumar K.V wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > On Mon, Jun 28, 2021 at 08:41:17PM +0530, Aneesh Kumar K.V wrote:
> >> PAPR interface currently supports two different ways of communicating resource
> >> grouping details to the OS. These are referred to as Form 0 and Form 1
> >> associativity grouping. Form 0 is the older format and is now considered
> >> deprecated. This patch adds another resource grouping named FORM2.
> >> 
> >> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> >> ---
> >>  Documentation/powerpc/associativity.rst   | 103 ++++++++++++++
> >>  arch/powerpc/include/asm/firmware.h       |   3 +-
> >>  arch/powerpc/include/asm/prom.h           |   1 +
> >>  arch/powerpc/kernel/prom_init.c           |   3 +-
> >>  arch/powerpc/mm/numa.c                    | 157 ++++++++++++++++++----
> >>  arch/powerpc/platforms/pseries/firmware.c |   1 +
> >>  6 files changed, 242 insertions(+), 26 deletions(-)
> >>  create mode 100644 Documentation/powerpc/associativity.rst
> >> 
> >> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
> >> new file mode 100644
> >> index 000000000000..31cc7da2c7a6
> >> --- /dev/null
> >> +++ b/Documentation/powerpc/associativity.rst
> >> @@ -0,0 +1,103 @@
> >> +============================
> >> +NUMA resource associativity
> >> +=============================
> >> +
> >> +Associativity represents the groupings of the various platform resources into
> >> +domains of substantially similar mean performance relative to resources outside
> >> +of that domain. Resources subsets of a given domain that exhibit better
> >> +performance relative to each other than relative to other resources subsets
> >> +are represented as being members of a sub-grouping domain. This performance
> >> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
> >> +From the platform view, these groups are also referred to as domains.
> >
> > Pretty hard to decipher, but that's typical for PAPR.
> >
> >> +PAPR interface currently supports different ways of communicating these resource
> >> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
> >> +associativity grouping. Form 0 is the older format and is now considered deprecated.
> >
> > Nit: s/older/oldest/ since there are now >2 forms.
> 
> updated.
> 
> >
> >> +Hypervisor indicates the type/form of associativity used via "ibm,architecture-vec-5 property".
> >> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
> >> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
> >> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
> >> +
> >> +Form 0
> >> +-----
> >> +Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
> >> +
> >> +Form 1
> >> +-----
> >> +With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity
> >> +device tree properties are used to determine the NUMA distance between resource groups/domains.
> >> +
> >> +The “ibm,associativity” property contains a list of one or more numbers (domainID)
> >> +representing the resource’s platform grouping domains.
> >> +
> >> +The “ibm,associativity-reference-points” property contains a list of one or more numbers
> >> +(domainID index) that represents the 1 based ordinal in the associativity lists.
> >> +The list of domainID indexes represents an increasing hierarchy of resource grouping.
> >> +
> >> +ex:
> >> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
> >> +
> >> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
> >> +Linux kernel computes NUMA distance between two domains by recursively comparing
> >> +if they belong to the same higher-level domains. For mismatch at every higher
> >> +level of the resource group, the kernel doubles the NUMA distance between the
> >> +comparing domains.
> >> +
> >> +Form 2
> >> +-------
> >> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
> >> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
> >> +domain numbering. With numa distance computation now detached from the index value in
> >> +"ibm,associativity-reference-points" property, Form 2 allows a large number of primary domain
> >> +ids at the same domainID index representing resource groups of different performance/latency
> >> +characteristics.
> >> +
> >> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
> >> +"ibm,architecture-vec-5" property.
> >> +
> >> +"ibm,numa-lookup-index-table" property contains a list of one or more numbers representing
> >> +the domainIDs present in the system. The offset of the domainID in this property is
> >> +used as an index while computing numa distance information via "ibm,numa-distance-table".
> >> +
> >> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
> >> +N domainID encoded as with encode-int
> >> +
> >> +For ex:
> >> +"ibm,numa-lookup-index-table" =  {4, 0, 8, 250, 252}. The offset of domainID 8 (2) is used when
> >> +computing the distance of domain 8 from other domains present in the system. For the rest of
> >> +this document, this offset will be referred to as domain distance offset.
> >> +
> >> +"ibm,numa-distance-table" property contains a list of one or more numbers representing the NUMA
> >> +distance between resource groups/domains present in the system.
> >> +
> >> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
> >> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
> >> +The number N must be equal to the square of m where m is the number of domainIDs in the
> >> +numa-lookup-index-table.
> >> +
> >> +For ex:
> >> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
> >> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
> >
> > This representation doesn't make it clear that the 9 is a u32, but the
> > rest are u8s.
> 
> How do you suggest we specify that? I could do 9:u32 10:u8 etc. But
> considering the details are explained in the paragraph above, is that
> needed?

Yes, I think it is needed.  The examples are, honestly, a lot easier
to read and follow than the PAPR-ese text, so people are much more
likely to be looking at those than parsing the minutiae of the text.

> >> +
> >> +  | 0    8   40
> >> +--|------------
> >> +  |
> >> +0 | 10   20  80
> >> +  |
> >> +8 | 20   10  160
> >> +  |
> >> +40| 80   160  10
> >> +
> >> +A possible "ibm,associativity" property for resources in node 0, 8 and 40
> >> +
> >> +{ 3, 6, 7, 0 }
> >> +{ 3, 6, 9, 8 }
> >> +{ 3, 6, 7, 40}
> >> +
> >> +With "ibm,associativity-reference-points"  { 0x3 }
> >
> > You haven't actually described how ibm,associativity-reference-points
> > operates in Form2.
> 
> Nothing change w.r.t the definition of associativity-reference-points
> w.r.t FORM2. It still will continue to show the increasing hierarchy of
> resource groups.

I guess, except that really none of them matter except the primary any
more.

> 
> >
> >> +"ibm,lookup-index-table" helps in having a compact representation of distance matrix.
> >> +Since domainID can be sparse, the matrix of distances can also be effectively sparse.
> >> +With "ibm,lookup-index-table" we can achieve a compact representation of
> >> +distance information.
> >> diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
> >> index 60b631161360..97a3bd9ffeb9 100644
> >> --- a/arch/powerpc/include/asm/firmware.h
> >> +++ b/arch/powerpc/include/asm/firmware.h
> >> @@ -53,6 +53,7 @@
> >>  #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
> >>  #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
> >>  #define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
> >> +#define FW_FEATURE_FORM2_AFFINITY ASM_CONST(0x0000020000000000)
> >>  
> >>  #ifndef __ASSEMBLY__
> >>  
> >> @@ -73,7 +74,7 @@ enum {
> >>  		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
> >>  		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
> >>  		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
> >> -		FW_FEATURE_RPT_INVALIDATE,
> >> +		FW_FEATURE_RPT_INVALIDATE | FW_FEATURE_FORM2_AFFINITY,
> >>  	FW_FEATURE_PSERIES_ALWAYS = 0,
> >>  	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
> >>  	FW_FEATURE_POWERNV_ALWAYS = 0,
> >> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
> >> index df9fec9d232c..5c80152e8f18 100644
> >> --- a/arch/powerpc/include/asm/prom.h
> >> +++ b/arch/powerpc/include/asm/prom.h
> >> @@ -149,6 +149,7 @@ extern int of_read_drc_info_cell(struct property **prop,
> >>  #define OV5_XCMO		0x0440	/* Page Coalescing */
> >>  #define OV5_FORM1_AFFINITY	0x0580	/* FORM1 NUMA affinity */
> >>  #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
> >> +#define OV5_FORM2_AFFINITY	0x0520	/* Form2 NUMA affinity */
> >>  #define OV5_HP_EVT		0x0604	/* Hot Plug Event support */
> >>  #define OV5_RESIZE_HPT		0x0601	/* Hash Page Table resizing */
> >>  #define OV5_PFO_HW_RNG		0x1180	/* PFO Random Number Generator */
> >> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> >> index 5d9ea059594f..c483df6c9393 100644
> >> --- a/arch/powerpc/kernel/prom_init.c
> >> +++ b/arch/powerpc/kernel/prom_init.c
> >> @@ -1069,7 +1069,8 @@ static const struct ibm_arch_vec ibm_architecture_vec_template __initconst = {
> >>  #else
> >>  		0,
> >>  #endif
> >> -		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN),
> >> +		.associativity = OV5_FEAT(OV5_FORM1_AFFINITY) | OV5_FEAT(OV5_PRRN) |
> >> +		OV5_FEAT(OV5_FORM2_AFFINITY),
> >>  		.bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
> >>  		.micro_checkpoint = 0,
> >>  		.reserved0 = 0,
> >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> >> index c6293037a103..c68846fc9550 100644
> >> --- a/arch/powerpc/mm/numa.c
> >> +++ b/arch/powerpc/mm/numa.c
> >> @@ -56,12 +56,17 @@ static int n_mem_addr_cells, n_mem_size_cells;
> >>  
> >>  #define FORM0_AFFINITY 0
> >>  #define FORM1_AFFINITY 1
> >> +#define FORM2_AFFINITY 2
> >>  static int affinity_form;
> >>  
> >>  #define MAX_DISTANCE_REF_POINTS 4
> >>  static int max_associativity_domain_index;
> >>  static const __be32 *distance_ref_points;
> >>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
> >> +static int numa_distance_table[MAX_NUMNODES][MAX_NUMNODES] = {
> >> +	[0 ... MAX_NUMNODES - 1] = { [0 ... MAX_NUMNODES - 1] = -1 }
> >> +};
> >> +static int numa_id_index_table[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
> >>  
> >>  /*
> >>   * Allocate node_to_cpumask_map based on number of available nodes
> >> @@ -166,6 +171,44 @@ static void unmap_cpu_from_node(unsigned long cpu)
> >>  }
> >>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
> >>  
> >> +/*
> >> + * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
> >> + * info is found.
> >> + */
> >> +static int associativity_to_nid(const __be32 *associativity)
> >> +{
> >> +	int nid = NUMA_NO_NODE;
> >> +
> >> +	if (!numa_enabled)
> >> +		goto out;
> >> +
> >> +	if (of_read_number(associativity, 1) >= primary_domain_index)
> >> +		nid = of_read_number(&associativity[primary_domain_index], 1);
> >> +
> >> +	/* POWER4 LPAR uses 0xffff as invalid node */
> >> +	if (nid == 0xffff || nid >= nr_node_ids)
> >> +		nid = NUMA_NO_NODE;
> >> +out:
> >> +	return nid;
> >> +}
> >> +
> >> +static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >> +{
> >> +	int dist;
> >> +	int node1, node2;
> >> +
> >> +	node1 = associativity_to_nid(cpu1_assoc);
> >> +	node2 = associativity_to_nid(cpu2_assoc);
> >> +
> >> +	dist = numa_distance_table[node1][node2];
> >> +	if (dist <= LOCAL_DISTANCE)
> >> +		return 0;
> >> +	else if (dist <= REMOTE_DISTANCE)
> >> +		return 1;
> >> +	else
> >> +		return 2;
> >
> > Squashing the full range of distances into just 0, 1 or 2 seems odd.
> > But then, this whole cpu_distance() thing being distinct from
> > node_distance() seems odd.
> >
> >> +}
> >> +
> >>  static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >>  {
> >>  	int dist = 0;
> >> @@ -186,8 +229,9 @@ int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> >>  {
> >>  	/* We should not get called with FORM0 */
> >>  	VM_WARN_ON(affinity_form == FORM0_AFFINITY);
> >> -
> >> -	return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
> >> +	if (affinity_form == FORM1_AFFINITY)
> >> +		return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
> >> +	return __cpu_form2_relative_distance(cpu1_assoc, cpu2_assoc);
> >>  }
> >>  
> >>  /* must hold reference to node during call */
> >> @@ -201,7 +245,9 @@ int __node_distance(int a, int b)
> >>  	int i;
> >>  	int distance = LOCAL_DISTANCE;
> >>  
> >> -	if (affinity_form == FORM0_AFFINITY)
> >> +	if (affinity_form == FORM2_AFFINITY)
> >> +		return numa_distance_table[a][b];
> >> +	else if (affinity_form == FORM0_AFFINITY)
> >>  		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
> >>  
> >>  	for (i = 0; i < max_associativity_domain_index; i++) {
> >
> > Hmm.. couldn't we simplify this whole __node_distance function, if we
> > just update numa_distance_table[][] appropriately for Form0 and Form1
> > as well?
> 
> IIUC what you are suggesting is to look at the possibility of using
> numa_distance_table[a][b] even for FORM1_AFFINITY? I can do that as part
> of separate patch?

Ok, that's reasonable.

> >
> >> @@ -216,27 +262,6 @@ int __node_distance(int a, int b)
> >>  }
> >>  EXPORT_SYMBOL(__node_distance);
> >>  
> >> -/*
> >> - * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
> >> - * info is found.
> >> - */
> >> -static int associativity_to_nid(const __be32 *associativity)
> >> -{
> >> -	int nid = NUMA_NO_NODE;
> >> -
> >> -	if (!numa_enabled)
> >> -		goto out;
> >> -
> >> -	if (of_read_number(associativity, 1) >= primary_domain_index)
> >> -		nid = of_read_number(&associativity[primary_domain_index], 1);
> >> -
> >> -	/* POWER4 LPAR uses 0xffff as invalid node */
> >> -	if (nid == 0xffff || nid >= nr_node_ids)
> >> -		nid = NUMA_NO_NODE;
> >> -out:
> >> -	return nid;
> >> -}
> >> -
> >>  /* Returns the nid associated with the given device tree node,
> >>   * or -1 if not found.
> >>   */
> >> @@ -305,12 +330,84 @@ static void initialize_form1_numa_distance(struct device_node *node)
> >>   */
> >>  void update_numa_distance(struct device_node *node)
> >>  {
> >> +	int nid;
> >> +
> >>  	if (affinity_form == FORM0_AFFINITY)
> >>  		return;
> >>  	else if (affinity_form == FORM1_AFFINITY) {
> >>  		initialize_form1_numa_distance(node);
> >>  		return;
> >>  	}
> >> +
> >> +	/* FORM2 affinity  */
> >> +	nid = of_node_to_nid_single(node);
> >> +	if (nid == NUMA_NO_NODE)
> >> +		return;
> >> +
> >> +	/*
> >> +	 * With FORM2 we expect NUMA distance of all possible NUMA
> >> +	 * nodes to be provided during boot.
> >> +	 */
> >> +	WARN(numa_distance_table[nid][nid] == -1,
> >> +	     "NUMA distance details for node %d not provided\n", nid);
> >> +}
> >> +
> >> +/*
> >> + * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
> >> + * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
> >> + */
> >> +static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
> >> +{
> >> +	int i, j;
> >> +	const __u8 *numa_dist_table;
> >> +	const __be32 *numa_lookup_index;
> >> +	int numa_dist_table_length;
> >> +	int max_numa_index, distance_index;
> >> +
> >> +	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
> >> +	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
> >> +
> >> +	/* first element of the array is the size and is encode-int */
> >> +	numa_dist_table = of_get_property(root, "ibm,numa-distance-table", NULL);
> >> +	numa_dist_table_length = of_read_number((const __be32 *)&numa_dist_table[0], 1);
> >> +	/* Skip the size which is encoded int */
> >> +	numa_dist_table += sizeof(__be32);
> >> +
> >> +	pr_debug("numa_dist_table_len = %d, numa_dist_indexes_len = %d\n",
> >> +		 numa_dist_table_length, max_numa_index);
> >> +
> >> +	for (i = 0; i < max_numa_index; i++)
> >> +		/* +1 skip the max_numa_index in the property */
> >> +		numa_id_index_table[i] = of_read_number(&numa_lookup_index[i + 1], 1);
> >> +
> >> +
> >> +	if (numa_dist_table_length != max_numa_index * max_numa_index) {
> >> +
> >> +		WARN(1, "Wrong NUMA distance information\n");
> >> +		/* consider everybody else just remote. */
> >> +		for (i = 0;  i < max_numa_index; i++) {
> >> +			for (j = 0; j < max_numa_index; j++) {
> >> +				int nodeA = numa_id_index_table[i];
> >> +				int nodeB = numa_id_index_table[j];
> >> +
> >> +				if (nodeA == nodeB)
> >> +					numa_distance_table[nodeA][nodeB] = LOCAL_DISTANCE;
> >> +				else
> >> +					numa_distance_table[nodeA][nodeB] = REMOTE_DISTANCE;
> >> +			}
> >> +		}
> >> +	}
> >> +
> >> +	distance_index = 0;
> >> +	for (i = 0;  i < max_numa_index; i++) {
> >> +		for (j = 0; j < max_numa_index; j++) {
> >> +			int nodeA = numa_id_index_table[i];
> >> +			int nodeB = numa_id_index_table[j];
> >> +
> >> +			numa_distance_table[nodeA][nodeB] = numa_dist_table[distance_index++];
> >> +			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
> >> +		}
> >> +	}
> >>  }
> >>  
> >>  static int __init find_primary_domain_index(void)
> >> @@ -323,6 +420,9 @@ static int __init find_primary_domain_index(void)
> >>  	 */
> >>  	if (firmware_has_feature(FW_FEATURE_OPAL)) {
> >>  		affinity_form = FORM1_AFFINITY;
> >> +	} else if (firmware_has_feature(FW_FEATURE_FORM2_AFFINITY)) {
> >> +		dbg("Using form 2 affinity\n");
> >> +		affinity_form = FORM2_AFFINITY;
> >>  	} else if (firmware_has_feature(FW_FEATURE_FORM1_AFFINITY)) {
> >>  		dbg("Using form 1 affinity\n");
> >>  		affinity_form = FORM1_AFFINITY;
> >> @@ -367,8 +467,17 @@ static int __init find_primary_domain_index(void)
> >>  
> >>  		index = of_read_number(&distance_ref_points[1], 1);
> >>  	} else {
> >> +		/*
> >> +		 * Both FORM1 and FORM2 affinity find the primary domain details
> >> +		 * at the same offset.
> >> +		 */
> >>  		index = of_read_number(distance_ref_points, 1);
> >>  	}
> >> +	/*
> >> +	 * If it is FORM2 also initialize the distance table here.
> >> +	 */
> >> +	if (affinity_form == FORM2_AFFINITY)
> >> +		initialize_form2_numa_distance_lookup_table(root);
> >
> > Ew.  Calling a function called "find_primary_domain_index" to also
> > initialize the main distance table is needlessly counterintuitive.
> > Move this call to parse_numa_properties().
> 
> The reason I ended up doing it here is because 'root' is already fetched
> here. But I agree it is confusing. I will move fetching of root inside
> initialize_form2_numa_distance_lookup_table() and move the function
> outside primary_index lookup.

Ok.  This is not a hot path anyway, so looking up root twice isn't
really a big deal anyway.

> 
> modified   arch/powerpc/mm/numa.c
> @@ -355,14 +355,22 @@ void update_numa_distance(struct device_node *node)
>   * ibm,numa-lookup-index-table= {N, domainid1, domainid2, ..... domainidN}
>   * ibm,numa-distance-table = { N, 1, 2, 4, 5, 1, 6, .... N elements}
>   */
> -static void initialize_form2_numa_distance_lookup_table(struct device_node *root)
> +static void initialize_form2_numa_distance_lookup_table()
>  {
>  	int i, j;
> +	struct device_node *root;
>  	const __u8 *numa_dist_table;
>  	const __be32 *numa_lookup_index;
>  	int numa_dist_table_length;
>  	int max_numa_index, distance_index;
>  
> +	if (firmware_has_feature(FW_FEATURE_OPAL))
> +		root = of_find_node_by_path("/ibm,opal");
> +	else
> +		root = of_find_node_by_path("/rtas");
> +	if (!root)
> +		root = of_find_node_by_path("/");
> +
>  	numa_lookup_index = of_get_property(root, "ibm,numa-lookup-index-table", NULL);
>  	max_numa_index = of_read_number(&numa_lookup_index[0], 1);
>  
> @@ -407,6 +415,7 @@ static void initialize_form2_numa_distance_lookup_table(struct device_node *root
>  			pr_debug("dist[%d][%d]=%d ", nodeA, nodeB, numa_distance_table[nodeA][nodeB]);
>  		}
>  	}
> +	of_node_put(root);
>  }
>  
>  static int __init find_primary_domain_index(void)
> @@ -472,12 +481,6 @@ static int __init find_primary_domain_index(void)
>  		 */
>  		index = of_read_number(distance_ref_points, 1);
>  	}
> -	/*
> -	 * If it is FORM2 also initialize the distance table here.
> -	 */
> -	if (affinity_form == FORM2_AFFINITY)
> -		initialize_form2_numa_distance_lookup_table(root);
> -
>  	/*
>  	 * Warn and cap if the hardware supports more than
>  	 * MAX_DISTANCE_REF_POINTS domains.
> @@ -916,6 +919,12 @@ static int __init parse_numa_properties(void)
>  
>  	dbg("NUMA associativity depth for CPU/Memory: %d\n", primary_domain_index);
>  
> +	/*
> +	 * If it is FORM2 also initialize the distance table here.
> +	 */
> +	if (affinity_form == FORM2_AFFINITY)
> +		initialize_form2_numa_distance_lookup_table();
> +
>  	/*
>  	 * Even though we connect cpus to numa domains later in SMP
>  	 * init, we need to know the node ids now. This is because
> 
> -aneesh
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths
  2021-07-26  2:37       ` David Gibson
@ 2021-07-27  3:32         ` Aneesh Kumar K.V
  2021-07-27  5:59           ` David Gibson
  0 siblings, 1 reply; 30+ messages in thread
From: Aneesh Kumar K.V @ 2021-07-27  3:32 UTC (permalink / raw)
  To: David Gibson; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

David Gibson <david@gibson.dropbear.id.au> writes:

> On Thu, Jul 22, 2021 at 12:37:46PM +0530, Aneesh Kumar K.V wrote:
>> David Gibson <david@gibson.dropbear.id.au> writes:
>> 
>> > On Mon, Jun 28, 2021 at 08:41:15PM +0530, Aneesh Kumar K.V wrote:

....

> 
>> >
>> >> +		nid = of_read_number(&aa.arrays[index], 1);
>> >> +
>> >> +		if (nid == 0xffff || nid >= nr_node_ids)
>> >> +			nid = default_nid;
>> >> +		if (nid > 0 && affinity_form == FORM1_AFFINITY) {
>> >> +			int i;
>> >> +			const __be32 *associativity;
>> >> +
>> >> +			index = lmb->aa_index * aa.array_sz;
>> >> +			associativity = &aa.arrays[index];
>> >> +			/*
>> >> +			 * lookup array associativity entries have different format
>> >> +			 * There is no length of the array as the first element.
>> >
>> > The difference it very small, and this is not a hot path.  Couldn't
>> > you reduce a chunk of code by prepending aa.array_sz, then re-using
>> > __initialize_form1_numa_distance.  Or even making
>> > __initialize_form1_numa_distance() take the length as a parameter.
>> 
>> The changes are small but confusing w.r.t how we look at the
>> associativity-lookup-arrays. The way we interpret associativity array
>> and associativity lookup array using primary_domain_index is different.
>> Hence the '-1' in the node lookup here.
>
> They're really not, though.  It's exactly the same interpretation of
> the associativity array itself - it's just that one of them has the
> array prepended with a (redundant) length.  So you can make
> __initialize_form1_numa_distance() work on the "bare" associativity
> array, with a given length.  Here you call it with aa.array_sz as the
> length, and in the other place you call it with prop[0] as the length.
>
>> 
>> 	index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
>> 	nid = of_read_number(&aa.arrays[index], 1);
>> 
>> 
>> >
>> >> +			 */
>> >> +			for (i = 0; i < max_associativity_domain_index; i++) {
>> >> +				const __be32 *entry;
>> >> +
>> >> +				entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
>> >
>> > Does anywhere verify that distance_ref_points[i] <= aa.array_size for
>> > every i?
>> 
>> We do check for 
>> 
>> 	if (primary_domain_index <= aa.array_sz &&
>
> Right, but that doesn't check the other distance_ref_points entries.
> Not that there's any reason to have extra entries with Form2, but we
> still don't want stray array accesses.

This is how the change looks. I am not convinced this makes it simpler.
I will add that as the last patch and we can drop that if we find that
not helpful? 

modified   arch/powerpc/mm/numa.c
@@ -171,20 +171,31 @@ static void unmap_cpu_from_node(unsigned long cpu)
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
 
-/*
- * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
- * info is found.
- */
-static int associativity_to_nid(const __be32 *associativity)
+static int __associativity_to_nid(const __be32 *associativity,
+				  bool lookup_array_assoc,
+				  int max_array_index)
 {
 	int nid = NUMA_NO_NODE;
+	int index;
 
 	if (!numa_enabled)
 		goto out;
+	/*
+	 * ibm,associativity-lookup-array doesn't have element
+	 * count at the start of the associativity. Hence
+	 * decrement the primary_domain_index when used with
+	 * lookup-array associativity.
+	 */
+	if (lookup_array_assoc)
+		index = primary_domain_index - 1;
+	else {
+		index = primary_domain_index;
+		max_array_index = of_read_number(associativity, 1);
+	}
+	if (index > max_array_index)
+		goto out;
 
-	if (of_read_number(associativity, 1) >= primary_domain_index)
-		nid = of_read_number(&associativity[primary_domain_index], 1);
-
+	nid = of_read_number(&associativity[index], 1);
 	/* POWER4 LPAR uses 0xffff as invalid node */
 	if (nid == 0xffff || nid >= nr_node_ids)
 		nid = NUMA_NO_NODE;
@@ -192,6 +203,15 @@ static int associativity_to_nid(const __be32 *associativity)
 	return nid;
 }
 
+/*
+ * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
+ * info is found.
+ */
+static inline int associativity_to_nid(const __be32 *associativity)
+{
+	return __associativity_to_nid(associativity, false, 0);
+}
+
 static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 {
 	int dist;
@@ -295,19 +315,38 @@ int of_node_to_nid(struct device_node *device)
 }
 EXPORT_SYMBOL(of_node_to_nid);
 
-static void __initialize_form1_numa_distance(const __be32 *associativity)
+static void __initialize_form1_numa_distance(const __be32 *associativity,
+					     bool lookup_array_assoc,
+					     int max_array_index)
 {
 	int i, nid;
+	int index_offset = 0;
 
 	if (affinity_form != FORM1_AFFINITY)
 		return;
+	/*
+	 * ibm,associativity-lookup-array doesn't have element
+	 * count at the start of the associativity. Hence
+	 * decrement the distance_ref_points index when used with
+	 * lookup-array associativity.
+	 */
+	if (lookup_array_assoc)
+		index_offset = 1;
+	else
+		max_array_index = of_read_number(associativity, 1);
 
-	nid = associativity_to_nid(associativity);
+	nid = __associativity_to_nid(associativity, lookup_array_assoc, max_array_index);
 	if (nid != NUMA_NO_NODE) {
 		for (i = 0; i < distance_ref_points_depth; i++) {
 			const __be32 *entry;
+			int index = be32_to_cpu(distance_ref_points[i]) - index_offset;
 
-			entry = &associativity[be32_to_cpu(distance_ref_points[i])];
+			/*
+			 * broken hierarchy, return with broken distance table
+			 */
+			if (index > max_array_index)
+				return;
+			entry = &associativity[index];
 			distance_lookup_table[nid][i] = of_read_number(entry, 1);
 		}
 	}
@@ -321,7 +360,7 @@ static void initialize_form1_numa_distance(struct device_node *node)
 	if (!associativity)
 		return;
 
-	__initialize_form1_numa_distance(associativity);
+	__initialize_form1_numa_distance(associativity, false, 0);
 }
 
 /*
@@ -586,27 +625,14 @@ static int get_nid_and_numa_distance(struct drmem_lmb *lmb)
 
 	if (primary_domain_index <= aa.array_sz &&
 	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
-		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
-		nid = of_read_number(&aa.arrays[index], 1);
+		const __be32 *associativity;
 
-		if (nid == 0xffff || nid >= nr_node_ids)
-			nid = default_nid;
+		index = lmb->aa_index * aa.array_sz;
+		associativity = &aa.arrays[index];
+		nid = __associativity_to_nid(associativity, true, aa.array_sz - 1);
 		if (nid > 0 && affinity_form == FORM1_AFFINITY) {
-			int i;
-			const __be32 *associativity;
-
-			index = lmb->aa_index * aa.array_sz;
-			associativity = &aa.arrays[index];
-			/*
-			 * lookup array associativity entries have different format
-			 * There is no length of the array as the first element.
-			 */
-			for (i = 0; i < distance_ref_points_depth; i++) {
-				const __be32 *entry;
-
-				entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
-				distance_lookup_table[nid][i] = of_read_number(entry, 1);
-			}
+			__initialize_form1_numa_distance(associativity,
+							 true, aa.array_sz - 1);
 		}
 	}
 	return nid;
@@ -632,9 +658,11 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 
 	if (primary_domain_index <= aa.array_sz &&
 	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
-		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
-		nid = of_read_number(&aa.arrays[index], 1);
+		const __be32 *associativity;
 
+		index = lmb->aa_index * aa.array_sz;
+		associativity = &aa.arrays[index];
+		nid = __associativity_to_nid(associativity, true, aa.array_sz - 1);
 		if (nid == 0xffff || nid >= nr_node_ids)
 			nid = default_nid;
 	}
@@ -939,7 +967,7 @@ static int __init parse_numa_properties(void)
 
 		if (__vphn_get_associativity(i, vphn_assoc) == 0) {
 			nid = associativity_to_nid(vphn_assoc);
-			__initialize_form1_numa_distance(vphn_assoc);
+			__initialize_form1_numa_distance(vphn_assoc, false, 0);
 		} else {
 
 			/*
@@ -953,7 +981,7 @@ static int __init parse_numa_properties(void)
 			associativity = of_get_associativity(cpu);
 			if (associativity) {
 				nid = associativity_to_nid(associativity);
-				__initialize_form1_numa_distance(associativity);
+				__initialize_form1_numa_distance(associativity, false, 0);
 			}
 			of_node_put(cpu);
 		}
@@ -993,7 +1021,7 @@ static int __init parse_numa_properties(void)
 		associativity = of_get_associativity(memory);
 		if (associativity) {
 			nid = associativity_to_nid(associativity);
-			__initialize_form1_numa_distance(associativity);
+			__initialize_form1_numa_distance(associativity, false, 0);
 		} else
 			nid = default_nid;
 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths
  2021-07-27  3:32         ` Aneesh Kumar K.V
@ 2021-07-27  5:59           ` David Gibson
  0 siblings, 0 replies; 30+ messages in thread
From: David Gibson @ 2021-07-27  5:59 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 10105 bytes --]

On Tue, Jul 27, 2021 at 09:02:33AM +0530, Aneesh Kumar K.V wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > On Thu, Jul 22, 2021 at 12:37:46PM +0530, Aneesh Kumar K.V wrote:
> >> David Gibson <david@gibson.dropbear.id.au> writes:
> >> 
> >> > On Mon, Jun 28, 2021 at 08:41:15PM +0530, Aneesh Kumar K.V wrote:
> 
> ....
> 
> > 
> >> >
> >> >> +		nid = of_read_number(&aa.arrays[index], 1);
> >> >> +
> >> >> +		if (nid == 0xffff || nid >= nr_node_ids)
> >> >> +			nid = default_nid;
> >> >> +		if (nid > 0 && affinity_form == FORM1_AFFINITY) {
> >> >> +			int i;
> >> >> +			const __be32 *associativity;
> >> >> +
> >> >> +			index = lmb->aa_index * aa.array_sz;
> >> >> +			associativity = &aa.arrays[index];
> >> >> +			/*
> >> >> +			 * lookup array associativity entries have different format
> >> >> +			 * There is no length of the array as the first element.
> >> >
> >> > The difference it very small, and this is not a hot path.  Couldn't
> >> > you reduce a chunk of code by prepending aa.array_sz, then re-using
> >> > __initialize_form1_numa_distance.  Or even making
> >> > __initialize_form1_numa_distance() take the length as a parameter.
> >> 
> >> The changes are small but confusing w.r.t how we look at the
> >> associativity-lookup-arrays. The way we interpret associativity array
> >> and associativity lookup array using primary_domain_index is different.
> >> Hence the '-1' in the node lookup here.
> >
> > They're really not, though.  It's exactly the same interpretation of
> > the associativity array itself - it's just that one of them has the
> > array prepended with a (redundant) length.  So you can make
> > __initialize_form1_numa_distance() work on the "bare" associativity
> > array, with a given length.  Here you call it with aa.array_sz as the
> > length, and in the other place you call it with prop[0] as the length.
> >
> >> 
> >> 	index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
> >> 	nid = of_read_number(&aa.arrays[index], 1);
> >> 
> >> 
> >> >
> >> >> +			 */
> >> >> +			for (i = 0; i < max_associativity_domain_index; i++) {
> >> >> +				const __be32 *entry;
> >> >> +
> >> >> +				entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
> >> >
> >> > Does anywhere verify that distance_ref_points[i] <= aa.array_size for
> >> > every i?
> >> 
> >> We do check for 
> >> 
> >> 	if (primary_domain_index <= aa.array_sz &&
> >
> > Right, but that doesn't check the other distance_ref_points entries.
> > Not that there's any reason to have extra entries with Form2, but we
> > still don't want stray array accesses.
> 
> This is how the change looks. I am not convinced this makes it simpler.

It's not, but that's because the lookup_array_assoc flag is not needed...

> I will add that as the last patch and we can drop that if we find that
> not helpful? 
> 
> modified   arch/powerpc/mm/numa.c
> @@ -171,20 +171,31 @@ static void unmap_cpu_from_node(unsigned long cpu)
>  }
>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>  
> -/*
> - * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
> - * info is found.
> - */
> -static int associativity_to_nid(const __be32 *associativity)
> +static int __associativity_to_nid(const __be32 *associativity,
> +				  bool lookup_array_assoc,
> +				  int max_array_index)
>  {
>  	int nid = NUMA_NO_NODE;
> +	int index;
>  
>  	if (!numa_enabled)
>  		goto out;
> +	/*
> +	 * ibm,associativity-lookup-array doesn't have element
> +	 * count at the start of the associativity. Hence
> +	 * decrement the primary_domain_index when used with
> +	 * lookup-array associativity.
> +	 */
> +	if (lookup_array_assoc)
> +		index = primary_domain_index - 1;
> +	else {
> +		index = primary_domain_index;
> +		max_array_index = of_read_number(associativity, 1);
> +	}
> +	if (index > max_array_index)
> +		goto out;

So, the associativity-array-with-length is exactly a length, followed
by an associativity-array-without-length.  What I was suggesting is
you make this function only take an
associativity-array-without-length, with the length passed separately.

Where you want to use it on an associativity-array-with-length, stored
in __be32 *awl, you just invoke it as:
	associativity_to_nid(awl + 1, of_read_number(awl, 1));

> -	if (of_read_number(associativity, 1) >= primary_domain_index)
> -		nid = of_read_number(&associativity[primary_domain_index], 1);
> -
> +	nid = of_read_number(&associativity[index], 1);
>  	/* POWER4 LPAR uses 0xffff as invalid node */
>  	if (nid == 0xffff || nid >= nr_node_ids)
>  		nid = NUMA_NO_NODE;
> @@ -192,6 +203,15 @@ static int associativity_to_nid(const __be32 *associativity)
>  	return nid;
>  }
>  
> +/*
> + * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
> + * info is found.
> + */
> +static inline int associativity_to_nid(const __be32 *associativity)
> +{
> +	return __associativity_to_nid(associativity, false, 0);
> +}
> +
>  static int __cpu_form2_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>  {
>  	int dist;
> @@ -295,19 +315,38 @@ int of_node_to_nid(struct device_node *device)
>  }
>  EXPORT_SYMBOL(of_node_to_nid);
>  
> -static void __initialize_form1_numa_distance(const __be32 *associativity)
> +static void __initialize_form1_numa_distance(const __be32 *associativity,
> +					     bool lookup_array_assoc,
> +					     int max_array_index)
>  {
>  	int i, nid;
> +	int index_offset = 0;
>  
>  	if (affinity_form != FORM1_AFFINITY)
>  		return;
> +	/*
> +	 * ibm,associativity-lookup-array doesn't have element
> +	 * count at the start of the associativity. Hence
> +	 * decrement the distance_ref_points index when used with
> +	 * lookup-array associativity.
> +	 */
> +	if (lookup_array_assoc)
> +		index_offset = 1;
> +	else
> +		max_array_index = of_read_number(associativity, 1);
>  
> -	nid = associativity_to_nid(associativity);
> +	nid = __associativity_to_nid(associativity, lookup_array_assoc, max_array_index);
>  	if (nid != NUMA_NO_NODE) {
>  		for (i = 0; i < distance_ref_points_depth; i++) {
>  			const __be32 *entry;
> +			int index = be32_to_cpu(distance_ref_points[i]) - index_offset;
>  
> -			entry = &associativity[be32_to_cpu(distance_ref_points[i])];
> +			/*
> +			 * broken hierarchy, return with broken distance table
> +			 */
> +			if (index > max_array_index)
> +				return;
> +			entry = &associativity[index];
>  			distance_lookup_table[nid][i] = of_read_number(entry, 1);
>  		}
>  	}
> @@ -321,7 +360,7 @@ static void initialize_form1_numa_distance(struct device_node *node)
>  	if (!associativity)
>  		return;
>  
> -	__initialize_form1_numa_distance(associativity);
> +	__initialize_form1_numa_distance(associativity, false, 0);
>  }
>  
>  /*
> @@ -586,27 +625,14 @@ static int get_nid_and_numa_distance(struct drmem_lmb *lmb)
>  
>  	if (primary_domain_index <= aa.array_sz &&
>  	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
> -		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
> -		nid = of_read_number(&aa.arrays[index], 1);
> +		const __be32 *associativity;
>  
> -		if (nid == 0xffff || nid >= nr_node_ids)
> -			nid = default_nid;
> +		index = lmb->aa_index * aa.array_sz;
> +		associativity = &aa.arrays[index];
> +		nid = __associativity_to_nid(associativity, true, aa.array_sz - 1);
>  		if (nid > 0 && affinity_form == FORM1_AFFINITY) {
> -			int i;
> -			const __be32 *associativity;
> -
> -			index = lmb->aa_index * aa.array_sz;
> -			associativity = &aa.arrays[index];
> -			/*
> -			 * lookup array associativity entries have different format
> -			 * There is no length of the array as the first element.
> -			 */
> -			for (i = 0; i < distance_ref_points_depth; i++) {
> -				const __be32 *entry;
> -
> -				entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
> -				distance_lookup_table[nid][i] = of_read_number(entry, 1);
> -			}
> +			__initialize_form1_numa_distance(associativity,
> +							 true, aa.array_sz - 1);
>  		}
>  	}
>  	return nid;
> @@ -632,9 +658,11 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>  
>  	if (primary_domain_index <= aa.array_sz &&
>  	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
> -		index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
> -		nid = of_read_number(&aa.arrays[index], 1);
> +		const __be32 *associativity;
>  
> +		index = lmb->aa_index * aa.array_sz;
> +		associativity = &aa.arrays[index];
> +		nid = __associativity_to_nid(associativity, true, aa.array_sz - 1);
>  		if (nid == 0xffff || nid >= nr_node_ids)
>  			nid = default_nid;
>  	}
> @@ -939,7 +967,7 @@ static int __init parse_numa_properties(void)
>  
>  		if (__vphn_get_associativity(i, vphn_assoc) == 0) {
>  			nid = associativity_to_nid(vphn_assoc);
> -			__initialize_form1_numa_distance(vphn_assoc);
> +			__initialize_form1_numa_distance(vphn_assoc, false, 0);
>  		} else {
>  
>  			/*
> @@ -953,7 +981,7 @@ static int __init parse_numa_properties(void)
>  			associativity = of_get_associativity(cpu);
>  			if (associativity) {
>  				nid = associativity_to_nid(associativity);
> -				__initialize_form1_numa_distance(associativity);
> +				__initialize_form1_numa_distance(associativity, false, 0);
>  			}
>  			of_node_put(cpu);
>  		}
> @@ -993,7 +1021,7 @@ static int __init parse_numa_properties(void)
>  		associativity = of_get_associativity(memory);
>  		if (associativity) {
>  			nid = associativity_to_nid(associativity);
> -			__initialize_form1_numa_distance(associativity);
> +			__initialize_form1_numa_distance(associativity, false, 0);
>  		} else
>  			nid = default_nid;
>  
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-07-27  5:59 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-28 15:11 [PATCH v5 0/6] Add support for FORM2 associativity Aneesh Kumar K.V
2021-06-28 15:11 ` [PATCH v5 1/6] powerpc/pseries: rename min_common_depth to primary_domain_index Aneesh Kumar K.V
2021-07-22  1:59   ` David Gibson
2021-07-22  2:36     ` David Gibson
2021-07-22  5:17       ` Aneesh Kumar K.V
2021-07-26  2:28         ` David Gibson
2021-06-28 15:11 ` [PATCH v5 2/6] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index Aneesh Kumar K.V
2021-07-22  0:59   ` David Gibson
2021-07-22  1:19     ` David Gibson
2021-06-28 15:11 ` [PATCH v5 3/6] powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITY Aneesh Kumar K.V
2021-06-28 15:11 ` [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths Aneesh Kumar K.V
2021-06-28 20:21   ` kernel test robot
2021-06-28 20:21     ` kernel test robot
2021-06-28 20:40   ` kernel test robot
2021-06-28 20:40     ` kernel test robot
2021-07-22  1:40   ` David Gibson
2021-07-22  7:07     ` Aneesh Kumar K.V
2021-07-26  2:37       ` David Gibson
2021-07-27  3:32         ` Aneesh Kumar K.V
2021-07-27  5:59           ` David Gibson
2021-06-28 15:11 ` [PATCH v5 5/6] powerpc/pseries: Add a helper for form1 cpu distance Aneesh Kumar K.V
2021-07-22  1:42   ` David Gibson
2021-07-22  7:09     ` Aneesh Kumar K.V
2021-07-26  2:38       ` David Gibson
2021-06-28 15:11 ` [PATCH v5 6/6] powerpc/pseries: Add support for FORM2 associativity Aneesh Kumar K.V
2021-07-22  2:28   ` David Gibson
2021-07-22  7:34     ` Aneesh Kumar K.V
2021-07-26  2:41       ` David Gibson
2021-07-13 14:27 ` [PATCH v5 0/6] " Daniel Henrique Barboza
2021-07-13 14:30   ` Aneesh Kumar K.V

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.