linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC  0/5] powerpc:numa Add serial nid support
@ 2015-09-27 18:29 Raghavendra K T
  2015-09-27 18:29 ` [PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table Raghavendra K T
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Raghavendra K T @ 2015-09-27 18:29 UTC (permalink / raw)
  To: benh, paulus, mpe
  Cc: anton, linuxppc-dev, linux-kernel, cl, nacc, gkurz, grant.likely,
	nikunj, raghavendra.kt, khandual

Problem description:
Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
got from device tree is naturally mapped (directly) to nid.

Potential side effect of that is:

1) There are several places in kernel that assumes serial node numbering.
and memory allocations assume that all the nodes from 0-(highest nid)
exist inturn ending up allocating memory for the nodes that does not exist.

2) For virtualization use cases (such as qemu, libvirt, openstack), mapping
sparse nid of the host system to contiguous nids of guest (numa affinity,
placement) could be a challenge.

Possible Solutions:
1) Handling the memory allocations is kernel case by case: Though in some
cases it is easy to achieve, some cases may be intrusive/not trivial. 
at the end it does not handle side effect (2) above.

2) Map the sparse chipid got from device tree to a serial nid at kernel
level (The idea proposed in this series).
Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
con: The chipid is in device tree no longer the same as nid in kernel

3) Let the lower layer (OPAL) give the serial node ids after parsing the
chipid and the associativity etc [ either as a separate item in device tree
or by compacting the chipid numbers ]
Pros: kernel, device tree are on same page and less change in kernel
Con: is it the functionality expected in lower layer

As mentioned above, current patch series tries to map chipid from lower layer
to a contiguos nid at kernel level keeping the node distance calculation and
so on intact.

Result:
Before the patch: numactl -H

available: 4 nodes (0-1,16-17)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 31665 MB
node 0 free: 29836 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 32722 MB
node 1 free: 32019 MB
node 16 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 16 size: 32571 MB
node 16 free: 31222 MB
node 17 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 17 size: 0 MB
node 17 free: 0 MB
node distances:
node   0   1  16  17 
  0:  10  20  40  40 
  1:  20  10  40  40 
 16:  40  40  10  20 
 17:  40  40  20  10 

After the patch: numactl -H

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 31665 MB
node 0 free: 30657 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 32722 MB
node 1 free: 32566 MB
node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 2 size: 32571 MB
node 2 free: 32401 MB
node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3 
  0:  10  20  40  40 
  1:  20  10  40  40 
  2:  40  40  10  20 
  3:  40  40  20  10 

(note that numa distances are intact). Apart from this, The following tests
are done with the patched kernel (both baremetal and KVM guest with multiple
nodes) to ensure there is no breakage.

1) offlining and onlining of memory in /sys/devices/system/node/nodeX path

2) offlining and onlining of cpus in /sys/devices/system/cpu/ path

3) Numactl tests from
ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz

(infact there were more breakage before the patch because of sparse nid
and memoryless node cases of powerpc)

4) Thousands of docker containers were spawned.

Please let me know your comments.

 patch 1-3: cleanup patches
 patch 4: Adds helper function to map nid and chipid
 patch 5: Uses the mapping to get serial nid 

Raghavendra K T (5):
  powerpc:numa Add numa_cpu_lookup function to update lookup table
  powerpc:numa Rename functions referring to nid as chipid
  powerpc:numa create 1:1 mappaing between chipid and nid
  powerpc:numa Add helper functions to maintain chipid to nid mapping
  powerpc:numa Use chipid to nid mapping to get serial numa node ids

 arch/powerpc/include/asm/mmzone.h |   2 +-
 arch/powerpc/kernel/smp.c         |  10 ++--
 arch/powerpc/mm/numa.c            | 121 +++++++++++++++++++++++++++++++-------
 3 files changed, 105 insertions(+), 28 deletions(-)

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC  1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table
  2015-09-27 18:29 [PATCH RFC 0/5] powerpc:numa Add serial nid support Raghavendra K T
@ 2015-09-27 18:29 ` Raghavendra K T
  2015-09-27 18:41   ` Raghavendra K T
  2015-10-06 10:17   ` [RFC, " Michael Ellerman
  2015-09-27 18:29 ` [PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid Raghavendra K T
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Raghavendra K T @ 2015-09-27 18:29 UTC (permalink / raw)
  To: benh, paulus, mpe
  Cc: anton, linuxppc-dev, linux-kernel, cl, nacc, gkurz, grant.likely,
	nikunj, raghavendra.kt, khandual, Raghavendra K T

We access numa_cpu_lookup_table array directly in all the places
to read/update numa cpu lookup information. Instead use a helper
function to update.

This is helpful in changing the way numa<-->cpu mapping in single
place when needed.

This is a cosmetic change, no change in functionality.

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.inet.ibm.com>
---
 arch/powerpc/include/asm/mmzone.h |  2 +-
 arch/powerpc/kernel/smp.c         | 10 +++++-----
 arch/powerpc/mm/numa.c            | 28 +++++++++++++++++-----------
 3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
index 7b58917..c24a5f4 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -29,7 +29,7 @@ extern struct pglist_data *node_data[];
  * Following are specific to this numa platform.
  */
 
-extern int numa_cpu_lookup_table[];
+extern int numa_cpu_lookup(int cpu);
 extern cpumask_var_t node_to_cpumask_map[];
 #ifdef CONFIG_MEMORY_HOTPLUG
 extern unsigned long max_pfn;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ec9ec20..56fbe9e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 		 * numa_node_id() works after this.
 		 */
 		if (cpu_present(cpu)) {
-			set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+			set_cpu_numa_node(cpu, numa_cpu_lookup(cpu));
 			set_cpu_numa_mem(cpu,
-				local_memory_node(numa_cpu_lookup_table[cpu]));
+				local_memory_node(numa_cpu_lookup(cpu)));
 		}
 	}
 
@@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void)
 #ifdef CONFIG_PPC64
 	paca[boot_cpuid].__current = current;
 #endif
-	set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
+	set_numa_node(numa_cpu_lookup(boot_cpuid));
 	current_set[boot_cpuid] = task_thread_info(current);
 }
 
@@ -718,8 +718,8 @@ void start_secondary(void *unused)
 	}
 	traverse_core_siblings(cpu, true);
 
-	set_numa_node(numa_cpu_lookup_table[cpu]);
-	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
+	set_numa_node(numa_cpu_lookup(cpu));
+	set_numa_mem(local_memory_node(numa_cpu_lookup(cpu)));
 
 	smp_wmb();
 	notify_cpu_starting(cpu);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..d5e6eee 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS];
 cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
 struct pglist_data *node_data[MAX_NUMNODES];
 
-EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
@@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn,
 	return 0;
 }
 
-static void reset_numa_cpu_lookup_table(void)
+int numa_cpu_lookup(int cpu)
 {
-	unsigned int cpu;
-
-	for_each_possible_cpu(cpu)
-		numa_cpu_lookup_table[cpu] = -1;
+	return numa_cpu_lookup_table[cpu];
 }
+EXPORT_SYMBOL(numa_cpu_lookup);
 
-static void update_numa_cpu_lookup_table(unsigned int cpu, int node)
+static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node)
 {
 	numa_cpu_lookup_table[cpu] = node;
 }
 
+static void reset_numa_cpu_lookup_table(void)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu)
+		update_numa_cpu_lookup_table(cpu, -1);
+}
+
 static void map_cpu_to_node(int cpu, int node)
 {
 	update_numa_cpu_lookup_table(cpu, node);
@@ -160,7 +165,7 @@ static void map_cpu_to_node(int cpu, int node)
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
 static void unmap_cpu_from_node(unsigned long cpu)
 {
-	int node = numa_cpu_lookup_table[cpu];
+	int node = numa_cpu_lookup(cpu);
 
 	dbg("removing cpu %lu from node %d\n", cpu, node);
 
@@ -536,7 +541,8 @@ static int numa_setup_cpu(unsigned long lcpu)
 	 * directly instead of querying the firmware, since it represents
 	 * the most recent mapping notified to us by the platform (eg: VPHN).
 	 */
-	if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
+	nid = numa_cpu_lookup(lcpu);
+	if (nid >= 0) {
 		map_cpu_to_node(lcpu, nid);
 		return nid;
 	}
@@ -1413,7 +1419,7 @@ int arch_update_cpu_topology(void)
 		if (new_nid < 0 || !node_online(new_nid))
 			new_nid = first_online_node;
 
-		if (new_nid == numa_cpu_lookup_table[cpu]) {
+		if (new_nid == numa_cpu_lookup(cpu)) {
 			cpumask_andnot(&cpu_associativity_changes_mask,
 					&cpu_associativity_changes_mask,
 					cpu_sibling_mask(cpu));
@@ -1425,7 +1431,7 @@ int arch_update_cpu_topology(void)
 			ud = &updates[i++];
 			ud->cpu = sibling;
 			ud->new_nid = new_nid;
-			ud->old_nid = numa_cpu_lookup_table[sibling];
+			ud->old_nid = numa_cpu_lookup(sibling);
 			cpumask_set_cpu(sibling, &updated_cpus);
 			if (i < weight)
 				ud->next = &updates[i];
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC  2/5] powerpc:numa Rename functions referring to nid as chipid
  2015-09-27 18:29 [PATCH RFC 0/5] powerpc:numa Add serial nid support Raghavendra K T
  2015-09-27 18:29 ` [PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table Raghavendra K T
@ 2015-09-27 18:29 ` Raghavendra K T
  2015-09-28 17:27   ` Nishanth Aravamudan
  2015-09-27 18:29 ` [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid Raghavendra K T
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Raghavendra K T @ 2015-09-27 18:29 UTC (permalink / raw)
  To: benh, paulus, mpe
  Cc: anton, linuxppc-dev, linux-kernel, cl, nacc, gkurz, grant.likely,
	nikunj, raghavendra.kt, khandual

There is no change in the fuctionality

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d5e6eee..f84ed2f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid,
 	}
 }
 
-/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
+/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
  * info is found.
  */
-static int associativity_to_nid(const __be32 *associativity)
+static int associativity_to_chipid(const __be32 *associativity)
 {
-	int nid = -1;
+	int chipid = -1;
 
 	if (min_common_depth == -1)
 		goto out;
 
 	if (of_read_number(associativity, 1) >= min_common_depth)
-		nid = of_read_number(&associativity[min_common_depth], 1);
+		chipid = of_read_number(&associativity[min_common_depth], 1);
 
 	/* POWER4 LPAR uses 0xffff as invalid node */
-	if (nid == 0xffff || nid >= MAX_NUMNODES)
-		nid = -1;
+	if (chipid == 0xffff || chipid >= MAX_NUMNODES)
+		chipid = -1;
 
-	if (nid > 0 &&
+	if (chipid > 0 &&
 		of_read_number(associativity, 1) >= distance_ref_points_depth) {
 		/*
 		 * Skip the length field and send start of associativity array
 		 */
-		initialize_distance_lookup_table(nid, associativity + 1);
+		initialize_distance_lookup_table(chipid, associativity + 1);
 	}
 
 out:
-	return nid;
+	return chipid;
 }
 
-/* Returns the nid associated with the given device tree node,
+/* Returns the chipid associated with the given device tree node,
  * or -1 if not found.
  */
-static int of_node_to_nid_single(struct device_node *device)
+static int of_node_to_chipid_single(struct device_node *device)
 {
-	int nid = -1;
+	int chipid = -1;
 	const __be32 *tmp;
 
 	tmp = of_get_associativity(device);
 	if (tmp)
-		nid = associativity_to_nid(tmp);
-	return nid;
+		chipid = associativity_to_chipid(tmp);
+	return chipid;
 }
 
 /* Walk the device tree upwards, looking for an associativity id */
@@ -286,7 +286,7 @@ int of_node_to_nid(struct device_node *device)
 
 	of_node_get(device);
 	while (device) {
-		nid = of_node_to_nid_single(device);
+		nid = of_node_to_chipid_single(device);
 		if (nid != -1)
 			break;
 
@@ -498,7 +498,7 @@ static int of_get_assoc_arrays(struct device_node *memory,
 }
 
 /*
- * This is like of_node_to_nid_single() for memory represented in the
+ * This is like of_node_to_chipid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
  */
 static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
@@ -557,7 +557,7 @@ static int numa_setup_cpu(unsigned long lcpu)
 			goto out;
 	}
 
-	nid = of_node_to_nid_single(cpu);
+	nid = of_node_to_chipid_single(cpu);
 
 out_present:
 	if (nid < 0 || !node_online(nid))
@@ -754,7 +754,7 @@ static int __init parse_numa_properties(void)
 
 		cpu = of_get_cpu_node(i, NULL);
 		BUG_ON(!cpu);
-		nid = of_node_to_nid_single(cpu);
+		nid = of_node_to_chipid_single(cpu);
 		of_node_put(cpu);
 
 		/*
@@ -796,7 +796,7 @@ new_range:
 		 * have associativity properties.  If none, then
 		 * everything goes to default_nid.
 		 */
-		nid = of_node_to_nid_single(memory);
+		nid = of_node_to_chipid_single(memory);
 		if (nid < 0)
 			nid = default_nid;
 
@@ -1119,7 +1119,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr)
 			if ((scn_addr < start) || (scn_addr >= (start + size)))
 				continue;
 
-			nid = of_node_to_nid_single(memory);
+			nid = of_node_to_chipid_single(memory);
 			break;
 		}
 
@@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void)
 
 		/* Use associativity from first thread for all siblings */
 		vphn_get_associativity(cpu, associativity);
-		new_nid = associativity_to_nid(associativity);
+		new_nid = associativity_to_chipid(associativity);
 		if (new_nid < 0 || !node_online(new_nid))
 			new_nid = first_online_node;
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC  3/5] powerpc:numa create 1:1 mappaing between chipid and nid
  2015-09-27 18:29 [PATCH RFC 0/5] powerpc:numa Add serial nid support Raghavendra K T
  2015-09-27 18:29 ` [PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table Raghavendra K T
  2015-09-27 18:29 ` [PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid Raghavendra K T
@ 2015-09-27 18:29 ` Raghavendra K T
  2015-09-28 17:28   ` Nishanth Aravamudan
  2015-09-28 17:35   ` Nishanth Aravamudan
  2015-09-27 18:29 ` [PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping Raghavendra K T
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Raghavendra K T @ 2015-09-27 18:29 UTC (permalink / raw)
  To: benh, paulus, mpe
  Cc: anton, linuxppc-dev, linux-kernel, cl, nacc, gkurz, grant.likely,
	nikunj, raghavendra.kt, khandual

Once we have made the distinction between nid and chipid
create a 1:1 mapping between them. This makes compacting the
nids easy later.

No functionality change.

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f84ed2f..dd2073b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -264,6 +264,17 @@ out:
 	return chipid;
 }
 
+
+ /* Return the nid from associativity */
+static int associativity_to_nid(const __be32 *associativity)
+{
+	int nid;
+
+	nid = associativity_to_chipid(associativity);
+	return nid;
+}
+
+
 /* Returns the chipid associated with the given device tree node,
  * or -1 if not found.
  */
@@ -278,6 +289,17 @@ static int of_node_to_chipid_single(struct device_node *device)
 	return chipid;
 }
 
+/*
+ * Returns nid from the chipid associated with given tree node
+ */
+static int of_node_to_nid_single(struct device_node *device)
+{
+	int nid;
+
+	nid = of_node_to_chipid_single(device);
+	return nid;
+}
+
 /* Walk the device tree upwards, looking for an associativity id */
 int of_node_to_nid(struct device_node *device)
 {
@@ -286,7 +308,7 @@ int of_node_to_nid(struct device_node *device)
 
 	of_node_get(device);
 	while (device) {
-		nid = of_node_to_chipid_single(device);
+		nid = of_node_to_nid_single(device);
 		if (nid != -1)
 			break;
 
@@ -498,7 +520,7 @@ static int of_get_assoc_arrays(struct device_node *memory,
 }
 
 /*
- * This is like of_node_to_chipid_single() for memory represented in the
+ * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
  */
 static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
@@ -557,7 +579,7 @@ static int numa_setup_cpu(unsigned long lcpu)
 			goto out;
 	}
 
-	nid = of_node_to_chipid_single(cpu);
+	nid = of_node_to_nid_single(cpu);
 
 out_present:
 	if (nid < 0 || !node_online(nid))
@@ -754,7 +776,7 @@ static int __init parse_numa_properties(void)
 
 		cpu = of_get_cpu_node(i, NULL);
 		BUG_ON(!cpu);
-		nid = of_node_to_chipid_single(cpu);
+		nid = of_node_to_nid_single(cpu);
 		of_node_put(cpu);
 
 		/*
@@ -796,7 +818,7 @@ new_range:
 		 * have associativity properties.  If none, then
 		 * everything goes to default_nid.
 		 */
-		nid = of_node_to_chipid_single(memory);
+		nid = of_node_to_nid_single(memory);
 		if (nid < 0)
 			nid = default_nid;
 
@@ -1119,7 +1141,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr)
 			if ((scn_addr < start) || (scn_addr >= (start + size)))
 				continue;
 
-			nid = of_node_to_chipid_single(memory);
+			nid = of_node_to_nid_single(memory);
 			break;
 		}
 
@@ -1415,7 +1437,7 @@ int arch_update_cpu_topology(void)
 
 		/* Use associativity from first thread for all siblings */
 		vphn_get_associativity(cpu, associativity);
-		new_nid = associativity_to_chipid(associativity);
+		new_nid = associativity_to_nid(associativity);
 		if (new_nid < 0 || !node_online(new_nid))
 			new_nid = first_online_node;
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC  4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping
  2015-09-27 18:29 [PATCH RFC 0/5] powerpc:numa Add serial nid support Raghavendra K T
                   ` (2 preceding siblings ...)
  2015-09-27 18:29 ` [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid Raghavendra K T
@ 2015-09-27 18:29 ` Raghavendra K T
  2015-09-28 17:32   ` Nishanth Aravamudan
  2015-09-27 18:29 ` [PATCH RFC 5/5] powerpc:numa Use chipid to nid mapping to get serial numa node ids Raghavendra K T
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Raghavendra K T @ 2015-09-27 18:29 UTC (permalink / raw)
  To: benh, paulus, mpe
  Cc: anton, linuxppc-dev, linux-kernel, cl, nacc, gkurz, grant.likely,
	nikunj, raghavendra.kt, khandual

Create arrays that maps serial nids and sparse chipids.

Note: My original idea had only two arrays of chipid to nid map. Final
code is inspired by driver/acpi/numa.c that maps a proximity node with
a logical node by Takayoshi Kochi <t-kochi@bq.jp.nec.com>, and thus
uses an additional chipid_map nodemask. The mask helps in first unused
nid easily by knowing first unset bit in the mask.

No change in functionality.

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index dd2073b..f015cad 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -63,6 +63,11 @@ static int form1_affinity;
 static int distance_ref_points_depth;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
+static nodemask_t chipid_map = NODE_MASK_NONE;
+static int chipid_to_nid_map[MAX_NUMNODES]
+				= { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
+static int nid_to_chipid_map[MAX_NUMNODES]
+				= { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
 
 /*
  * Allocate node_to_cpumask_map based on number of available nodes
@@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn,
 	return 0;
 }
 
+int chipid_to_nid(int chipid)
+{
+	if (chipid < 0)
+		return NUMA_NO_NODE;
+	return chipid_to_nid_map[chipid];
+}
+
+int nid_to_chipid(int nid)
+{
+	if (nid < 0)
+		return NUMA_NO_NODE;
+	return nid_to_chipid_map[nid];
+}
+
+static void __map_chipid_to_nid(int chipid, int nid)
+{
+	if (chipid_to_nid_map[chipid] == NUMA_NO_NODE
+	     || nid < chipid_to_nid_map[chipid])
+		chipid_to_nid_map[chipid] = nid;
+	if (nid_to_chipid_map[nid] == NUMA_NO_NODE
+	    || chipid < nid_to_chipid_map[nid])
+		nid_to_chipid_map[nid] = chipid;
+}
+
+int map_chipid_to_nid(int chipid)
+{
+	int nid;
+
+	if (chipid < 0 || chipid >= MAX_NUMNODES)
+		return NUMA_NO_NODE;
+
+	nid = chipid_to_nid_map[chipid];
+	if (nid == NUMA_NO_NODE) {
+		if (nodes_weight(chipid_map) >= MAX_NUMNODES)
+			return NUMA_NO_NODE;
+		nid = first_unset_node(chipid_map);
+		__map_chipid_to_nid(chipid, nid);
+		node_set(nid, chipid_map);
+	}
+	return nid;
+}
+
 int numa_cpu_lookup(int cpu)
 {
 	return numa_cpu_lookup_table[cpu];
@@ -264,7 +311,6 @@ out:
 	return chipid;
 }
 
-
  /* Return the nid from associativity */
 static int associativity_to_nid(const __be32 *associativity)
 {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC  5/5] powerpc:numa Use chipid to nid mapping to get serial numa node ids
  2015-09-27 18:29 [PATCH RFC 0/5] powerpc:numa Add serial nid support Raghavendra K T
                   ` (3 preceding siblings ...)
  2015-09-27 18:29 ` [PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping Raghavendra K T
@ 2015-09-27 18:29 ` Raghavendra K T
  2015-09-28 10:44 ` [PATCH RFC 0/5] powerpc:numa Add serial nid support Denis Kirjanov
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Raghavendra K T @ 2015-09-27 18:29 UTC (permalink / raw)
  To: benh, paulus, mpe
  Cc: anton, linuxppc-dev, linux-kernel, cl, nacc, gkurz, grant.likely,
	nikunj, raghavendra.kt, khandual

Also properly initialize numa distance table for serial nids.

Problem: Powerpc supports sparse nid numbering which could affect
1) memory footprint 2) virtualization use cases

Current solution:
The patch maps sprase chipid got fromn device tree to serail
nids.

Result before:
node   0   1  16  17
  0:  10  20  40  40
  1:  20  10  40  40
 16:  40  40  10  20
 17:  40  40  20  10

After:
node   0   1   2   3
  0:  10  20  40  40
  1:  20  10  40  40
  2:  40  40  10  20
  3:  40  40  20  10

Testing: Scenarios tested on  baremetal and KVM guest with 4 nodes
1) offlining and onlining memory and cpus
2) Running the tests from numactl source.
3) Creating 1000s of docker containers stressing the system

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f015cad..873ac8c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -304,7 +304,8 @@ static int associativity_to_chipid(const __be32 *associativity)
 		/*
 		 * Skip the length field and send start of associativity array
 		 */
-		initialize_distance_lookup_table(chipid, associativity + 1);
+		initialize_distance_lookup_table(chipid_to_nid(chipid),
+						 associativity + 1);
 	}
 
 out:
@@ -314,9 +315,10 @@ out:
  /* Return the nid from associativity */
 static int associativity_to_nid(const __be32 *associativity)
 {
-	int nid;
+	int chipid, nid;
 
-	nid = associativity_to_chipid(associativity);
+	chipid = associativity_to_chipid(associativity);
+	nid = map_chipid_to_nid(chipid);
 	return nid;
 }
 
@@ -340,9 +342,10 @@ static int of_node_to_chipid_single(struct device_node *device)
  */
 static int of_node_to_nid_single(struct device_node *device)
 {
-	int nid;
+	int chipid, nid;
 
-	nid = of_node_to_chipid_single(device);
+	chipid = of_node_to_chipid_single(device);
+	nid = map_chipid_to_nid(chipid);
 	return nid;
 }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC  1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table
  2015-09-27 18:29 ` [PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table Raghavendra K T
@ 2015-09-27 18:41   ` Raghavendra K T
  2015-10-06 10:17   ` [RFC, " Michael Ellerman
  1 sibling, 0 replies; 26+ messages in thread
From: Raghavendra K T @ 2015-09-27 18:41 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: benh, paulus, mpe, anton, linuxppc-dev, linux-kernel, cl, nacc,
	gkurz, grant.likely, nikunj, khandual

On 09/27/2015 11:59 PM, Raghavendra K T wrote:
> We access numa_cpu_lookup_table array directly in all the places
> to read/update numa cpu lookup information. Instead use a helper
> function to update.
>
> This is helpful in changing the way numa<-->cpu mapping in single
> place when needed.
>
> This is a cosmetic change, no change in functionality.
>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.inet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
Sorry:  changelog edit messed it.. :(

>   arch/powerpc/include/asm/mmzone.h |  2 +-
>   arch/powerpc/kernel/smp.c         | 10 +++++-----
>   arch/powerpc/mm/numa.c            | 28 +++++++++++++++++-----------
>   3 files changed, 23 insertions(+), 17 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
> index 7b58917..c24a5f4 100644
> --- a/arch/powerpc/include/asm/mmzone.h
> +++ b/arch/powerpc/include/asm/mmzone.h
> @@ -29,7 +29,7 @@ extern struct pglist_data *node_data[];
>    * Following are specific to this numa platform.
>    */
>
> -extern int numa_cpu_lookup_table[];
> +extern int numa_cpu_lookup(int cpu);
>   extern cpumask_var_t node_to_cpumask_map[];
>   #ifdef CONFIG_MEMORY_HOTPLUG
>   extern unsigned long max_pfn;
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index ec9ec20..56fbe9e 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>   		 * numa_node_id() works after this.
>   		 */
>   		if (cpu_present(cpu)) {
> -			set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
> +			set_cpu_numa_node(cpu, numa_cpu_lookup(cpu));
>   			set_cpu_numa_mem(cpu,
> -				local_memory_node(numa_cpu_lookup_table[cpu]));
> +				local_memory_node(numa_cpu_lookup(cpu)));
>   		}
>   	}
>
> @@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void)
>   #ifdef CONFIG_PPC64
>   	paca[boot_cpuid].__current = current;
>   #endif
> -	set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
> +	set_numa_node(numa_cpu_lookup(boot_cpuid));
>   	current_set[boot_cpuid] = task_thread_info(current);
>   }
>
> @@ -718,8 +718,8 @@ void start_secondary(void *unused)
>   	}
>   	traverse_core_siblings(cpu, true);
>
> -	set_numa_node(numa_cpu_lookup_table[cpu]);
> -	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
> +	set_numa_node(numa_cpu_lookup(cpu));
> +	set_numa_mem(local_memory_node(numa_cpu_lookup(cpu)));
>
>   	smp_wmb();
>   	notify_cpu_starting(cpu);
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 8b9502a..d5e6eee 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS];
>   cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
>   struct pglist_data *node_data[MAX_NUMNODES];
>
> -EXPORT_SYMBOL(numa_cpu_lookup_table);
>   EXPORT_SYMBOL(node_to_cpumask_map);
>   EXPORT_SYMBOL(node_data);
>
> @@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn,
>   	return 0;
>   }
>
> -static void reset_numa_cpu_lookup_table(void)
> +int numa_cpu_lookup(int cpu)
>   {
> -	unsigned int cpu;
> -
> -	for_each_possible_cpu(cpu)
> -		numa_cpu_lookup_table[cpu] = -1;
> +	return numa_cpu_lookup_table[cpu];
>   }
> +EXPORT_SYMBOL(numa_cpu_lookup);
>
> -static void update_numa_cpu_lookup_table(unsigned int cpu, int node)
> +static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node)
>   {
>   	numa_cpu_lookup_table[cpu] = node;
>   }
>
> +static void reset_numa_cpu_lookup_table(void)
> +{
> +	unsigned int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		update_numa_cpu_lookup_table(cpu, -1);
> +}
> +
>   static void map_cpu_to_node(int cpu, int node)
>   {
>   	update_numa_cpu_lookup_table(cpu, node);
> @@ -160,7 +165,7 @@ static void map_cpu_to_node(int cpu, int node)
>   #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
>   static void unmap_cpu_from_node(unsigned long cpu)
>   {
> -	int node = numa_cpu_lookup_table[cpu];
> +	int node = numa_cpu_lookup(cpu);
>
>   	dbg("removing cpu %lu from node %d\n", cpu, node);
>
> @@ -536,7 +541,8 @@ static int numa_setup_cpu(unsigned long lcpu)
>   	 * directly instead of querying the firmware, since it represents
>   	 * the most recent mapping notified to us by the platform (eg: VPHN).
>   	 */
> -	if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
> +	nid = numa_cpu_lookup(lcpu);
> +	if (nid >= 0) {
>   		map_cpu_to_node(lcpu, nid);
>   		return nid;
>   	}
> @@ -1413,7 +1419,7 @@ int arch_update_cpu_topology(void)
>   		if (new_nid < 0 || !node_online(new_nid))
>   			new_nid = first_online_node;
>
> -		if (new_nid == numa_cpu_lookup_table[cpu]) {
> +		if (new_nid == numa_cpu_lookup(cpu)) {
>   			cpumask_andnot(&cpu_associativity_changes_mask,
>   					&cpu_associativity_changes_mask,
>   					cpu_sibling_mask(cpu));
> @@ -1425,7 +1431,7 @@ int arch_update_cpu_topology(void)
>   			ud = &updates[i++];
>   			ud->cpu = sibling;
>   			ud->new_nid = new_nid;
> -			ud->old_nid = numa_cpu_lookup_table[sibling];
> +			ud->old_nid = numa_cpu_lookup(sibling);
>   			cpumask_set_cpu(sibling, &updated_cpus);
>   			if (i < weight)
>   				ud->next = &updates[i];
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
  2015-09-27 18:29 [PATCH RFC 0/5] powerpc:numa Add serial nid support Raghavendra K T
                   ` (4 preceding siblings ...)
  2015-09-27 18:29 ` [PATCH RFC 5/5] powerpc:numa Use chipid to nid mapping to get serial numa node ids Raghavendra K T
@ 2015-09-28 10:44 ` Denis Kirjanov
  2015-09-28 17:04   ` Nishanth Aravamudan
  2015-09-28 17:34 ` Nishanth Aravamudan
  2015-10-06 10:25 ` Michael Ellerman
  7 siblings, 1 reply; 26+ messages in thread
From: Denis Kirjanov @ 2015-09-28 10:44 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: benh, paulus, mpe, nikunj, nacc, linux-kernel, anton,
	grant.likely, cl, khandual, linuxppc-dev, gkurz

On 9/27/15, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> Problem description:
> Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
> numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
> got from device tree is naturally mapped (directly) to nid.

Interesting thing to play with, I'll try to test it on my POWER7 box,
but it doesn't have the OPAL layer :(

>
> Potential side effect of that is:
>
> 1) There are several places in kernel that assumes serial node numbering.
> and memory allocations assume that all the nodes from 0-(highest nid)
> exist inturn ending up allocating memory for the nodes that does not exist.
>
> 2) For virtualization use cases (such as qemu, libvirt, openstack), mapping
> sparse nid of the host system to contiguous nids of guest (numa affinity,
> placement) could be a challenge.
>
> Possible Solutions:
> 1) Handling the memory allocations is kernel case by case: Though in some
> cases it is easy to achieve, some cases may be intrusive/not trivial.
> at the end it does not handle side effect (2) above.
>
> 2) Map the sparse chipid got from device tree to a serial nid at kernel
> level (The idea proposed in this series).
> Pro: It is more natural to handle at kernel level than at lower (OPAL)
> layer.
> con: The chipid is in device tree no longer the same as nid in kernel
>
> 3) Let the lower layer (OPAL) give the serial node ids after parsing the
> chipid and the associativity etc [ either as a separate item in device tree
> or by compacting the chipid numbers ]
> Pros: kernel, device tree are on same page and less change in kernel
> Con: is it the functionality expected in lower layer
>
> As mentioned above, current patch series tries to map chipid from lower
> layer
> to a contiguos nid at kernel level keeping the node distance calculation and
> so on intact.
>
> Result:
> Before the patch: numactl -H
>
> available: 4 nodes (0-1,16-17)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> 24 25 26 27 28 29 30 31
> node 0 size: 31665 MB
> node 0 free: 29836 MB
> node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
> 53 54 55 56 57 58 59 60 61 62 63
> node 1 size: 32722 MB
> node 1 free: 32019 MB
> node 16 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
> 85 86 87 88 89 90 91 92 93 94 95
> node 16 size: 32571 MB
> node 16 free: 31222 MB
> node 17 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
> 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
> node 17 size: 0 MB
> node 17 free: 0 MB
> node distances:
> node   0   1  16  17
>   0:  10  20  40  40
>   1:  20  10  40  40
>  16:  40  40  10  20
>  17:  40  40  20  10
>
> After the patch: numactl -H
>
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> 24 25 26 27 28 29 30 31
> node 0 size: 31665 MB
> node 0 free: 30657 MB
> node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
> 53 54 55 56 57 58 59 60 61 62 63
> node 1 size: 32722 MB
> node 1 free: 32566 MB
> node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
> 85 86 87 88 89 90 91 92 93 94 95
> node 2 size: 32571 MB
> node 2 free: 32401 MB
> node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
> 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
> node 3 size: 0 MB
> node 3 free: 0 MB
> node distances:
> node   0   1   2   3
>   0:  10  20  40  40
>   1:  20  10  40  40
>   2:  40  40  10  20
>   3:  40  40  20  10
>
> (note that numa distances are intact). Apart from this, The following tests
> are done with the patched kernel (both baremetal and KVM guest with multiple
> nodes) to ensure there is no breakage.
>
> 1) offlining and onlining of memory in /sys/devices/system/node/nodeX path
>
> 2) offlining and onlining of cpus in /sys/devices/system/cpu/ path
>
> 3) Numactl tests from
> ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz
>
> (infact there were more breakage before the patch because of sparse nid
> and memoryless node cases of powerpc)
>
> 4) Thousands of docker containers were spawned.
>
> Please let me know your comments.
>
>  patch 1-3: cleanup patches
>  patch 4: Adds helper function to map nid and chipid
>  patch 5: Uses the mapping to get serial nid
>
> Raghavendra K T (5):
>   powerpc:numa Add numa_cpu_lookup function to update lookup table
>   powerpc:numa Rename functions referring to nid as chipid
>   powerpc:numa create 1:1 mappaing between chipid and nid
>   powerpc:numa Add helper functions to maintain chipid to nid mapping
>   powerpc:numa Use chipid to nid mapping to get serial numa node ids
>
>  arch/powerpc/include/asm/mmzone.h |   2 +-
>  arch/powerpc/kernel/smp.c         |  10 ++--
>  arch/powerpc/mm/numa.c            | 121
> +++++++++++++++++++++++++++++++-------
>  3 files changed, 105 insertions(+), 28 deletions(-)
>
> --
> 1.7.11.7
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
  2015-09-28 10:44 ` [PATCH RFC 0/5] powerpc:numa Add serial nid support Denis Kirjanov
@ 2015-09-28 17:04   ` Nishanth Aravamudan
  2015-09-29 18:20     ` Raghavendra K T
  0 siblings, 1 reply; 26+ messages in thread
From: Nishanth Aravamudan @ 2015-09-28 17:04 UTC (permalink / raw)
  To: Denis Kirjanov
  Cc: Raghavendra K T, benh, paulus, mpe, nikunj, linux-kernel, anton,
	grant.likely, cl, khandual, linuxppc-dev, gkurz

On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote:
> On 9/27/15, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> > Problem description:
> > Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
> > numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
> > got from device tree is naturally mapped (directly) to nid.
> 
> Interesting thing to play with, I'll try to test it on my POWER7 box,
> but it doesn't have the OPAL layer :(

Note that it's also interesting to try it under PowerVM, with odd NUMA
topologies and report any issues found :)


-Nish


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC  2/5] powerpc:numa Rename functions referring to nid as chipid
  2015-09-27 18:29 ` [PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid Raghavendra K T
@ 2015-09-28 17:27   ` Nishanth Aravamudan
  2015-09-29 18:31     ` Raghavendra K T
  0 siblings, 1 reply; 26+ messages in thread
From: Nishanth Aravamudan @ 2015-09-28 17:27 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: benh, paulus, mpe, anton, linuxppc-dev, linux-kernel, cl, gkurz,
	grant.likely, nikunj, khandual

On 27.09.2015 [23:59:10 +0530], Raghavendra K T wrote:
> There is no change in the fuctionality
> 
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/numa.c | 42 +++++++++++++++++++++---------------------
>  1 file changed, 21 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index d5e6eee..f84ed2f 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid,
>  	}
>  }
> 
> -/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
> +/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
>   * info is found.
>   */
> -static int associativity_to_nid(const __be32 *associativity)
> +static int associativity_to_chipid(const __be32 *associativity)

This is confusing to me. This function is also used by the DLPAR code
under PowerVM to indicate what node the CPU is on -- not a chip (which I
don't believe is exposed at all under PowerVM).

>  {
> -	int nid = -1;
> +	int chipid = -1;
> 
>  	if (min_common_depth == -1)
>  		goto out;
> 
>  	if (of_read_number(associativity, 1) >= min_common_depth)
> -		nid = of_read_number(&associativity[min_common_depth], 1);
> +		chipid = of_read_number(&associativity[min_common_depth], 1);

Doesn't this, in the PAPR documentation, specifically refer to the NODE
level domain?

>  	/* POWER4 LPAR uses 0xffff as invalid node */
> -	if (nid == 0xffff || nid >= MAX_NUMNODES)
> -		nid = -1;
> +	if (chipid == 0xffff || chipid >= MAX_NUMNODES)
> +		chipid = -1;
> 
> -	if (nid > 0 &&
> +	if (chipid > 0 &&
>  		of_read_number(associativity, 1) >= distance_ref_points_depth) {
>  		/*
>  		 * Skip the length field and send start of associativity array
>  		 */
> -		initialize_distance_lookup_table(nid, associativity + 1);
> +		initialize_distance_lookup_table(chipid, associativity + 1);
>  	}
> 
>  out:
> -	return nid;
> +	return chipid;
>  }
> 
> -/* Returns the nid associated with the given device tree node,
> +/* Returns the chipid associated with the given device tree node,
>   * or -1 if not found.
>   */
> -static int of_node_to_nid_single(struct device_node *device)
> +static int of_node_to_chipid_single(struct device_node *device)
>  {
> -	int nid = -1;
> +	int chipid = -1;
>  	const __be32 *tmp;
> 
>  	tmp = of_get_associativity(device);
>  	if (tmp)
> -		nid = associativity_to_nid(tmp);
> -	return nid;
> +		chipid = associativity_to_chipid(tmp);
> +	return chipid;
>  }
> 
>  /* Walk the device tree upwards, looking for an associativity id */
> @@ -286,7 +286,7 @@ int of_node_to_nid(struct device_node *device)
> 
>  	of_node_get(device);
>  	while (device) {
> -		nid = of_node_to_nid_single(device);
> +		nid = of_node_to_chipid_single(device);
>  		if (nid != -1)
>  			break;
> 
> @@ -498,7 +498,7 @@ static int of_get_assoc_arrays(struct device_node *memory,
>  }
> 
>  /*
> - * This is like of_node_to_nid_single() for memory represented in the
> + * This is like of_node_to_chipid_single() for memory represented in the
>   * ibm,dynamic-reconfiguration-memory node.
>   */
>  static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
> @@ -557,7 +557,7 @@ static int numa_setup_cpu(unsigned long lcpu)
>  			goto out;
>  	}
> 
> -	nid = of_node_to_nid_single(cpu);
> +	nid = of_node_to_chipid_single(cpu);
> 
>  out_present:
>  	if (nid < 0 || !node_online(nid))
> @@ -754,7 +754,7 @@ static int __init parse_numa_properties(void)
> 
>  		cpu = of_get_cpu_node(i, NULL);
>  		BUG_ON(!cpu);
> -		nid = of_node_to_nid_single(cpu);
> +		nid = of_node_to_chipid_single(cpu);
>  		of_node_put(cpu);
> 
>  		/*
> @@ -796,7 +796,7 @@ new_range:
>  		 * have associativity properties.  If none, then
>  		 * everything goes to default_nid.
>  		 */
> -		nid = of_node_to_nid_single(memory);
> +		nid = of_node_to_chipid_single(memory);
>  		if (nid < 0)
>  			nid = default_nid;
> 
> @@ -1119,7 +1119,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr)
>  			if ((scn_addr < start) || (scn_addr >= (start + size)))
>  				continue;
> 
> -			nid = of_node_to_nid_single(memory);
> +			nid = of_node_to_chipid_single(memory);
>  			break;
>  		}
> 
> @@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void)
> 
>  		/* Use associativity from first thread for all siblings */
>  		vphn_get_associativity(cpu, associativity);
> -		new_nid = associativity_to_nid(associativity);
> +		new_nid = associativity_to_chipid(associativity);

If you are getting a chipid, shouldn't you be assigning it to a variable
called 'new_chipid'?

>  		if (new_nid < 0 || !node_online(new_nid))
>  			new_nid = first_online_node;
> 
> -- 
> 1.7.11.7
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC  3/5] powerpc:numa create 1:1 mappaing between chipid and nid
  2015-09-27 18:29 ` [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid Raghavendra K T
@ 2015-09-28 17:28   ` Nishanth Aravamudan
  2015-09-29 18:35     ` Raghavendra K T
  2015-09-28 17:35   ` Nishanth Aravamudan
  1 sibling, 1 reply; 26+ messages in thread
From: Nishanth Aravamudan @ 2015-09-28 17:28 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: benh, paulus, mpe, anton, linuxppc-dev, linux-kernel, cl, gkurz,
	grant.likely, nikunj, khandual

On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote:
> Once we have made the distinction between nid and chipid
> create a 1:1 mapping between them. This makes compacting the
> nids easy later.


Didn't the previous patch just do the opposite of...

> @@ -286,7 +308,7 @@ int of_node_to_nid(struct device_node *device)
> 
>  	of_node_get(device);
>  	while (device) {
> -		nid = of_node_to_chipid_single(device);
> +		nid = of_node_to_nid_single(device);
>  		if (nid != -1)
>  			break;
> 
> @@ -498,7 +520,7 @@ static int of_get_assoc_arrays(struct device_node *memory,
>  }
> 
>  /*
> - * This is like of_node_to_chipid_single() for memory represented in the
> + * This is like of_node_to_nid_single() for memory represented in the
>   * ibm,dynamic-reconfiguration-memory node.
>   */
>  static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
> @@ -557,7 +579,7 @@ static int numa_setup_cpu(unsigned long lcpu)
>  			goto out;
>  	}
> 
> -	nid = of_node_to_chipid_single(cpu);
> +	nid = of_node_to_nid_single(cpu);
> 
>  out_present:
>  	if (nid < 0 || !node_online(nid))
> @@ -754,7 +776,7 @@ static int __init parse_numa_properties(void)
> 
>  		cpu = of_get_cpu_node(i, NULL);
>  		BUG_ON(!cpu);
> -		nid = of_node_to_chipid_single(cpu);
> +		nid = of_node_to_nid_single(cpu);
>  		of_node_put(cpu);
> 
>  		/*
> @@ -796,7 +818,7 @@ new_range:
>  		 * have associativity properties.  If none, then
>  		 * everything goes to default_nid.
>  		 */
> -		nid = of_node_to_chipid_single(memory);
> +		nid = of_node_to_nid_single(memory);
>  		if (nid < 0)
>  			nid = default_nid;
> 
> @@ -1119,7 +1141,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr)
>  			if ((scn_addr < start) || (scn_addr >= (start + size)))
>  				continue;
> 
> -			nid = of_node_to_chipid_single(memory);
> +			nid = of_node_to_nid_single(memory);
>  			break;
>  		}
> 
> @@ -1415,7 +1437,7 @@ int arch_update_cpu_topology(void)
> 
>  		/* Use associativity from first thread for all siblings */
>  		vphn_get_associativity(cpu, associativity);
> -		new_nid = associativity_to_chipid(associativity);
> +		new_nid = associativity_to_nid(associativity);
>  		if (new_nid < 0 || !node_online(new_nid))
>  			new_nid = first_online_node;
> 

Why is this churn useful?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC  4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping
  2015-09-27 18:29 ` [PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping Raghavendra K T
@ 2015-09-28 17:32   ` Nishanth Aravamudan
  2015-09-29 19:00     ` Raghavendra K T
  0 siblings, 1 reply; 26+ messages in thread
From: Nishanth Aravamudan @ 2015-09-28 17:32 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: benh, paulus, mpe, anton, linuxppc-dev, linux-kernel, cl, gkurz,
	grant.likely, nikunj, khandual

On 27.09.2015 [23:59:12 +0530], Raghavendra K T wrote:
> Create arrays that maps serial nids and sparse chipids.
> 
> Note: My original idea had only two arrays of chipid to nid map. Final
> code is inspired by driver/acpi/numa.c that maps a proximity node with
> a logical node by Takayoshi Kochi <t-kochi@bq.jp.nec.com>, and thus
> uses an additional chipid_map nodemask. The mask helps in first unused
> nid easily by knowing first unset bit in the mask.
> 
> No change in functionality.
> 
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/numa.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index dd2073b..f015cad 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -63,6 +63,11 @@ static int form1_affinity;
>  static int distance_ref_points_depth;
>  static const __be32 *distance_ref_points;
>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
> +static nodemask_t chipid_map = NODE_MASK_NONE;
> +static int chipid_to_nid_map[MAX_NUMNODES]
> +				= { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };

Hrm, conceptually there are *more* chips than nodes, right? So what
guarantees we won't see > MAX_NUMNODES chips?

> +static int nid_to_chipid_map[MAX_NUMNODES]
> +				= { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
> 
>  /*
>   * Allocate node_to_cpumask_map based on number of available nodes
> @@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn,
>  	return 0;
>  }
> 
> +int chipid_to_nid(int chipid)
> +{
> +	if (chipid < 0)
> +		return NUMA_NO_NODE;

Do you really want to support these cases? Or should they be
bugs/warnings indicating that you got an unexpected input? Or at least
WARN_ON_ONCE?

> +	return chipid_to_nid_map[chipid];
> +}
> +
> +int nid_to_chipid(int nid)
> +{
> +	if (nid < 0)
> +		return NUMA_NO_NODE;
> +	return nid_to_chipid_map[nid];
> +}
> +
> +static void __map_chipid_to_nid(int chipid, int nid)
> +{
> +	if (chipid_to_nid_map[chipid] == NUMA_NO_NODE
> +	     || nid < chipid_to_nid_map[chipid])
> +		chipid_to_nid_map[chipid] = nid;
> +	if (nid_to_chipid_map[nid] == NUMA_NO_NODE
> +	    || chipid < nid_to_chipid_map[nid])
> +		nid_to_chipid_map[nid] = chipid;
> +}

chip <-> node mapping is a static (physical) concept, right? Should we
emit some debugging if for some reason we get a runtime call to remap
an already mapped chip to a new node?

> +
> +int map_chipid_to_nid(int chipid)
> +{
> +	int nid;
> +
> +	if (chipid < 0 || chipid >= MAX_NUMNODES)
> +		return NUMA_NO_NODE;
> +
> +	nid = chipid_to_nid_map[chipid];
> +	if (nid == NUMA_NO_NODE) {
> +		if (nodes_weight(chipid_map) >= MAX_NUMNODES)
> +			return NUMA_NO_NODE;

If you create a KVM guest with a bogus topology, doesn't this just start
losing NUMA information for very high-noded guests?

> +		nid = first_unset_node(chipid_map);
> +		__map_chipid_to_nid(chipid, nid);
> +		node_set(nid, chipid_map);
> +	}
> +	return nid;
> +}
> +
>  int numa_cpu_lookup(int cpu)
>  {
>  	return numa_cpu_lookup_table[cpu];
> @@ -264,7 +311,6 @@ out:
>  	return chipid;
>  }
> 
> -

stray change?

>   /* Return the nid from associativity */
>  static int associativity_to_nid(const __be32 *associativity)
>  {
> -- 
> 1.7.11.7
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC  0/5] powerpc:numa Add serial nid support
  2015-09-27 18:29 [PATCH RFC 0/5] powerpc:numa Add serial nid support Raghavendra K T
                   ` (5 preceding siblings ...)
  2015-09-28 10:44 ` [PATCH RFC 0/5] powerpc:numa Add serial nid support Denis Kirjanov
@ 2015-09-28 17:34 ` Nishanth Aravamudan
  2015-09-29 19:10   ` Raghavendra K T
  2015-10-06 10:25 ` Michael Ellerman
  7 siblings, 1 reply; 26+ messages in thread
From: Nishanth Aravamudan @ 2015-09-28 17:34 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: benh, paulus, mpe, anton, linuxppc-dev, linux-kernel, cl, gkurz,
	grant.likely, nikunj, khandual

On 27.09.2015 [23:59:08 +0530], Raghavendra K T wrote:
> Problem description:
> Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
> numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
> got from device tree is naturally mapped (directly) to nid.

chipid is a OPAL concept, I believe, and not documented in PAPR... How
does this work under PowerVM?

> Potential side effect of that is:
> 
> 1) There are several places in kernel that assumes serial node numbering.
> and memory allocations assume that all the nodes from 0-(highest nid)
> exist inturn ending up allocating memory for the nodes that does not exist.
> 
> 2) For virtualization use cases (such as qemu, libvirt, openstack), mapping
> sparse nid of the host system to contiguous nids of guest (numa affinity,
> placement) could be a challenge.
> 
> Possible Solutions:
> 1) Handling the memory allocations is kernel case by case: Though in some
> cases it is easy to achieve, some cases may be intrusive/not trivial. 
> at the end it does not handle side effect (2) above.
> 
> 2) Map the sparse chipid got from device tree to a serial nid at kernel
> level (The idea proposed in this series).
> Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
> con: The chipid is in device tree no longer the same as nid in kernel

Is there any debugging/logging? Looks like not -- so how does a sysadmin
map from firmware-provided values to the Linux values? That's going to
make debugging of large systems (PowerVM or otherwise) less than
pleasant, it seems? Possibly you could put something in sysfs?

-Nish


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC  3/5] powerpc:numa create 1:1 mappaing between chipid and nid
  2015-09-27 18:29 ` [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid Raghavendra K T
  2015-09-28 17:28   ` Nishanth Aravamudan
@ 2015-09-28 17:35   ` Nishanth Aravamudan
  2015-09-29 19:20     ` Raghavendra K T
  1 sibling, 1 reply; 26+ messages in thread
From: Nishanth Aravamudan @ 2015-09-28 17:35 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: benh, paulus, mpe, anton, linuxppc-dev, linux-kernel, cl, gkurz,
	grant.likely, nikunj, khandual

On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote:
> Once we have made the distinction between nid and chipid
> create a 1:1 mapping between them. This makes compacting the
> nids easy later.
> 
> No functionality change.
> 
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/numa.c | 36 +++++++++++++++++++++++++++++-------
>  1 file changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index f84ed2f..dd2073b 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -264,6 +264,17 @@ out:
>  	return chipid;
>  }
> 
> +
> + /* Return the nid from associativity */
> +static int associativity_to_nid(const __be32 *associativity)
> +{
> +	int nid;
> +
> +	nid = associativity_to_chipid(associativity);
> +	return nid;
> +}

This is ultimately confusing. You are assigning the semantic return
value of a chipid to a nid -- is it a nid or a chipid? Shouldn't the
variable naming be consistent?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
  2015-09-28 17:04   ` Nishanth Aravamudan
@ 2015-09-29 18:20     ` Raghavendra K T
  2015-09-29 19:46       ` Denis Kirjanov
  0 siblings, 1 reply; 26+ messages in thread
From: Raghavendra K T @ 2015-09-29 18:20 UTC (permalink / raw)
  To: Nishanth Aravamudan, Denis Kirjanov
  Cc: benh, paulus, mpe, nikunj, linux-kernel, anton, grant.likely, cl,
	khandual, linuxppc-dev, gkurz

On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote:
> On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote:
>> On 9/27/15, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>> Problem description:
>>> Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
>>> numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
>>> got from device tree is naturally mapped (directly) to nid.
>>
>> Interesting thing to play with, I'll try to test it on my POWER7 box,
>> but it doesn't have the OPAL layer :(

Hi Denis,
Thanks for your interest. I have pushed the patches to

https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes
  patches easy to grab.

>
> Note that it's also interesting to try it under PowerVM, with odd NUMA
> topologies and report any issues found :)
>

Thanks Nish, I 'll also grab a powerVM and test.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC  2/5] powerpc:numa Rename functions referring to nid as chipid
  2015-09-28 17:27   ` Nishanth Aravamudan
@ 2015-09-29 18:31     ` Raghavendra K T
  0 siblings, 0 replies; 26+ messages in thread
From: Raghavendra K T @ 2015-09-29 18:31 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: benh, paulus, mpe, anton, linuxppc-dev, linux-kernel, cl, gkurz,
	grant.likely, nikunj, khandual

On 09/28/2015 10:57 PM, Nishanth Aravamudan wrote:
> On 27.09.2015 [23:59:10 +0530], Raghavendra K T wrote:
>> There is no change in the fuctionality
>>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>> ---
>>   arch/powerpc/mm/numa.c | 42 +++++++++++++++++++++---------------------
>>   1 file changed, 21 insertions(+), 21 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index d5e6eee..f84ed2f 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid,
>>   	}
>>   }
>>
>> -/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
>> +/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
>>    * info is found.
>>    */
>> -static int associativity_to_nid(const __be32 *associativity)
>> +static int associativity_to_chipid(const __be32 *associativity)
>
> This is confusing to me. This function is also used by the DLPAR code
> under PowerVM to indicate what node the CPU is on -- not a chip (which I
> don't believe is exposed at all under PowerVM).
>

Good point.

should I retain the name nid?
or any suggestions? instead of chipid -> nid which fits both the cases.
or should I rename like nid->vnid  something?
[...]
>> @@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void)
>>
>>   		/* Use associativity from first thread for all siblings */
>>   		vphn_get_associativity(cpu, associativity);
>> -		new_nid = associativity_to_nid(associativity);
>> +		new_nid = associativity_to_chipid(associativity);
>
> If you are getting a chipid, shouldn't you be assigning it to a variable
> called 'new_chipid'?

yes perhaps.
my splitting idea was
1. change nid name in functions to chipid (without changing nid
variable calling that function)
2. rename variables to chipid and assign nid=chipid (1:1 mapping)
3. now let nid = mapped chipid

But I see that it isn't consistent in some places. do you think
merging step 1 and step 2 is okay?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC  3/5] powerpc:numa create 1:1 mappaing between chipid and nid
  2015-09-28 17:28   ` Nishanth Aravamudan
@ 2015-09-29 18:35     ` Raghavendra K T
  0 siblings, 0 replies; 26+ messages in thread
From: Raghavendra K T @ 2015-09-29 18:35 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: benh, paulus, mpe, anton, linuxppc-dev, linux-kernel, cl, gkurz,
	grant.likely, nikunj, khandual

On 09/28/2015 10:58 PM, Nishanth Aravamudan wrote:
> On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote:
>> Once we have made the distinction between nid and chipid
>> create a 1:1 mapping between them. This makes compacting the
>> nids easy later.
>
>
> Didn't the previous patch just do the opposite of...
>

As per my thoughts it was:
1. rename functions to say loud that it is chipid (and not nid)
2. and then assign nid = chipid so that we are clear that
we made nid:chipid 1:1 mapping and compact nids later..

But again may be I should combine patch 2 and 3.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC  4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping
  2015-09-28 17:32   ` Nishanth Aravamudan
@ 2015-09-29 19:00     ` Raghavendra K T
  0 siblings, 0 replies; 26+ messages in thread
From: Raghavendra K T @ 2015-09-29 19:00 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: benh, paulus, mpe, anton, linuxppc-dev, linux-kernel, cl, gkurz,
	grant.likely, nikunj, khandual

On 09/28/2015 11:02 PM, Nishanth Aravamudan wrote:
> On 27.09.2015 [23:59:12 +0530], Raghavendra K T wrote:
>> Create arrays that maps serial nids and sparse chipids.
>>
>> Note: My original idea had only two arrays of chipid to nid map. Final
>> code is inspired by driver/acpi/numa.c that maps a proximity node with
>> a logical node by Takayoshi Kochi <t-kochi@bq.jp.nec.com>, and thus
>> uses an additional chipid_map nodemask. The mask helps in first unused
>> nid easily by knowing first unset bit in the mask.
>>
>> No change in functionality.
>>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>> ---
>>   arch/powerpc/mm/numa.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 47 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index dd2073b..f015cad 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -63,6 +63,11 @@ static int form1_affinity;
>>   static int distance_ref_points_depth;
>>   static const __be32 *distance_ref_points;
>>   static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
>> +static nodemask_t chipid_map = NODE_MASK_NONE;
>> +static int chipid_to_nid_map[MAX_NUMNODES]
>> +				= { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
>
> Hrm, conceptually there are *more* chips than nodes, right? So what
> guarantees we won't see > MAX_NUMNODES chips?

You are correct that nid <= chipids.
and #nids = #chipids when all possible slots are populated. Considering
we assume that maximum chip slots are no more than MAX_NUMNODES,


how about having

#define MAX_CHIPNODES MAX_NUMNODES
and
chipid_to_nid_map[MAX_CHIPNODES] = { [0 ... MAX_CHIPNODES - 1] = ..

>
>> +static int nid_to_chipid_map[MAX_NUMNODES]
>> +				= { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
>>
>>   /*
>>    * Allocate node_to_cpumask_map based on number of available nodes
>> @@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn,
>>   	return 0;
>>   }
>>
>> +int chipid_to_nid(int chipid)
>> +{
>> +	if (chipid < 0)
>> +		return NUMA_NO_NODE;
>
> Do you really want to support these cases? Or should they be
> bugs/warnings indicating that you got an unexpected input? Or at least
> WARN_ON_ONCE?
>

Right. Querying for nid of an invalid chipid should be atleast
WARN_ON_ONCE(). But 'll check once if there is any valid scenario
before the change.

>> +	return chipid_to_nid_map[chipid];
>> +}
>> +
>> +int nid_to_chipid(int nid)
>> +{
>> +	if (nid < 0)
>> +		return NUMA_NO_NODE;
>> +	return nid_to_chipid_map[nid];
>> +}
>> +
>> +static void __map_chipid_to_nid(int chipid, int nid)
>> +{
>> +	if (chipid_to_nid_map[chipid] == NUMA_NO_NODE
>> +	     || nid < chipid_to_nid_map[chipid])
>> +		chipid_to_nid_map[chipid] = nid;
>> +	if (nid_to_chipid_map[nid] == NUMA_NO_NODE
>> +	    || chipid < nid_to_chipid_map[nid])
>> +		nid_to_chipid_map[nid] = chipid;
>> +}
>
> chip <-> node mapping is a static (physical) concept, right? Should we
> emit some debugging if for some reason we get a runtime call to remap
> an already mapped chip to a new node?
>

Good point. Already mapped chipid to a different nid is unexpected
whereas mapping chipid to same nid is expected.(because mapping comes
from cpus belonging to same node).

WARN_ON() should suffice here?


>> +
>> +int map_chipid_to_nid(int chipid)
>> +{
>> +	int nid;
>> +
>> +	if (chipid < 0 || chipid >= MAX_NUMNODES)
>> +		return NUMA_NO_NODE;
>> +
>> +	nid = chipid_to_nid_map[chipid];
>> +	if (nid == NUMA_NO_NODE) {
>> +		if (nodes_weight(chipid_map) >= MAX_NUMNODES)
>> +			return NUMA_NO_NODE;
>
> If you create a KVM guest with a bogus topology, doesn't this just start
> losing NUMA information for very high-noded guests?
>

'll try to see if it is possible to hit this case, ideally we should
not allow more than MAX_NUMNODES for chipids and we should abort early.

>> +		nid = first_unset_node(chipid_map);
>> +		__map_chipid_to_nid(chipid, nid);
>> +		node_set(nid, chipid_map);
>> +	}
>> +	return nid;
>> +}
>> +
>>   int numa_cpu_lookup(int cpu)
>>   {
>>   	return numa_cpu_lookup_table[cpu];
>> @@ -264,7 +311,6 @@ out:
>>   	return chipid;
>>   }
>>
>> -
>
> stray change?
>

yep, will correct that.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC  0/5] powerpc:numa Add serial nid support
  2015-09-28 17:34 ` Nishanth Aravamudan
@ 2015-09-29 19:10   ` Raghavendra K T
  0 siblings, 0 replies; 26+ messages in thread
From: Raghavendra K T @ 2015-09-29 19:10 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: benh, paulus, mpe, anton, linuxppc-dev, linux-kernel, cl, gkurz,
	grant.likely, nikunj, khandual

On 09/28/2015 11:04 PM, Nishanth Aravamudan wrote:
> On 27.09.2015 [23:59:08 +0530], Raghavendra K T wrote:
[...]
>>
>> 2) Map the sparse chipid got from device tree to a serial nid at kernel
>> level (The idea proposed in this series).
>> Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
>> con: The chipid is in device tree no longer the same as nid in kernel
>
> Is there any debugging/logging? Looks like not -- so how does a sysadmin
> map from firmware-provided values to the Linux values? That's going to
> make debugging of large systems (PowerVM or otherwise) less than
> pleasant, it seems? Possibly you could put something in sysfs?

I see 2 things could be done here:

1) while doing dump_numa_cpu_topology() we can dump nid_to_chipid()
as additional information.

2) sysfs->
Does /sys/devices/system/node/nodeX/*chipid* looks good. May be we
should add only for powerpc or otherwise we need to have chipid = nid
populated for other archs. [ I think this change may be done slowly ]






^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC  3/5] powerpc:numa create 1:1 mappaing between chipid and nid
  2015-09-28 17:35   ` Nishanth Aravamudan
@ 2015-09-29 19:20     ` Raghavendra K T
  0 siblings, 0 replies; 26+ messages in thread
From: Raghavendra K T @ 2015-09-29 19:20 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: benh, paulus, mpe, anton, linuxppc-dev, linux-kernel, cl, gkurz,
	grant.likely, nikunj, khandual

On 09/28/2015 11:05 PM, Nishanth Aravamudan wrote:
> On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote:
>> Once we have made the distinction between nid and chipid
>> create a 1:1 mapping between them. This makes compacting the
>> nids easy later.
>>
>> No functionality change.
>>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>> ---
>>   arch/powerpc/mm/numa.c | 36 +++++++++++++++++++++++++++++-------
>>   1 file changed, 29 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index f84ed2f..dd2073b 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -264,6 +264,17 @@ out:
>>   	return chipid;
>>   }
>>
>> +
>> + /* Return the nid from associativity */
>> +static int associativity_to_nid(const __be32 *associativity)
>> +{
>> +	int nid;
>> +
>> +	nid = associativity_to_chipid(associativity);
>> +	return nid;
>> +}
>
> This is ultimately confusing. You are assigning the semantic return
> value of a chipid to a nid -- is it a nid or a chipid? Shouldn't the
> variable naming be consistent?
>

:( yes. will come up with some consistent naming.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
  2015-09-29 18:20     ` Raghavendra K T
@ 2015-09-29 19:46       ` Denis Kirjanov
  2015-09-30  6:16         ` Raghavendra K T
  0 siblings, 1 reply; 26+ messages in thread
From: Denis Kirjanov @ 2015-09-29 19:46 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Nishanth Aravamudan, benh, paulus, mpe, nikunj, linux-kernel,
	anton, grant.likely, cl, khandual, linuxppc-dev, gkurz

On 9/29/15, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote:
>> On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote:
>>> On 9/27/15, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>>> Problem description:
>>>> Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
>>>> numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
>>>> got from device tree is naturally mapped (directly) to nid.
>>>
>>> Interesting thing to play with, I'll try to test it on my POWER7 box,
>>> but it doesn't have the OPAL layer :(
>
> Hi Denis,
> Thanks for your interest. I have pushed the patches to
>
> https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes
>   patches easy to grab.

Thanks!
One sad thing is that I can't test the actual node id mapping now
since currently I have an access to machine with only one memory node
:/ Can we fake it through qemu?

>
>>
>> Note that it's also interesting to try it under PowerVM, with odd NUMA
>> topologies and report any issues found :)
>>
>
> Thanks Nish, I 'll also grab a powerVM and test.
>
>
>
>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
  2015-09-29 19:46       ` Denis Kirjanov
@ 2015-09-30  6:16         ` Raghavendra K T
  0 siblings, 0 replies; 26+ messages in thread
From: Raghavendra K T @ 2015-09-30  6:16 UTC (permalink / raw)
  To: Denis Kirjanov
  Cc: Nishanth Aravamudan, benh, paulus, mpe, nikunj, linux-kernel,
	anton, grant.likely, cl, khandual, linuxppc-dev, gkurz

On 09/30/2015 01:16 AM, Denis Kirjanov wrote:
> On 9/29/15, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>> On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote:
>>> On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote:
>>>> On 9/27/15, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>>>> Problem description:
>>>>> Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
>>>>> numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
>>>>> got from device tree is naturally mapped (directly) to nid.
>>>>
>>>> Interesting thing to play with, I'll try to test it on my POWER7 box,
>>>> but it doesn't have the OPAL layer :(
>>
>> Hi Denis,
>> Thanks for your interest. I have pushed the patches to
>>
>> https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes
>>    patches easy to grab.
>
> Thanks!
> One sad thing is that I can't test the actual node id mapping now
> since currently I have an access to machine with only one memory node
> :/ Can we fake it through qemu?
>

faking the sparse numa ids is possible with Nish's patch for qemu:

https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg05826.html


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC, 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table
  2015-09-27 18:29 ` [PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table Raghavendra K T
  2015-09-27 18:41   ` Raghavendra K T
@ 2015-10-06 10:17   ` Michael Ellerman
  2015-10-06 10:33     ` Raghavendra K T
  1 sibling, 1 reply; 26+ messages in thread
From: Michael Ellerman @ 2015-10-06 10:17 UTC (permalink / raw)
  To: Raghavendra K T, benh, paulus
  Cc: nikunj, nacc, raghavendra.kt, linux-kernel, Raghavendra K T,
	anton, grant.likely, cl, khandual, linuxppc-dev, gkurz

On Sun, 2015-27-09 at 18:29:09 UTC, Raghavendra K T wrote:
> We access numa_cpu_lookup_table array directly in all the places
> to read/update numa cpu lookup information. Instead use a helper
> function to update.
> 
> This is helpful in changing the way numa<-->cpu mapping in single
> place when needed.
> 
> This is a cosmetic change, no change in functionality.
> 
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.inet.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/mmzone.h |  2 +-
>  arch/powerpc/kernel/smp.c         | 10 +++++-----
>  arch/powerpc/mm/numa.c            | 28 +++++++++++++++++-----------
>  3 files changed, 23 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
> index 7b58917..c24a5f4 100644
> --- a/arch/powerpc/include/asm/mmzone.h
> +++ b/arch/powerpc/include/asm/mmzone.h
> @@ -29,7 +29,7 @@ extern struct pglist_data *node_data[];
>   * Following are specific to this numa platform.
>   */
>  
> -extern int numa_cpu_lookup_table[];
> +extern int numa_cpu_lookup(int cpu);

Can you rename it better :)

Something like cpu_to_nid().

Although maybe nid is wrong given the rest of the series.

> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 8b9502a..d5e6eee 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS];
>  cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
>  struct pglist_data *node_data[MAX_NUMNODES];
>  
> -EXPORT_SYMBOL(numa_cpu_lookup_table);
>  EXPORT_SYMBOL(node_to_cpumask_map);
>  EXPORT_SYMBOL(node_data);
>  
> @@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn,
>  	return 0;
>  }
>  
> -static void reset_numa_cpu_lookup_table(void)
> +int numa_cpu_lookup(int cpu)
>  {
> -	unsigned int cpu;
> -
> -	for_each_possible_cpu(cpu)
> -		numa_cpu_lookup_table[cpu] = -1;
> +	return numa_cpu_lookup_table[cpu];
>  }
> +EXPORT_SYMBOL(numa_cpu_lookup);

I don't see you changing any modular code that uses this, or any macros that
might be used by modules, so I don't see why this needs to be exported?

I think you just added it because num_cpu_lookup_table was exported?

cheers

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC  0/5] powerpc:numa Add serial nid support
  2015-09-27 18:29 [PATCH RFC 0/5] powerpc:numa Add serial nid support Raghavendra K T
                   ` (6 preceding siblings ...)
  2015-09-28 17:34 ` Nishanth Aravamudan
@ 2015-10-06 10:25 ` Michael Ellerman
  2015-10-06 11:15   ` Raghavendra K T
  7 siblings, 1 reply; 26+ messages in thread
From: Michael Ellerman @ 2015-10-06 10:25 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: benh, paulus, anton, linuxppc-dev, linux-kernel, cl, nacc, gkurz,
	grant.likely, nikunj, khandual

On Sun, 2015-09-27 at 23:59 +0530, Raghavendra K T wrote:
> Problem description:
> Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
> numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
> got from device tree is naturally mapped (directly) to nid.
> 
> Potential side effect of that is:
> 
> 1) There are several places in kernel that assumes serial node numbering.
> and memory allocations assume that all the nodes from 0-(highest nid)
> exist inturn ending up allocating memory for the nodes that does not exist.

Is it several? Or lots?

If it's several, ie. more than two but not lots, then we should probably just
fix those places. Or is that /really/ hard for some reason?

Do we ever get whole nodes hotplugged in under PowerVM? I don't think so, but I
don't remember for sure.

> 2) For virtualization use cases (such as qemu, libvirt, openstack), mapping
> sparse nid of the host system to contiguous nids of guest (numa affinity,
> placement) could be a challenge.

Can you elaborate? That's a bit vague.

> Possible Solutions:
> 1) Handling the memory allocations is kernel case by case: Though in some
> cases it is easy to achieve, some cases may be intrusive/not trivial. 
> at the end it does not handle side effect (2) above.
> 
> 2) Map the sparse chipid got from device tree to a serial nid at kernel
> level (The idea proposed in this series).
> Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
> con: The chipid is in device tree no longer the same as nid in kernel
> 
> 3) Let the lower layer (OPAL) give the serial node ids after parsing the
> chipid and the associativity etc [ either as a separate item in device tree
> or by compacting the chipid numbers ]
> Pros: kernel, device tree are on same page and less change in kernel
> Con: is it the functionality expected in lower layer

...

> 3) Numactl tests from
> ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz
> 
> (infact there were more breakage before the patch because of sparse nid
> and memoryless node cases of powerpc)

This is probably the best argument for your series. ie. userspace is dumb and
fixing every broken app that assumes linear node numbering is not feasible.


So on the whole I think the concept is good. This series though is a bit
confusing because of all the renaming etc. etc. Nish made lots of good comments
so I'll wait for a v2 based on those.

cheers




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC, 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table
  2015-10-06 10:17   ` [RFC, " Michael Ellerman
@ 2015-10-06 10:33     ` Raghavendra K T
  0 siblings, 0 replies; 26+ messages in thread
From: Raghavendra K T @ 2015-10-06 10:33 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: benh, paulus, nikunj, nacc, linux-kernel, anton, grant.likely,
	cl, khandual, linuxppc-dev, gkurz

On 10/06/2015 03:47 PM, Michael Ellerman wrote:
> On Sun, 2015-27-09 at 18:29:09 UTC, Raghavendra K T wrote:
>> We access numa_cpu_lookup_table array directly in all the places
>> to read/update numa cpu lookup information. Instead use a helper
>> function to update.
>>
>> This is helpful in changing the way numa<-->cpu mapping in single
>> place when needed.
>>
>> This is a cosmetic change, no change in functionality.
>>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.inet.ibm.com>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>> ---
>>   arch/powerpc/include/asm/mmzone.h |  2 +-
>>   arch/powerpc/kernel/smp.c         | 10 +++++-----
>>   arch/powerpc/mm/numa.c            | 28 +++++++++++++++++-----------
>>   3 files changed, 23 insertions(+), 17 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
>> index 7b58917..c24a5f4 100644
>> --- a/arch/powerpc/include/asm/mmzone.h
>> +++ b/arch/powerpc/include/asm/mmzone.h
>> @@ -29,7 +29,7 @@ extern struct pglist_data *node_data[];
>>    * Following are specific to this numa platform.
>>    */
>>
>> -extern int numa_cpu_lookup_table[];
>> +extern int numa_cpu_lookup(int cpu);
>
> Can you rename it better :)
>
> Something like cpu_to_nid().

Good name. sure.

>
> Although maybe nid is wrong given the rest of the series.

May be not. The current plan is to rename (after discussing with Nish)

chipid to pnid (physical nid)
and nid to vnid (virtual nid)

within powerpc numa.c
[reasoning chipid is applicable only to OPAL, since we want to handle
powerkvm, powervm and baremetal we need a generic name ]

But 'nid' naming will be retained which is applicable for generic
kernel interactions.

>
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 8b9502a..d5e6eee 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS];
>>   cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
>>   struct pglist_data *node_data[MAX_NUMNODES];
>>
>> -EXPORT_SYMBOL(numa_cpu_lookup_table);
>>   EXPORT_SYMBOL(node_to_cpumask_map);
>>   EXPORT_SYMBOL(node_data);
>>
>> @@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn,
>>   	return 0;
>>   }
>>
>> -static void reset_numa_cpu_lookup_table(void)
>> +int numa_cpu_lookup(int cpu)
>>   {
>> -	unsigned int cpu;
>> -
>> -	for_each_possible_cpu(cpu)
>> -		numa_cpu_lookup_table[cpu] = -1;
>> +	return numa_cpu_lookup_table[cpu];
>>   }
>> +EXPORT_SYMBOL(numa_cpu_lookup);
>
> I don't see you changing any modular code that uses this, or any macros that
> might be used by modules, so I don't see why this needs to be exported?
>
> I think you just added it because num_cpu_lookup_table was exported?
>

arch/powerpc/kernel/smp.c uses it.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC  0/5] powerpc:numa Add serial nid support
  2015-10-06 10:25 ` Michael Ellerman
@ 2015-10-06 11:15   ` Raghavendra K T
  0 siblings, 0 replies; 26+ messages in thread
From: Raghavendra K T @ 2015-10-06 11:15 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: benh, paulus, anton, linuxppc-dev, linux-kernel, cl, nacc, gkurz,
	grant.likely, nikunj, khandual

On 10/06/2015 03:55 PM, Michael Ellerman wrote:
> On Sun, 2015-09-27 at 23:59 +0530, Raghavendra K T wrote:
>> Problem description:
>> Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
>> numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
>> got from device tree is naturally mapped (directly) to nid.
>>
>> Potential side effect of that is:
>>
>> 1) There are several places in kernel that assumes serial node numbering.
>> and memory allocations assume that all the nodes from 0-(highest nid)
>> exist inturn ending up allocating memory for the nodes that does not exist.
>
> Is it several? Or lots?
>
> If it's several, ie. more than two but not lots, then we should probably just
> fix those places. Or is that /really/ hard for some reason?
>

It is several and I did attempt to fix them. But the rest of the places
(like memcg, work queue, scheduler and so on) are tricky to fix because
the memory allocations are glued with other things.
and similar fix may be expected in future too..


> Do we ever get whole nodes hotplugged in under PowerVM? I don't think so, but I
> don't remember for sure.
>

Even on powervm we do have discontiguous  numa nodes. [Adding more to 
it, we could even end up creating a dummy node 0 just to make kernel
happy]
for e.g.,
available: 2 nodes (0,7)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 7 cpus: 0 1 2 3 4 5 6 7
node 7 size: 10240 MB
node 7 free: 8174 MB
node distances:
node   0   7
   0:  10  40
   7:  40  10

note that node zero neither has any cpu nor memory.

>> 2) For virtualization use cases (such as qemu, libvirt, openstack), mapping
>> sparse nid of the host system to contiguous nids of guest (numa affinity,
>> placement) could be a challenge.
>
> Can you elaborate? That's a bit vague.

one e.g., i can think of: (though libvirt/openstack people will know 
more about it) suppose one wishes to have half of the vcpus bind to one
physical node and rest of the vcpus to second numa node, we cant say
whether second node is 1,8, or 16. and same libvirtxml on a two node
system may not be valid for another two numa node system.
[ i believe it may cause some migration problem too ].

>
>> Possible Solutions:
>> 1) Handling the memory allocations is kernel case by case: Though in some
>> cases it is easy to achieve, some cases may be intrusive/not trivial.
>> at the end it does not handle side effect (2) above.
>>
>> 2) Map the sparse chipid got from device tree to a serial nid at kernel
>> level (The idea proposed in this series).
>> Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
>> con: The chipid is in device tree no longer the same as nid in kernel
>>
>> 3) Let the lower layer (OPAL) give the serial node ids after parsing the
>> chipid and the associativity etc [ either as a separate item in device tree
>> or by compacting the chipid numbers ]
>> Pros: kernel, device tree are on same page and less change in kernel
>> Con: is it the functionality expected in lower layer
>
> ...
>
>> 3) Numactl tests from
>> ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz
>>
>> (infact there were more breakage before the patch because of sparse nid
>> and memoryless node cases of powerpc)
>
> This is probably the best argument for your series. ie. userspace is dumb and
> fixing every broken app that assumes linear node numbering is not feasible.
>
>
> So on the whole I think the concept is good. This series though is a bit
> confusing because of all the renaming etc. etc. Nish made lots of good comments
> so I'll wait for a v2 based on those.
>

Yes, will be sending V2 soon extending my patch to fix powervm case too.


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-10-06 11:14 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-27 18:29 [PATCH RFC 0/5] powerpc:numa Add serial nid support Raghavendra K T
2015-09-27 18:29 ` [PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table Raghavendra K T
2015-09-27 18:41   ` Raghavendra K T
2015-10-06 10:17   ` [RFC, " Michael Ellerman
2015-10-06 10:33     ` Raghavendra K T
2015-09-27 18:29 ` [PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid Raghavendra K T
2015-09-28 17:27   ` Nishanth Aravamudan
2015-09-29 18:31     ` Raghavendra K T
2015-09-27 18:29 ` [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid Raghavendra K T
2015-09-28 17:28   ` Nishanth Aravamudan
2015-09-29 18:35     ` Raghavendra K T
2015-09-28 17:35   ` Nishanth Aravamudan
2015-09-29 19:20     ` Raghavendra K T
2015-09-27 18:29 ` [PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping Raghavendra K T
2015-09-28 17:32   ` Nishanth Aravamudan
2015-09-29 19:00     ` Raghavendra K T
2015-09-27 18:29 ` [PATCH RFC 5/5] powerpc:numa Use chipid to nid mapping to get serial numa node ids Raghavendra K T
2015-09-28 10:44 ` [PATCH RFC 0/5] powerpc:numa Add serial nid support Denis Kirjanov
2015-09-28 17:04   ` Nishanth Aravamudan
2015-09-29 18:20     ` Raghavendra K T
2015-09-29 19:46       ` Denis Kirjanov
2015-09-30  6:16         ` Raghavendra K T
2015-09-28 17:34 ` Nishanth Aravamudan
2015-09-29 19:10   ` Raghavendra K T
2015-10-06 10:25 ` Michael Ellerman
2015-10-06 11:15   ` Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).