All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] make memblock allocator utilize the node's fallback info
@ 2019-02-24 12:34 Pingfan Liu
  2019-02-24 12:34 ` [PATCH 1/6] mm/numa: extract the code of building node fall back list Pingfan Liu
                   ` (6 more replies)
  0 siblings, 7 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-02-24 12:34 UTC (permalink / raw)
  To: x86, linux-mm
  Cc: Pingfan Liu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, linux-kernel

There are NUMA machines with memory-less node. At present page allocator builds the
full fallback info by build_zonelists(). But memblock allocator does not utilize
this info. And for memory-less node, memblock allocator just falls back "node 0",
without utilizing the nearest node. Unfortunately, the percpu section is allocated 
by memblock, which is accessed frequently after bootup.

This series aims to improve the performance of per cpu section on memory-less node
by feeding node's fallback info to memblock allocator on x86, like we do for page
allocator. On other archs, it requires independent effort to setup node to cpumask
map ahead.


CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Joonsoo Kim <iamjoonsoo.kim@lge.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Andi Kleen <ak@linux.intel.com>
CC: Petr Tesarik <ptesarik@suse.cz>
CC: Michal Hocko <mhocko@suse.com>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Nicholas Piggin <npiggin@gmail.com>
CC: Daniel Vacek <neelx@redhat.com>
CC: linux-kernel@vger.kernel.org

Pingfan Liu (6):
  mm/numa: extract the code of building node fall back list
  mm/memblock: make full utilization of numa info
  x86/numa: define numa_init_array() conditional on CONFIG_NUMA
  x86/numa: concentrate the code of setting cpu to node map
  x86/numa: push forward the setup of node to cpumask map
  x86/numa: build node fallback info after setting up node to cpumask
    map

 arch/x86/include/asm/topology.h |  4 ---
 arch/x86/kernel/setup.c         |  2 ++
 arch/x86/kernel/setup_percpu.c  |  3 --
 arch/x86/mm/numa.c              | 40 +++++++++++-------------
 include/linux/memblock.h        |  3 ++
 mm/memblock.c                   | 68 ++++++++++++++++++++++++++++++++++++++---
 mm/page_alloc.c                 | 48 +++++++++++++++++------------
 7 files changed, 114 insertions(+), 54 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/6] mm/numa: extract the code of building node fall back list
  2019-02-24 12:34 [PATCH 0/6] make memblock allocator utilize the node's fallback info Pingfan Liu
@ 2019-02-24 12:34 ` Pingfan Liu
  2019-02-24 12:34 ` [PATCH 2/6] mm/memblock: make full utilization of numa info Pingfan Liu
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-02-24 12:34 UTC (permalink / raw)
  To: x86, linux-mm
  Cc: Pingfan Liu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, linux-kernel

In coming patch, memblock allocator also utilizes node fall back list info.
Hence extracting the related code for reusing.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Joonsoo Kim <iamjoonsoo.kim@lge.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Andi Kleen <ak@linux.intel.com>
CC: Petr Tesarik <ptesarik@suse.cz>
CC: Michal Hocko <mhocko@suse.com>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Nicholas Piggin <npiggin@gmail.com>
CC: Daniel Vacek <neelx@redhat.com>
CC: linux-kernel@vger.kernel.org
---
 mm/page_alloc.c | 48 +++++++++++++++++++++++++++++-------------------
 1 file changed, 29 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 35fdde0..a6967a1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5380,6 +5380,32 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
 	zonerefs->zone_idx = 0;
 }
 
+int build_node_order(int *node_oder_array, int sz,
+	int local_node, nodemask_t *used_mask)
+{
+	int node, nr_nodes = 0;
+	int prev_node = local_node;
+	int load = nr_online_nodes;
+
+
+	while ((node = find_next_best_node(local_node, used_mask)) >= 0
+		&& nr_nodes < sz) {
+		/*
+		 * We don't want to pressure a particular node.
+		 * So adding penalty to the first node in same
+		 * distance group to make it round-robin.
+		 */
+		if (node_distance(local_node, node) !=
+		    node_distance(local_node, prev_node))
+			node_load[node] = load;
+
+		node_oder_array[nr_nodes++] = node;
+		prev_node = node;
+		load--;
+	}
+	return nr_nodes;
+}
+
 /*
  * Build zonelists ordered by zone and nodes within zones.
  * This results in conserving DMA zone[s] until all Normal memory is
@@ -5390,32 +5416,16 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
 static void build_zonelists(pg_data_t *pgdat)
 {
 	static int node_order[MAX_NUMNODES];
-	int node, load, nr_nodes = 0;
+	int local_node, nr_nodes = 0;
 	nodemask_t used_mask;
-	int local_node, prev_node;
 
 	/* NUMA-aware ordering of nodes */
 	local_node = pgdat->node_id;
-	load = nr_online_nodes;
-	prev_node = local_node;
 	nodes_clear(used_mask);
 
 	memset(node_order, 0, sizeof(node_order));
-	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
-		/*
-		 * We don't want to pressure a particular node.
-		 * So adding penalty to the first node in same
-		 * distance group to make it round-robin.
-		 */
-		if (node_distance(local_node, node) !=
-		    node_distance(local_node, prev_node))
-			node_load[node] = load;
-
-		node_order[nr_nodes++] = node;
-		prev_node = node;
-		load--;
-	}
-
+	nr_nodes = build_node_order(node_order, MAX_NUMNODES,
+		local_node, &used_mask);
 	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
 	build_thisnode_zonelists(pgdat);
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/6] mm/memblock: make full utilization of numa info
  2019-02-24 12:34 [PATCH 0/6] make memblock allocator utilize the node's fallback info Pingfan Liu
  2019-02-24 12:34 ` [PATCH 1/6] mm/numa: extract the code of building node fall back list Pingfan Liu
@ 2019-02-24 12:34 ` Pingfan Liu
  2019-02-25  7:07   ` kbuild test robot
                     ` (3 more replies)
  2019-02-24 12:34 ` [PATCH 3/6] x86/numa: define numa_init_array() conditional on CONFIG_NUMA Pingfan Liu
                   ` (4 subsequent siblings)
  6 siblings, 4 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-02-24 12:34 UTC (permalink / raw)
  To: x86, linux-mm
  Cc: Pingfan Liu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, linux-kernel

There are numa machines with memory-less node. When allocating memory for
the memory-less node, memblock allocator falls back to 'Node 0' without fully
utilizing the nearest node. This hurts the performance, especially for per
cpu section. Suppressing this defect by building the full node fall back
info for memblock allocator, like what we have done for page allocator.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Joonsoo Kim <iamjoonsoo.kim@lge.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Andi Kleen <ak@linux.intel.com>
CC: Petr Tesarik <ptesarik@suse.cz>
CC: Michal Hocko <mhocko@suse.com>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Nicholas Piggin <npiggin@gmail.com>
CC: Daniel Vacek <neelx@redhat.com>
CC: linux-kernel@vger.kernel.org
---
 include/linux/memblock.h |  3 +++
 mm/memblock.c            | 68 ++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 64c41cf..ee999c5 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -342,6 +342,9 @@ void *memblock_alloc_try_nid_nopanic(phys_addr_t size, phys_addr_t align,
 void *memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align,
 			     phys_addr_t min_addr, phys_addr_t max_addr,
 			     int nid);
+extern int build_node_order(int *node_oder_array, int sz,
+	int local_node, nodemask_t *used_mask);
+void memblock_build_node_order(void);
 
 static inline void * __init memblock_alloc(phys_addr_t size,  phys_addr_t align)
 {
diff --git a/mm/memblock.c b/mm/memblock.c
index 022d4cb..cf78850 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1338,6 +1338,47 @@ phys_addr_t __init memblock_phys_alloc_try_nid(phys_addr_t size, phys_addr_t ali
 	return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);
 }
 
+static int **node_fallback __initdata;
+
+/*
+ * build_node_order() relies on cpumask_of_node(), hence arch should set up
+ * cpumask before calling this func.
+ */
+void __init memblock_build_node_order(void)
+{
+	int nid, i;
+	nodemask_t used_mask;
+
+	node_fallback = memblock_alloc(MAX_NUMNODES * sizeof(int *),
+		sizeof(int *));
+	for_each_online_node(nid) {
+		node_fallback[nid] = memblock_alloc(
+			num_online_nodes() * sizeof(int), sizeof(int));
+		for (i = 0; i < num_online_nodes(); i++)
+			node_fallback[nid][i] = NUMA_NO_NODE;
+	}
+
+	for_each_online_node(nid) {
+		nodes_clear(used_mask);
+		node_set(nid, used_mask);
+		build_node_order(node_fallback[nid], num_online_nodes(),
+			nid, &used_mask);
+	}
+}
+
+static void __init memblock_free_node_order(void)
+{
+	int nid;
+
+	if (!node_fallback)
+		return;
+	for_each_online_node(nid)
+		memblock_free(__pa(node_fallback[nid]),
+			num_online_nodes() * sizeof(int));
+	memblock_free(__pa(node_fallback), MAX_NUMNODES * sizeof(int *));
+	node_fallback = NULL;
+}
+
 /**
  * memblock_alloc_internal - allocate boot memory block
  * @size: size of memory block to be allocated in bytes
@@ -1370,6 +1411,7 @@ static void * __init memblock_alloc_internal(
 {
 	phys_addr_t alloc;
 	void *ptr;
+	int node;
 	enum memblock_flags flags = choose_memblock_flags();
 
 	if (WARN_ONCE(nid == MAX_NUMNODES, "Usage of MAX_NUMNODES is deprecated. Use NUMA_NO_NODE instead\n"))
@@ -1397,11 +1439,26 @@ static void * __init memblock_alloc_internal(
 		goto done;
 
 	if (nid != NUMA_NO_NODE) {
-		alloc = memblock_find_in_range_node(size, align, min_addr,
-						    max_addr, NUMA_NO_NODE,
-						    flags);
-		if (alloc && !memblock_reserve(alloc, size))
-			goto done;
+		if (!node_fallback) {
+			alloc = memblock_find_in_range_node(size, align,
+					min_addr, max_addr,
+					NUMA_NO_NODE, flags);
+			if (alloc && !memblock_reserve(alloc, size))
+				goto done;
+		} else {
+			int i;
+			for (i = 0; i < num_online_nodes(); i++) {
+				node = node_fallback[nid][i];
+				/* fallback list has all memory nodes */
+				if (node == NUMA_NO_NODE)
+					break;
+				alloc = memblock_find_in_range_node(size,
+						align, min_addr, max_addr,
+						node, flags);
+				if (alloc && !memblock_reserve(alloc, size))
+					goto done;
+			}
+		}
 	}
 
 	if (min_addr) {
@@ -1969,6 +2026,7 @@ unsigned long __init memblock_free_all(void)
 
 	reset_all_zones_managed_pages();
 
+	memblock_free_node_order();
 	pages = free_low_memory_core_early();
 	totalram_pages_add(pages);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/6] x86/numa: define numa_init_array() conditional on CONFIG_NUMA
  2019-02-24 12:34 [PATCH 0/6] make memblock allocator utilize the node's fallback info Pingfan Liu
  2019-02-24 12:34 ` [PATCH 1/6] mm/numa: extract the code of building node fall back list Pingfan Liu
  2019-02-24 12:34 ` [PATCH 2/6] mm/memblock: make full utilization of numa info Pingfan Liu
@ 2019-02-24 12:34 ` Pingfan Liu
  2019-02-25 15:23   ` Dave Hansen
  2019-02-24 12:34 ` [PATCH 4/6] x86/numa: concentrate the code of setting cpu to node map Pingfan Liu
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Pingfan Liu @ 2019-02-24 12:34 UTC (permalink / raw)
  To: x86, linux-mm
  Cc: Pingfan Liu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, linux-kernel

For non-NUMA, it turns out that numa_init_array() has no operations. Make
separated definition for non-NUMA and NUMA, so later they can be combined
into their counterpart init_cpu_to_node().

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Joonsoo Kim <iamjoonsoo.kim@lge.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Andi Kleen <ak@linux.intel.com>
CC: Petr Tesarik <ptesarik@suse.cz>
CC: Michal Hocko <mhocko@suse.com>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Nicholas Piggin <npiggin@gmail.com>
CC: Daniel Vacek <neelx@redhat.com>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/mm/numa.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f54..bfe6732 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -599,6 +599,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 	return 0;
 }
 
+#ifdef CONFIG_NUMA
 /*
  * There are unfortunately some poorly designed mainboards around that
  * only connect memory to a single CPU. This breaks the 1:1 cpu->node
@@ -618,6 +619,9 @@ static void __init numa_init_array(void)
 		rr = next_node_in(rr, node_online_map);
 	}
 }
+#else
+static void __init numa_init_array(void) {}
+#endif
 
 static int __init numa_init(int (*init_func)(void))
 {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/6] x86/numa: concentrate the code of setting cpu to node map
  2019-02-24 12:34 [PATCH 0/6] make memblock allocator utilize the node's fallback info Pingfan Liu
                   ` (2 preceding siblings ...)
  2019-02-24 12:34 ` [PATCH 3/6] x86/numa: define numa_init_array() conditional on CONFIG_NUMA Pingfan Liu
@ 2019-02-24 12:34 ` Pingfan Liu
  2019-02-25 15:25   ` Dave Hansen
  2019-02-24 12:34 ` [PATCH 5/6] x86/numa: push forward the setup of node to cpumask map Pingfan Liu
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Pingfan Liu @ 2019-02-24 12:34 UTC (permalink / raw)
  To: x86, linux-mm
  Cc: Pingfan Liu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, linux-kernel

Both numa_init_array() and init_cpu_to_node() aim at setting up the cpu to
node map, so combining them. And the coming patch will set up node to
cpumask map in the combined function.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Joonsoo Kim <iamjoonsoo.kim@lge.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Andi Kleen <ak@linux.intel.com>
CC: Petr Tesarik <ptesarik@suse.cz>
CC: Michal Hocko <mhocko@suse.com>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Nicholas Piggin <npiggin@gmail.com>
CC: Daniel Vacek <neelx@redhat.com>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/mm/numa.c | 39 +++++++++++++--------------------------
 1 file changed, 13 insertions(+), 26 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index bfe6732..c8dd7af 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -599,30 +599,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 	return 0;
 }
 
-#ifdef CONFIG_NUMA
-/*
- * There are unfortunately some poorly designed mainboards around that
- * only connect memory to a single CPU. This breaks the 1:1 cpu->node
- * mapping. To avoid this fill in the mapping for all possible CPUs,
- * as the number of CPUs is not known yet. We round robin the existing
- * nodes.
- */
-static void __init numa_init_array(void)
-{
-	int rr, i;
-
-	rr = first_node(node_online_map);
-	for (i = 0; i < nr_cpu_ids; i++) {
-		if (early_cpu_to_node(i) != NUMA_NO_NODE)
-			continue;
-		numa_set_node(i, rr);
-		rr = next_node_in(rr, node_online_map);
-	}
-}
-#else
-static void __init numa_init_array(void) {}
-#endif
-
 static int __init numa_init(int (*init_func)(void))
 {
 	int i;
@@ -675,7 +651,6 @@ static int __init numa_init(int (*init_func)(void))
 		if (!node_online(nid))
 			numa_clear_node(i);
 	}
-	numa_init_array();
 
 	return 0;
 }
@@ -758,14 +733,26 @@ void __init init_cpu_to_node(void)
 {
 	int cpu;
 	u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
+	int rr;
 
 	BUG_ON(cpu_to_apicid == NULL);
+	rr = first_node(node_online_map);
 
 	for_each_possible_cpu(cpu) {
 		int node = numa_cpu_node(cpu);
 
-		if (node == NUMA_NO_NODE)
+		/*
+		 * There are unfortunately some poorly designed mainboards
+		 * around that only connect memory to a single CPU. This
+		 * breaks the 1:1 cpu->node mapping. To avoid this fill in
+		 * the mapping for all possible CPUs, as the number of CPUs
+		 * is not known yet. We round robin the existing nodes.
+		 */
+		if (node == NUMA_NO_NODE) {
+			numa_set_node(cpu, rr);
+			rr = next_node_in(rr, node_online_map);
 			continue;
+		}
 
 		if (!node_online(node))
 			init_memory_less_node(node);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/6] x86/numa: push forward the setup of node to cpumask map
  2019-02-24 12:34 [PATCH 0/6] make memblock allocator utilize the node's fallback info Pingfan Liu
                   ` (3 preceding siblings ...)
  2019-02-24 12:34 ` [PATCH 4/6] x86/numa: concentrate the code of setting cpu to node map Pingfan Liu
@ 2019-02-24 12:34 ` Pingfan Liu
  2019-02-25 15:30   ` Dave Hansen
  2019-02-24 12:34 ` [PATCH 6/6] x86/numa: build node fallback info after setting up " Pingfan Liu
  2019-02-25 16:03 ` [PATCH 0/6] make memblock allocator utilize the node's fallback info Michal Hocko
  6 siblings, 1 reply; 29+ messages in thread
From: Pingfan Liu @ 2019-02-24 12:34 UTC (permalink / raw)
  To: x86, linux-mm
  Cc: Pingfan Liu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, linux-kernel

At present the node to cpumask map is set up until the secondary
cpu boot up. But it is too late for the purpose of building node fall back
list at early boot stage. Considering that init_cpu_to_node() already owns
cpu to node map, it is a good place to set up node to cpumask map too. So
do it by calling numa_add_cpu(cpu) in init_cpu_to_node().

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Joonsoo Kim <iamjoonsoo.kim@lge.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Andi Kleen <ak@linux.intel.com>
CC: Petr Tesarik <ptesarik@suse.cz>
CC: Michal Hocko <mhocko@suse.com>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Nicholas Piggin <npiggin@gmail.com>
CC: Daniel Vacek <neelx@redhat.com>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/topology.h | 4 ----
 arch/x86/kernel/setup_percpu.c  | 3 ---
 arch/x86/mm/numa.c              | 5 ++++-
 3 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 453cf38..fad77c7 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -73,8 +73,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
 }
 #endif
 
-extern void setup_node_to_cpumask_map(void);
-
 #define pcibus_to_node(bus) __pcibus_to_node(bus)
 
 extern int __node_distance(int, int);
@@ -96,8 +94,6 @@ static inline int early_cpu_to_node(int cpu)
 	return 0;
 }
 
-static inline void setup_node_to_cpumask_map(void) { }
-
 #endif
 
 #include <asm-generic/topology.h>
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index e8796fc..206fa43 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -283,9 +283,6 @@ void __init setup_per_cpu_areas(void)
 	early_per_cpu_ptr(x86_cpu_to_node_map) = NULL;
 #endif
 
-	/* Setup node to cpumask map */
-	setup_node_to_cpumask_map();
-
 	/* Setup cpu initialized, callin, callout masks */
 	setup_cpu_local_masks();
 
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index c8dd7af..8d73e2273 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -110,7 +110,7 @@ void numa_clear_node(int cpu)
  * Note: cpumask_of_node() is not valid until after this is done.
  * (Use CONFIG_DEBUG_PER_CPU_MAPS to check this.)
  */
-void __init setup_node_to_cpumask_map(void)
+static void __init setup_node_to_cpumask_map(void)
 {
 	unsigned int node;
 
@@ -738,6 +738,7 @@ void __init init_cpu_to_node(void)
 	BUG_ON(cpu_to_apicid == NULL);
 	rr = first_node(node_online_map);
 
+	setup_node_to_cpumask_map();
 	for_each_possible_cpu(cpu) {
 		int node = numa_cpu_node(cpu);
 
@@ -750,6 +751,7 @@ void __init init_cpu_to_node(void)
 		 */
 		if (node == NUMA_NO_NODE) {
 			numa_set_node(cpu, rr);
+			numa_add_cpu(cpu);
 			rr = next_node_in(rr, node_online_map);
 			continue;
 		}
@@ -758,6 +760,7 @@ void __init init_cpu_to_node(void)
 			init_memory_less_node(node);
 
 		numa_set_node(cpu, node);
+		numa_add_cpu(cpu);
 	}
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 6/6] x86/numa: build node fallback info after setting up node to cpumask map
  2019-02-24 12:34 [PATCH 0/6] make memblock allocator utilize the node's fallback info Pingfan Liu
                   ` (4 preceding siblings ...)
  2019-02-24 12:34 ` [PATCH 5/6] x86/numa: push forward the setup of node to cpumask map Pingfan Liu
@ 2019-02-24 12:34 ` Pingfan Liu
  2019-02-25 16:03 ` [PATCH 0/6] make memblock allocator utilize the node's fallback info Michal Hocko
  6 siblings, 0 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-02-24 12:34 UTC (permalink / raw)
  To: x86, linux-mm
  Cc: Pingfan Liu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, linux-kernel

After the previous patches, on x86, it is safe to call
memblock_build_node_order() after init_cpu_to_node(), which has set up node
to cpumask map. So calling memblock_build_node_order() to feed memblock with
numa node fall back info.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Joonsoo Kim <iamjoonsoo.kim@lge.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Andi Kleen <ak@linux.intel.com>
CC: Petr Tesarik <ptesarik@suse.cz>
CC: Michal Hocko <mhocko@suse.com>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Nicholas Piggin <npiggin@gmail.com>
CC: Daniel Vacek <neelx@redhat.com>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/setup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3d872a5..3ec1a6e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1245,6 +1245,8 @@ void __init setup_arch(char **cmdline_p)
 	prefill_possible_map();
 
 	init_cpu_to_node();
+	/* node to cpumask map is ready */
+	memblock_build_node_order();
 
 	io_apic_init_mappings();
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm/memblock: make full utilization of numa info
  2019-02-24 12:34 ` [PATCH 2/6] mm/memblock: make full utilization of numa info Pingfan Liu
@ 2019-02-25  7:07   ` kbuild test robot
  2019-02-25  7:59   ` kbuild test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: kbuild test robot @ 2019-02-25  7:07 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: kbuild-all, x86, linux-mm, Pingfan Liu, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Dave Hansen,
	Vlastimil Babka, Mike Rapoport, Andrew Morton, Mel Gorman,
	Joonsoo Kim, Andy Lutomirski, Andi Kleen, Petr Tesarik,
	Michal Hocko, Stephen Rothwell, Jonathan Corbet, Nicholas Piggin,
	Daniel Vacek, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 924 bytes --]

Hi Pingfan,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.0-rc4]
[cannot apply to next-20190222]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Pingfan-Liu/mm-numa-extract-the-code-of-building-node-fall-back-list/20190225-143613
config: i386-tinyconfig (attached as .config)
compiler: gcc-8 (Debian 8.2.0-20) 8.2.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   ld: mm/memblock.o: in function `memblock_build_node_order':
>> memblock.c:(.init.text+0x310): undefined reference to `build_node_order'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6512 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm/memblock: make full utilization of numa info
  2019-02-24 12:34 ` [PATCH 2/6] mm/memblock: make full utilization of numa info Pingfan Liu
  2019-02-25  7:07   ` kbuild test robot
@ 2019-02-25  7:59   ` kbuild test robot
  2019-02-25 15:34   ` Dave Hansen
  2019-02-26 11:58   ` Mike Rapoport
  3 siblings, 0 replies; 29+ messages in thread
From: kbuild test robot @ 2019-02-25  7:59 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: kbuild-all, x86, linux-mm, Pingfan Liu, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Dave Hansen,
	Vlastimil Babka, Mike Rapoport, Andrew Morton, Mel Gorman,
	Joonsoo Kim, Andy Lutomirski, Andi Kleen, Petr Tesarik,
	Michal Hocko, Stephen Rothwell, Jonathan Corbet, Nicholas Piggin,
	Daniel Vacek, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1799 bytes --]

Hi Pingfan,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.0-rc4]
[cannot apply to next-20190222]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Pingfan-Liu/mm-numa-extract-the-code-of-building-node-fall-back-list/20190225-143613
config: i386-randconfig-a1-201908 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   ld: mm/memblock.o: in function `memblock_build_node_order':
>> mm/memblock.c:1364: undefined reference to `build_node_order'

vim +1364 mm/memblock.c

  1342	
  1343	/*
  1344	 * build_node_order() relies on cpumask_of_node(), hence arch should set up
  1345	 * cpumask before calling this func.
  1346	 */
  1347	void __init memblock_build_node_order(void)
  1348	{
  1349		int nid, i;
  1350		nodemask_t used_mask;
  1351	
  1352		node_fallback = memblock_alloc(MAX_NUMNODES * sizeof(int *),
  1353			sizeof(int *));
  1354		for_each_online_node(nid) {
  1355			node_fallback[nid] = memblock_alloc(
  1356				num_online_nodes() * sizeof(int), sizeof(int));
  1357			for (i = 0; i < num_online_nodes(); i++)
  1358				node_fallback[nid][i] = NUMA_NO_NODE;
  1359		}
  1360	
  1361		for_each_online_node(nid) {
  1362			nodes_clear(used_mask);
  1363			node_set(nid, used_mask);
> 1364			build_node_order(node_fallback[nid], num_online_nodes(),
  1365				nid, &used_mask);
  1366		}
  1367	}
  1368	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31378 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] x86/numa: define numa_init_array() conditional on CONFIG_NUMA
  2019-02-24 12:34 ` [PATCH 3/6] x86/numa: define numa_init_array() conditional on CONFIG_NUMA Pingfan Liu
@ 2019-02-25 15:23   ` Dave Hansen
  2019-02-26  5:40       ` Pingfan Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Hansen @ 2019-02-25 15:23 UTC (permalink / raw)
  To: Pingfan Liu, x86, linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Dave Hansen, Vlastimil Babka, Mike Rapoport, Andrew Morton,
	Mel Gorman, Joonsoo Kim, Andy Lutomirski, Andi Kleen,
	Petr Tesarik, Michal Hocko, Stephen Rothwell, Jonathan Corbet,
	Nicholas Piggin, Daniel Vacek, linux-kernel

On 2/24/19 4:34 AM, Pingfan Liu wrote:
> +#ifdef CONFIG_NUMA
>  /*
>   * There are unfortunately some poorly designed mainboards around that
>   * only connect memory to a single CPU. This breaks the 1:1 cpu->node
> @@ -618,6 +619,9 @@ static void __init numa_init_array(void)
>  		rr = next_node_in(rr, node_online_map);
>  	}
>  }
> +#else
> +static void __init numa_init_array(void) {}
> +#endif

What functional effect does this #ifdef have?

Let's look at the code:

> static void __init numa_init_array(void)
> {
>         int rr, i;
> 
>         rr = first_node(node_online_map);
>         for (i = 0; i < nr_cpu_ids; i++) {
>                 if (early_cpu_to_node(i) != NUMA_NO_NODE)
>                         continue;
>                 numa_set_node(i, rr);
>                 rr = next_node_in(rr, node_online_map);
>         }
> }

and "play compiler" for a bit.

The first iteration will see early_cpu_to_node(i)==1 because:

static inline int early_cpu_to_node(int cpu)
{
        return 0;
}

if CONFIG_NUMA=n.

In other words, I'm not sure this patch does *anything*.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] x86/numa: concentrate the code of setting cpu to node map
  2019-02-24 12:34 ` [PATCH 4/6] x86/numa: concentrate the code of setting cpu to node map Pingfan Liu
@ 2019-02-25 15:25   ` Dave Hansen
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Hansen @ 2019-02-25 15:25 UTC (permalink / raw)
  To: Pingfan Liu, x86, linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Dave Hansen, Vlastimil Babka, Mike Rapoport, Andrew Morton,
	Mel Gorman, Joonsoo Kim, Andy Lutomirski, Andi Kleen,
	Petr Tesarik, Michal Hocko, Stephen Rothwell, Jonathan Corbet,
	Nicholas Piggin, Daniel Vacek, linux-kernel

On 2/24/19 4:34 AM, Pingfan Liu wrote:
> -#else
> -static void __init numa_init_array(void) {}
> -#endif
> -
>  static int __init numa_init(int (*init_func)(void))
>  {
>  	int i;
> @@ -675,7 +651,6 @@ static int __init numa_init(int (*init_func)(void))
>  		if (!node_online(nid))
>  			numa_clear_node(i);
>  	}
> -	numa_init_array();
>  
>  	return 0;
>  }

Huh, and now the (stub) code that was just added gets removed?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] x86/numa: push forward the setup of node to cpumask map
  2019-02-24 12:34 ` [PATCH 5/6] x86/numa: push forward the setup of node to cpumask map Pingfan Liu
@ 2019-02-25 15:30   ` Dave Hansen
  2019-02-26  5:40       ` Pingfan Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Hansen @ 2019-02-25 15:30 UTC (permalink / raw)
  To: Pingfan Liu, x86, linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Dave Hansen, Vlastimil Babka, Mike Rapoport, Andrew Morton,
	Mel Gorman, Joonsoo Kim, Andy Lutomirski, Andi Kleen,
	Petr Tesarik, Michal Hocko, Stephen Rothwell, Jonathan Corbet,
	Nicholas Piggin, Daniel Vacek, linux-kernel

On 2/24/19 4:34 AM, Pingfan Liu wrote:
> At present the node to cpumask map is set up until the secondary
> cpu boot up. But it is too late for the purpose of building node fall back
> list at early boot stage. Considering that init_cpu_to_node() already owns
> cpu to node map, it is a good place to set up node to cpumask map too. So
> do it by calling numa_add_cpu(cpu) in init_cpu_to_node().

It sounds like you have carefully considered the ordering and
dependencies here.  However, none of that consideration has made it into
the code.

Could you please add some comments to the new call-sites to explain why
the *must* be where they are?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm/memblock: make full utilization of numa info
  2019-02-24 12:34 ` [PATCH 2/6] mm/memblock: make full utilization of numa info Pingfan Liu
  2019-02-25  7:07   ` kbuild test robot
  2019-02-25  7:59   ` kbuild test robot
@ 2019-02-25 15:34   ` Dave Hansen
  2019-02-26  5:40       ` Pingfan Liu
  2019-02-26 11:58   ` Mike Rapoport
  3 siblings, 1 reply; 29+ messages in thread
From: Dave Hansen @ 2019-02-25 15:34 UTC (permalink / raw)
  To: Pingfan Liu, x86, linux-mm
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Dave Hansen, Vlastimil Babka, Mike Rapoport, Andrew Morton,
	Mel Gorman, Joonsoo Kim, Andy Lutomirski, Andi Kleen,
	Petr Tesarik, Michal Hocko, Stephen Rothwell, Jonathan Corbet,
	Nicholas Piggin, Daniel Vacek, linux-kernel

On 2/24/19 4:34 AM, Pingfan Liu wrote:
> +/*
> + * build_node_order() relies on cpumask_of_node(), hence arch should 
> + * set up cpumask before calling this func.
> + */

Whenever I see comments like this, I wonder what happens if the arch
doesn't do this?  Do we just crash in early boot in wonderful new ways?
 Or do we get a nice message telling us?

> +void __init memblock_build_node_order(void)
> +{
> +	int nid, i;
> +	nodemask_t used_mask;
> +
> +	node_fallback = memblock_alloc(MAX_NUMNODES * sizeof(int *),
> +		sizeof(int *));
> +	for_each_online_node(nid) {
> +		node_fallback[nid] = memblock_alloc(
> +			num_online_nodes() * sizeof(int), sizeof(int));
> +		for (i = 0; i < num_online_nodes(); i++)
> +			node_fallback[nid][i] = NUMA_NO_NODE;
> +	}
> +
> +	for_each_online_node(nid) {
> +		nodes_clear(used_mask);
> +		node_set(nid, used_mask);
> +		build_node_order(node_fallback[nid], num_online_nodes(),
> +			nid, &used_mask);
> +	}
> +}

This doesn't get used until patch 6 as far as I can tell.  Was there a
reason to define it here?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/6] make memblock allocator utilize the node's fallback info
  2019-02-24 12:34 [PATCH 0/6] make memblock allocator utilize the node's fallback info Pingfan Liu
                   ` (5 preceding siblings ...)
  2019-02-24 12:34 ` [PATCH 6/6] x86/numa: build node fallback info after setting up " Pingfan Liu
@ 2019-02-25 16:03 ` Michal Hocko
  2019-02-26  5:47     ` Pingfan Liu
  6 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2019-02-25 16:03 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Stephen Rothwell, Jonathan Corbet,
	Nicholas Piggin, Daniel Vacek, linux-kernel

On Sun 24-02-19 20:34:03, Pingfan Liu wrote:
> There are NUMA machines with memory-less node. At present page allocator builds the
> full fallback info by build_zonelists(). But memblock allocator does not utilize
> this info. And for memory-less node, memblock allocator just falls back "node 0",
> without utilizing the nearest node. Unfortunately, the percpu section is allocated 
> by memblock, which is accessed frequently after bootup.
> 
> This series aims to improve the performance of per cpu section on memory-less node
> by feeding node's fallback info to memblock allocator on x86, like we do for page
> allocator. On other archs, it requires independent effort to setup node to cpumask
> map ahead.

Do you have any numbers to tell us how much does this improve the
situation?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm/memblock: make full utilization of numa info
  2019-02-25 15:34   ` Dave Hansen
@ 2019-02-26  5:40       ` Pingfan Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-02-26  5:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, LKML

On Mon, Feb 25, 2019 at 11:34 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/24/19 4:34 AM, Pingfan Liu wrote:
> > +/*
> > + * build_node_order() relies on cpumask_of_node(), hence arch should
> > + * set up cpumask before calling this func.
> > + */
>
> Whenever I see comments like this, I wonder what happens if the arch
> doesn't do this?  Do we just crash in early boot in wonderful new ways?
>  Or do we get a nice message telling us?
>
If doesn't do this, this function will crash. It is a shame but a
little hard to work around, since this function is called at early
boot stage, things like cpumask_of_node(cpu_to_node(cpu)) can not work
reliably, and we lack of an abstract interface to get such information
from all archs. So I leave this to arch's developer.

> > +void __init memblock_build_node_order(void)
> > +{
> > +     int nid, i;
> > +     nodemask_t used_mask;
> > +
> > +     node_fallback = memblock_alloc(MAX_NUMNODES * sizeof(int *),
> > +             sizeof(int *));
> > +     for_each_online_node(nid) {
> > +             node_fallback[nid] = memblock_alloc(
> > +                     num_online_nodes() * sizeof(int), sizeof(int));
> > +             for (i = 0; i < num_online_nodes(); i++)
> > +                     node_fallback[nid][i] = NUMA_NO_NODE;
> > +     }
> > +
> > +     for_each_online_node(nid) {
> > +             nodes_clear(used_mask);
> > +             node_set(nid, used_mask);
> > +             build_node_order(node_fallback[nid], num_online_nodes(),
> > +                     nid, &used_mask);
> > +     }
> > +}
>
> This doesn't get used until patch 6 as far as I can tell.  Was there a
> reason to define it here?
>
Yes, it gets used until patch 6. Patch 6 has two groups of
pre-requirements [1-2] and [3-5]. Do you think reorder the patches and
moving [3-5] ahead of [1-2] is a better choice?

Thanks and regards,
Pingfan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm/memblock: make full utilization of numa info
@ 2019-02-26  5:40       ` Pingfan Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-02-26  5:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, LKML

On Mon, Feb 25, 2019 at 11:34 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/24/19 4:34 AM, Pingfan Liu wrote:
> > +/*
> > + * build_node_order() relies on cpumask_of_node(), hence arch should
> > + * set up cpumask before calling this func.
> > + */
>
> Whenever I see comments like this, I wonder what happens if the arch
> doesn't do this?  Do we just crash in early boot in wonderful new ways?
>  Or do we get a nice message telling us?
>
If doesn't do this, this function will crash. It is a shame but a
little hard to work around, since this function is called at early
boot stage, things like cpumask_of_node(cpu_to_node(cpu)) can not work
reliably, and we lack of an abstract interface to get such information
from all archs. So I leave this to arch's developer.

> > +void __init memblock_build_node_order(void)
> > +{
> > +     int nid, i;
> > +     nodemask_t used_mask;
> > +
> > +     node_fallback = memblock_alloc(MAX_NUMNODES * sizeof(int *),
> > +             sizeof(int *));
> > +     for_each_online_node(nid) {
> > +             node_fallback[nid] = memblock_alloc(
> > +                     num_online_nodes() * sizeof(int), sizeof(int));
> > +             for (i = 0; i < num_online_nodes(); i++)
> > +                     node_fallback[nid][i] = NUMA_NO_NODE;
> > +     }
> > +
> > +     for_each_online_node(nid) {
> > +             nodes_clear(used_mask);
> > +             node_set(nid, used_mask);
> > +             build_node_order(node_fallback[nid], num_online_nodes(),
> > +                     nid, &used_mask);
> > +     }
> > +}
>
> This doesn't get used until patch 6 as far as I can tell.  Was there a
> reason to define it here?
>
Yes, it gets used until patch 6. Patch 6 has two groups of
pre-requirements [1-2] and [3-5]. Do you think reorder the patches and
moving [3-5] ahead of [1-2] is a better choice?

Thanks and regards,
Pingfan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] x86/numa: define numa_init_array() conditional on CONFIG_NUMA
  2019-02-25 15:23   ` Dave Hansen
@ 2019-02-26  5:40       ` Pingfan Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-02-26  5:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, LKML

On Mon, Feb 25, 2019 at 11:24 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/24/19 4:34 AM, Pingfan Liu wrote:
> > +#ifdef CONFIG_NUMA
> >  /*
> >   * There are unfortunately some poorly designed mainboards around that
> >   * only connect memory to a single CPU. This breaks the 1:1 cpu->node
> > @@ -618,6 +619,9 @@ static void __init numa_init_array(void)
> >               rr = next_node_in(rr, node_online_map);
> >       }
> >  }
> > +#else
> > +static void __init numa_init_array(void) {}
> > +#endif
>
> What functional effect does this #ifdef have?
>
> Let's look at the code:
>
> > static void __init numa_init_array(void)
> > {
> >         int rr, i;
> >
> >         rr = first_node(node_online_map);
> >         for (i = 0; i < nr_cpu_ids; i++) {
> >                 if (early_cpu_to_node(i) != NUMA_NO_NODE)
> >                         continue;
> >                 numa_set_node(i, rr);
> >                 rr = next_node_in(rr, node_online_map);
> >         }
> > }
>
> and "play compiler" for a bit.
>
> The first iteration will see early_cpu_to_node(i)==1 because:
>
> static inline int early_cpu_to_node(int cpu)
> {
>         return 0;
> }
>
> if CONFIG_NUMA=n.
>
> In other words, I'm not sure this patch does *anything*.

I had thought separating [3/6] and [4/6] can ease the review. And I
will merge them in next version.

Thanks and regards,
Pingfan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] x86/numa: define numa_init_array() conditional on CONFIG_NUMA
@ 2019-02-26  5:40       ` Pingfan Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-02-26  5:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, LKML

On Mon, Feb 25, 2019 at 11:24 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/24/19 4:34 AM, Pingfan Liu wrote:
> > +#ifdef CONFIG_NUMA
> >  /*
> >   * There are unfortunately some poorly designed mainboards around that
> >   * only connect memory to a single CPU. This breaks the 1:1 cpu->node
> > @@ -618,6 +619,9 @@ static void __init numa_init_array(void)
> >               rr = next_node_in(rr, node_online_map);
> >       }
> >  }
> > +#else
> > +static void __init numa_init_array(void) {}
> > +#endif
>
> What functional effect does this #ifdef have?
>
> Let's look at the code:
>
> > static void __init numa_init_array(void)
> > {
> >         int rr, i;
> >
> >         rr = first_node(node_online_map);
> >         for (i = 0; i < nr_cpu_ids; i++) {
> >                 if (early_cpu_to_node(i) != NUMA_NO_NODE)
> >                         continue;
> >                 numa_set_node(i, rr);
> >                 rr = next_node_in(rr, node_online_map);
> >         }
> > }
>
> and "play compiler" for a bit.
>
> The first iteration will see early_cpu_to_node(i)==1 because:
>
> static inline int early_cpu_to_node(int cpu)
> {
>         return 0;
> }
>
> if CONFIG_NUMA=n.
>
> In other words, I'm not sure this patch does *anything*.

I had thought separating [3/6] and [4/6] can ease the review. And I
will merge them in next version.

Thanks and regards,
Pingfan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] x86/numa: push forward the setup of node to cpumask map
  2019-02-25 15:30   ` Dave Hansen
@ 2019-02-26  5:40       ` Pingfan Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-02-26  5:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, LKML

On Mon, Feb 25, 2019 at 11:30 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/24/19 4:34 AM, Pingfan Liu wrote:
> > At present the node to cpumask map is set up until the secondary
> > cpu boot up. But it is too late for the purpose of building node fall back
> > list at early boot stage. Considering that init_cpu_to_node() already owns
> > cpu to node map, it is a good place to set up node to cpumask map too. So
> > do it by calling numa_add_cpu(cpu) in init_cpu_to_node().
>
> It sounds like you have carefully considered the ordering and
> dependencies here.  However, none of that consideration has made it into
> the code.
>
> Could you please add some comments to the new call-sites to explain why
> the *must* be where they are?

OK. How about: "building up node fallback list needs cpumask info, so
filling cpumask info here"
Thanks for your kindly review.

Regards,
Pingfan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] x86/numa: push forward the setup of node to cpumask map
@ 2019-02-26  5:40       ` Pingfan Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-02-26  5:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, LKML

On Mon, Feb 25, 2019 at 11:30 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/24/19 4:34 AM, Pingfan Liu wrote:
> > At present the node to cpumask map is set up until the secondary
> > cpu boot up. But it is too late for the purpose of building node fall back
> > list at early boot stage. Considering that init_cpu_to_node() already owns
> > cpu to node map, it is a good place to set up node to cpumask map too. So
> > do it by calling numa_add_cpu(cpu) in init_cpu_to_node().
>
> It sounds like you have carefully considered the ordering and
> dependencies here.  However, none of that consideration has made it into
> the code.
>
> Could you please add some comments to the new call-sites to explain why
> the *must* be where they are?

OK. How about: "building up node fallback list needs cpumask info, so
filling cpumask info here"
Thanks for your kindly review.

Regards,
Pingfan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/6] make memblock allocator utilize the node's fallback info
  2019-02-25 16:03 ` [PATCH 0/6] make memblock allocator utilize the node's fallback info Michal Hocko
@ 2019-02-26  5:47     ` Pingfan Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-02-26  5:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Stephen Rothwell, Jonathan Corbet,
	Nicholas Piggin, Daniel Vacek, LKML

On Tue, Feb 26, 2019 at 12:04 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Sun 24-02-19 20:34:03, Pingfan Liu wrote:
> > There are NUMA machines with memory-less node. At present page allocator builds the
> > full fallback info by build_zonelists(). But memblock allocator does not utilize
> > this info. And for memory-less node, memblock allocator just falls back "node 0",
> > without utilizing the nearest node. Unfortunately, the percpu section is allocated
> > by memblock, which is accessed frequently after bootup.
> >
> > This series aims to improve the performance of per cpu section on memory-less node
> > by feeding node's fallback info to memblock allocator on x86, like we do for page
> > allocator. On other archs, it requires independent effort to setup node to cpumask
> > map ahead.
>
> Do you have any numbers to tell us how much does this improve the
> situation?

Not yet. At present just based on the fact that we prefer to allocate
per cpu area on local node.

Thanks,
Pingfan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/6] make memblock allocator utilize the node's fallback info
@ 2019-02-26  5:47     ` Pingfan Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-02-26  5:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Stephen Rothwell, Jonathan Corbet,
	Nicholas Piggin, Daniel Vacek, LKML

On Tue, Feb 26, 2019 at 12:04 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Sun 24-02-19 20:34:03, Pingfan Liu wrote:
> > There are NUMA machines with memory-less node. At present page allocator builds the
> > full fallback info by build_zonelists(). But memblock allocator does not utilize
> > this info. And for memory-less node, memblock allocator just falls back "node 0",
> > without utilizing the nearest node. Unfortunately, the percpu section is allocated
> > by memblock, which is accessed frequently after bootup.
> >
> > This series aims to improve the performance of per cpu section on memory-less node
> > by feeding node's fallback info to memblock allocator on x86, like we do for page
> > allocator. On other archs, it requires independent effort to setup node to cpumask
> > map ahead.
>
> Do you have any numbers to tell us how much does this improve the
> situation?

Not yet. At present just based on the fact that we prefer to allocate
per cpu area on local node.

Thanks,
Pingfan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm/memblock: make full utilization of numa info
  2019-02-24 12:34 ` [PATCH 2/6] mm/memblock: make full utilization of numa info Pingfan Liu
                     ` (2 preceding siblings ...)
  2019-02-25 15:34   ` Dave Hansen
@ 2019-02-26 11:58   ` Mike Rapoport
  2019-02-27  9:23       ` Pingfan Liu
  3 siblings, 1 reply; 29+ messages in thread
From: Mike Rapoport @ 2019-02-26 11:58 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, linux-kernel

On Sun, Feb 24, 2019 at 08:34:05PM +0800, Pingfan Liu wrote:
> There are numa machines with memory-less node. When allocating memory for
> the memory-less node, memblock allocator falls back to 'Node 0' without fully
> utilizing the nearest node. This hurts the performance, especially for per
> cpu section. Suppressing this defect by building the full node fall back
> info for memblock allocator, like what we have done for page allocator.

Is it really necessary to build full node fallback info for memblock and
then rebuild it again for the page allocator?

I think it should be possible to split parts of build_all_zonelists_init()
that do not touch per-cpu areas into a separate function and call that
function after topology detection. Then it would be possible to use
local_memory_node() when calling memblock.
 
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Borislav Petkov <bp@alien8.de>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Dave Hansen <dave.hansen@linux.intel.com>
> CC: Vlastimil Babka <vbabka@suse.cz>
> CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Mel Gorman <mgorman@suse.de>
> CC: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> CC: Andy Lutomirski <luto@kernel.org>
> CC: Andi Kleen <ak@linux.intel.com>
> CC: Petr Tesarik <ptesarik@suse.cz>
> CC: Michal Hocko <mhocko@suse.com>
> CC: Stephen Rothwell <sfr@canb.auug.org.au>
> CC: Jonathan Corbet <corbet@lwn.net>
> CC: Nicholas Piggin <npiggin@gmail.com>
> CC: Daniel Vacek <neelx@redhat.com>
> CC: linux-kernel@vger.kernel.org
> ---
>  include/linux/memblock.h |  3 +++
>  mm/memblock.c            | 68 ++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 66 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 64c41cf..ee999c5 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -342,6 +342,9 @@ void *memblock_alloc_try_nid_nopanic(phys_addr_t size, phys_addr_t align,
>  void *memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align,
>  			     phys_addr_t min_addr, phys_addr_t max_addr,
>  			     int nid);
> +extern int build_node_order(int *node_oder_array, int sz,
> +	int local_node, nodemask_t *used_mask);
> +void memblock_build_node_order(void);
> 
>  static inline void * __init memblock_alloc(phys_addr_t size,  phys_addr_t align)
>  {
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 022d4cb..cf78850 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1338,6 +1338,47 @@ phys_addr_t __init memblock_phys_alloc_try_nid(phys_addr_t size, phys_addr_t ali
>  	return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);
>  }
> 
> +static int **node_fallback __initdata;
> +
> +/*
> + * build_node_order() relies on cpumask_of_node(), hence arch should set up
> + * cpumask before calling this func.
> + */
> +void __init memblock_build_node_order(void)
> +{
> +	int nid, i;
> +	nodemask_t used_mask;
> +
> +	node_fallback = memblock_alloc(MAX_NUMNODES * sizeof(int *),
> +		sizeof(int *));
> +	for_each_online_node(nid) {
> +		node_fallback[nid] = memblock_alloc(
> +			num_online_nodes() * sizeof(int), sizeof(int));
> +		for (i = 0; i < num_online_nodes(); i++)
> +			node_fallback[nid][i] = NUMA_NO_NODE;
> +	}
> +
> +	for_each_online_node(nid) {
> +		nodes_clear(used_mask);
> +		node_set(nid, used_mask);
> +		build_node_order(node_fallback[nid], num_online_nodes(),
> +			nid, &used_mask);
> +	}
> +}
> +
> +static void __init memblock_free_node_order(void)
> +{
> +	int nid;
> +
> +	if (!node_fallback)
> +		return;
> +	for_each_online_node(nid)
> +		memblock_free(__pa(node_fallback[nid]),
> +			num_online_nodes() * sizeof(int));
> +	memblock_free(__pa(node_fallback), MAX_NUMNODES * sizeof(int *));
> +	node_fallback = NULL;
> +}
> +
>  /**
>   * memblock_alloc_internal - allocate boot memory block
>   * @size: size of memory block to be allocated in bytes
> @@ -1370,6 +1411,7 @@ static void * __init memblock_alloc_internal(
>  {
>  	phys_addr_t alloc;
>  	void *ptr;
> +	int node;
>  	enum memblock_flags flags = choose_memblock_flags();
> 
>  	if (WARN_ONCE(nid == MAX_NUMNODES, "Usage of MAX_NUMNODES is deprecated. Use NUMA_NO_NODE instead\n"))
> @@ -1397,11 +1439,26 @@ static void * __init memblock_alloc_internal(
>  		goto done;
> 
>  	if (nid != NUMA_NO_NODE) {
> -		alloc = memblock_find_in_range_node(size, align, min_addr,
> -						    max_addr, NUMA_NO_NODE,
> -						    flags);
> -		if (alloc && !memblock_reserve(alloc, size))
> -			goto done;
> +		if (!node_fallback) {
> +			alloc = memblock_find_in_range_node(size, align,
> +					min_addr, max_addr,
> +					NUMA_NO_NODE, flags);
> +			if (alloc && !memblock_reserve(alloc, size))
> +				goto done;
> +		} else {
> +			int i;
> +			for (i = 0; i < num_online_nodes(); i++) {
> +				node = node_fallback[nid][i];
> +				/* fallback list has all memory nodes */
> +				if (node == NUMA_NO_NODE)
> +					break;
> +				alloc = memblock_find_in_range_node(size,
> +						align, min_addr, max_addr,
> +						node, flags);
> +				if (alloc && !memblock_reserve(alloc, size))
> +					goto done;
> +			}
> +		}
>  	}
> 
>  	if (min_addr) {
> @@ -1969,6 +2026,7 @@ unsigned long __init memblock_free_all(void)
> 
>  	reset_all_zones_managed_pages();
> 
> +	memblock_free_node_order();
>  	pages = free_low_memory_core_early();
>  	totalram_pages_add(pages);
> 
> -- 
> 2.7.4
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/6] make memblock allocator utilize the node's fallback info
  2019-02-26  5:47     ` Pingfan Liu
  (?)
@ 2019-02-26 12:09     ` Michal Hocko
  2019-03-05 12:37         ` Pingfan Liu
  -1 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2019-02-26 12:09 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Stephen Rothwell, Jonathan Corbet,
	Nicholas Piggin, Daniel Vacek, LKML

On Tue 26-02-19 13:47:37, Pingfan Liu wrote:
> On Tue, Feb 26, 2019 at 12:04 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Sun 24-02-19 20:34:03, Pingfan Liu wrote:
> > > There are NUMA machines with memory-less node. At present page allocator builds the
> > > full fallback info by build_zonelists(). But memblock allocator does not utilize
> > > this info. And for memory-less node, memblock allocator just falls back "node 0",
> > > without utilizing the nearest node. Unfortunately, the percpu section is allocated
> > > by memblock, which is accessed frequently after bootup.
> > >
> > > This series aims to improve the performance of per cpu section on memory-less node
> > > by feeding node's fallback info to memblock allocator on x86, like we do for page
> > > allocator. On other archs, it requires independent effort to setup node to cpumask
> > > map ahead.
> >
> > Do you have any numbers to tell us how much does this improve the
> > situation?
> 
> Not yet. At present just based on the fact that we prefer to allocate
> per cpu area on local node.

Yes, we _usually_ do. But the additional complexity should be worth it.
And if we find out that the final improvement is not all that great and
considering that memory-less setups are crippled anyway then it might
turn out we just do not care all that much.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm/memblock: make full utilization of numa info
  2019-02-26  5:40       ` Pingfan Liu
  (?)
@ 2019-02-26 12:37       ` Dave Hansen
  -1 siblings, 0 replies; 29+ messages in thread
From: Dave Hansen @ 2019-02-26 12:37 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, LKML

On 2/25/19 9:40 PM, Pingfan Liu wrote:
>> This doesn't get used until patch 6 as far as I can tell.  Was there a
>> reason to define it here?
>>
> Yes, it gets used until patch 6. Patch 6 has two groups of
> pre-requirements [1-2] and [3-5]. Do you think reorder the patches and
> moving [3-5] ahead of [1-2] is a better choice?

I'd rather that you just introduce the code along with its first user.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm/memblock: make full utilization of numa info
  2019-02-26 11:58   ` Mike Rapoport
@ 2019-02-27  9:23       ` Pingfan Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-02-27  9:23 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, LKML

On Tue, Feb 26, 2019 at 7:58 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Sun, Feb 24, 2019 at 08:34:05PM +0800, Pingfan Liu wrote:
> > There are numa machines with memory-less node. When allocating memory for
> > the memory-less node, memblock allocator falls back to 'Node 0' without fully
> > utilizing the nearest node. This hurts the performance, especially for per
> > cpu section. Suppressing this defect by building the full node fall back
> > info for memblock allocator, like what we have done for page allocator.
>
> Is it really necessary to build full node fallback info for memblock and
> then rebuild it again for the page allocator?
>
Do you mean building the full node fallback info once, and share it by
both memblock and page allocator? If it is, then node online/offline
is the corner case to block this design.

> I think it should be possible to split parts of build_all_zonelists_init()
> that do not touch per-cpu areas into a separate function and call that
> function after topology detection. Then it would be possible to use
> local_memory_node() when calling memblock.
>
Yes, this is one way but may be with higher pay of changing the code.
I will try it.
Thank your for your suggestion.

Best regards,
Pingfan
> > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > CC: Thomas Gleixner <tglx@linutronix.de>
> > CC: Ingo Molnar <mingo@redhat.com>
> > CC: Borislav Petkov <bp@alien8.de>
> > CC: "H. Peter Anvin" <hpa@zytor.com>
> > CC: Dave Hansen <dave.hansen@linux.intel.com>
> > CC: Vlastimil Babka <vbabka@suse.cz>
> > CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > CC: Andrew Morton <akpm@linux-foundation.org>
> > CC: Mel Gorman <mgorman@suse.de>
> > CC: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > CC: Andy Lutomirski <luto@kernel.org>
> > CC: Andi Kleen <ak@linux.intel.com>
> > CC: Petr Tesarik <ptesarik@suse.cz>
> > CC: Michal Hocko <mhocko@suse.com>
> > CC: Stephen Rothwell <sfr@canb.auug.org.au>
> > CC: Jonathan Corbet <corbet@lwn.net>
> > CC: Nicholas Piggin <npiggin@gmail.com>
> > CC: Daniel Vacek <neelx@redhat.com>
> > CC: linux-kernel@vger.kernel.org
> > ---
> >  include/linux/memblock.h |  3 +++
> >  mm/memblock.c            | 68 ++++++++++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 66 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index 64c41cf..ee999c5 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -342,6 +342,9 @@ void *memblock_alloc_try_nid_nopanic(phys_addr_t size, phys_addr_t align,
> >  void *memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align,
> >                            phys_addr_t min_addr, phys_addr_t max_addr,
> >                            int nid);
> > +extern int build_node_order(int *node_oder_array, int sz,
> > +     int local_node, nodemask_t *used_mask);
> > +void memblock_build_node_order(void);
> >
> >  static inline void * __init memblock_alloc(phys_addr_t size,  phys_addr_t align)
> >  {
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 022d4cb..cf78850 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1338,6 +1338,47 @@ phys_addr_t __init memblock_phys_alloc_try_nid(phys_addr_t size, phys_addr_t ali
> >       return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);
> >  }
> >
> > +static int **node_fallback __initdata;
> > +
> > +/*
> > + * build_node_order() relies on cpumask_of_node(), hence arch should set up
> > + * cpumask before calling this func.
> > + */
> > +void __init memblock_build_node_order(void)
> > +{
> > +     int nid, i;
> > +     nodemask_t used_mask;
> > +
> > +     node_fallback = memblock_alloc(MAX_NUMNODES * sizeof(int *),
> > +             sizeof(int *));
> > +     for_each_online_node(nid) {
> > +             node_fallback[nid] = memblock_alloc(
> > +                     num_online_nodes() * sizeof(int), sizeof(int));
> > +             for (i = 0; i < num_online_nodes(); i++)
> > +                     node_fallback[nid][i] = NUMA_NO_NODE;
> > +     }
> > +
> > +     for_each_online_node(nid) {
> > +             nodes_clear(used_mask);
> > +             node_set(nid, used_mask);
> > +             build_node_order(node_fallback[nid], num_online_nodes(),
> > +                     nid, &used_mask);
> > +     }
> > +}
> > +
> > +static void __init memblock_free_node_order(void)
> > +{
> > +     int nid;
> > +
> > +     if (!node_fallback)
> > +             return;
> > +     for_each_online_node(nid)
> > +             memblock_free(__pa(node_fallback[nid]),
> > +                     num_online_nodes() * sizeof(int));
> > +     memblock_free(__pa(node_fallback), MAX_NUMNODES * sizeof(int *));
> > +     node_fallback = NULL;
> > +}
> > +
> >  /**
> >   * memblock_alloc_internal - allocate boot memory block
> >   * @size: size of memory block to be allocated in bytes
> > @@ -1370,6 +1411,7 @@ static void * __init memblock_alloc_internal(
> >  {
> >       phys_addr_t alloc;
> >       void *ptr;
> > +     int node;
> >       enum memblock_flags flags = choose_memblock_flags();
> >
> >       if (WARN_ONCE(nid == MAX_NUMNODES, "Usage of MAX_NUMNODES is deprecated. Use NUMA_NO_NODE instead\n"))
> > @@ -1397,11 +1439,26 @@ static void * __init memblock_alloc_internal(
> >               goto done;
> >
> >       if (nid != NUMA_NO_NODE) {
> > -             alloc = memblock_find_in_range_node(size, align, min_addr,
> > -                                                 max_addr, NUMA_NO_NODE,
> > -                                                 flags);
> > -             if (alloc && !memblock_reserve(alloc, size))
> > -                     goto done;
> > +             if (!node_fallback) {
> > +                     alloc = memblock_find_in_range_node(size, align,
> > +                                     min_addr, max_addr,
> > +                                     NUMA_NO_NODE, flags);
> > +                     if (alloc && !memblock_reserve(alloc, size))
> > +                             goto done;
> > +             } else {
> > +                     int i;
> > +                     for (i = 0; i < num_online_nodes(); i++) {
> > +                             node = node_fallback[nid][i];
> > +                             /* fallback list has all memory nodes */
> > +                             if (node == NUMA_NO_NODE)
> > +                                     break;
> > +                             alloc = memblock_find_in_range_node(size,
> > +                                             align, min_addr, max_addr,
> > +                                             node, flags);
> > +                             if (alloc && !memblock_reserve(alloc, size))
> > +                                     goto done;
> > +                     }
> > +             }
> >       }
> >
> >       if (min_addr) {
> > @@ -1969,6 +2026,7 @@ unsigned long __init memblock_free_all(void)
> >
> >       reset_all_zones_managed_pages();
> >
> > +     memblock_free_node_order();
> >       pages = free_low_memory_core_early();
> >       totalram_pages_add(pages);
> >
> > --
> > 2.7.4
> >
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm/memblock: make full utilization of numa info
@ 2019-02-27  9:23       ` Pingfan Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-02-27  9:23 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Michal Hocko, Stephen Rothwell,
	Jonathan Corbet, Nicholas Piggin, Daniel Vacek, LKML

On Tue, Feb 26, 2019 at 7:58 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Sun, Feb 24, 2019 at 08:34:05PM +0800, Pingfan Liu wrote:
> > There are numa machines with memory-less node. When allocating memory for
> > the memory-less node, memblock allocator falls back to 'Node 0' without fully
> > utilizing the nearest node. This hurts the performance, especially for per
> > cpu section. Suppressing this defect by building the full node fall back
> > info for memblock allocator, like what we have done for page allocator.
>
> Is it really necessary to build full node fallback info for memblock and
> then rebuild it again for the page allocator?
>
Do you mean building the full node fallback info once, and share it by
both memblock and page allocator? If it is, then node online/offline
is the corner case to block this design.

> I think it should be possible to split parts of build_all_zonelists_init()
> that do not touch per-cpu areas into a separate function and call that
> function after topology detection. Then it would be possible to use
> local_memory_node() when calling memblock.
>
Yes, this is one way but may be with higher pay of changing the code.
I will try it.
Thank your for your suggestion.

Best regards,
Pingfan
> > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > CC: Thomas Gleixner <tglx@linutronix.de>
> > CC: Ingo Molnar <mingo@redhat.com>
> > CC: Borislav Petkov <bp@alien8.de>
> > CC: "H. Peter Anvin" <hpa@zytor.com>
> > CC: Dave Hansen <dave.hansen@linux.intel.com>
> > CC: Vlastimil Babka <vbabka@suse.cz>
> > CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > CC: Andrew Morton <akpm@linux-foundation.org>
> > CC: Mel Gorman <mgorman@suse.de>
> > CC: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > CC: Andy Lutomirski <luto@kernel.org>
> > CC: Andi Kleen <ak@linux.intel.com>
> > CC: Petr Tesarik <ptesarik@suse.cz>
> > CC: Michal Hocko <mhocko@suse.com>
> > CC: Stephen Rothwell <sfr@canb.auug.org.au>
> > CC: Jonathan Corbet <corbet@lwn.net>
> > CC: Nicholas Piggin <npiggin@gmail.com>
> > CC: Daniel Vacek <neelx@redhat.com>
> > CC: linux-kernel@vger.kernel.org
> > ---
> >  include/linux/memblock.h |  3 +++
> >  mm/memblock.c            | 68 ++++++++++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 66 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index 64c41cf..ee999c5 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -342,6 +342,9 @@ void *memblock_alloc_try_nid_nopanic(phys_addr_t size, phys_addr_t align,
> >  void *memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align,
> >                            phys_addr_t min_addr, phys_addr_t max_addr,
> >                            int nid);
> > +extern int build_node_order(int *node_oder_array, int sz,
> > +     int local_node, nodemask_t *used_mask);
> > +void memblock_build_node_order(void);
> >
> >  static inline void * __init memblock_alloc(phys_addr_t size,  phys_addr_t align)
> >  {
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 022d4cb..cf78850 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1338,6 +1338,47 @@ phys_addr_t __init memblock_phys_alloc_try_nid(phys_addr_t size, phys_addr_t ali
> >       return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);
> >  }
> >
> > +static int **node_fallback __initdata;
> > +
> > +/*
> > + * build_node_order() relies on cpumask_of_node(), hence arch should set up
> > + * cpumask before calling this func.
> > + */
> > +void __init memblock_build_node_order(void)
> > +{
> > +     int nid, i;
> > +     nodemask_t used_mask;
> > +
> > +     node_fallback = memblock_alloc(MAX_NUMNODES * sizeof(int *),
> > +             sizeof(int *));
> > +     for_each_online_node(nid) {
> > +             node_fallback[nid] = memblock_alloc(
> > +                     num_online_nodes() * sizeof(int), sizeof(int));
> > +             for (i = 0; i < num_online_nodes(); i++)
> > +                     node_fallback[nid][i] = NUMA_NO_NODE;
> > +     }
> > +
> > +     for_each_online_node(nid) {
> > +             nodes_clear(used_mask);
> > +             node_set(nid, used_mask);
> > +             build_node_order(node_fallback[nid], num_online_nodes(),
> > +                     nid, &used_mask);
> > +     }
> > +}
> > +
> > +static void __init memblock_free_node_order(void)
> > +{
> > +     int nid;
> > +
> > +     if (!node_fallback)
> > +             return;
> > +     for_each_online_node(nid)
> > +             memblock_free(__pa(node_fallback[nid]),
> > +                     num_online_nodes() * sizeof(int));
> > +     memblock_free(__pa(node_fallback), MAX_NUMNODES * sizeof(int *));
> > +     node_fallback = NULL;
> > +}
> > +
> >  /**
> >   * memblock_alloc_internal - allocate boot memory block
> >   * @size: size of memory block to be allocated in bytes
> > @@ -1370,6 +1411,7 @@ static void * __init memblock_alloc_internal(
> >  {
> >       phys_addr_t alloc;
> >       void *ptr;
> > +     int node;
> >       enum memblock_flags flags = choose_memblock_flags();
> >
> >       if (WARN_ONCE(nid == MAX_NUMNODES, "Usage of MAX_NUMNODES is deprecated. Use NUMA_NO_NODE instead\n"))
> > @@ -1397,11 +1439,26 @@ static void * __init memblock_alloc_internal(
> >               goto done;
> >
> >       if (nid != NUMA_NO_NODE) {
> > -             alloc = memblock_find_in_range_node(size, align, min_addr,
> > -                                                 max_addr, NUMA_NO_NODE,
> > -                                                 flags);
> > -             if (alloc && !memblock_reserve(alloc, size))
> > -                     goto done;
> > +             if (!node_fallback) {
> > +                     alloc = memblock_find_in_range_node(size, align,
> > +                                     min_addr, max_addr,
> > +                                     NUMA_NO_NODE, flags);
> > +                     if (alloc && !memblock_reserve(alloc, size))
> > +                             goto done;
> > +             } else {
> > +                     int i;
> > +                     for (i = 0; i < num_online_nodes(); i++) {
> > +                             node = node_fallback[nid][i];
> > +                             /* fallback list has all memory nodes */
> > +                             if (node == NUMA_NO_NODE)
> > +                                     break;
> > +                             alloc = memblock_find_in_range_node(size,
> > +                                             align, min_addr, max_addr,
> > +                                             node, flags);
> > +                             if (alloc && !memblock_reserve(alloc, size))
> > +                                     goto done;
> > +                     }
> > +             }
> >       }
> >
> >       if (min_addr) {
> > @@ -1969,6 +2026,7 @@ unsigned long __init memblock_free_all(void)
> >
> >       reset_all_zones_managed_pages();
> >
> > +     memblock_free_node_order();
> >       pages = free_low_memory_core_early();
> >       totalram_pages_add(pages);
> >
> > --
> > 2.7.4
> >
>
> --
> Sincerely yours,
> Mike.
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/6] make memblock allocator utilize the node's fallback info
  2019-02-26 12:09     ` Michal Hocko
@ 2019-03-05 12:37         ` Pingfan Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-03-05 12:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Stephen Rothwell, Jonathan Corbet,
	Nicholas Piggin, Daniel Vacek, LKML

On Tue, Feb 26, 2019 at 8:09 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 26-02-19 13:47:37, Pingfan Liu wrote:
> > On Tue, Feb 26, 2019 at 12:04 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Sun 24-02-19 20:34:03, Pingfan Liu wrote:
> > > > There are NUMA machines with memory-less node. At present page allocator builds the
> > > > full fallback info by build_zonelists(). But memblock allocator does not utilize
> > > > this info. And for memory-less node, memblock allocator just falls back "node 0",
> > > > without utilizing the nearest node. Unfortunately, the percpu section is allocated
> > > > by memblock, which is accessed frequently after bootup.
> > > >
> > > > This series aims to improve the performance of per cpu section on memory-less node
> > > > by feeding node's fallback info to memblock allocator on x86, like we do for page
> > > > allocator. On other archs, it requires independent effort to setup node to cpumask
> > > > map ahead.
> > >
> > > Do you have any numbers to tell us how much does this improve the
> > > situation?
> >
> > Not yet. At present just based on the fact that we prefer to allocate
> > per cpu area on local node.
>
> Yes, we _usually_ do. But the additional complexity should be worth it.
> And if we find out that the final improvement is not all that great and
> considering that memory-less setups are crippled anyway then it might
> turn out we just do not care all that much.
> --
I had finished some test on a "Dell Inc. PowerEdge R7425/02MJ3T"
machine, which owns 8 numa node. and the topology is:
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            4096K
NUMA node0 CPU(s):   0,8,16,24
NUMA node1 CPU(s):   2,10,18,26
NUMA node2 CPU(s):   4,12,20,28
NUMA node3 CPU(s):   6,14,22,30
NUMA node4 CPU(s):   1,9,17,25
NUMA node5 CPU(s):   3,11,19,27
NUMA node6 CPU(s):   5,13,21,29
NUMA node7 CPU(s):   7,15,23,31

Here is the basic info about the NUMA machine. cpu 0 and 16 share the
same L3 cache. Only node 1 and 5 own memory. Using local node as
baseline, the memory write performance suffer 25% drop to nearest node
(i.e. writing data from node 0 to 1), and 78% drop to farthest node
(i.e. writing from 0 to 5).

I used a user space test case to get the performance difference
between the nearest node and the farthest. The case pins two tasks on
cpu 0 and 16. The case used two memory chunks, A which emulates a
small footprint of per cpu section, and B which emulates a large
footprint. Chunk B is always allocated on nearest node, while chunk A
switch between nearest node and the farthest to render comparable
result. To emulate around 2.5% access to per cpu area, the case
composes two groups of writing, 1 time to memory chunk A, then 40
times to chunk B.

On the nearest node, I used 4MB foot print, which is the same size as
L3 cache. And varying foot print from 2K -> 4K ->8K to emulate the
access to the per cpu section. For 2K and 4K, perf result can not tell
the difference exactly, due to the difference is smaller than the
variance. For 8K: 1.8% improvement, then the larger footprint, the
higher improvement in performance. But 8K means that a module
allocates 4K/per cpu in the section. This is not in practice.

So the changes may be not need.

Regards,
Pingfan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/6] make memblock allocator utilize the node's fallback info
@ 2019-03-05 12:37         ` Pingfan Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Pingfan Liu @ 2019-03-05 12:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Vlastimil Babka, Mike Rapoport,
	Andrew Morton, Mel Gorman, Joonsoo Kim, Andy Lutomirski,
	Andi Kleen, Petr Tesarik, Stephen Rothwell, Jonathan Corbet,
	Nicholas Piggin, Daniel Vacek, LKML

On Tue, Feb 26, 2019 at 8:09 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 26-02-19 13:47:37, Pingfan Liu wrote:
> > On Tue, Feb 26, 2019 at 12:04 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Sun 24-02-19 20:34:03, Pingfan Liu wrote:
> > > > There are NUMA machines with memory-less node. At present page allocator builds the
> > > > full fallback info by build_zonelists(). But memblock allocator does not utilize
> > > > this info. And for memory-less node, memblock allocator just falls back "node 0",
> > > > without utilizing the nearest node. Unfortunately, the percpu section is allocated
> > > > by memblock, which is accessed frequently after bootup.
> > > >
> > > > This series aims to improve the performance of per cpu section on memory-less node
> > > > by feeding node's fallback info to memblock allocator on x86, like we do for page
> > > > allocator. On other archs, it requires independent effort to setup node to cpumask
> > > > map ahead.
> > >
> > > Do you have any numbers to tell us how much does this improve the
> > > situation?
> >
> > Not yet. At present just based on the fact that we prefer to allocate
> > per cpu area on local node.
>
> Yes, we _usually_ do. But the additional complexity should be worth it.
> And if we find out that the final improvement is not all that great and
> considering that memory-less setups are crippled anyway then it might
> turn out we just do not care all that much.
> --
I had finished some test on a "Dell Inc. PowerEdge R7425/02MJ3T"
machine, which owns 8 numa node. and the topology is:
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            4096K
NUMA node0 CPU(s):   0,8,16,24
NUMA node1 CPU(s):   2,10,18,26
NUMA node2 CPU(s):   4,12,20,28
NUMA node3 CPU(s):   6,14,22,30
NUMA node4 CPU(s):   1,9,17,25
NUMA node5 CPU(s):   3,11,19,27
NUMA node6 CPU(s):   5,13,21,29
NUMA node7 CPU(s):   7,15,23,31

Here is the basic info about the NUMA machine. cpu 0 and 16 share the
same L3 cache. Only node 1 and 5 own memory. Using local node as
baseline, the memory write performance suffer 25% drop to nearest node
(i.e. writing data from node 0 to 1), and 78% drop to farthest node
(i.e. writing from 0 to 5).

I used a user space test case to get the performance difference
between the nearest node and the farthest. The case pins two tasks on
cpu 0 and 16. The case used two memory chunks, A which emulates a
small footprint of per cpu section, and B which emulates a large
footprint. Chunk B is always allocated on nearest node, while chunk A
switch between nearest node and the farthest to render comparable
result. To emulate around 2.5% access to per cpu area, the case
composes two groups of writing, 1 time to memory chunk A, then 40
times to chunk B.

On the nearest node, I used 4MB foot print, which is the same size as
L3 cache. And varying foot print from 2K -> 4K ->8K to emulate the
access to the per cpu section. For 2K and 4K, perf result can not tell
the difference exactly, due to the difference is smaller than the
variance. For 8K: 1.8% improvement, then the larger footprint, the
higher improvement in performance. But 8K means that a module
allocates 4K/per cpu in the section. This is not in practice.

So the changes may be not need.

Regards,
Pingfan


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2019-03-05 12:38 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-24 12:34 [PATCH 0/6] make memblock allocator utilize the node's fallback info Pingfan Liu
2019-02-24 12:34 ` [PATCH 1/6] mm/numa: extract the code of building node fall back list Pingfan Liu
2019-02-24 12:34 ` [PATCH 2/6] mm/memblock: make full utilization of numa info Pingfan Liu
2019-02-25  7:07   ` kbuild test robot
2019-02-25  7:59   ` kbuild test robot
2019-02-25 15:34   ` Dave Hansen
2019-02-26  5:40     ` Pingfan Liu
2019-02-26  5:40       ` Pingfan Liu
2019-02-26 12:37       ` Dave Hansen
2019-02-26 11:58   ` Mike Rapoport
2019-02-27  9:23     ` Pingfan Liu
2019-02-27  9:23       ` Pingfan Liu
2019-02-24 12:34 ` [PATCH 3/6] x86/numa: define numa_init_array() conditional on CONFIG_NUMA Pingfan Liu
2019-02-25 15:23   ` Dave Hansen
2019-02-26  5:40     ` Pingfan Liu
2019-02-26  5:40       ` Pingfan Liu
2019-02-24 12:34 ` [PATCH 4/6] x86/numa: concentrate the code of setting cpu to node map Pingfan Liu
2019-02-25 15:25   ` Dave Hansen
2019-02-24 12:34 ` [PATCH 5/6] x86/numa: push forward the setup of node to cpumask map Pingfan Liu
2019-02-25 15:30   ` Dave Hansen
2019-02-26  5:40     ` Pingfan Liu
2019-02-26  5:40       ` Pingfan Liu
2019-02-24 12:34 ` [PATCH 6/6] x86/numa: build node fallback info after setting up " Pingfan Liu
2019-02-25 16:03 ` [PATCH 0/6] make memblock allocator utilize the node's fallback info Michal Hocko
2019-02-26  5:47   ` Pingfan Liu
2019-02-26  5:47     ` Pingfan Liu
2019-02-26 12:09     ` Michal Hocko
2019-03-05 12:37       ` Pingfan Liu
2019-03-05 12:37         ` Pingfan Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.