linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH part2 0/4] Allow allocating pagetable on local node in movablemem_map.
@ 2013-03-21  9:21 Tang Chen
  2013-03-21  9:21 ` [PATCH part2 1/4] x86, mm, numa, acpi: Introduce numa_meminfo_all to store all the numa meminfo Tang Chen
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Tang Chen @ 2013-03-21  9:21 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

Hi Yinghai, all,

This patch-set is based on Yinghai's tree:
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm

For main line, we need to apply Yinghai's
"x86, ACPI, numa: Parse numa info early" patch-set first.
Please refer to:
v1: https://lkml.org/lkml/2013/3/7/642
v2: https://lkml.org/lkml/2013/3/10/47


In this part2 patch-set, we didi the following things:
1) Introduce a "bool hotpluggable" member into struct numa_memblk so that we are
   able to know which memory ranges in numa_meminfo are hotpluggable.
   All the related apis have been changed.
2) Introduce a new global variable "numa_meminfo_all" to store all the memory ranges
   recorded in SRAT, because numa_cleanup_meminfo() will remove ranges higher than
   max_pfn.
   We need full numa memory info to limit zone_movable_pfn[].
3) Move movablemem_map sanitization after memory mapping is initialized so that
   pagetable allocation will not be limited by movablemem_map.


On the other hand, we may have another way to solve this problem:

Not only pagetable and vmemmap pages, but also all the data whose life cycle is the
same as a node, could be put on local node.

1) Introduce a flag into memblock, such as "LOCAL_NODE_DATA", to mark out which
   ranges have the same life cycle with node.
2) Only keep existing memory ranges in movablemem_map (no need to introduce
   numa_meminfo_all), and exclude these LOCAL_NODE_DATA ranges.
3) When hot-removing, we are able to find out these ranges, and free them first.
   This is very important.

Also, hot-add logic needs to be modified, too. As Yinghai mentioned before, I think
we can make memblock alive when memory is hot-added. And go with the same logic
as it is when booting.

How do you think?


Tang Chen (4):
  x86, mm, numa, acpi: Introduce numa_meminfo_all to store all the numa
    meminfo.
  x86, mm, numa, acpi: Introduce hotplug info into struct numa_meminfo.
  x86, mm, numa, acpi: Consider hotplug info when cleanup numa_meminfo.
  x86, mm, numa, acpi: Sanitize movablemem_map after memory mapping
    initialized.

 arch/x86/include/asm/numa.h     |    3 +-
 arch/x86/kernel/apic/numaq_32.c |    2 +-
 arch/x86/mm/amdtopology.c       |    3 +-
 arch/x86/mm/numa.c              |  161 +++++++++++++++++++++++++++++++++++++--
 arch/x86/mm/numa_internal.h     |    1 +
 arch/x86/mm/srat.c              |  141 +++++-----------------------------
 6 files changed, 178 insertions(+), 133 deletions(-)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH part2 1/4] x86, mm, numa, acpi: Introduce numa_meminfo_all to store all the numa meminfo.
  2013-03-21  9:21 [RFC PATCH part2 0/4] Allow allocating pagetable on local node in movablemem_map Tang Chen
@ 2013-03-21  9:21 ` Tang Chen
  2013-03-21  9:21 ` [PATCH part2 2/4] x86, mm, numa, acpi: Introduce hotplug info into struct numa_meminfo Tang Chen
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Tang Chen @ 2013-03-21  9:21 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

Now, Yinghai has tried to allocate pagetables and vmemmap pages in local
node. If we limit memblock allocation in movablemem_map.map[], we have to
exclude the pagetables and vmemmap pages.

So we need the following sequence:
1) Parse SRAT, store numa_meminfo.
2) Initialize memory mapping, allocate pagetables and vmemmap pages in local
   node. And reserve these memory with memblock.
3) Sanitize movablemem_map.map[], exclude the pagetables and vmemmap pages.

When parsing SRAT, we added memory ranges into numa_meminfo. But in
numa_cleanup_meminfo(), it removed all the unused memory from numa_meminfo.

         const u64 low = 0;
         const u64 high = PFN_PHYS(max_pfn);

         /* first, trim all entries */
         for (i = 0; i < mi->nr_blks; i++) {
                 struct numa_memblk *bi = &mi->blk[i];

                 /* make sure all blocks are inside the limits */
                 bi->start = max(bi->start, low);
                 bi->end = min(bi->end, high);

                 /* and there's no empty block */
                 if (bi->start >= bi->end)
                         numa_remove_memblk_from(i--, mi);
         }

So numa_meminfo doesn't have the whole memory info.

In order to sanitize movablemem_map.map[] after memory mapping initialziation,
we need the whole SRAT info.

So this patch introduces global variable numa_meminfo_all to store the whole
numa memory info.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4f754e6..4cf3b49 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -28,12 +28,20 @@ nodemask_t numa_nodes_parsed __initdata;
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
 
+/*e820 mapped memory info */
 static struct numa_meminfo numa_meminfo
 #ifndef CONFIG_MEMORY_HOTPLUG
 __initdata
 #endif
 ;
 
+/* All memory info */
+static struct numa_meminfo numa_meminfo_all
+#ifndef CONFIG_MEMORY_HOTPLUG
+__initdata
+#endif
+;
+
 static int numa_distance_cnt;
 static u8 *numa_distance;
 
@@ -599,10 +607,15 @@ static int __init numa_init(int (*init_func)(void))
 
 	nodes_clear(numa_nodes_parsed);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
+	memset(&numa_meminfo_all, 0, sizeof(numa_meminfo));
 
 	ret = init_func();
 	if (ret < 0)
 		return ret;
+
+	/* Store the whole memory info before cleanup numa_meminfo. */
+	memcpy(&numa_meminfo_all, &numa_meminfo, sizeof(numa_meminfo));
+
 	ret = numa_cleanup_meminfo(&numa_meminfo);
 	if (ret < 0)
 		return ret;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH part2 2/4] x86, mm, numa, acpi: Introduce hotplug info into struct numa_meminfo.
  2013-03-21  9:21 [RFC PATCH part2 0/4] Allow allocating pagetable on local node in movablemem_map Tang Chen
  2013-03-21  9:21 ` [PATCH part2 1/4] x86, mm, numa, acpi: Introduce numa_meminfo_all to store all the numa meminfo Tang Chen
@ 2013-03-21  9:21 ` Tang Chen
  2013-03-21  9:21 ` [PATCH part2 3/4] x86, mm, numa, acpi: Consider hotplug info when cleanup numa_meminfo Tang Chen
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Tang Chen @ 2013-03-21  9:21 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

Since we are using struct numa_meminfo to store SRAT info, and sanitize
movablemem_map.map[], we need hotplug info in struct numa_meminfo.

This patch introduces a "bool hotpluggable" member into struct
numa_meminfo.

And modifies the following APIs' prototypes to support it:
   - numa_add_memblk()
   - numa_add_memblk_to()

And the following callers:
   - numaq_register_node()
   - dummy_numa_init()
   - amd_numa_init()
   - acpi_numa_memory_affinity_init() in x86

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/include/asm/numa.h     |    3 ++-
 arch/x86/kernel/apic/numaq_32.c |    2 +-
 arch/x86/mm/amdtopology.c       |    3 ++-
 arch/x86/mm/numa.c              |   10 +++++++---
 arch/x86/mm/numa_internal.h     |    1 +
 arch/x86/mm/srat.c              |    2 +-
 6 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 1b99ee5..73096b2 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -31,7 +31,8 @@ extern int numa_off;
 extern s16 __apicid_to_node[MAX_LOCAL_APIC];
 extern nodemask_t numa_nodes_parsed __initdata;
 
-extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
+extern int __init numa_add_memblk(int nodeid, u64 start, u64 end,
+				  bool hotpluggable);
 extern void __init numa_set_distance(int from, int to, int distance);
 
 static inline void set_apicid_to_node(int apicid, s16 node)
diff --git a/arch/x86/kernel/apic/numaq_32.c b/arch/x86/kernel/apic/numaq_32.c
index d661ee9..7a9c542 100644
--- a/arch/x86/kernel/apic/numaq_32.c
+++ b/arch/x86/kernel/apic/numaq_32.c
@@ -82,7 +82,7 @@ static inline void numaq_register_node(int node, struct sys_cfg_data *scd)
 	int ret;
 
 	node_set(node, numa_nodes_parsed);
-	ret = numa_add_memblk(node, start, end);
+	ret = numa_add_memblk(node, start, end, false);
 	BUG_ON(ret < 0);
 }
 
diff --git a/arch/x86/mm/amdtopology.c b/arch/x86/mm/amdtopology.c
index 5247d01..d521471 100644
--- a/arch/x86/mm/amdtopology.c
+++ b/arch/x86/mm/amdtopology.c
@@ -167,7 +167,8 @@ int __init amd_numa_init(void)
 			nodeid, base, limit);
 
 		prevbase = base;
-		numa_add_memblk(nodeid, base, limit);
+		/* Do not support memory hotplug for AMD cpu. */
+		numa_add_memblk(nodeid, base, limit, false);
 		node_set(nodeid, numa_nodes_parsed);
 	}
 
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4cf3b49..5f98bb5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -142,6 +142,7 @@ void __init setup_node_to_cpumask_map(void)
 }
 
 static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
+				     bool hotpluggable,
 				     struct numa_meminfo *mi)
 {
 	/* ignore zero length blks */
@@ -163,6 +164,7 @@ static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
 	mi->blk[mi->nr_blks].start = start;
 	mi->blk[mi->nr_blks].end = end;
 	mi->blk[mi->nr_blks].nid = nid;
+	mi->blk[mi->nr_blks].hotpluggable = hotpluggable;
 	mi->nr_blks++;
 	return 0;
 }
@@ -187,15 +189,17 @@ void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
  * @nid: NUMA node ID of the new memblk
  * @start: Start address of the new memblk
  * @end: End address of the new memblk
+ * @hotpluggable: True if memblk is hotpluggable
  *
  * Add a new memblk to the default numa_meminfo.
  *
  * RETURNS:
  * 0 on success, -errno on failure.
  */
-int __init numa_add_memblk(int nid, u64 start, u64 end)
+int __init numa_add_memblk(int nid, u64 start, u64 end,
+			   bool hotpluggable)
 {
-	return numa_add_memblk_to(nid, start, end, &numa_meminfo);
+	return numa_add_memblk_to(nid, start, end, hotpluggable, &numa_meminfo);
 }
 
 /* Initialize NODE_DATA for a node on the local memory */
@@ -644,7 +648,7 @@ static int __init dummy_numa_init(void)
 	       0LLU, PFN_PHYS(max_pfn) - 1);
 
 	node_set(0, numa_nodes_parsed);
-	numa_add_memblk(0, 0, PFN_PHYS(max_pfn));
+	numa_add_memblk(0, 0, PFN_PHYS(max_pfn), false);
 
 	return 0;
 }
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index bb2fbcc..1ce4e6b 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -8,6 +8,7 @@ struct numa_memblk {
 	u64			start;
 	u64			end;
 	int			nid;
+	bool			hotpluggable;
 };
 
 struct numa_meminfo {
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 4f443de..76c2eb4 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -290,7 +290,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 		goto out_err_bad_srat;
 	}
 
-	if (numa_add_memblk(node, start, end) < 0)
+	if (numa_add_memblk(node, start, end, hotpluggable) < 0)
 		goto out_err_bad_srat;
 
 	node_set(node, numa_nodes_parsed);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH part2 3/4] x86, mm, numa, acpi: Consider hotplug info when cleanup numa_meminfo.
  2013-03-21  9:21 [RFC PATCH part2 0/4] Allow allocating pagetable on local node in movablemem_map Tang Chen
  2013-03-21  9:21 ` [PATCH part2 1/4] x86, mm, numa, acpi: Introduce numa_meminfo_all to store all the numa meminfo Tang Chen
  2013-03-21  9:21 ` [PATCH part2 2/4] x86, mm, numa, acpi: Introduce hotplug info into struct numa_meminfo Tang Chen
@ 2013-03-21  9:21 ` Tang Chen
  2013-03-21  9:21 ` [PATCH part2 4/4] x86, mm, numa, acpi: Sanitize movablemem_map after memory mapping initialized Tang Chen
  2013-03-27  1:43 ` [RFC PATCH part2 0/4] Allow allocating pagetable on local node in movablemem_map Tang Chen
  4 siblings, 0 replies; 6+ messages in thread
From: Tang Chen @ 2013-03-21  9:21 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

Since we have introduced hotplug info into struct numa_meminfo, we need
to consider it when cleanup numa_meminfo.

The original logic in numa_cleanup_meminfo() is:
Merge blocks on the same node, holes between which don't overlap with
memory on other nodes.

This patch modifies numa_cleanup_meminfo() logic like this:
Merge blocks with the same hotpluggable type on the same node, holes
between which don't overlap with memory on other nodes.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |   13 +++++++++----
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 5f98bb5..0c3a278 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -304,18 +304,22 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
 			}
 
 			/*
-			 * Join together blocks on the same node, holes
-			 * between which don't overlap with memory on other
-			 * nodes.
+			 * Join together blocks on the same node, with the same
+			 * hotpluggable flags, holes between which don't overlap
+			 * with memory on other nodes.
 			 */
 			if (bi->nid != bj->nid)
 				continue;
+			if (bi->hotpluggable != bj->hotpluggable)
+				continue;
+
 			start = min(bi->start, bj->start);
 			end = max(bi->end, bj->end);
 			for (k = 0; k < mi->nr_blks; k++) {
 				struct numa_memblk *bk = &mi->blk[k];
 
-				if (bi->nid == bk->nid)
+				if (bi->nid == bk->nid &&
+				    bi->hotpluggable == bk->hotpluggable)
 					continue;
 				if (start < bk->end && end > bk->start)
 					break;
@@ -335,6 +339,7 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
 	for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) {
 		mi->blk[i].start = mi->blk[i].end = 0;
 		mi->blk[i].nid = NUMA_NO_NODE;
+		mi->blk[i].hotpluggable = false;
 	}
 
 	return 0;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH part2 4/4] x86, mm, numa, acpi: Sanitize movablemem_map after memory mapping initialized.
  2013-03-21  9:21 [RFC PATCH part2 0/4] Allow allocating pagetable on local node in movablemem_map Tang Chen
                   ` (2 preceding siblings ...)
  2013-03-21  9:21 ` [PATCH part2 3/4] x86, mm, numa, acpi: Consider hotplug info when cleanup numa_meminfo Tang Chen
@ 2013-03-21  9:21 ` Tang Chen
  2013-03-27  1:43 ` [RFC PATCH part2 0/4] Allow allocating pagetable on local node in movablemem_map Tang Chen
  4 siblings, 0 replies; 6+ messages in thread
From: Tang Chen @ 2013-03-21  9:21 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

In order to support allocating pagetable and vmammap pages in local node,
we should initialzie memory mapping without any limitation for memblock first,
using memblock to reserve pagetable and vmemmap pages in local node, and then
sanitize movablemem_map.map[] to limit memblock.

In this way, we can prevent allocation in movable area but with pagetable
and vmemmap pages (used by kernel) in local node.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |  125 ++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/srat.c |  139 ++++++---------------------------------------------
 2 files changed, 142 insertions(+), 122 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 0c3a278..d0b9c5a 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -738,6 +738,129 @@ static void __init early_x86_numa_init_mapping(void)
 }
 #endif
 
+#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+static void __init movablemem_map_handle_srat(struct numa_memblk mb)
+{
+	unsigned long start_pfn = PFN_DOWN(mb.start);
+	unsigned long end_pfn = PFN_UP(mb.end);
+	int nid = mb.nid;
+	bool hotpluggable = mb.hotpluggable;
+
+	/*
+	 * For movablemem_map=acpi:
+	 *
+	 * SRAT:                |_____| |_____| |_________| |_________| ......
+	 * node id:                0       1         1           2
+	 * hotpluggable:           n       y         y           n
+	 * movablemem_map:              |_____| |_________|
+	 *
+	 * Using movablemem_map, we can prevent memblock from allocating memory
+	 * on ZONE_MOVABLE at boot time.
+	 *
+	 * Before parsing SRAT, memblock has already reserve some memory ranges
+	 * for other purposes, such as for kernel image. We cannot prevent
+	 * kernel from using these memory. Furthermore, if all the memory is
+	 * hotpluggable, then the system won't have enough memory to boot. So
+	 * we always set the nodes which the kernel resides in as non-movable
+	 * by not calling this function in sanitize_movablemem_map().
+	 *
+	 * Known problem: We now allocate pagetable and vmemmap pages on local
+	 * node, and reserved them in memblock. But we cannot tell these pages
+	 * from other reserved memory, such as kernel image. Fortunately, the
+	 * reserved memory will not be released into buddy system, so it won't
+	 * impact the ZONE_MOVABLE limitation.
+	 */
+	if (!hotpluggable)
+		return;
+
+	/* If the range is hotpluggable, insert it into movablemem_map. */
+	insert_movablemem_map(start_pfn, end_pfn);
+
+	if (zone_movable_limit[nid])
+		zone_movable_limit[nid] = min(zone_movable_limit[nid],
+					      start_pfn);
+	else
+		zone_movable_limit[nid] = start_pfn;
+}
+
+static void __init movablemem_map_handle_user(struct numa_memblk mb)
+{
+	int overlap;
+	unsigned long start_pfn = PFN_DOWN(mb.start);
+	unsigned long end_pfn = PFN_UP(mb.end);
+	int nid = mb.nid;
+
+	/*
+	 * For movablemem_map=nn[KMG]@ss[KMG]:
+	 *
+	 * SRAT:                |_____| |_____| |_________| |_________| ......
+	 * node id:                0       1         1           2
+	 * user specified:                |__|                 |___|
+	 * movablemem_map:                |___| |_________|    |______| ......
+	 *
+	 * Using movablemem_map, we can prevent memblock from allocating memory
+	 * on ZONE_MOVABLE at boot time.
+	 *
+	 * NOTE: In this case, SRAT info will be ingored. Even if the memory
+	 * range is not hotpluggable in SRAT, it will be inserted into
+	 * movablemem_map. This is useful if firmware is buggy.
+	 */
+	overlap = movablemem_map_overlap(start_pfn, end_pfn);
+	if (overlap >= 0) {
+		/*
+		 * If this range overlaps with movablemem_map, then update
+		 * zone_movable_limit[nid] if it has lower start pfn.
+		 */
+		start_pfn = max(start_pfn,
+				movablemem_map.map[overlap].start_pfn);
+
+		if (!zone_movable_limit[nid] ||
+		    zone_movable_limit[nid] > start_pfn)
+			zone_movable_limit[nid] = start_pfn;
+
+		/* Insert the higher part of the overlapped range. */
+		if (movablemem_map.map[overlap].end_pfn < end_pfn)
+			insert_movablemem_map(start_pfn, end_pfn);
+	} else {
+		/*
+		 * If this is a range higher than zone_movable_limit[nid],
+		 * insert it to movablemem_map because all ranges higher than
+		 * zone_movable_limit[nid] on this node will be ZONE_MOVABLE.
+		 */
+		if (zone_movable_limit[nid] &&
+		    start_pfn > zone_movable_limit[nid])
+			insert_movablemem_map(start_pfn, end_pfn);
+	}
+}
+
+static void __init sanitize_movablemem_map()
+{
+	int i;
+
+	if (movablemem_map.acpi) {
+		for (i = 0; i < numa_meminfo_all.nr_blks; i++) {
+			/*
+			 * In order to ensure the kernel has enough memory to
+			 * boot, we always set the node which the kernel
+			 * resides in as unhotpluggable.
+			 */
+			if (node_isset(numa_meminfo_all.blk[i].nid,
+					movablemem_map.numa_nodes_kernel))
+				continue;
+
+			movablemem_map_handle_srat(numa_meminfo_all.blk[i]);
+		}
+	} else {
+		for (i = 0; i < numa_meminfo_all.nr_blks; i++)
+			movablemem_map_handle_user(numa_meminfo_all.blk[i]);
+	}
+}
+#else		/* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+static inline void sanitize_movablemem_map()
+{
+}
+#endif		/* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+
 void __init early_initmem_init(void)
 {
 	early_x86_numa_init();
@@ -747,6 +870,8 @@ void __init early_initmem_init(void)
 	load_cr3(swapper_pg_dir);
 	__flush_tlb_all();
 
+	sanitize_movablemem_map();
+
 	early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
 }
 
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 76c2eb4..2c1f9a6 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -141,132 +141,14 @@ static inline int save_add_info(void) {return 1;}
 static inline int save_add_info(void) {return 0;}
 #endif
 
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-static void __init sanitize_movablemem_map(int nid, u64 start, u64 end,
-					   bool hotpluggable)
-{
-	int overlap, i;
-	unsigned long start_pfn, end_pfn;
-
-	start_pfn = PFN_DOWN(start);
-	end_pfn = PFN_UP(end);
-
-	/*
-	 * For movablemem_map=acpi:
-	 *
-	 * SRAT:                |_____| |_____| |_________| |_________| ......
-	 * node id:                0       1         1           2
-	 * hotpluggable:           n       y         y           n
-	 * movablemem_map:              |_____| |_________|
-	 *
-	 * Using movablemem_map, we can prevent memblock from allocating memory
-	 * on ZONE_MOVABLE at boot time.
-	 *
-	 * Before parsing SRAT, memblock has already reserve some memory ranges
-	 * for other purposes, such as for kernel image. We cannot prevent
-	 * kernel from using these memory, so we need to exclude these memory
-	 * even if it is hotpluggable.
-	 * Furthermore, to ensure the kernel has enough memory to boot, we make
-	 * all the memory on the node which the kernel resides in should be
-	 * un-hotpluggable.
-	 */
-	if (hotpluggable && movablemem_map.acpi) {
-		/* Exclude ranges reserved by memblock. */
-		struct memblock_type *rgn = &memblock.reserved;
-
-		for (i = 0; i < rgn->cnt; i++) {
-			if (end <= rgn->regions[i].base ||
-			    start >= rgn->regions[i].base +
-			    rgn->regions[i].size)
-				continue;
-
-			/*
-			 * If the memory range overlaps the memory reserved by
-			 * memblock, then the kernel resides in this node.
-			 */
-			node_set(nid, movablemem_map.numa_nodes_kernel);
-			zone_movable_limit[nid] = 0;
-
-			return;
-		}
-
-		/*
-		 * If the kernel resides in this node, then the whole node
-		 * should not be hotpluggable.
-		 */
-		if (node_isset(nid, movablemem_map.numa_nodes_kernel)) {
-			zone_movable_limit[nid] = 0;
-			return;
-		}
-
-		/*
-		 * Otherwise, if the range is hotpluggable, and the kernel is
-		 * not on this node, insert it into movablemem_map.
-		 */
-		insert_movablemem_map(start_pfn, end_pfn);
-		if (zone_movable_limit[nid])
-			zone_movable_limit[nid] = min(zone_movable_limit[nid],
-						      start_pfn);
-		else
-			zone_movable_limit[nid] = start_pfn;
-
-		return;
-	}
-
-	/*
-	 * For movablemem_map=nn[KMG]@ss[KMG]:
-	 *
-	 * SRAT:                |_____| |_____| |_________| |_________| ......
-	 * node id:                0       1         1           2
-	 * user specified:                |__|                 |___|
-	 * movablemem_map:                |___| |_________|    |______| ......
-	 *
-	 * Using movablemem_map, we can prevent memblock from allocating memory
-	 * on ZONE_MOVABLE at boot time.
-	 *
-	 * NOTE: In this case, SRAT info will be ingored.
-	 */
-	overlap = movablemem_map_overlap(start_pfn, end_pfn);
-	if (overlap >= 0) {
-		/*
-		 * If this range overlaps with movablemem_map, then update
-		 * zone_movable_limit[nid] if it has lower start pfn.
-		 */
-		start_pfn = max(start_pfn,
-				movablemem_map.map[overlap].start_pfn);
-
-		if (!zone_movable_limit[nid] ||
-		    zone_movable_limit[nid] > start_pfn)
-			zone_movable_limit[nid] = start_pfn;
-
-		/* Insert the higher part of the overlapped range. */
-		if (movablemem_map.map[overlap].end_pfn < end_pfn)
-			insert_movablemem_map(start_pfn, end_pfn);
-	} else {
-		/*
-		 * If this is a range higher than zone_movable_limit[nid],
-		 * insert it to movablemem_map because all ranges higher than
-		 * zone_movable_limit[nid] on this node will be ZONE_MOVABLE.
-		 */
-		if (zone_movable_limit[nid] &&
-		    start_pfn > zone_movable_limit[nid])
-			insert_movablemem_map(start_pfn, end_pfn);
-	}
-}
-#else		/* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
-static inline void sanitize_movablemem_map(int nid, u64 start, u64 end,
-					   bool hotpluggable)
-{
-}
-#endif		/* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
-
 /* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
 int __init
 acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 {
 	u64 start, end;
 	u32 hotpluggable;
-	int node, pxm;
+	int node, pxm, i;
+	struct memblock_type *rgn = &memblock.reserved;
 
 	if (srat_disabled())
 		goto out_err;
@@ -295,14 +177,27 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 
 	node_set(node, numa_nodes_parsed);
 
+	/*
+	 * The data whose life cycle is the same as the node, such as pagetable,
+	 * could be acllocated on local node by memblock. But now, none of them
+	 * has been initialized yet. So the kernel resides in the nodes on which
+	 * memblock has reserved memory.
+	 */
+	for (i = 0; i < rgn->cnt; i++) {
+		if (end <= rgn->regions[i].base ||
+		    start >= rgn->regions[i].base + rgn->regions[i].size)
+			continue;
+
+		node_set(node, movablemem_map.numa_nodes_kernel);
+	}
+
 	printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx] %s\n",
 	       node, pxm,
 	       (unsigned long long) start, (unsigned long long) end - 1,
 	       hotpluggable ? "Hot Pluggable" : "");
 
-	sanitize_movablemem_map(node, start, end, hotpluggable);
-
 	return 0;
+
 out_err_bad_srat:
 	bad_srat();
 out_err:
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH part2 0/4] Allow allocating pagetable on local node in movablemem_map.
  2013-03-21  9:21 [RFC PATCH part2 0/4] Allow allocating pagetable on local node in movablemem_map Tang Chen
                   ` (3 preceding siblings ...)
  2013-03-21  9:21 ` [PATCH part2 4/4] x86, mm, numa, acpi: Sanitize movablemem_map after memory mapping initialized Tang Chen
@ 2013-03-27  1:43 ` Tang Chen
  4 siblings, 0 replies; 6+ messages in thread
From: Tang Chen @ 2013-03-27  1:43 UTC (permalink / raw)
  To: rob, tglx, mingo, hpa, yinghai, akpm, wency, trenn, liwanp,
	mgorman, walken, riel, khlebnikov, tj, minchan, m.szyprowski,
	mina86, laijs, isimatu.yasuaki, linfeng, jiang.liu,
	kosaki.motohiro, guz.fnst
  Cc: x86, linux-doc, linux-kernel, linux-mm

Hi Yinghai,

Would you please help to review this patch-set ?

And how do you think of the memblock flag idea ?

FYI, Liu Jiang has proposed a similar idea before.
https://lkml.org/lkml/2012/12/6/422

But we may have the following difference:
1) It is a flag, not a tag, which means a range may have several
    different attributes.
2) Mark node-lify-cycle data, and put it on local node, and free
    it when hot-removing.
3) Mark and reserve movable memory, as you did.

Thanks. :)

On 03/21/2013 05:21 PM, Tang Chen wrote:
> Hi Yinghai, all,
>
> This patch-set is based on Yinghai's tree:
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
>
> For main line, we need to apply Yinghai's
> "x86, ACPI, numa: Parse numa info early" patch-set first.
> Please refer to:
> v1: https://lkml.org/lkml/2013/3/7/642
> v2: https://lkml.org/lkml/2013/3/10/47
>
>
> In this part2 patch-set, we didi the following things:
> 1) Introduce a "bool hotpluggable" member into struct numa_memblk so that we are
>     able to know which memory ranges in numa_meminfo are hotpluggable.
>     All the related apis have been changed.
> 2) Introduce a new global variable "numa_meminfo_all" to store all the memory ranges
>     recorded in SRAT, because numa_cleanup_meminfo() will remove ranges higher than
>     max_pfn.
>     We need full numa memory info to limit zone_movable_pfn[].
> 3) Move movablemem_map sanitization after memory mapping is initialized so that
>     pagetable allocation will not be limited by movablemem_map.
>
>
> On the other hand, we may have another way to solve this problem:
>
> Not only pagetable and vmemmap pages, but also all the data whose life cycle is the
> same as a node, could be put on local node.
>
> 1) Introduce a flag into memblock, such as "LOCAL_NODE_DATA", to mark out which
>     ranges have the same life cycle with node.
> 2) Only keep existing memory ranges in movablemem_map (no need to introduce
>     numa_meminfo_all), and exclude these LOCAL_NODE_DATA ranges.
> 3) When hot-removing, we are able to find out these ranges, and free them first.
>     This is very important.
>
> Also, hot-add logic needs to be modified, too. As Yinghai mentioned before, I think
> we can make memblock alive when memory is hot-added. And go with the same logic
> as it is when booting.
>
> How do you think?
>
>
> Tang Chen (4):
>    x86, mm, numa, acpi: Introduce numa_meminfo_all to store all the numa
>      meminfo.
>    x86, mm, numa, acpi: Introduce hotplug info into struct numa_meminfo.
>    x86, mm, numa, acpi: Consider hotplug info when cleanup numa_meminfo.
>    x86, mm, numa, acpi: Sanitize movablemem_map after memory mapping
>      initialized.
>
>   arch/x86/include/asm/numa.h     |    3 +-
>   arch/x86/kernel/apic/numaq_32.c |    2 +-
>   arch/x86/mm/amdtopology.c       |    3 +-
>   arch/x86/mm/numa.c              |  161 +++++++++++++++++++++++++++++++++++++--
>   arch/x86/mm/numa_internal.h     |    1 +
>   arch/x86/mm/srat.c              |  141 +++++-----------------------------
>   6 files changed, 178 insertions(+), 133 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-03-27  1:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-21  9:21 [RFC PATCH part2 0/4] Allow allocating pagetable on local node in movablemem_map Tang Chen
2013-03-21  9:21 ` [PATCH part2 1/4] x86, mm, numa, acpi: Introduce numa_meminfo_all to store all the numa meminfo Tang Chen
2013-03-21  9:21 ` [PATCH part2 2/4] x86, mm, numa, acpi: Introduce hotplug info into struct numa_meminfo Tang Chen
2013-03-21  9:21 ` [PATCH part2 3/4] x86, mm, numa, acpi: Consider hotplug info when cleanup numa_meminfo Tang Chen
2013-03-21  9:21 ` [PATCH part2 4/4] x86, mm, numa, acpi: Sanitize movablemem_map after memory mapping initialized Tang Chen
2013-03-27  1:43 ` [RFC PATCH part2 0/4] Allow allocating pagetable on local node in movablemem_map Tang Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).