All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-23 10:44 ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-23 10:44 UTC (permalink / raw)
  To: hpa, akpm, rob, isimatu.yasuaki, tangchen, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty
  Cc: linux-kernel, linux-mm, linux-doc

[What we are doing]
This patchset provide a boot option for user to specify ZONE_MOVABLE memory
map for each node in the system.

movablecore_map=nn[KMG]@ss[KMG]

This option make sure memory range from ss to ss+nn is movable memory.


[Why we do this]
If we hot remove a memroy, the memory cannot have kernel memory,
because Linux cannot migrate kernel memory currently. Therefore,
we have to guarantee that the hot removed memory has only movable
memoroy.

Linux has two boot options, kernelcore= and movablecore=, for
creating movable memory. These boot options can specify the amount
of memory use as kernel or movable memory. Using them, we can
create ZONE_MOVABLE which has only movable memory.

But it does not fulfill a requirement of memory hot remove, because
even if we specify the boot options, movable memory is distributed
in each node evenly. So when we want to hot remove memory which
memory range is 0x80000000-0c0000000, we have no way to specify
the memory as movable memory.

So we proposed a new feature which specifies memory range to use as
movable memory.


[Ways to do this]
There may be 2 ways to specify movable memory.
 1. use firmware information
 2. use boot option

1. use firmware information
  According to ACPI spec 5.0, SRAT table has memory affinity structure
  and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
  Affinity Structure". If we use the information, we might be able to
  specify movable memory by firmware. For example, if Hot Pluggable
  Filed is enabled, Linux sets the memory as movable memory.

2. use boot option
  This is our proposal. New boot option can specify memory range to use
  as movable memory.


[How we do this]
We chose second way, because if we use first way, users cannot change
memory range to use as movable memory easily. We think if we create
movable memory, performance regression may occur by NUMA. In this case,
user can turn off the feature easily if we prepare the boot option.
And if we prepare the boot optino, the user can select which memory
to use as movable memory easily. 


[How to use]
Specify the following boot option:
movablecore_map=nn[KMG]@ss[KMG]

That means physical address range from ss to ss+nn will be allocated as
ZONE_MOVABLE.

And the following points should be considered.

1) If the range is involved in a single node, then from ss to the end of
   the node will be ZONE_MOVABLE.
2) If the range covers two or more nodes, then from ss to the end of
   the node will be ZONE_MOVABLE, and all the other nodes will only
   have ZONE_MOVABLE.
3) If no range is in the node, then the node will have no ZONE_MOVABLE
   unless kernelcore or movablecore is specified.
4) This option could be specified at most MAX_NUMNODES times.
5) If kernelcore or movablecore is also specified, movablecore_map will have
   higher priority to be satisfied.
6) This option has no conflict with memmap option.



Tang Chen (4):
  page_alloc: add movable_memmap kernel parameter
  page_alloc: Introduce zone_movable_limit[] to keep movable limit for
    nodes
  page_alloc: Make movablecore_map has higher priority
  page_alloc: Bootmem limit with movablecore_map

Yasuaki Ishimatsu (1):
  x86: get pg_data_t's memory from other node

 Documentation/kernel-parameters.txt |   17 +++
 arch/x86/mm/numa.c                  |   11 ++-
 include/linux/memblock.h            |    1 +
 include/linux/mm.h                  |   11 ++
 mm/memblock.c                       |   15 +++-
 mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
 6 files changed, 263 insertions(+), 8 deletions(-)


^ permalink raw reply	[flat|nested] 170+ messages in thread

* [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-23 10:44 ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-23 10:44 UTC (permalink / raw)
  To: hpa, akpm, rob, isimatu.yasuaki, tangchen, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty
  Cc: linux-kernel, linux-mm, linux-doc

[What we are doing]
This patchset provide a boot option for user to specify ZONE_MOVABLE memory
map for each node in the system.

movablecore_map=nn[KMG]@ss[KMG]

This option make sure memory range from ss to ss+nn is movable memory.


[Why we do this]
If we hot remove a memroy, the memory cannot have kernel memory,
because Linux cannot migrate kernel memory currently. Therefore,
we have to guarantee that the hot removed memory has only movable
memoroy.

Linux has two boot options, kernelcore= and movablecore=, for
creating movable memory. These boot options can specify the amount
of memory use as kernel or movable memory. Using them, we can
create ZONE_MOVABLE which has only movable memory.

But it does not fulfill a requirement of memory hot remove, because
even if we specify the boot options, movable memory is distributed
in each node evenly. So when we want to hot remove memory which
memory range is 0x80000000-0c0000000, we have no way to specify
the memory as movable memory.

So we proposed a new feature which specifies memory range to use as
movable memory.


[Ways to do this]
There may be 2 ways to specify movable memory.
 1. use firmware information
 2. use boot option

1. use firmware information
  According to ACPI spec 5.0, SRAT table has memory affinity structure
  and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
  Affinity Structure". If we use the information, we might be able to
  specify movable memory by firmware. For example, if Hot Pluggable
  Filed is enabled, Linux sets the memory as movable memory.

2. use boot option
  This is our proposal. New boot option can specify memory range to use
  as movable memory.


[How we do this]
We chose second way, because if we use first way, users cannot change
memory range to use as movable memory easily. We think if we create
movable memory, performance regression may occur by NUMA. In this case,
user can turn off the feature easily if we prepare the boot option.
And if we prepare the boot optino, the user can select which memory
to use as movable memory easily. 


[How to use]
Specify the following boot option:
movablecore_map=nn[KMG]@ss[KMG]

That means physical address range from ss to ss+nn will be allocated as
ZONE_MOVABLE.

And the following points should be considered.

1) If the range is involved in a single node, then from ss to the end of
   the node will be ZONE_MOVABLE.
2) If the range covers two or more nodes, then from ss to the end of
   the node will be ZONE_MOVABLE, and all the other nodes will only
   have ZONE_MOVABLE.
3) If no range is in the node, then the node will have no ZONE_MOVABLE
   unless kernelcore or movablecore is specified.
4) This option could be specified at most MAX_NUMNODES times.
5) If kernelcore or movablecore is also specified, movablecore_map will have
   higher priority to be satisfied.
6) This option has no conflict with memmap option.



Tang Chen (4):
  page_alloc: add movable_memmap kernel parameter
  page_alloc: Introduce zone_movable_limit[] to keep movable limit for
    nodes
  page_alloc: Make movablecore_map has higher priority
  page_alloc: Bootmem limit with movablecore_map

Yasuaki Ishimatsu (1):
  x86: get pg_data_t's memory from other node

 Documentation/kernel-parameters.txt |   17 +++
 arch/x86/mm/numa.c                  |   11 ++-
 include/linux/memblock.h            |    1 +
 include/linux/mm.h                  |   11 ++
 mm/memblock.c                       |   15 +++-
 mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
 6 files changed, 263 insertions(+), 8 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* [PATCH v2 1/5] x86: get pg_data_t's memory from other node
  2012-11-23 10:44 ` Tang Chen
@ 2012-11-23 10:44   ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-23 10:44 UTC (permalink / raw)
  To: hpa, akpm, rob, isimatu.yasuaki, tangchen, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty
  Cc: linux-kernel, linux-mm, linux-doc

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So when memblock_alloc_nid() fails, setup_node_data() retries
memblock_alloc().

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..734bbd2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -224,9 +224,14 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
 	} else {
 		nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
 		if (!nd_pa) {
-			pr_err("Cannot find %zu bytes in node %d\n",
-			       nd_size, nid);
-			return;
+			pr_warn("Cannot find %zu bytes in node %d\n",
+				nd_size, nid);
+			nd_pa = memblock_alloc(nd_size, SMP_CACHE_BYTES);
+			if (!nd_pa) {
+				pr_err("Cannot find %zu bytes in other node\n",
+				       nd_size);
+				return;
+			}
 		}
 		nd = __va(nd_pa);
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 170+ messages in thread

* [PATCH v2 1/5] x86: get pg_data_t's memory from other node
@ 2012-11-23 10:44   ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-23 10:44 UTC (permalink / raw)
  To: hpa, akpm, rob, isimatu.yasuaki, tangchen, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty
  Cc: linux-kernel, linux-mm, linux-doc

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So when memblock_alloc_nid() fails, setup_node_data() retries
memblock_alloc().

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..734bbd2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -224,9 +224,14 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
 	} else {
 		nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
 		if (!nd_pa) {
-			pr_err("Cannot find %zu bytes in node %d\n",
-			       nd_size, nid);
-			return;
+			pr_warn("Cannot find %zu bytes in node %d\n",
+				nd_size, nid);
+			nd_pa = memblock_alloc(nd_size, SMP_CACHE_BYTES);
+			if (!nd_pa) {
+				pr_err("Cannot find %zu bytes in other node\n",
+				       nd_size);
+				return;
+			}
 		}
 		nd = __va(nd_pa);
 	}
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 170+ messages in thread

* [PATCH v2 2/5] page_alloc: add movable_memmap kernel parameter
  2012-11-23 10:44 ` Tang Chen
@ 2012-11-23 10:44   ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-23 10:44 UTC (permalink / raw)
  To: hpa, akpm, rob, isimatu.yasuaki, tangchen, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty
  Cc: linux-kernel, linux-mm, linux-doc

This patch adds functions to parse movablecore_map boot option. Since the
option could be specified more then once, all the maps will be stored in
the global variable movablecore_map.map array.

And also, we keep the array in monotonic increasing order by start_pfn.
And merge all overlapped ranges.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |   17 +++++
 include/linux/mm.h                  |   11 +++
 mm/page_alloc.c                     |  126 +++++++++++++++++++++++++++++++++++
 3 files changed, 154 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 9776f06..785f878 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1620,6 +1620,23 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
+	movablecore_map=nn[KMG]@ss[KMG]
+			[KNL,X86,IA-64,PPC] This parameter is similar to
+			memmap except it specifies the memory map of
+			ZONE_MOVABLE.
+			If more areas are all within one node, then from
+			lowest ss to the end of the node will be ZONE_MOVABLE.
+			If an area covers two or more nodes, the area from
+			ss to the end of the 1st node will be ZONE_MOVABLE,
+			and all the rest nodes will only have ZONE_MOVABLE.
+			If memmap is specified at the same time, the
+			movablecore_map will be limited within the memmap
+			areas. If kernelcore or movablecore is also specified,
+			movablecore_map will have higher priority to be
+			satisfied. So the administrator should be careful that
+			the amount of movablecore_map areas are not too large.
+			Otherwise kernel won't have enough memory to start.
+
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..647c980 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1328,6 +1328,17 @@ extern void free_bootmem_with_active_regions(int nid,
 						unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
 
+#define MOVABLECORE_MAP_MAX MAX_NUMNODES
+struct movablecore_entry {
+	unsigned long start;    /* start pfn of memory segment */
+	unsigned long end;      /* end pfn of memory segment */
+};
+
+struct movablecore_map {
+	int nr_map;
+	struct movablecore_entry map[MOVABLECORE_MAP_MAX];
+};
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 #if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b74de6..fb5cf12 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -198,6 +198,9 @@ static unsigned long __meminitdata nr_all_pages;
 static unsigned long __meminitdata dma_reserve;
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+/* Movable memory ranges, will also be used by memblock subsystem. */
+struct movablecore_map movablecore_map;
+
 static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
@@ -4986,6 +4989,129 @@ static int __init cmdline_parse_movablecore(char *p)
 early_param("kernelcore", cmdline_parse_kernelcore);
 early_param("movablecore", cmdline_parse_movablecore);
 
+/**
+ * insert_movablecore_map - Insert a memory range in to movablecore_map.map.
+ * @start_pfn: start pfn of the range
+ * @end_pfn: end pfn of the range
+ *
+ * This function will also merge the overlapped ranges, and sort the array
+ * by start_pfn in monotonic increasing order.
+ */
+static void __init insert_movablecore_map(unsigned long start_pfn,
+					  unsigned long end_pfn)
+{
+	int pos, overlap;
+
+	/*
+	 * pos will be at the 1st overlapped range, or the position
+	 * where the element should be inserted.
+	 */
+	for (pos = 0; pos < movablecore_map.nr_map; pos++)
+		if (start_pfn <= movablecore_map.map[pos].end)
+			break;
+
+	/* If there is no overlapped range, just insert the element. */
+	if (pos == movablecore_map.nr_map ||
+	    end_pfn < movablecore_map.map[pos].start) {
+		/*
+		 * If pos is not the end of array, we need to move all
+		 * the rest elements backward.
+		 */
+		if (pos < movablecore_map.nr_map)
+			memmove(&movablecore_map.map[pos+1],
+				&movablecore_map.map[pos],
+				sizeof(struct movablecore_entry) *
+				(movablecore_map.nr_map - pos));
+		movablecore_map.map[pos].start = start_pfn;
+		movablecore_map.map[pos].end = end_pfn;
+		movablecore_map.nr_map++;
+		return;
+	}
+
+	/* overlap will be at the last overlapped range */
+	for (overlap = pos + 1; overlap < movablecore_map.nr_map; overlap++)
+		if (end_pfn < movablecore_map.map[overlap].start)
+			break;
+
+	/*
+	 * If there are more ranges overlapped, we need to merge them,
+	 * and move the rest elements forward.
+	 */
+	overlap--;
+	movablecore_map.map[pos].start = min(start_pfn,
+					     movablecore_map.map[pos].start);
+	movablecore_map.map[pos].end = max(end_pfn,
+					     movablecore_map.map[overlap].end);
+
+	if (pos != overlap && overlap + 1 != movablecore_map.nr_map)
+		memmove(&movablecore_map.map[pos+1],
+			&movablecore_map.map[overlap+1],
+			sizeof(struct movablecore_entry) *
+			(movablecore_map.nr_map - overlap - 1));
+
+	movablecore_map.nr_map -= overlap - pos;
+}
+
+/**
+ * movablecore_map_add_region - Add a memory range into movablecore_map.
+ * @start: physical start address of range
+ * @end: physical end address of range
+ *
+ * This function transform the physical address into pfn, and then add the
+ * range into movablecore_map by calling insert_movablecore_map().
+ */
+static void __init movablecore_map_add_region(u64 start, u64 size)
+{
+	unsigned long start_pfn, end_pfn;
+
+	/* In case size == 0 or start + size overflows */
+	if (start + size <= start)
+		return;
+
+	if (movablecore_map.nr_map >= ARRAY_SIZE(movablecore_map.map)) {
+		pr_err("movable_memory_map: too many entries;"
+			" ignoring [mem %#010llx-%#010llx]\n",
+			(unsigned long long) start,
+			(unsigned long long) (start + size - 1));
+		return;
+	}
+
+	start_pfn = PFN_DOWN(start);
+	end_pfn = PFN_UP(start + size);
+	insert_movablecore_map(start_pfn, end_pfn);
+}
+
+/*
+ * movablecore_map=nn[KMG]@ss[KMG] sets the region of memory to be used as
+ * movable memory.
+ */
+static int __init cmdline_parse_movablecore_map(char *p)
+{
+	char *oldp;
+	u64 start_at, mem_size;
+
+	if (!p)
+		goto err;
+
+	oldp = p;
+	mem_size = memparse(p, &p);
+	if (p == oldp)
+		goto err;
+
+	if (*p == '@') {
+		oldp = ++p;
+		start_at = memparse(p, &p);
+		if (p == oldp || *p != '\0')
+			goto err;
+
+		movablecore_map_add_region(start_at, mem_size);
+		return 0;
+	}
+err:
+	return -EINVAL;
+}
+early_param("movablecore_map", cmdline_parse_movablecore_map);
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 /**
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 170+ messages in thread

* [PATCH v2 2/5] page_alloc: add movable_memmap kernel parameter
@ 2012-11-23 10:44   ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-23 10:44 UTC (permalink / raw)
  To: hpa, akpm, rob, isimatu.yasuaki, tangchen, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty
  Cc: linux-kernel, linux-mm, linux-doc

This patch adds functions to parse movablecore_map boot option. Since the
option could be specified more then once, all the maps will be stored in
the global variable movablecore_map.map array.

And also, we keep the array in monotonic increasing order by start_pfn.
And merge all overlapped ranges.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |   17 +++++
 include/linux/mm.h                  |   11 +++
 mm/page_alloc.c                     |  126 +++++++++++++++++++++++++++++++++++
 3 files changed, 154 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 9776f06..785f878 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1620,6 +1620,23 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
+	movablecore_map=nn[KMG]@ss[KMG]
+			[KNL,X86,IA-64,PPC] This parameter is similar to
+			memmap except it specifies the memory map of
+			ZONE_MOVABLE.
+			If more areas are all within one node, then from
+			lowest ss to the end of the node will be ZONE_MOVABLE.
+			If an area covers two or more nodes, the area from
+			ss to the end of the 1st node will be ZONE_MOVABLE,
+			and all the rest nodes will only have ZONE_MOVABLE.
+			If memmap is specified at the same time, the
+			movablecore_map will be limited within the memmap
+			areas. If kernelcore or movablecore is also specified,
+			movablecore_map will have higher priority to be
+			satisfied. So the administrator should be careful that
+			the amount of movablecore_map areas are not too large.
+			Otherwise kernel won't have enough memory to start.
+
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..647c980 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1328,6 +1328,17 @@ extern void free_bootmem_with_active_regions(int nid,
 						unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
 
+#define MOVABLECORE_MAP_MAX MAX_NUMNODES
+struct movablecore_entry {
+	unsigned long start;    /* start pfn of memory segment */
+	unsigned long end;      /* end pfn of memory segment */
+};
+
+struct movablecore_map {
+	int nr_map;
+	struct movablecore_entry map[MOVABLECORE_MAP_MAX];
+};
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 #if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b74de6..fb5cf12 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -198,6 +198,9 @@ static unsigned long __meminitdata nr_all_pages;
 static unsigned long __meminitdata dma_reserve;
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+/* Movable memory ranges, will also be used by memblock subsystem. */
+struct movablecore_map movablecore_map;
+
 static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
@@ -4986,6 +4989,129 @@ static int __init cmdline_parse_movablecore(char *p)
 early_param("kernelcore", cmdline_parse_kernelcore);
 early_param("movablecore", cmdline_parse_movablecore);
 
+/**
+ * insert_movablecore_map - Insert a memory range in to movablecore_map.map.
+ * @start_pfn: start pfn of the range
+ * @end_pfn: end pfn of the range
+ *
+ * This function will also merge the overlapped ranges, and sort the array
+ * by start_pfn in monotonic increasing order.
+ */
+static void __init insert_movablecore_map(unsigned long start_pfn,
+					  unsigned long end_pfn)
+{
+	int pos, overlap;
+
+	/*
+	 * pos will be at the 1st overlapped range, or the position
+	 * where the element should be inserted.
+	 */
+	for (pos = 0; pos < movablecore_map.nr_map; pos++)
+		if (start_pfn <= movablecore_map.map[pos].end)
+			break;
+
+	/* If there is no overlapped range, just insert the element. */
+	if (pos == movablecore_map.nr_map ||
+	    end_pfn < movablecore_map.map[pos].start) {
+		/*
+		 * If pos is not the end of array, we need to move all
+		 * the rest elements backward.
+		 */
+		if (pos < movablecore_map.nr_map)
+			memmove(&movablecore_map.map[pos+1],
+				&movablecore_map.map[pos],
+				sizeof(struct movablecore_entry) *
+				(movablecore_map.nr_map - pos));
+		movablecore_map.map[pos].start = start_pfn;
+		movablecore_map.map[pos].end = end_pfn;
+		movablecore_map.nr_map++;
+		return;
+	}
+
+	/* overlap will be at the last overlapped range */
+	for (overlap = pos + 1; overlap < movablecore_map.nr_map; overlap++)
+		if (end_pfn < movablecore_map.map[overlap].start)
+			break;
+
+	/*
+	 * If there are more ranges overlapped, we need to merge them,
+	 * and move the rest elements forward.
+	 */
+	overlap--;
+	movablecore_map.map[pos].start = min(start_pfn,
+					     movablecore_map.map[pos].start);
+	movablecore_map.map[pos].end = max(end_pfn,
+					     movablecore_map.map[overlap].end);
+
+	if (pos != overlap && overlap + 1 != movablecore_map.nr_map)
+		memmove(&movablecore_map.map[pos+1],
+			&movablecore_map.map[overlap+1],
+			sizeof(struct movablecore_entry) *
+			(movablecore_map.nr_map - overlap - 1));
+
+	movablecore_map.nr_map -= overlap - pos;
+}
+
+/**
+ * movablecore_map_add_region - Add a memory range into movablecore_map.
+ * @start: physical start address of range
+ * @end: physical end address of range
+ *
+ * This function transform the physical address into pfn, and then add the
+ * range into movablecore_map by calling insert_movablecore_map().
+ */
+static void __init movablecore_map_add_region(u64 start, u64 size)
+{
+	unsigned long start_pfn, end_pfn;
+
+	/* In case size == 0 or start + size overflows */
+	if (start + size <= start)
+		return;
+
+	if (movablecore_map.nr_map >= ARRAY_SIZE(movablecore_map.map)) {
+		pr_err("movable_memory_map: too many entries;"
+			" ignoring [mem %#010llx-%#010llx]\n",
+			(unsigned long long) start,
+			(unsigned long long) (start + size - 1));
+		return;
+	}
+
+	start_pfn = PFN_DOWN(start);
+	end_pfn = PFN_UP(start + size);
+	insert_movablecore_map(start_pfn, end_pfn);
+}
+
+/*
+ * movablecore_map=nn[KMG]@ss[KMG] sets the region of memory to be used as
+ * movable memory.
+ */
+static int __init cmdline_parse_movablecore_map(char *p)
+{
+	char *oldp;
+	u64 start_at, mem_size;
+
+	if (!p)
+		goto err;
+
+	oldp = p;
+	mem_size = memparse(p, &p);
+	if (p == oldp)
+		goto err;
+
+	if (*p == '@') {
+		oldp = ++p;
+		start_at = memparse(p, &p);
+		if (p == oldp || *p != '\0')
+			goto err;
+
+		movablecore_map_add_region(start_at, mem_size);
+		return 0;
+	}
+err:
+	return -EINVAL;
+}
+early_param("movablecore_map", cmdline_parse_movablecore_map);
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 /**
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 170+ messages in thread

* [PATCH v2 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes
  2012-11-23 10:44 ` Tang Chen
@ 2012-11-23 10:44   ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-23 10:44 UTC (permalink / raw)
  To: hpa, akpm, rob, isimatu.yasuaki, tangchen, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty
  Cc: linux-kernel, linux-mm, linux-doc

This patch introduces a new array zone_movable_limit[] to store the
ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
The function sanitize_zone_movable_limit() will find out to which
node the ranges in movable_map.map[] belongs, and calculates the
low boundary of ZONE_MOVABLE for each node.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 mm/page_alloc.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fb5cf12..f23d76a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -206,6 +206,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -4323,6 +4324,55 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
 	return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
 }
 
+/**
+ * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
+ *
+ * zone_movable_limit is initialized as 0. This function will try to get
+ * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
+ * assigne them to zone_movable_limit.
+ * zone_movable_limit[nid] == 0 means no limit for the node.
+ *
+ * Note: Each range is represented as [start_pfn, end_pfn)
+ */
+static void __meminit sanitize_zone_movable_limit(void)
+{
+	int map_pos = 0, i, nid;
+	unsigned long start_pfn, end_pfn;
+
+	if (!movablecore_map.nr_map)
+		return;
+
+	/* Iterate all ranges from minimum to maximum */
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+		/*
+		 * If we have found lowest pfn of ZONE_MOVABLE of the node
+		 * specified by user, just go on to check next range.
+		 */
+		if (zone_movable_limit[nid])
+			continue;
+
+		while (map_pos < movablecore_map.nr_map) {
+			if (end_pfn <= movablecore_map.map[map_pos].start)
+				break;
+
+			if (start_pfn >= movablecore_map.map[map_pos].end) {
+				map_pos++;
+				continue;
+			}
+
+			/*
+			 * The start_pfn of ZONE_MOVABLE is either the minimum
+			 * pfn specified by movablecore_map, or 0, which means
+			 * the node has no ZONE_MOVABLE.
+			 */
+			zone_movable_limit[nid] = max(start_pfn,
+					movablecore_map.map[map_pos].start);
+
+			break;
+		}
+	}
+}
+
 #else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
 					unsigned long zone_type,
@@ -4341,6 +4391,10 @@ static inline unsigned long __meminit zone_absent_pages_in_node(int nid,
 	return zholes_size[zone_type];
 }
 
+static void __meminit sanitize_zone_movable_limit(void)
+{
+}
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
@@ -4906,6 +4960,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 
 	/* Find the PFNs that ZONE_MOVABLE begins at in each node */
 	memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
+	sanitize_zone_movable_limit();
 	find_zone_movable_pfns_for_nodes();
 
 	/* Print out the zone ranges */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 170+ messages in thread

* [PATCH v2 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes
@ 2012-11-23 10:44   ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-23 10:44 UTC (permalink / raw)
  To: hpa, akpm, rob, isimatu.yasuaki, tangchen, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty
  Cc: linux-kernel, linux-mm, linux-doc

This patch introduces a new array zone_movable_limit[] to store the
ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
The function sanitize_zone_movable_limit() will find out to which
node the ranges in movable_map.map[] belongs, and calculates the
low boundary of ZONE_MOVABLE for each node.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 mm/page_alloc.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fb5cf12..f23d76a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -206,6 +206,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -4323,6 +4324,55 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
 	return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
 }
 
+/**
+ * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
+ *
+ * zone_movable_limit is initialized as 0. This function will try to get
+ * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
+ * assigne them to zone_movable_limit.
+ * zone_movable_limit[nid] == 0 means no limit for the node.
+ *
+ * Note: Each range is represented as [start_pfn, end_pfn)
+ */
+static void __meminit sanitize_zone_movable_limit(void)
+{
+	int map_pos = 0, i, nid;
+	unsigned long start_pfn, end_pfn;
+
+	if (!movablecore_map.nr_map)
+		return;
+
+	/* Iterate all ranges from minimum to maximum */
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+		/*
+		 * If we have found lowest pfn of ZONE_MOVABLE of the node
+		 * specified by user, just go on to check next range.
+		 */
+		if (zone_movable_limit[nid])
+			continue;
+
+		while (map_pos < movablecore_map.nr_map) {
+			if (end_pfn <= movablecore_map.map[map_pos].start)
+				break;
+
+			if (start_pfn >= movablecore_map.map[map_pos].end) {
+				map_pos++;
+				continue;
+			}
+
+			/*
+			 * The start_pfn of ZONE_MOVABLE is either the minimum
+			 * pfn specified by movablecore_map, or 0, which means
+			 * the node has no ZONE_MOVABLE.
+			 */
+			zone_movable_limit[nid] = max(start_pfn,
+					movablecore_map.map[map_pos].start);
+
+			break;
+		}
+	}
+}
+
 #else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
 					unsigned long zone_type,
@@ -4341,6 +4391,10 @@ static inline unsigned long __meminit zone_absent_pages_in_node(int nid,
 	return zholes_size[zone_type];
 }
 
+static void __meminit sanitize_zone_movable_limit(void)
+{
+}
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
@@ -4906,6 +4960,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 
 	/* Find the PFNs that ZONE_MOVABLE begins at in each node */
 	memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
+	sanitize_zone_movable_limit();
 	find_zone_movable_pfns_for_nodes();
 
 	/* Print out the zone ranges */
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 170+ messages in thread

* [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
  2012-11-23 10:44 ` Tang Chen
@ 2012-11-23 10:44   ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-23 10:44 UTC (permalink / raw)
  To: hpa, akpm, rob, isimatu.yasuaki, tangchen, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty
  Cc: linux-kernel, linux-mm, linux-doc

If kernelcore or movablecore is specified at the same time
with movablecore_map, movablecore_map will have higher
priority to be satisfied.
This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from
zone_movable_limit[].

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 mm/page_alloc.c |   35 +++++++++++++++++++++++++++++++----
 1 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f23d76a..05bafbb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4800,12 +4800,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 		required_kernelcore = max(required_kernelcore, corepages);
 	}
 
-	/* If kernelcore was not specified, there is no ZONE_MOVABLE */
-	if (!required_kernelcore)
+	/*
+	 * No matter kernelcore/movablecore was limited or not, movable_zone
+	 * should always be set to a usable zone index.
+	 */
+	find_usable_zone_for_movable();
+
+	/*
+	 * If neither kernelcore/movablecore nor movablecore_map is specified,
+	 * there is no ZONE_MOVABLE. But if movablecore_map is specified, the
+	 * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
+	 */
+	if (!required_kernelcore) {
+		if (movablecore_map.nr_map)
+			memcpy(zone_movable_pfn, zone_movable_limit,
+				sizeof(zone_movable_pfn));
 		goto out;
+	}
 
 	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
-	find_usable_zone_for_movable();
 	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
 restart:
@@ -4833,10 +4846,24 @@ restart:
 		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 			unsigned long size_pages;
 
+			/*
+			 * Find more memory for kernelcore in
+			 * [zone_movable_pfn[nid], zone_movable_limit[nid]).
+			 */
 			start_pfn = max(start_pfn, zone_movable_pfn[nid]);
 			if (start_pfn >= end_pfn)
 				continue;
 
+			if (zone_movable_limit[nid]) {
+				end_pfn = min(end_pfn, zone_movable_limit[nid]);
+				/* No range left for kernelcore in this node */
+				if (start_pfn >= end_pfn) {
+					zone_movable_pfn[nid] =
+							zone_movable_limit[nid];
+					break;
+				}
+			}
+
 			/* Account for what is only usable for kernelcore */
 			if (start_pfn < usable_startpfn) {
 				unsigned long kernel_pages;
@@ -4896,12 +4923,12 @@ restart:
 	if (usable_nodes && required_kernelcore > usable_nodes)
 		goto restart;
 
+out:
 	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)
 		zone_movable_pfn[nid] =
 			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
 
-out:
 	/* restore the node_state */
 	node_states[N_HIGH_MEMORY] = saved_node_state;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 170+ messages in thread

* [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
@ 2012-11-23 10:44   ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-23 10:44 UTC (permalink / raw)
  To: hpa, akpm, rob, isimatu.yasuaki, tangchen, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty
  Cc: linux-kernel, linux-mm, linux-doc

If kernelcore or movablecore is specified at the same time
with movablecore_map, movablecore_map will have higher
priority to be satisfied.
This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from
zone_movable_limit[].

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 mm/page_alloc.c |   35 +++++++++++++++++++++++++++++++----
 1 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f23d76a..05bafbb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4800,12 +4800,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 		required_kernelcore = max(required_kernelcore, corepages);
 	}
 
-	/* If kernelcore was not specified, there is no ZONE_MOVABLE */
-	if (!required_kernelcore)
+	/*
+	 * No matter kernelcore/movablecore was limited or not, movable_zone
+	 * should always be set to a usable zone index.
+	 */
+	find_usable_zone_for_movable();
+
+	/*
+	 * If neither kernelcore/movablecore nor movablecore_map is specified,
+	 * there is no ZONE_MOVABLE. But if movablecore_map is specified, the
+	 * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
+	 */
+	if (!required_kernelcore) {
+		if (movablecore_map.nr_map)
+			memcpy(zone_movable_pfn, zone_movable_limit,
+				sizeof(zone_movable_pfn));
 		goto out;
+	}
 
 	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
-	find_usable_zone_for_movable();
 	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
 restart:
@@ -4833,10 +4846,24 @@ restart:
 		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 			unsigned long size_pages;
 
+			/*
+			 * Find more memory for kernelcore in
+			 * [zone_movable_pfn[nid], zone_movable_limit[nid]).
+			 */
 			start_pfn = max(start_pfn, zone_movable_pfn[nid]);
 			if (start_pfn >= end_pfn)
 				continue;
 
+			if (zone_movable_limit[nid]) {
+				end_pfn = min(end_pfn, zone_movable_limit[nid]);
+				/* No range left for kernelcore in this node */
+				if (start_pfn >= end_pfn) {
+					zone_movable_pfn[nid] =
+							zone_movable_limit[nid];
+					break;
+				}
+			}
+
 			/* Account for what is only usable for kernelcore */
 			if (start_pfn < usable_startpfn) {
 				unsigned long kernel_pages;
@@ -4896,12 +4923,12 @@ restart:
 	if (usable_nodes && required_kernelcore > usable_nodes)
 		goto restart;
 
+out:
 	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)
 		zone_movable_pfn[nid] =
 			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
 
-out:
 	/* restore the node_state */
 	node_states[N_HIGH_MEMORY] = saved_node_state;
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 170+ messages in thread

* [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-23 10:44 ` Tang Chen
@ 2012-11-23 10:44   ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-23 10:44 UTC (permalink / raw)
  To: hpa, akpm, rob, isimatu.yasuaki, tangchen, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty
  Cc: linux-kernel, linux-mm, linux-doc

This patch make sure bootmem will not allocate memory from areas that
may be ZONE_MOVABLE. The map info is from movablecore_map boot option.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 include/linux/memblock.h |    1 +
 mm/memblock.c            |   15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index d452ee1..6e25597 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,7 @@ struct memblock {
 
 extern struct memblock memblock;
 extern int memblock_debug;
+extern struct movablecore_map movablecore_map;
 
 #define memblock_dbg(fmt, ...) \
 	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
diff --git a/mm/memblock.c b/mm/memblock.c
index 6259055..33b3b4d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 {
 	phys_addr_t this_start, this_end, cand;
 	u64 i;
+	int curr = movablecore_map.nr_map - 1;
 
 	/* pump up @end */
 	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
@@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 		this_start = clamp(this_start, start, end);
 		this_end = clamp(this_end, start, end);
 
-		if (this_end < size)
+restart:
+		if (this_end <= this_start || this_end < size)
 			continue;
 
+		for (; curr >= 0; curr--) {
+			if (movablecore_map.map[curr].start < this_end)
+				break;
+		}
+
 		cand = round_down(this_end - size, align);
+		if (curr >= 0 && cand < movablecore_map.map[curr].end) {
+			this_end = movablecore_map.map[curr].start;
+			goto restart;
+		}
+
 		if (cand >= this_start)
 			return cand;
 	}
+
 	return 0;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 170+ messages in thread

* [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-11-23 10:44   ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-23 10:44 UTC (permalink / raw)
  To: hpa, akpm, rob, isimatu.yasuaki, tangchen, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty
  Cc: linux-kernel, linux-mm, linux-doc

This patch make sure bootmem will not allocate memory from areas that
may be ZONE_MOVABLE. The map info is from movablecore_map boot option.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 include/linux/memblock.h |    1 +
 mm/memblock.c            |   15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index d452ee1..6e25597 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,7 @@ struct memblock {
 
 extern struct memblock memblock;
 extern int memblock_debug;
+extern struct movablecore_map movablecore_map;
 
 #define memblock_dbg(fmt, ...) \
 	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
diff --git a/mm/memblock.c b/mm/memblock.c
index 6259055..33b3b4d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 {
 	phys_addr_t this_start, this_end, cand;
 	u64 i;
+	int curr = movablecore_map.nr_map - 1;
 
 	/* pump up @end */
 	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
@@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 		this_start = clamp(this_start, start, end);
 		this_end = clamp(this_end, start, end);
 
-		if (this_end < size)
+restart:
+		if (this_end <= this_start || this_end < size)
 			continue;
 
+		for (; curr >= 0; curr--) {
+			if (movablecore_map.map[curr].start < this_end)
+				break;
+		}
+
 		cand = round_down(this_end - size, align);
+		if (curr >= 0 && cand < movablecore_map.map[curr].end) {
+			this_end = movablecore_map.map[curr].start;
+			goto restart;
+		}
+
 		if (cand >= this_start)
 			return cand;
 	}
+
 	return 0;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 1/5] x86: get pg_data_t's memory from other node
  2012-11-23 10:44   ` Tang Chen
@ 2012-11-24  1:19     ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-24  1:19 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng, yinghai,
	kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc

On 2012-11-23 18:44, Tang Chen wrote:
> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> If system can create movable node which all memory of the
> node is allocated as ZONE_MOVABLE, setup_node_data() cannot
> allocate memory for the node's pg_data_t.
> So when memblock_alloc_nid() fails, setup_node_data() retries
> memblock_alloc().
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> ---
>  arch/x86/mm/numa.c |   11 ++++++++---
>  1 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 2d125be..734bbd2 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -224,9 +224,14 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
>  	} else {
>  		nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
>  		if (!nd_pa) {
> -			pr_err("Cannot find %zu bytes in node %d\n",
> -			       nd_size, nid);
> -			return;
> +			pr_warn("Cannot find %zu bytes in node %d\n",
> +				nd_size, nid);
Hi Tang,
	Should this be an "pr_info" because the allocation failure is expected?
Regards!
Gerry


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 1/5] x86: get pg_data_t's memory from other node
@ 2012-11-24  1:19     ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-24  1:19 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng, yinghai,
	kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc

On 2012-11-23 18:44, Tang Chen wrote:
> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> If system can create movable node which all memory of the
> node is allocated as ZONE_MOVABLE, setup_node_data() cannot
> allocate memory for the node's pg_data_t.
> So when memblock_alloc_nid() fails, setup_node_data() retries
> memblock_alloc().
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> ---
>  arch/x86/mm/numa.c |   11 ++++++++---
>  1 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 2d125be..734bbd2 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -224,9 +224,14 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
>  	} else {
>  		nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
>  		if (!nd_pa) {
> -			pr_err("Cannot find %zu bytes in node %d\n",
> -			       nd_size, nid);
> -			return;
> +			pr_warn("Cannot find %zu bytes in node %d\n",
> +				nd_size, nid);
Hi Tangi 1/4 ?
	Should this be an "pr_info" because the allocation failure is expected?
Regards!
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 1/5] x86: get pg_data_t's memory from other node
  2012-11-24  1:19     ` Jiang Liu
@ 2012-11-26  1:19       ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-26  1:19 UTC (permalink / raw)
  To: Jiang Liu
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng, yinghai,
	kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc

On 11/24/2012 09:19 AM, Jiang Liu wrote:
> On 2012-11-23 18:44, Tang Chen wrote:
>> From: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
>> @@ -224,9 +224,14 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
>>   	} else {
>>   		nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
>>   		if (!nd_pa) {
>> -			pr_err("Cannot find %zu bytes in node %d\n",
>> -			       nd_size, nid);
>> -			return;
>> +			pr_warn("Cannot find %zu bytes in node %d\n",
>> +				nd_size, nid);
> Hi Tang,
> 	Should this be an "pr_info" because the allocation failure is expected?

Hi Liu,

Sure, followed. Thanks. :)

> Regards!
> Gerry
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 1/5] x86: get pg_data_t's memory from other node
@ 2012-11-26  1:19       ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-26  1:19 UTC (permalink / raw)
  To: Jiang Liu
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng, yinghai,
	kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc

On 11/24/2012 09:19 AM, Jiang Liu wrote:
> On 2012-11-23 18:44, Tang Chen wrote:
>> From: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
>> @@ -224,9 +224,14 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
>>   	} else {
>>   		nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
>>   		if (!nd_pa) {
>> -			pr_err("Cannot find %zu bytes in node %d\n",
>> -			       nd_size, nid);
>> -			return;
>> +			pr_warn("Cannot find %zu bytes in node %d\n",
>> +				nd_size, nid);
> Hi Tang,
> 	Should this be an "pr_info" because the allocation failure is expected?

Hi Liu,

Sure, followed. Thanks. :)

> Regards!
> Gerry
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-23 10:44   ` Tang Chen
@ 2012-11-26 12:22     ` wujianguo
  -1 siblings, 0 replies; 170+ messages in thread
From: wujianguo @ 2012-11-26 12:22 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 2012-11-23 18:44, Tang Chen wrote:
> This patch make sure bootmem will not allocate memory from areas that
> may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
> Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
> ---
>  include/linux/memblock.h |    1 +
>  mm/memblock.c            |   15 ++++++++++++++-
>  2 files changed, 15 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index d452ee1..6e25597 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -42,6 +42,7 @@ struct memblock {
>  
>  extern struct memblock memblock;
>  extern int memblock_debug;
> +extern struct movablecore_map movablecore_map;
>  
>  #define memblock_dbg(fmt, ...) \
>  	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 6259055..33b3b4d 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>  {
>  	phys_addr_t this_start, this_end, cand;
>  	u64 i;
> +	int curr = movablecore_map.nr_map - 1;
>  
>  	/* pump up @end */
>  	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> @@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>  		this_start = clamp(this_start, start, end);
>  		this_end = clamp(this_end, start, end);
>  
> -		if (this_end < size)
> +restart:
> +		if (this_end <= this_start || this_end < size)
>  			continue;
>  
> +		for (; curr >= 0; curr--) {
> +			if (movablecore_map.map[curr].start < this_end)

movablecore_map[curr].start should be movablecore_map[curr].start << PAGE_SHIFT.
May be you can change movablecore_map[].start/end to movablecore_map[].start_pfn/end_pfn
to avoid confusion.

> +				break;
> +		}
> +
>  		cand = round_down(this_end - size, align);
> +		if (curr >= 0 && cand < movablecore_map.map[curr].end) {
> +			this_end = movablecore_map.map[curr].start;

Ditto.

> +			goto restart;
> +		}
> +
>  		if (cand >= this_start)
>  			return cand;
>  	}
> +
>  	return 0;
>  }
>  
> 


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-11-26 12:22     ` wujianguo
  0 siblings, 0 replies; 170+ messages in thread
From: wujianguo @ 2012-11-26 12:22 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 2012-11-23 18:44, Tang Chen wrote:
> This patch make sure bootmem will not allocate memory from areas that
> may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
> Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
> ---
>  include/linux/memblock.h |    1 +
>  mm/memblock.c            |   15 ++++++++++++++-
>  2 files changed, 15 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index d452ee1..6e25597 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -42,6 +42,7 @@ struct memblock {
>  
>  extern struct memblock memblock;
>  extern int memblock_debug;
> +extern struct movablecore_map movablecore_map;
>  
>  #define memblock_dbg(fmt, ...) \
>  	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 6259055..33b3b4d 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>  {
>  	phys_addr_t this_start, this_end, cand;
>  	u64 i;
> +	int curr = movablecore_map.nr_map - 1;
>  
>  	/* pump up @end */
>  	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> @@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>  		this_start = clamp(this_start, start, end);
>  		this_end = clamp(this_end, start, end);
>  
> -		if (this_end < size)
> +restart:
> +		if (this_end <= this_start || this_end < size)
>  			continue;
>  
> +		for (; curr >= 0; curr--) {
> +			if (movablecore_map.map[curr].start < this_end)

movablecore_map[curr].start should be movablecore_map[curr].start << PAGE_SHIFT.
May be you can change movablecore_map[].start/end to movablecore_map[].start_pfn/end_pfn
to avoid confusion.

> +				break;
> +		}
> +
>  		cand = round_down(this_end - size, align);
> +		if (curr >= 0 && cand < movablecore_map.map[curr].end) {
> +			this_end = movablecore_map.map[curr].start;

Ditto.

> +			goto restart;
> +		}
> +
>  		if (cand >= this_start)
>  			return cand;
>  	}
> +
>  	return 0;
>  }
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-23 10:44   ` Tang Chen
@ 2012-11-26 12:40     ` wujianguo
  -1 siblings, 0 replies; 170+ messages in thread
From: wujianguo @ 2012-11-26 12:40 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi

Hi Tang,
	I tested this patchset in x86_64, and I found that this patch didn't
work as expected.
	For example, if node2's memory pfn range is [0x680000-0x980000),
I boot kernel with movablecore_map=4G@0x680000000, all memory in node2 will be
in ZONE_MOVABLE, but bootmem still can be allocated from [0x780000000-0x980000000),
that means bootmem *is allocated* from ZONE_MOVABLE. This because movablecore_map
only contains [0x680000000-0x780000000). I think we can fixup movablecore_map, how
about this:

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 arch/x86/mm/srat.c |   15 +++++++++++++++
 include/linux/mm.h |    3 +++
 mm/page_alloc.c    |    2 +-
 3 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 4ddf497..f1aac08 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -147,6 +147,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 {
 	u64 start, end;
 	int node, pxm;
+	int i;
+	unsigned long start_pfn, end_pfn;

 	if (srat_disabled())
 		return -1;
@@ -181,6 +183,19 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 	printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
 	       node, pxm,
 	       (unsigned long long) start, (unsigned long long) end - 1);
+
+	start_pfn = PFN_DOWN(start);
+	end_pfn = PFN_UP(end);
+	for (i = 0; i < movablecore_map.nr_map; i++) {
+		if (end_pfn <= movablecore_map.map[i].start)
+			break;
+
+		if (movablecore_map.map[i].end < end_pfn) {
+			insert_movablecore_map(movablecore_map.map[i].end,
+						end_pfn);
+		}
+	}
+
 	return 0;
 }

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a65251..7a23403 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1356,6 +1356,9 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn);
 #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
 #endif

+extern void insert_movablecore_map(unsigned long start_pfn,
+					  unsigned long end_pfn);
+
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
 				unsigned long, enum memmap_context);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 544c829..e6b5090 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5089,7 +5089,7 @@ early_param("movablecore", cmdline_parse_movablecore);
  * This function will also merge the overlapped ranges, and sort the array
  * by start_pfn in monotonic increasing order.
  */
-static void __init insert_movablecore_map(unsigned long start_pfn,
+void __init insert_movablecore_map(unsigned long start_pfn,
 					  unsigned long end_pfn)
 {
 	int pos, overlap;
-- 1.7.6.1
.

Thanks,
Jianguo Wu

On 2012-11-23 18:44, Tang Chen wrote:
> This patch make sure bootmem will not allocate memory from areas that
> may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
> Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
> ---
>  include/linux/memblock.h |    1 +
>  mm/memblock.c            |   15 ++++++++++++++-
>  2 files changed, 15 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index d452ee1..6e25597 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -42,6 +42,7 @@ struct memblock {
>  
>  extern struct memblock memblock;
>  extern int memblock_debug;
> +extern struct movablecore_map movablecore_map;
>  
>  #define memblock_dbg(fmt, ...) \
>  	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 6259055..33b3b4d 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>  {
>  	phys_addr_t this_start, this_end, cand;
>  	u64 i;
> +	int curr = movablecore_map.nr_map - 1;
>  
>  	/* pump up @end */
>  	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> @@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>  		this_start = clamp(this_start, start, end);
>  		this_end = clamp(this_end, start, end);
>  
> -		if (this_end < size)
> +restart:
> +		if (this_end <= this_start || this_end < size)
>  			continue;
>  
> +		for (; curr >= 0; curr--) {
> +			if (movablecore_map.map[curr].start < this_end)
> +				break;
> +		}
> +
>  		cand = round_down(this_end - size, align);
> +		if (curr >= 0 && cand < movablecore_map.map[curr].end) {
> +			this_end = movablecore_map.map[curr].start;
> +			goto restart;
> +		}
> +
>  		if (cand >= this_start)
>  			return cand;
>  	}
> +
>  	return 0;
>  }
>  
> 


^ permalink raw reply related	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-11-26 12:40     ` wujianguo
  0 siblings, 0 replies; 170+ messages in thread
From: wujianguo @ 2012-11-26 12:40 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi

Hi Tang,
	I tested this patchset in x86_64, and I found that this patch didn't
work as expected.
	For example, if node2's memory pfn range is [0x680000-0x980000),
I boot kernel with movablecore_map=4G@0x680000000, all memory in node2 will be
in ZONE_MOVABLE, but bootmem still can be allocated from [0x780000000-0x980000000),
that means bootmem *is allocated* from ZONE_MOVABLE. This because movablecore_map
only contains [0x680000000-0x780000000). I think we can fixup movablecore_map, how
about this:

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 arch/x86/mm/srat.c |   15 +++++++++++++++
 include/linux/mm.h |    3 +++
 mm/page_alloc.c    |    2 +-
 3 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 4ddf497..f1aac08 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -147,6 +147,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 {
 	u64 start, end;
 	int node, pxm;
+	int i;
+	unsigned long start_pfn, end_pfn;

 	if (srat_disabled())
 		return -1;
@@ -181,6 +183,19 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 	printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
 	       node, pxm,
 	       (unsigned long long) start, (unsigned long long) end - 1);
+
+	start_pfn = PFN_DOWN(start);
+	end_pfn = PFN_UP(end);
+	for (i = 0; i < movablecore_map.nr_map; i++) {
+		if (end_pfn <= movablecore_map.map[i].start)
+			break;
+
+		if (movablecore_map.map[i].end < end_pfn) {
+			insert_movablecore_map(movablecore_map.map[i].end,
+						end_pfn);
+		}
+	}
+
 	return 0;
 }

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a65251..7a23403 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1356,6 +1356,9 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn);
 #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
 #endif

+extern void insert_movablecore_map(unsigned long start_pfn,
+					  unsigned long end_pfn);
+
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
 				unsigned long, enum memmap_context);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 544c829..e6b5090 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5089,7 +5089,7 @@ early_param("movablecore", cmdline_parse_movablecore);
  * This function will also merge the overlapped ranges, and sort the array
  * by start_pfn in monotonic increasing order.
  */
-static void __init insert_movablecore_map(unsigned long start_pfn,
+void __init insert_movablecore_map(unsigned long start_pfn,
 					  unsigned long end_pfn)
 {
 	int pos, overlap;
-- 1.7.6.1
.

Thanks,
Jianguo Wu

On 2012-11-23 18:44, Tang Chen wrote:
> This patch make sure bootmem will not allocate memory from areas that
> may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
> Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
> ---
>  include/linux/memblock.h |    1 +
>  mm/memblock.c            |   15 ++++++++++++++-
>  2 files changed, 15 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index d452ee1..6e25597 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -42,6 +42,7 @@ struct memblock {
>  
>  extern struct memblock memblock;
>  extern int memblock_debug;
> +extern struct movablecore_map movablecore_map;
>  
>  #define memblock_dbg(fmt, ...) \
>  	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 6259055..33b3b4d 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>  {
>  	phys_addr_t this_start, this_end, cand;
>  	u64 i;
> +	int curr = movablecore_map.nr_map - 1;
>  
>  	/* pump up @end */
>  	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> @@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>  		this_start = clamp(this_start, start, end);
>  		this_end = clamp(this_end, start, end);
>  
> -		if (this_end < size)
> +restart:
> +		if (this_end <= this_start || this_end < size)
>  			continue;
>  
> +		for (; curr >= 0; curr--) {
> +			if (movablecore_map.map[curr].start < this_end)
> +				break;
> +		}
> +
>  		cand = round_down(this_end - size, align);
> +		if (curr >= 0 && cand < movablecore_map.map[curr].end) {
> +			this_end = movablecore_map.map[curr].start;
> +			goto restart;
> +		}
> +
>  		if (cand >= this_start)
>  			return cand;
>  	}
> +
>  	return 0;
>  }
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-26 12:22     ` wujianguo
@ 2012-11-26 12:53       ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-26 12:53 UTC (permalink / raw)
  To: wujianguo
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/26/2012 08:22 PM, wujianguo wrote:
> On 2012-11-23 18:44, Tang Chen wrote:
>> This patch make sure bootmem will not allocate memory from areas that
>> may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
>>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> Signed-off-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>> Reviewed-by: Wen Congyang<wency@cn.fujitsu.com>
>> Tested-by: Lin Feng<linfeng@cn.fujitsu.com>
>> ---
>>   include/linux/memblock.h |    1 +
>>   mm/memblock.c            |   15 ++++++++++++++-
>>   2 files changed, 15 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> index d452ee1..6e25597 100644
>> --- a/include/linux/memblock.h
>> +++ b/include/linux/memblock.h
>> @@ -42,6 +42,7 @@ struct memblock {
>>
>>   extern struct memblock memblock;
>>   extern int memblock_debug;
>> +extern struct movablecore_map movablecore_map;
>>
>>   #define memblock_dbg(fmt, ...) \
>>   	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 6259055..33b3b4d 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>>   {
>>   	phys_addr_t this_start, this_end, cand;
>>   	u64 i;
>> +	int curr = movablecore_map.nr_map - 1;
>>
>>   	/* pump up @end */
>>   	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
>> @@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>>   		this_start = clamp(this_start, start, end);
>>   		this_end = clamp(this_end, start, end);
>>
>> -		if (this_end<  size)
>> +restart:
>> +		if (this_end<= this_start || this_end<  size)
>>   			continue;
>>
>> +		for (; curr>= 0; curr--) {
>> +			if (movablecore_map.map[curr].start<  this_end)
>
> movablecore_map[curr].start should be movablecore_map[curr].start<<  PAGE_SHIFT.
> May be you can change movablecore_map[].start/end to movablecore_map[].start_pfn/end_pfn
> to avoid confusion.

Hi Wu,

Yes, it was my mistake that I forgot to shift the pfn.
And this was tested out by my partner too. And I have fixed it in my v3
patch.

Thanks for the comments. :)

>
>> +				break;
>> +		}
>> +
>>   		cand = round_down(this_end - size, align);
>> +		if (curr>= 0&&  cand<  movablecore_map.map[curr].end) {
>> +			this_end = movablecore_map.map[curr].start;
>
> Ditto.
>
>> +			goto restart;
>> +		}
>> +
>>   		if (cand>= this_start)
>>   			return cand;
>>   	}
>> +
>>   	return 0;
>>   }
>>
>>
>
>


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-11-26 12:53       ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-26 12:53 UTC (permalink / raw)
  To: wujianguo
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/26/2012 08:22 PM, wujianguo wrote:
> On 2012-11-23 18:44, Tang Chen wrote:
>> This patch make sure bootmem will not allocate memory from areas that
>> may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
>>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> Signed-off-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>> Reviewed-by: Wen Congyang<wency@cn.fujitsu.com>
>> Tested-by: Lin Feng<linfeng@cn.fujitsu.com>
>> ---
>>   include/linux/memblock.h |    1 +
>>   mm/memblock.c            |   15 ++++++++++++++-
>>   2 files changed, 15 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> index d452ee1..6e25597 100644
>> --- a/include/linux/memblock.h
>> +++ b/include/linux/memblock.h
>> @@ -42,6 +42,7 @@ struct memblock {
>>
>>   extern struct memblock memblock;
>>   extern int memblock_debug;
>> +extern struct movablecore_map movablecore_map;
>>
>>   #define memblock_dbg(fmt, ...) \
>>   	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 6259055..33b3b4d 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>>   {
>>   	phys_addr_t this_start, this_end, cand;
>>   	u64 i;
>> +	int curr = movablecore_map.nr_map - 1;
>>
>>   	/* pump up @end */
>>   	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
>> @@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>>   		this_start = clamp(this_start, start, end);
>>   		this_end = clamp(this_end, start, end);
>>
>> -		if (this_end<  size)
>> +restart:
>> +		if (this_end<= this_start || this_end<  size)
>>   			continue;
>>
>> +		for (; curr>= 0; curr--) {
>> +			if (movablecore_map.map[curr].start<  this_end)
>
> movablecore_map[curr].start should be movablecore_map[curr].start<<  PAGE_SHIFT.
> May be you can change movablecore_map[].start/end to movablecore_map[].start_pfn/end_pfn
> to avoid confusion.

Hi Wu,

Yes, it was my mistake that I forgot to shift the pfn.
And this was tested out by my partner too. And I have fixed it in my v3
patch.

Thanks for the comments. :)

>
>> +				break;
>> +		}
>> +
>>   		cand = round_down(this_end - size, align);
>> +		if (curr>= 0&&  cand<  movablecore_map.map[curr].end) {
>> +			this_end = movablecore_map.map[curr].start;
>
> Ditto.
>
>> +			goto restart;
>> +		}
>> +
>>   		if (cand>= this_start)
>>   			return cand;
>>   	}
>> +
>>   	return 0;
>>   }
>>
>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-26 12:40     ` wujianguo
@ 2012-11-26 13:15       ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-26 13:15 UTC (permalink / raw)
  To: wujianguo
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi

On 11/26/2012 08:40 PM, wujianguo wrote:
> Hi Tang,
> 	I tested this patchset in x86_64, and I found that this patch didn't
> work as expected.
> 	For example, if node2's memory pfn range is [0x680000-0x980000),
> I boot kernel with movablecore_map=4G@0x680000000, all memory in node2 will be
> in ZONE_MOVABLE, but bootmem still can be allocated from [0x780000000-0x980000000),
> that means bootmem *is allocated* from ZONE_MOVABLE. This because movablecore_map
> only contains [0x680000000-0x780000000). I think we can fixup movablecore_map, how
> about this:

Hi Wu,

That is really a problem. And, before numa memory got initialized,
memblock subsystem would be used to allocate memory. I didn't find any
approach that could fully address it when I making the patches. There
always be risk that memblock allocates memory on ZONE_MOVABLE. I think
we can only do our best to prevent it from happening.

Your patch is very helpful. And after a shot look at the code, it seems
that acpi_numa_memory_affinity_init() is an architecture dependent
function. Could we do this somewhere which is not depending on the
architecture ?

Thanks. :)

>
> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
> Signed-off-by: Jiang Liu<jiang.liu@huawei.com>
> ---
>   arch/x86/mm/srat.c |   15 +++++++++++++++
>   include/linux/mm.h |    3 +++
>   mm/page_alloc.c    |    2 +-
>   3 files changed, 19 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
> index 4ddf497..f1aac08 100644
> --- a/arch/x86/mm/srat.c
> +++ b/arch/x86/mm/srat.c
> @@ -147,6 +147,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
>   {
>   	u64 start, end;
>   	int node, pxm;
> +	int i;
> +	unsigned long start_pfn, end_pfn;
>
>   	if (srat_disabled())
>   		return -1;
> @@ -181,6 +183,19 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
>   	printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
>   	       node, pxm,
>   	       (unsigned long long) start, (unsigned long long) end - 1);
> +
> +	start_pfn = PFN_DOWN(start);
> +	end_pfn = PFN_UP(end);
> +	for (i = 0; i<  movablecore_map.nr_map; i++) {
> +		if (end_pfn<= movablecore_map.map[i].start)
> +			break;
> +
> +		if (movablecore_map.map[i].end<  end_pfn) {
> +			insert_movablecore_map(movablecore_map.map[i].end,
> +						end_pfn);
> +		}
> +	}
> +
>   	return 0;
>   }
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5a65251..7a23403 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1356,6 +1356,9 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn);
>   #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
>   #endif
>
> +extern void insert_movablecore_map(unsigned long start_pfn,
> +					  unsigned long end_pfn);
> +
>   extern void set_dma_reserve(unsigned long new_dma_reserve);
>   extern void memmap_init_zone(unsigned long, int, unsigned long,
>   				unsigned long, enum memmap_context);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 544c829..e6b5090 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5089,7 +5089,7 @@ early_param("movablecore", cmdline_parse_movablecore);
>    * This function will also merge the overlapped ranges, and sort the array
>    * by start_pfn in monotonic increasing order.
>    */
> -static void __init insert_movablecore_map(unsigned long start_pfn,
> +void __init insert_movablecore_map(unsigned long start_pfn,
>   					  unsigned long end_pfn)
>   {
>   	int pos, overlap;
> -- 1.7.6.1
> .
>
> Thanks,
> Jianguo Wu
>
> On 2012-11-23 18:44, Tang Chen wrote:
>> This patch make sure bootmem will not allocate memory from areas that
>> may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
>>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> Signed-off-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>> Reviewed-by: Wen Congyang<wency@cn.fujitsu.com>
>> Tested-by: Lin Feng<linfeng@cn.fujitsu.com>
>> ---
>>   include/linux/memblock.h |    1 +
>>   mm/memblock.c            |   15 ++++++++++++++-
>>   2 files changed, 15 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> index d452ee1..6e25597 100644
>> --- a/include/linux/memblock.h
>> +++ b/include/linux/memblock.h
>> @@ -42,6 +42,7 @@ struct memblock {
>>
>>   extern struct memblock memblock;
>>   extern int memblock_debug;
>> +extern struct movablecore_map movablecore_map;
>>
>>   #define memblock_dbg(fmt, ...) \
>>   	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 6259055..33b3b4d 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>>   {
>>   	phys_addr_t this_start, this_end, cand;
>>   	u64 i;
>> +	int curr = movablecore_map.nr_map - 1;
>>
>>   	/* pump up @end */
>>   	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
>> @@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>>   		this_start = clamp(this_start, start, end);
>>   		this_end = clamp(this_end, start, end);
>>
>> -		if (this_end<  size)
>> +restart:
>> +		if (this_end<= this_start || this_end<  size)
>>   			continue;
>>
>> +		for (; curr>= 0; curr--) {
>> +			if (movablecore_map.map[curr].start<  this_end)
>> +				break;
>> +		}
>> +
>>   		cand = round_down(this_end - size, align);
>> +		if (curr>= 0&&  cand<  movablecore_map.map[curr].end) {
>> +			this_end = movablecore_map.map[curr].start;
>> +			goto restart;
>> +		}
>> +
>>   		if (cand>= this_start)
>>   			return cand;
>>   	}
>> +
>>   	return 0;
>>   }
>>
>>
>
>


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-11-26 13:15       ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-26 13:15 UTC (permalink / raw)
  To: wujianguo
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi

On 11/26/2012 08:40 PM, wujianguo wrote:
> Hi Tang,
> 	I tested this patchset in x86_64, and I found that this patch didn't
> work as expected.
> 	For example, if node2's memory pfn range is [0x680000-0x980000),
> I boot kernel with movablecore_map=4G@0x680000000, all memory in node2 will be
> in ZONE_MOVABLE, but bootmem still can be allocated from [0x780000000-0x980000000),
> that means bootmem *is allocated* from ZONE_MOVABLE. This because movablecore_map
> only contains [0x680000000-0x780000000). I think we can fixup movablecore_map, how
> about this:

Hi Wu,

That is really a problem. And, before numa memory got initialized,
memblock subsystem would be used to allocate memory. I didn't find any
approach that could fully address it when I making the patches. There
always be risk that memblock allocates memory on ZONE_MOVABLE. I think
we can only do our best to prevent it from happening.

Your patch is very helpful. And after a shot look at the code, it seems
that acpi_numa_memory_affinity_init() is an architecture dependent
function. Could we do this somewhere which is not depending on the
architecture ?

Thanks. :)

>
> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
> Signed-off-by: Jiang Liu<jiang.liu@huawei.com>
> ---
>   arch/x86/mm/srat.c |   15 +++++++++++++++
>   include/linux/mm.h |    3 +++
>   mm/page_alloc.c    |    2 +-
>   3 files changed, 19 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
> index 4ddf497..f1aac08 100644
> --- a/arch/x86/mm/srat.c
> +++ b/arch/x86/mm/srat.c
> @@ -147,6 +147,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
>   {
>   	u64 start, end;
>   	int node, pxm;
> +	int i;
> +	unsigned long start_pfn, end_pfn;
>
>   	if (srat_disabled())
>   		return -1;
> @@ -181,6 +183,19 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
>   	printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
>   	       node, pxm,
>   	       (unsigned long long) start, (unsigned long long) end - 1);
> +
> +	start_pfn = PFN_DOWN(start);
> +	end_pfn = PFN_UP(end);
> +	for (i = 0; i<  movablecore_map.nr_map; i++) {
> +		if (end_pfn<= movablecore_map.map[i].start)
> +			break;
> +
> +		if (movablecore_map.map[i].end<  end_pfn) {
> +			insert_movablecore_map(movablecore_map.map[i].end,
> +						end_pfn);
> +		}
> +	}
> +
>   	return 0;
>   }
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5a65251..7a23403 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1356,6 +1356,9 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn);
>   #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
>   #endif
>
> +extern void insert_movablecore_map(unsigned long start_pfn,
> +					  unsigned long end_pfn);
> +
>   extern void set_dma_reserve(unsigned long new_dma_reserve);
>   extern void memmap_init_zone(unsigned long, int, unsigned long,
>   				unsigned long, enum memmap_context);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 544c829..e6b5090 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5089,7 +5089,7 @@ early_param("movablecore", cmdline_parse_movablecore);
>    * This function will also merge the overlapped ranges, and sort the array
>    * by start_pfn in monotonic increasing order.
>    */
> -static void __init insert_movablecore_map(unsigned long start_pfn,
> +void __init insert_movablecore_map(unsigned long start_pfn,
>   					  unsigned long end_pfn)
>   {
>   	int pos, overlap;
> -- 1.7.6.1
> .
>
> Thanks,
> Jianguo Wu
>
> On 2012-11-23 18:44, Tang Chen wrote:
>> This patch make sure bootmem will not allocate memory from areas that
>> may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
>>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> Signed-off-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>> Reviewed-by: Wen Congyang<wency@cn.fujitsu.com>
>> Tested-by: Lin Feng<linfeng@cn.fujitsu.com>
>> ---
>>   include/linux/memblock.h |    1 +
>>   mm/memblock.c            |   15 ++++++++++++++-
>>   2 files changed, 15 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> index d452ee1..6e25597 100644
>> --- a/include/linux/memblock.h
>> +++ b/include/linux/memblock.h
>> @@ -42,6 +42,7 @@ struct memblock {
>>
>>   extern struct memblock memblock;
>>   extern int memblock_debug;
>> +extern struct movablecore_map movablecore_map;
>>
>>   #define memblock_dbg(fmt, ...) \
>>   	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 6259055..33b3b4d 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>>   {
>>   	phys_addr_t this_start, this_end, cand;
>>   	u64 i;
>> +	int curr = movablecore_map.nr_map - 1;
>>
>>   	/* pump up @end */
>>   	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
>> @@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>>   		this_start = clamp(this_start, start, end);
>>   		this_end = clamp(this_end, start, end);
>>
>> -		if (this_end<  size)
>> +restart:
>> +		if (this_end<= this_start || this_end<  size)
>>   			continue;
>>
>> +		for (; curr>= 0; curr--) {
>> +			if (movablecore_map.map[curr].start<  this_end)
>> +				break;
>> +		}
>> +
>>   		cand = round_down(this_end - size, align);
>> +		if (curr>= 0&&  cand<  movablecore_map.map[curr].end) {
>> +			this_end = movablecore_map.map[curr].start;
>> +			goto restart;
>> +		}
>> +
>>   		if (cand>= this_start)
>>   			return cand;
>>   	}
>> +
>>   	return 0;
>>   }
>>
>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-26 13:15       ` Tang Chen
@ 2012-11-26 15:48         ` H. Peter Anvin
  -1 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-26 15:48 UTC (permalink / raw)
  To: Tang Chen
  Cc: wujianguo, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi

On 11/26/2012 05:15 AM, Tang Chen wrote:
> 
> Hi Wu,
> 
> That is really a problem. And, before numa memory got initialized,
> memblock subsystem would be used to allocate memory. I didn't find any
> approach that could fully address it when I making the patches. There
> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
> we can only do our best to prevent it from happening.
> 
> Your patch is very helpful. And after a shot look at the code, it seems
> that acpi_numa_memory_affinity_init() is an architecture dependent
> function. Could we do this somewhere which is not depending on the
> architecture ?
> 

The movable memory should be classified as a non-RAM type in memblock,
that way we will not allocate from it early on.

	-hpa


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-11-26 15:48         ` H. Peter Anvin
  0 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-26 15:48 UTC (permalink / raw)
  To: Tang Chen
  Cc: wujianguo, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi

On 11/26/2012 05:15 AM, Tang Chen wrote:
> 
> Hi Wu,
> 
> That is really a problem. And, before numa memory got initialized,
> memblock subsystem would be used to allocate memory. I didn't find any
> approach that could fully address it when I making the patches. There
> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
> we can only do our best to prevent it from happening.
> 
> Your patch is very helpful. And after a shot look at the code, it seems
> that acpi_numa_memory_affinity_init() is an architecture dependent
> function. Could we do this somewhere which is not depending on the
> architecture ?
> 

The movable memory should be classified as a non-RAM type in memblock,
that way we will not allocate from it early on.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-26 15:48         ` H. Peter Anvin
@ 2012-11-27  0:58           ` Jianguo Wu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jianguo Wu @ 2012-11-27  0:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	qiuxishi

On 2012/11/26 23:48, H. Peter Anvin wrote:

> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>
>> Hi Wu,
>>
>> That is really a problem. And, before numa memory got initialized,
>> memblock subsystem would be used to allocate memory. I didn't find any
>> approach that could fully address it when I making the patches. There
>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>> we can only do our best to prevent it from happening.
>>
>> Your patch is very helpful. And after a shot look at the code, it seems
>> that acpi_numa_memory_affinity_init() is an architecture dependent
>> function. Could we do this somewhere which is not depending on the
>> architecture ?
>>
> 
> The movable memory should be classified as a non-RAM type in memblock,
> that way we will not allocate from it early on.
> 
> 	-hpa


yep, we can put movable memory in reserved.regions in memblock.

> 
> 
> .
> 




^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-11-27  0:58           ` Jianguo Wu
  0 siblings, 0 replies; 170+ messages in thread
From: Jianguo Wu @ 2012-11-27  0:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	qiuxishi

On 2012/11/26 23:48, H. Peter Anvin wrote:

> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>
>> Hi Wu,
>>
>> That is really a problem. And, before numa memory got initialized,
>> memblock subsystem would be used to allocate memory. I didn't find any
>> approach that could fully address it when I making the patches. There
>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>> we can only do our best to prevent it from happening.
>>
>> Your patch is very helpful. And after a shot look at the code, it seems
>> that acpi_numa_memory_affinity_init() is an architecture dependent
>> function. Could we do this somewhere which is not depending on the
>> architecture ?
>>
> 
> The movable memory should be classified as a non-RAM type in memblock,
> that way we will not allocate from it early on.
> 
> 	-hpa


yep, we can put movable memory in reserved.regions in memblock.

> 
> 
> .
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-26 15:48         ` H. Peter Anvin
@ 2012-11-27  1:12           ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-27  1:12 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi

On 2012-11-26 23:48, H. Peter Anvin wrote:
> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>
>> Hi Wu,
>>
>> That is really a problem. And, before numa memory got initialized,
>> memblock subsystem would be used to allocate memory. I didn't find any
>> approach that could fully address it when I making the patches. There
>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>> we can only do our best to prevent it from happening.
>>
>> Your patch is very helpful. And after a shot look at the code, it seems
>> that acpi_numa_memory_affinity_init() is an architecture dependent
>> function. Could we do this somewhere which is not depending on the
>> architecture ?
>>
> 
> The movable memory should be classified as a non-RAM type in memblock,
> that way we will not allocate from it early on.
Hi Peter,

I have tried to reserved movable memory from bootmem allocator, but the
ACPICA subsystem is initialized later than setting up movable zone.
So still trying to figure out a way to setup/reserve movable zones
according to information from static ACPI tables such as SRAT/MPST etc.

Regards!
Gerry

> 
> 	-hpa
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-11-27  1:12           ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-27  1:12 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi

On 2012-11-26 23:48, H. Peter Anvin wrote:
> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>
>> Hi Wu,
>>
>> That is really a problem. And, before numa memory got initialized,
>> memblock subsystem would be used to allocate memory. I didn't find any
>> approach that could fully address it when I making the patches. There
>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>> we can only do our best to prevent it from happening.
>>
>> Your patch is very helpful. And after a shot look at the code, it seems
>> that acpi_numa_memory_affinity_init() is an architecture dependent
>> function. Could we do this somewhere which is not depending on the
>> architecture ?
>>
> 
> The movable memory should be classified as a non-RAM type in memblock,
> that way we will not allocate from it early on.
Hi Peter,

I have tried to reserved movable memory from bootmem allocator, but the
ACPICA subsystem is initialized later than setting up movable zone.
So still trying to figure out a way to setup/reserve movable zones
according to information from static ACPI tables such as SRAT/MPST etc.

Regards!
Gerry

> 
> 	-hpa
> 
> 
> .
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-27  1:12           ` Jiang Liu
@ 2012-11-27  1:20             ` H. Peter Anvin
  -1 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-27  1:20 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi, Len Brown

On 11/26/2012 05:12 PM, Jiang Liu wrote:
> Hi Peter,
>
> I have tried to reserved movable memory from bootmem allocator, but the
> ACPICA subsystem is initialized later than setting up movable zone.
> So still trying to figure out a way to setup/reserve movable zones
> according to information from static ACPI tables such as SRAT/MPST etc.
>

[Adding Len Brown]

Right, for the case of platform-configured memory.  Len, I'm wondering 
if there is any reasonable way we can get memory-map-related stuff out 
of ACPI before we initialize the full ACPICA... we could of course write 
an ad hoc static parser (these are just static tables, after all), but 
I'm not sure if that fits into your overall view of how the subsystem 
should work?

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-11-27  1:20             ` H. Peter Anvin
  0 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-27  1:20 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi, Len Brown

On 11/26/2012 05:12 PM, Jiang Liu wrote:
> Hi Peter,
>
> I have tried to reserved movable memory from bootmem allocator, but the
> ACPICA subsystem is initialized later than setting up movable zone.
> So still trying to figure out a way to setup/reserve movable zones
> according to information from static ACPI tables such as SRAT/MPST etc.
>

[Adding Len Brown]

Right, for the case of platform-configured memory.  Len, I'm wondering 
if there is any reasonable way we can get memory-map-related stuff out 
of ACPI before we initialize the full ACPICA... we could of course write 
an ad hoc static parser (these are just static tables, after all), but 
I'm not sure if that fits into your overall view of how the subsystem 
should work?

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-23 10:44 ` Tang Chen
@ 2012-11-27  3:10   ` wujianguo
  -1 siblings, 0 replies; 170+ messages in thread
From: wujianguo @ 2012-11-27  3:10 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 2012-11-23 18:44, Tang Chen wrote:
> [What we are doing]
> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
> map for each node in the system.
> 
> movablecore_map=nn[KMG]@ss[KMG]
> 

Hi Tang,
	DMA address can't be set as movable, if some one boot kernel with
movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
system maybe boot failed. Should this case be handled or mentioned
in the change log and kernel-parameters.txt?

Thanks,
Jianguo Wu

> This option make sure memory range from ss to ss+nn is movable memory.
> 
> 
> [Why we do this]
> If we hot remove a memroy, the memory cannot have kernel memory,
> because Linux cannot migrate kernel memory currently. Therefore,
> we have to guarantee that the hot removed memory has only movable
> memoroy.
> 
> Linux has two boot options, kernelcore= and movablecore=, for
> creating movable memory. These boot options can specify the amount
> of memory use as kernel or movable memory. Using them, we can
> create ZONE_MOVABLE which has only movable memory.
> 
> But it does not fulfill a requirement of memory hot remove, because
> even if we specify the boot options, movable memory is distributed
> in each node evenly. So when we want to hot remove memory which
> memory range is 0x80000000-0c0000000, we have no way to specify
> the memory as movable memory.
> 
> So we proposed a new feature which specifies memory range to use as
> movable memory.
> 
> 
> [Ways to do this]
> There may be 2 ways to specify movable memory.
>  1. use firmware information
>  2. use boot option
> 
> 1. use firmware information
>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>   Affinity Structure". If we use the information, we might be able to
>   specify movable memory by firmware. For example, if Hot Pluggable
>   Filed is enabled, Linux sets the memory as movable memory.
> 
> 2. use boot option
>   This is our proposal. New boot option can specify memory range to use
>   as movable memory.
> 
> 
> [How we do this]
> We chose second way, because if we use first way, users cannot change
> memory range to use as movable memory easily. We think if we create
> movable memory, performance regression may occur by NUMA. In this case,
> user can turn off the feature easily if we prepare the boot option.
> And if we prepare the boot optino, the user can select which memory
> to use as movable memory easily. 
> 
> 
> [How to use]
> Specify the following boot option:
> movablecore_map=nn[KMG]@ss[KMG]
> 
> That means physical address range from ss to ss+nn will be allocated as
> ZONE_MOVABLE.
> 
> And the following points should be considered.
> 
> 1) If the range is involved in a single node, then from ss to the end of
>    the node will be ZONE_MOVABLE.
> 2) If the range covers two or more nodes, then from ss to the end of
>    the node will be ZONE_MOVABLE, and all the other nodes will only
>    have ZONE_MOVABLE.
> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>    unless kernelcore or movablecore is specified.
> 4) This option could be specified at most MAX_NUMNODES times.
> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>    higher priority to be satisfied.
> 6) This option has no conflict with memmap option.
> 
> 
> 
> Tang Chen (4):
>   page_alloc: add movable_memmap kernel parameter
>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>     nodes
>   page_alloc: Make movablecore_map has higher priority
>   page_alloc: Bootmem limit with movablecore_map
> 
> Yasuaki Ishimatsu (1):
>   x86: get pg_data_t's memory from other node
> 
>  Documentation/kernel-parameters.txt |   17 +++
>  arch/x86/mm/numa.c                  |   11 ++-
>  include/linux/memblock.h            |    1 +
>  include/linux/mm.h                  |   11 ++
>  mm/memblock.c                       |   15 +++-
>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>  6 files changed, 263 insertions(+), 8 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-27  3:10   ` wujianguo
  0 siblings, 0 replies; 170+ messages in thread
From: wujianguo @ 2012-11-27  3:10 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 2012-11-23 18:44, Tang Chen wrote:
> [What we are doing]
> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
> map for each node in the system.
> 
> movablecore_map=nn[KMG]@ss[KMG]
> 

Hi Tang,
	DMA address can't be set as movable, if some one boot kernel with
movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
system maybe boot failed. Should this case be handled or mentioned
in the change log and kernel-parameters.txt?

Thanks,
Jianguo Wu

> This option make sure memory range from ss to ss+nn is movable memory.
> 
> 
> [Why we do this]
> If we hot remove a memroy, the memory cannot have kernel memory,
> because Linux cannot migrate kernel memory currently. Therefore,
> we have to guarantee that the hot removed memory has only movable
> memoroy.
> 
> Linux has two boot options, kernelcore= and movablecore=, for
> creating movable memory. These boot options can specify the amount
> of memory use as kernel or movable memory. Using them, we can
> create ZONE_MOVABLE which has only movable memory.
> 
> But it does not fulfill a requirement of memory hot remove, because
> even if we specify the boot options, movable memory is distributed
> in each node evenly. So when we want to hot remove memory which
> memory range is 0x80000000-0c0000000, we have no way to specify
> the memory as movable memory.
> 
> So we proposed a new feature which specifies memory range to use as
> movable memory.
> 
> 
> [Ways to do this]
> There may be 2 ways to specify movable memory.
>  1. use firmware information
>  2. use boot option
> 
> 1. use firmware information
>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>   Affinity Structure". If we use the information, we might be able to
>   specify movable memory by firmware. For example, if Hot Pluggable
>   Filed is enabled, Linux sets the memory as movable memory.
> 
> 2. use boot option
>   This is our proposal. New boot option can specify memory range to use
>   as movable memory.
> 
> 
> [How we do this]
> We chose second way, because if we use first way, users cannot change
> memory range to use as movable memory easily. We think if we create
> movable memory, performance regression may occur by NUMA. In this case,
> user can turn off the feature easily if we prepare the boot option.
> And if we prepare the boot optino, the user can select which memory
> to use as movable memory easily. 
> 
> 
> [How to use]
> Specify the following boot option:
> movablecore_map=nn[KMG]@ss[KMG]
> 
> That means physical address range from ss to ss+nn will be allocated as
> ZONE_MOVABLE.
> 
> And the following points should be considered.
> 
> 1) If the range is involved in a single node, then from ss to the end of
>    the node will be ZONE_MOVABLE.
> 2) If the range covers two or more nodes, then from ss to the end of
>    the node will be ZONE_MOVABLE, and all the other nodes will only
>    have ZONE_MOVABLE.
> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>    unless kernelcore or movablecore is specified.
> 4) This option could be specified at most MAX_NUMNODES times.
> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>    higher priority to be satisfied.
> 6) This option has no conflict with memmap option.
> 
> 
> 
> Tang Chen (4):
>   page_alloc: add movable_memmap kernel parameter
>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>     nodes
>   page_alloc: Make movablecore_map has higher priority
>   page_alloc: Bootmem limit with movablecore_map
> 
> Yasuaki Ishimatsu (1):
>   x86: get pg_data_t's memory from other node
> 
>  Documentation/kernel-parameters.txt |   17 +++
>  arch/x86/mm/numa.c                  |   11 ++-
>  include/linux/memblock.h            |    1 +
>  include/linux/mm.h                  |   11 ++
>  mm/memblock.c                       |   15 +++-
>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>  6 files changed, 263 insertions(+), 8 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-26 15:48         ` H. Peter Anvin
@ 2012-11-27  3:15           ` Wen Congyang
  -1 siblings, 0 replies; 170+ messages in thread
From: Wen Congyang @ 2012-11-27  3:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki, laijs, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi

At 11/26/2012 11:48 PM, H. Peter Anvin Wrote:
> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>
>> Hi Wu,
>>
>> That is really a problem. And, before numa memory got initialized,
>> memblock subsystem would be used to allocate memory. I didn't find any
>> approach that could fully address it when I making the patches. There
>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>> we can only do our best to prevent it from happening.
>>
>> Your patch is very helpful. And after a shot look at the code, it seems
>> that acpi_numa_memory_affinity_init() is an architecture dependent
>> function. Could we do this somewhere which is not depending on the
>> architecture ?
>>
> 
> The movable memory should be classified as a non-RAM type in memblock,
> that way we will not allocate from it early on.

Hi, hpa

The problem is that:
node1 address rang: [18G, 34G), and the user specifies movable map is [8G, 24G).
We don't know node1's address range before numa init. So we can't prevent
allocating boot memory in the range [24G, 34G).

The movable memory should be classified as a non-RAM type in memblock. What
do you want to say? We don't save type in memblock because we only
add E820_RAM and E820_RESERVED_KERN to memblock.

Thanks
Wen Congyang

> 
> 	-hpa
> 
> 


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-11-27  3:15           ` Wen Congyang
  0 siblings, 0 replies; 170+ messages in thread
From: Wen Congyang @ 2012-11-27  3:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki, laijs, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi

At 11/26/2012 11:48 PM, H. Peter Anvin Wrote:
> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>
>> Hi Wu,
>>
>> That is really a problem. And, before numa memory got initialized,
>> memblock subsystem would be used to allocate memory. I didn't find any
>> approach that could fully address it when I making the patches. There
>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>> we can only do our best to prevent it from happening.
>>
>> Your patch is very helpful. And after a shot look at the code, it seems
>> that acpi_numa_memory_affinity_init() is an architecture dependent
>> function. Could we do this somewhere which is not depending on the
>> architecture ?
>>
> 
> The movable memory should be classified as a non-RAM type in memblock,
> that way we will not allocate from it early on.

Hi, hpa

The problem is that:
node1 address rang: [18G, 34G), and the user specifies movable map is [8G, 24G).
We don't know node1's address range before numa init. So we can't prevent
allocating boot memory in the range [24G, 34G).

The movable memory should be classified as a non-RAM type in memblock. What
do you want to say? We don't save type in memblock because we only
add E820_RAM and E820_RESERVED_KERN to memblock.

Thanks
Wen Congyang

> 
> 	-hpa
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-27  0:58           ` Jianguo Wu
@ 2012-11-27  3:19             ` Wen Congyang
  -1 siblings, 0 replies; 170+ messages in thread
From: Wen Congyang @ 2012-11-27  3:19 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: H. Peter Anvin, Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki,
	laijs, linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	qiuxishi

At 11/27/2012 08:58 AM, Jianguo Wu Wrote:
> On 2012/11/26 23:48, H. Peter Anvin wrote:
> 
>> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>>
>>> Hi Wu,
>>>
>>> That is really a problem. And, before numa memory got initialized,
>>> memblock subsystem would be used to allocate memory. I didn't find any
>>> approach that could fully address it when I making the patches. There
>>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>>> we can only do our best to prevent it from happening.
>>>
>>> Your patch is very helpful. And after a shot look at the code, it seems
>>> that acpi_numa_memory_affinity_init() is an architecture dependent
>>> function. Could we do this somewhere which is not depending on the
>>> architecture ?
>>>
>>
>> The movable memory should be classified as a non-RAM type in memblock,
>> that way we will not allocate from it early on.
>>
>> 	-hpa
> 
> 
> yep, we can put movable memory in reserved.regions in memblock.

Hmm, I don't think so. If so, memory in reserved.regions contain two type
memory: bootmem and movable memory. We will put all pages not in reserved.regions
into buddy system. If we put movable memory in reserved.regions, we have
no chance to put them to buddy system, and can't use them after system boots.

Thanks
Wen Congyang

> 
>>
>>
>> .
>>
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-11-27  3:19             ` Wen Congyang
  0 siblings, 0 replies; 170+ messages in thread
From: Wen Congyang @ 2012-11-27  3:19 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: H. Peter Anvin, Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki,
	laijs, linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	qiuxishi

At 11/27/2012 08:58 AM, Jianguo Wu Wrote:
> On 2012/11/26 23:48, H. Peter Anvin wrote:
> 
>> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>>
>>> Hi Wu,
>>>
>>> That is really a problem. And, before numa memory got initialized,
>>> memblock subsystem would be used to allocate memory. I didn't find any
>>> approach that could fully address it when I making the patches. There
>>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>>> we can only do our best to prevent it from happening.
>>>
>>> Your patch is very helpful. And after a shot look at the code, it seems
>>> that acpi_numa_memory_affinity_init() is an architecture dependent
>>> function. Could we do this somewhere which is not depending on the
>>> architecture ?
>>>
>>
>> The movable memory should be classified as a non-RAM type in memblock,
>> that way we will not allocate from it early on.
>>
>> 	-hpa
> 
> 
> yep, we can put movable memory in reserved.regions in memblock.

Hmm, I don't think so. If so, memory in reserved.regions contain two type
memory: bootmem and movable memory. We will put all pages not in reserved.regions
into buddy system. If we put movable memory in reserved.regions, we have
no chance to put them to buddy system, and can't use them after system boots.

Thanks
Wen Congyang

> 
>>
>>
>> .
>>
> 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-27  3:19             ` Wen Congyang
@ 2012-11-27  3:22               ` Jianguo Wu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jianguo Wu @ 2012-11-27  3:22 UTC (permalink / raw)
  To: Wen Congyang
  Cc: H. Peter Anvin, Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki,
	laijs, linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	qiuxishi

On 2012/11/27 11:19, Wen Congyang wrote:

> At 11/27/2012 08:58 AM, Jianguo Wu Wrote:
>> On 2012/11/26 23:48, H. Peter Anvin wrote:
>>
>>> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>>>
>>>> Hi Wu,
>>>>
>>>> That is really a problem. And, before numa memory got initialized,
>>>> memblock subsystem would be used to allocate memory. I didn't find any
>>>> approach that could fully address it when I making the patches. There
>>>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>>>> we can only do our best to prevent it from happening.
>>>>
>>>> Your patch is very helpful. And after a shot look at the code, it seems
>>>> that acpi_numa_memory_affinity_init() is an architecture dependent
>>>> function. Could we do this somewhere which is not depending on the
>>>> architecture ?
>>>>
>>>
>>> The movable memory should be classified as a non-RAM type in memblock,
>>> that way we will not allocate from it early on.
>>>
>>> 	-hpa
>>
>>
>> yep, we can put movable memory in reserved.regions in memblock.
> 
> Hmm, I don't think so. If so, memory in reserved.regions contain two type
> memory: bootmem and movable memory. We will put all pages not in reserved.regions
> into buddy system. If we put movable memory in reserved.regions, we have
> no chance to put them to buddy system, and can't use them after system boots.
> 

yes, you are right. Or we can fix movablecore_map when add memory region to memblock.

> Thanks
> Wen Congyang
> 
>>
>>>
>>>
>>> .
>>>
>>
>>
>>
>>
> 
> 
> .
> 




^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-11-27  3:22               ` Jianguo Wu
  0 siblings, 0 replies; 170+ messages in thread
From: Jianguo Wu @ 2012-11-27  3:22 UTC (permalink / raw)
  To: Wen Congyang
  Cc: H. Peter Anvin, Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki,
	laijs, linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	qiuxishi

On 2012/11/27 11:19, Wen Congyang wrote:

> At 11/27/2012 08:58 AM, Jianguo Wu Wrote:
>> On 2012/11/26 23:48, H. Peter Anvin wrote:
>>
>>> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>>>
>>>> Hi Wu,
>>>>
>>>> That is really a problem. And, before numa memory got initialized,
>>>> memblock subsystem would be used to allocate memory. I didn't find any
>>>> approach that could fully address it when I making the patches. There
>>>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>>>> we can only do our best to prevent it from happening.
>>>>
>>>> Your patch is very helpful. And after a shot look at the code, it seems
>>>> that acpi_numa_memory_affinity_init() is an architecture dependent
>>>> function. Could we do this somewhere which is not depending on the
>>>> architecture ?
>>>>
>>>
>>> The movable memory should be classified as a non-RAM type in memblock,
>>> that way we will not allocate from it early on.
>>>
>>> 	-hpa
>>
>>
>> yep, we can put movable memory in reserved.regions in memblock.
> 
> Hmm, I don't think so. If so, memory in reserved.regions contain two type
> memory: bootmem and movable memory. We will put all pages not in reserved.regions
> into buddy system. If we put movable memory in reserved.regions, we have
> no chance to put them to buddy system, and can't use them after system boots.
> 

yes, you are right. Or we can fix movablecore_map when add memory region to memblock.

> Thanks
> Wen Congyang
> 
>>
>>>
>>>
>>> .
>>>
>>
>>
>>
>>
> 
> 
> .
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-27  3:22               ` Jianguo Wu
@ 2012-11-27  3:34                 ` Wen Congyang
  -1 siblings, 0 replies; 170+ messages in thread
From: Wen Congyang @ 2012-11-27  3:34 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: H. Peter Anvin, Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki,
	laijs, linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	qiuxishi

At 11/27/2012 11:22 AM, Jianguo Wu Wrote:
> On 2012/11/27 11:19, Wen Congyang wrote:
> 
>> At 11/27/2012 08:58 AM, Jianguo Wu Wrote:
>>> On 2012/11/26 23:48, H. Peter Anvin wrote:
>>>
>>>> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>>>>
>>>>> Hi Wu,
>>>>>
>>>>> That is really a problem. And, before numa memory got initialized,
>>>>> memblock subsystem would be used to allocate memory. I didn't find any
>>>>> approach that could fully address it when I making the patches. There
>>>>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>>>>> we can only do our best to prevent it from happening.
>>>>>
>>>>> Your patch is very helpful. And after a shot look at the code, it seems
>>>>> that acpi_numa_memory_affinity_init() is an architecture dependent
>>>>> function. Could we do this somewhere which is not depending on the
>>>>> architecture ?
>>>>>
>>>>
>>>> The movable memory should be classified as a non-RAM type in memblock,
>>>> that way we will not allocate from it early on.
>>>>
>>>> 	-hpa
>>>
>>>
>>> yep, we can put movable memory in reserved.regions in memblock.
>>
>> Hmm, I don't think so. If so, memory in reserved.regions contain two type
>> memory: bootmem and movable memory. We will put all pages not in reserved.regions
>> into buddy system. If we put movable memory in reserved.regions, we have
>> no chance to put them to buddy system, and can't use them after system boots.
>>
> 
> yes, you are right. Or we can fix movablecore_map when add memory region to memblock.

If so, we should know the nodes address range...

Thanks
Wen Congyang

>> Thanks
>> Wen Congyang
>>
>>>
>>>>
>>>>
>>>> .
>>>>
>>>
>>>
>>>
>>>
>>
>>
>> .
>>
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-11-27  3:34                 ` Wen Congyang
  0 siblings, 0 replies; 170+ messages in thread
From: Wen Congyang @ 2012-11-27  3:34 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: H. Peter Anvin, Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki,
	laijs, linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	qiuxishi

At 11/27/2012 11:22 AM, Jianguo Wu Wrote:
> On 2012/11/27 11:19, Wen Congyang wrote:
> 
>> At 11/27/2012 08:58 AM, Jianguo Wu Wrote:
>>> On 2012/11/26 23:48, H. Peter Anvin wrote:
>>>
>>>> On 11/26/2012 05:15 AM, Tang Chen wrote:
>>>>>
>>>>> Hi Wu,
>>>>>
>>>>> That is really a problem. And, before numa memory got initialized,
>>>>> memblock subsystem would be used to allocate memory. I didn't find any
>>>>> approach that could fully address it when I making the patches. There
>>>>> always be risk that memblock allocates memory on ZONE_MOVABLE. I think
>>>>> we can only do our best to prevent it from happening.
>>>>>
>>>>> Your patch is very helpful. And after a shot look at the code, it seems
>>>>> that acpi_numa_memory_affinity_init() is an architecture dependent
>>>>> function. Could we do this somewhere which is not depending on the
>>>>> architecture ?
>>>>>
>>>>
>>>> The movable memory should be classified as a non-RAM type in memblock,
>>>> that way we will not allocate from it early on.
>>>>
>>>> 	-hpa
>>>
>>>
>>> yep, we can put movable memory in reserved.regions in memblock.
>>
>> Hmm, I don't think so. If so, memory in reserved.regions contain two type
>> memory: bootmem and movable memory. We will put all pages not in reserved.regions
>> into buddy system. If we put movable memory in reserved.regions, we have
>> no chance to put them to buddy system, and can't use them after system boots.
>>
> 
> yes, you are right. Or we can fix movablecore_map when add memory region to memblock.

If so, we should know the nodes address range...

Thanks
Wen Congyang

>> Thanks
>> Wen Congyang
>>
>>>
>>>>
>>>>
>>>> .
>>>>
>>>
>>>
>>>
>>>
>>
>>
>> .
>>
> 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-27  3:15           ` Wen Congyang
@ 2012-11-27  5:31             ` H. Peter Anvin
  -1 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-27  5:31 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki, laijs, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi

On 11/26/2012 07:15 PM, Wen Congyang wrote:
>
> Hi, hpa
>
> The problem is that:
> node1 address rang: [18G, 34G), and the user specifies movable map is [8G, 24G).
> We don't know node1's address range before numa init. So we can't prevent
> allocating boot memory in the range [24G, 34G).
>
> The movable memory should be classified as a non-RAM type in memblock. What
> do you want to say? We don't save type in memblock because we only
> add E820_RAM and E820_RESERVED_KERN to memblock.
>

We either need to keep the type or not add it to the memblocks.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-11-27  5:31             ` H. Peter Anvin
  0 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-27  5:31 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki, laijs, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi

On 11/26/2012 07:15 PM, Wen Congyang wrote:
>
> Hi, hpa
>
> The problem is that:
> node1 address rang: [18G, 34G), and the user specifies movable map is [8G, 24G).
> We don't know node1's address range before numa init. So we can't prevent
> allocating boot memory in the range [24G, 34G).
>
> The movable memory should be classified as a non-RAM type in memblock. What
> do you want to say? We don't save type in memblock because we only
> add E820_RAM and E820_RESERVED_KERN to memblock.
>

We either need to keep the type or not add it to the memblocks.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-27  3:10   ` wujianguo
@ 2012-11-27  5:43     ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-27  5:43 UTC (permalink / raw)
  To: wujianguo
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/27/2012 11:10 AM, wujianguo wrote:
> On 2012-11-23 18:44, Tang Chen wrote:
>> [What we are doing]
>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>> map for each node in the system.
>>
>> movablecore_map=nn[KMG]@ss[KMG]
>>
>
> Hi Tang,
> 	DMA address can't be set as movable, if some one boot kernel with
> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
> system maybe boot failed. Should this case be handled or mentioned
> in the change log and kernel-parameters.txt?

Hi Wu,

Right, DMA address can't be set as movable. And I should have mentioned
it in the doc more clear. :)

Actually, the situation is not only for DMA address. Because we limited
the memblock allocation, even if users did not specified the DMA
address, but set too much memory as movable, which means there was too
little memory for kernel to use, kernel will also fail to boot.

I added the following info into doc, but obviously it was not clear
enough. :)
+		If kernelcore or movablecore is also specified,
+		movablecore_map will have higher priority to be
+		satisfied. So the administrator should be careful that
+		the amount of movablecore_map areas are not too large.
+		Otherwise kernel won't have enough memory to start.


And about how to fix it, as you said, we can handle the situation if
user specified DMA address as movable. But how to handle "too little
memory for kernel to start" case ?  Is there any info about how much
at least memory kernel needs ?


Thanks for the comments. :)

>
> Thanks,
> Jianguo Wu
>



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-27  5:43     ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-27  5:43 UTC (permalink / raw)
  To: wujianguo
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/27/2012 11:10 AM, wujianguo wrote:
> On 2012-11-23 18:44, Tang Chen wrote:
>> [What we are doing]
>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>> map for each node in the system.
>>
>> movablecore_map=nn[KMG]@ss[KMG]
>>
>
> Hi Tang,
> 	DMA address can't be set as movable, if some one boot kernel with
> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
> system maybe boot failed. Should this case be handled or mentioned
> in the change log and kernel-parameters.txt?

Hi Wu,

Right, DMA address can't be set as movable. And I should have mentioned
it in the doc more clear. :)

Actually, the situation is not only for DMA address. Because we limited
the memblock allocation, even if users did not specified the DMA
address, but set too much memory as movable, which means there was too
little memory for kernel to use, kernel will also fail to boot.

I added the following info into doc, but obviously it was not clear
enough. :)
+		If kernelcore or movablecore is also specified,
+		movablecore_map will have higher priority to be
+		satisfied. So the administrator should be careful that
+		the amount of movablecore_map areas are not too large.
+		Otherwise kernel won't have enough memory to start.


And about how to fix it, as you said, we can handle the situation if
user specified DMA address as movable. But how to handle "too little
memory for kernel to start" case ?  Is there any info about how much
at least memory kernel needs ?


Thanks for the comments. :)

>
> Thanks,
> Jianguo Wu
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-27  5:43     ` Tang Chen
@ 2012-11-27  6:20       ` H. Peter Anvin
  -1 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-27  6:20 UTC (permalink / raw)
  To: Tang Chen
  Cc: wujianguo, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/26/2012 09:43 PM, Tang Chen wrote:
>
> And about how to fix it, as you said, we can handle the situation if
> user specified DMA address as movable. But how to handle "too little
> memory for kernel to start" case ?  Is there any info about how much
> at least memory kernel needs ?
>

Not really, and it depends on so many variables.

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-27  6:20       ` H. Peter Anvin
  0 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-27  6:20 UTC (permalink / raw)
  To: Tang Chen
  Cc: wujianguo, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/26/2012 09:43 PM, Tang Chen wrote:
>
> And about how to fix it, as you said, we can handle the situation if
> user specified DMA address as movable. But how to handle "too little
> memory for kernel to start" case ?  Is there any info about how much
> at least memory kernel needs ?
>

Not really, and it depends on so many variables.

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-27  5:43     ` Tang Chen
@ 2012-11-27  6:47       ` Jianguo Wu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jianguo Wu @ 2012-11-27  6:47 UTC (permalink / raw)
  To: Tang Chen
  Cc: wujianguo, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 2012/11/27 13:43, Tang Chen wrote:

> On 11/27/2012 11:10 AM, wujianguo wrote:
>> On 2012-11-23 18:44, Tang Chen wrote:
>>> [What we are doing]
>>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>>> map for each node in the system.
>>>
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>
>> Hi Tang,
>>     DMA address can't be set as movable, if some one boot kernel with
>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>> system maybe boot failed. Should this case be handled or mentioned
>> in the change log and kernel-parameters.txt?
> 
> Hi Wu,
> 
> Right, DMA address can't be set as movable. And I should have mentioned
> it in the doc more clear. :)
> 
> Actually, the situation is not only for DMA address. Because we limited
> the memblock allocation, even if users did not specified the DMA
> address, but set too much memory as movable, which means there was too
> little memory for kernel to use, kernel will also fail to boot.
> 
> I added the following info into doc, but obviously it was not clear
> enough. :)
> +        If kernelcore or movablecore is also specified,
> +        movablecore_map will have higher priority to be
> +        satisfied. So the administrator should be careful that
> +        the amount of movablecore_map areas are not too large.
> +        Otherwise kernel won't have enough memory to start.
> 
> 
> And about how to fix it, as you said, we can handle the situation if
> user specified DMA address as movable. But how to handle "too little
> memory for kernel to start" case ?  Is there any info about how much
> at least memory kernel needs ?
> 

As I know, bootmem is mostly used by page structs when CONFIG_SPARSEMEM=y.
But it is hard to calculate how much bootmem is needed exactly.

> 
> Thanks for the comments. :)
> 
>>
>> Thanks,
>> Jianguo Wu
>>
> 
> 
> 
> .
> 




^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-27  6:47       ` Jianguo Wu
  0 siblings, 0 replies; 170+ messages in thread
From: Jianguo Wu @ 2012-11-27  6:47 UTC (permalink / raw)
  To: Tang Chen
  Cc: wujianguo, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 2012/11/27 13:43, Tang Chen wrote:

> On 11/27/2012 11:10 AM, wujianguo wrote:
>> On 2012-11-23 18:44, Tang Chen wrote:
>>> [What we are doing]
>>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>>> map for each node in the system.
>>>
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>
>> Hi Tang,
>>     DMA address can't be set as movable, if some one boot kernel with
>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>> system maybe boot failed. Should this case be handled or mentioned
>> in the change log and kernel-parameters.txt?
> 
> Hi Wu,
> 
> Right, DMA address can't be set as movable. And I should have mentioned
> it in the doc more clear. :)
> 
> Actually, the situation is not only for DMA address. Because we limited
> the memblock allocation, even if users did not specified the DMA
> address, but set too much memory as movable, which means there was too
> little memory for kernel to use, kernel will also fail to boot.
> 
> I added the following info into doc, but obviously it was not clear
> enough. :)
> +        If kernelcore or movablecore is also specified,
> +        movablecore_map will have higher priority to be
> +        satisfied. So the administrator should be careful that
> +        the amount of movablecore_map areas are not too large.
> +        Otherwise kernel won't have enough memory to start.
> 
> 
> And about how to fix it, as you said, we can handle the situation if
> user specified DMA address as movable. But how to handle "too little
> memory for kernel to start" case ?  Is there any info about how much
> at least memory kernel needs ?
> 

As I know, bootmem is mostly used by page structs when CONFIG_SPARSEMEM=y.
But it is hard to calculate how much bootmem is needed exactly.

> 
> Thanks for the comments. :)
> 
>>
>> Thanks,
>> Jianguo Wu
>>
> 
> 
> 
> .
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-23 10:44 ` Tang Chen
@ 2012-11-27  8:00   ` Bob Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Bob Liu @ 2012-11-27  8:00 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

Hi Tang,

On Fri, Nov 23, 2012 at 6:44 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
> [What we are doing]
> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
> map for each node in the system.
>
> movablecore_map=nn[KMG]@ss[KMG]
>
> This option make sure memory range from ss to ss+nn is movable memory.
>
>
> [Why we do this]
> If we hot remove a memroy, the memory cannot have kernel memory,
> because Linux cannot migrate kernel memory currently. Therefore,
> we have to guarantee that the hot removed memory has only movable
> memoroy.
>
> Linux has two boot options, kernelcore= and movablecore=, for
> creating movable memory. These boot options can specify the amount
> of memory use as kernel or movable memory. Using them, we can
> create ZONE_MOVABLE which has only movable memory.
>
> But it does not fulfill a requirement of memory hot remove, because
> even if we specify the boot options, movable memory is distributed
> in each node evenly. So when we want to hot remove memory which
> memory range is 0x80000000-0c0000000, we have no way to specify
> the memory as movable memory.
>

Sorry, I'm still not get your idea.
Why you need a specify range that is movable?
Could you describe the requirement and situation a bit more?
Thank you.

> So we proposed a new feature which specifies memory range to use as
> movable memory.
>
>
> [Ways to do this]
> There may be 2 ways to specify movable memory.
>  1. use firmware information
>  2. use boot option
>
> 1. use firmware information
>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>   Affinity Structure". If we use the information, we might be able to
>   specify movable memory by firmware. For example, if Hot Pluggable
>   Filed is enabled, Linux sets the memory as movable memory.
>
> 2. use boot option
>   This is our proposal. New boot option can specify memory range to use
>   as movable memory.
>
>
> [How we do this]
> We chose second way, because if we use first way, users cannot change
> memory range to use as movable memory easily. We think if we create
> movable memory, performance regression may occur by NUMA. In this case,
> user can turn off the feature easily if we prepare the boot option.
> And if we prepare the boot optino, the user can select which memory
> to use as movable memory easily.
>
>
> [How to use]
> Specify the following boot option:
> movablecore_map=nn[KMG]@ss[KMG]
>
> That means physical address range from ss to ss+nn will be allocated as
> ZONE_MOVABLE.
>
> And the following points should be considered.
>
> 1) If the range is involved in a single node, then from ss to the end of
>    the node will be ZONE_MOVABLE.
> 2) If the range covers two or more nodes, then from ss to the end of
>    the node will be ZONE_MOVABLE, and all the other nodes will only
>    have ZONE_MOVABLE.
> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>    unless kernelcore or movablecore is specified.
> 4) This option could be specified at most MAX_NUMNODES times.
> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>    higher priority to be satisfied.
> 6) This option has no conflict with memmap option.
>
>
>
> Tang Chen (4):
>   page_alloc: add movable_memmap kernel parameter
>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>     nodes
>   page_alloc: Make movablecore_map has higher priority
>   page_alloc: Bootmem limit with movablecore_map
>
> Yasuaki Ishimatsu (1):
>   x86: get pg_data_t's memory from other node
>
>  Documentation/kernel-parameters.txt |   17 +++
>  arch/x86/mm/numa.c                  |   11 ++-
>  include/linux/memblock.h            |    1 +
>  include/linux/mm.h                  |   11 ++
>  mm/memblock.c                       |   15 +++-
>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>  6 files changed, 263 insertions(+), 8 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Regards,
-Bob

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-27  8:00   ` Bob Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Bob Liu @ 2012-11-27  8:00 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

Hi Tang,

On Fri, Nov 23, 2012 at 6:44 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
> [What we are doing]
> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
> map for each node in the system.
>
> movablecore_map=nn[KMG]@ss[KMG]
>
> This option make sure memory range from ss to ss+nn is movable memory.
>
>
> [Why we do this]
> If we hot remove a memroy, the memory cannot have kernel memory,
> because Linux cannot migrate kernel memory currently. Therefore,
> we have to guarantee that the hot removed memory has only movable
> memoroy.
>
> Linux has two boot options, kernelcore= and movablecore=, for
> creating movable memory. These boot options can specify the amount
> of memory use as kernel or movable memory. Using them, we can
> create ZONE_MOVABLE which has only movable memory.
>
> But it does not fulfill a requirement of memory hot remove, because
> even if we specify the boot options, movable memory is distributed
> in each node evenly. So when we want to hot remove memory which
> memory range is 0x80000000-0c0000000, we have no way to specify
> the memory as movable memory.
>

Sorry, I'm still not get your idea.
Why you need a specify range that is movable?
Could you describe the requirement and situation a bit more?
Thank you.

> So we proposed a new feature which specifies memory range to use as
> movable memory.
>
>
> [Ways to do this]
> There may be 2 ways to specify movable memory.
>  1. use firmware information
>  2. use boot option
>
> 1. use firmware information
>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>   Affinity Structure". If we use the information, we might be able to
>   specify movable memory by firmware. For example, if Hot Pluggable
>   Filed is enabled, Linux sets the memory as movable memory.
>
> 2. use boot option
>   This is our proposal. New boot option can specify memory range to use
>   as movable memory.
>
>
> [How we do this]
> We chose second way, because if we use first way, users cannot change
> memory range to use as movable memory easily. We think if we create
> movable memory, performance regression may occur by NUMA. In this case,
> user can turn off the feature easily if we prepare the boot option.
> And if we prepare the boot optino, the user can select which memory
> to use as movable memory easily.
>
>
> [How to use]
> Specify the following boot option:
> movablecore_map=nn[KMG]@ss[KMG]
>
> That means physical address range from ss to ss+nn will be allocated as
> ZONE_MOVABLE.
>
> And the following points should be considered.
>
> 1) If the range is involved in a single node, then from ss to the end of
>    the node will be ZONE_MOVABLE.
> 2) If the range covers two or more nodes, then from ss to the end of
>    the node will be ZONE_MOVABLE, and all the other nodes will only
>    have ZONE_MOVABLE.
> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>    unless kernelcore or movablecore is specified.
> 4) This option could be specified at most MAX_NUMNODES times.
> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>    higher priority to be satisfied.
> 6) This option has no conflict with memmap option.
>
>
>
> Tang Chen (4):
>   page_alloc: add movable_memmap kernel parameter
>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>     nodes
>   page_alloc: Make movablecore_map has higher priority
>   page_alloc: Bootmem limit with movablecore_map
>
> Yasuaki Ishimatsu (1):
>   x86: get pg_data_t's memory from other node
>
>  Documentation/kernel-parameters.txt |   17 +++
>  arch/x86/mm/numa.c                  |   11 ++-
>  include/linux/memblock.h            |    1 +
>  include/linux/mm.h                  |   11 ++
>  mm/memblock.c                       |   15 +++-
>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>  6 files changed, 263 insertions(+), 8 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Regards,
-Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-27  8:00   ` Bob Liu
@ 2012-11-27  8:29     ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-27  8:29 UTC (permalink / raw)
  To: Bob Liu
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/27/2012 04:00 PM, Bob Liu wrote:
> Hi Tang,
>
> On Fri, Nov 23, 2012 at 6:44 PM, Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>> [What we are doing]
>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>> map for each node in the system.
>>
>> movablecore_map=nn[KMG]@ss[KMG]
>>
>> This option make sure memory range from ss to ss+nn is movable memory.
>>
>>
>> [Why we do this]
>> If we hot remove a memroy, the memory cannot have kernel memory,
>> because Linux cannot migrate kernel memory currently. Therefore,
>> we have to guarantee that the hot removed memory has only movable
>> memoroy.
>>
>> Linux has two boot options, kernelcore= and movablecore=, for
>> creating movable memory. These boot options can specify the amount
>> of memory use as kernel or movable memory. Using them, we can
>> create ZONE_MOVABLE which has only movable memory.
>>
>> But it does not fulfill a requirement of memory hot remove, because
>> even if we specify the boot options, movable memory is distributed
>> in each node evenly. So when we want to hot remove memory which
>> memory range is 0x80000000-0c0000000, we have no way to specify
>> the memory as movable memory.
>>
>
> Sorry, I'm still not get your idea.
> Why you need a specify range that is movable?
> Could you describe the requirement and situation a bit more?
> Thank you.

Hi Liu,

This feature is used in memory hotplug.

In order to implement a whole node hotplug, we need to make sure the
node contains no kernel memory, because memory used by kernel could
not be migrated. (Since the kernel memory is directly mapped,
VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

User could specify all the memory on a node to be movable, so that the
node could be hot-removed.

Another approach is like the following:
movable_node = 1,3-5,8
This could set all the memory on the nodes to be movable. And the rest
of memory works as usual. But movablecore_map is more flexible.

Thanks. :)

>
>> So we proposed a new feature which specifies memory range to use as
>> movable memory.
>>
>>
>> [Ways to do this]
>> There may be 2 ways to specify movable memory.
>>   1. use firmware information
>>   2. use boot option
>>
>> 1. use firmware information
>>    According to ACPI spec 5.0, SRAT table has memory affinity structure
>>    and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>    Affinity Structure". If we use the information, we might be able to
>>    specify movable memory by firmware. For example, if Hot Pluggable
>>    Filed is enabled, Linux sets the memory as movable memory.
>>
>> 2. use boot option
>>    This is our proposal. New boot option can specify memory range to use
>>    as movable memory.
>>
>>
>> [How we do this]
>> We chose second way, because if we use first way, users cannot change
>> memory range to use as movable memory easily. We think if we create
>> movable memory, performance regression may occur by NUMA. In this case,
>> user can turn off the feature easily if we prepare the boot option.
>> And if we prepare the boot optino, the user can select which memory
>> to use as movable memory easily.
>>
>>
>> [How to use]
>> Specify the following boot option:
>> movablecore_map=nn[KMG]@ss[KMG]
>>
>> That means physical address range from ss to ss+nn will be allocated as
>> ZONE_MOVABLE.
>>
>> And the following points should be considered.
>>
>> 1) If the range is involved in a single node, then from ss to the end of
>>     the node will be ZONE_MOVABLE.
>> 2) If the range covers two or more nodes, then from ss to the end of
>>     the node will be ZONE_MOVABLE, and all the other nodes will only
>>     have ZONE_MOVABLE.
>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>     unless kernelcore or movablecore is specified.
>> 4) This option could be specified at most MAX_NUMNODES times.
>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>>     higher priority to be satisfied.
>> 6) This option has no conflict with memmap option.
>>
>>
>>
>> Tang Chen (4):
>>    page_alloc: add movable_memmap kernel parameter
>>    page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>      nodes
>>    page_alloc: Make movablecore_map has higher priority
>>    page_alloc: Bootmem limit with movablecore_map
>>
>> Yasuaki Ishimatsu (1):
>>    x86: get pg_data_t's memory from other node
>>
>>   Documentation/kernel-parameters.txt |   17 +++
>>   arch/x86/mm/numa.c                  |   11 ++-
>>   include/linux/memblock.h            |    1 +
>>   include/linux/mm.h                  |   11 ++
>>   mm/memblock.c                       |   15 +++-
>>   mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>>   6 files changed, 263 insertions(+), 8 deletions(-)
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
>


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-27  8:29     ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-27  8:29 UTC (permalink / raw)
  To: Bob Liu
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/27/2012 04:00 PM, Bob Liu wrote:
> Hi Tang,
>
> On Fri, Nov 23, 2012 at 6:44 PM, Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>> [What we are doing]
>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>> map for each node in the system.
>>
>> movablecore_map=nn[KMG]@ss[KMG]
>>
>> This option make sure memory range from ss to ss+nn is movable memory.
>>
>>
>> [Why we do this]
>> If we hot remove a memroy, the memory cannot have kernel memory,
>> because Linux cannot migrate kernel memory currently. Therefore,
>> we have to guarantee that the hot removed memory has only movable
>> memoroy.
>>
>> Linux has two boot options, kernelcore= and movablecore=, for
>> creating movable memory. These boot options can specify the amount
>> of memory use as kernel or movable memory. Using them, we can
>> create ZONE_MOVABLE which has only movable memory.
>>
>> But it does not fulfill a requirement of memory hot remove, because
>> even if we specify the boot options, movable memory is distributed
>> in each node evenly. So when we want to hot remove memory which
>> memory range is 0x80000000-0c0000000, we have no way to specify
>> the memory as movable memory.
>>
>
> Sorry, I'm still not get your idea.
> Why you need a specify range that is movable?
> Could you describe the requirement and situation a bit more?
> Thank you.

Hi Liu,

This feature is used in memory hotplug.

In order to implement a whole node hotplug, we need to make sure the
node contains no kernel memory, because memory used by kernel could
not be migrated. (Since the kernel memory is directly mapped,
VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

User could specify all the memory on a node to be movable, so that the
node could be hot-removed.

Another approach is like the following:
movable_node = 1,3-5,8
This could set all the memory on the nodes to be movable. And the rest
of memory works as usual. But movablecore_map is more flexible.

Thanks. :)

>
>> So we proposed a new feature which specifies memory range to use as
>> movable memory.
>>
>>
>> [Ways to do this]
>> There may be 2 ways to specify movable memory.
>>   1. use firmware information
>>   2. use boot option
>>
>> 1. use firmware information
>>    According to ACPI spec 5.0, SRAT table has memory affinity structure
>>    and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>    Affinity Structure". If we use the information, we might be able to
>>    specify movable memory by firmware. For example, if Hot Pluggable
>>    Filed is enabled, Linux sets the memory as movable memory.
>>
>> 2. use boot option
>>    This is our proposal. New boot option can specify memory range to use
>>    as movable memory.
>>
>>
>> [How we do this]
>> We chose second way, because if we use first way, users cannot change
>> memory range to use as movable memory easily. We think if we create
>> movable memory, performance regression may occur by NUMA. In this case,
>> user can turn off the feature easily if we prepare the boot option.
>> And if we prepare the boot optino, the user can select which memory
>> to use as movable memory easily.
>>
>>
>> [How to use]
>> Specify the following boot option:
>> movablecore_map=nn[KMG]@ss[KMG]
>>
>> That means physical address range from ss to ss+nn will be allocated as
>> ZONE_MOVABLE.
>>
>> And the following points should be considered.
>>
>> 1) If the range is involved in a single node, then from ss to the end of
>>     the node will be ZONE_MOVABLE.
>> 2) If the range covers two or more nodes, then from ss to the end of
>>     the node will be ZONE_MOVABLE, and all the other nodes will only
>>     have ZONE_MOVABLE.
>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>     unless kernelcore or movablecore is specified.
>> 4) This option could be specified at most MAX_NUMNODES times.
>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>>     higher priority to be satisfied.
>> 6) This option has no conflict with memmap option.
>>
>>
>>
>> Tang Chen (4):
>>    page_alloc: add movable_memmap kernel parameter
>>    page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>      nodes
>>    page_alloc: Make movablecore_map has higher priority
>>    page_alloc: Bootmem limit with movablecore_map
>>
>> Yasuaki Ishimatsu (1):
>>    x86: get pg_data_t's memory from other node
>>
>>   Documentation/kernel-parameters.txt |   17 +++
>>   arch/x86/mm/numa.c                  |   11 ++-
>>   include/linux/memblock.h            |    1 +
>>   include/linux/mm.h                  |   11 ++
>>   mm/memblock.c                       |   15 +++-
>>   mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>>   6 files changed, 263 insertions(+), 8 deletions(-)
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-27  8:29     ` Tang Chen
@ 2012-11-27  8:49       ` H. Peter Anvin
  -1 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-27  8:49 UTC (permalink / raw)
  To: Tang Chen
  Cc: Bob Liu, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/27/2012 12:29 AM, Tang Chen wrote:
> Another approach is like the following:
> movable_node = 1,3-5,8
> This could set all the memory on the nodes to be movable. And the rest
> of memory works as usual. But movablecore_map is more flexible.

... but *much* harder for users, so movable_node is better in most cases.

	-hpa


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-27  8:49       ` H. Peter Anvin
  0 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-27  8:49 UTC (permalink / raw)
  To: Tang Chen
  Cc: Bob Liu, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/27/2012 12:29 AM, Tang Chen wrote:
> Another approach is like the following:
> movable_node = 1,3-5,8
> This could set all the memory on the nodes to be movable. And the rest
> of memory works as usual. But movablecore_map is more flexible.

... but *much* harder for users, so movable_node is better in most cases.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-27  8:49       ` H. Peter Anvin
@ 2012-11-27  9:47         ` Wen Congyang
  -1 siblings, 0 replies; 170+ messages in thread
From: Wen Congyang @ 2012-11-27  9:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tang Chen, Bob Liu, akpm, rob, isimatu.yasuaki, laijs, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	Rafael J. Wysocki

At 11/27/2012 04:49 PM, H. Peter Anvin Wrote:
> On 11/27/2012 12:29 AM, Tang Chen wrote:
>> Another approach is like the following:
>> movable_node = 1,3-5,8
>> This could set all the memory on the nodes to be movable. And the rest
>> of memory works as usual. But movablecore_map is more flexible.
> 
> ... but *much* harder for users, so movable_node is better in most cases.

But numa is initialized very later, and we need the information in SRAT...

Thanks
Wen Congyang

> 
> 	-hpa
> 
> 


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-27  9:47         ` Wen Congyang
  0 siblings, 0 replies; 170+ messages in thread
From: Wen Congyang @ 2012-11-27  9:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tang Chen, Bob Liu, akpm, rob, isimatu.yasuaki, laijs, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	Rafael J. Wysocki

At 11/27/2012 04:49 PM, H. Peter Anvin Wrote:
> On 11/27/2012 12:29 AM, Tang Chen wrote:
>> Another approach is like the following:
>> movable_node = 1,3-5,8
>> This could set all the memory on the nodes to be movable. And the rest
>> of memory works as usual. But movablecore_map is more flexible.
> 
> ... but *much* harder for users, so movable_node is better in most cases.

But numa is initialized very later, and we need the information in SRAT...

Thanks
Wen Congyang

> 
> 	-hpa
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-27  9:47         ` Wen Congyang
@ 2012-11-27  9:53           ` H. Peter Anvin
  -1 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-27  9:53 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Tang Chen, Bob Liu, akpm, rob, isimatu.yasuaki, laijs, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	Rafael J. Wysocki

On 11/27/2012 01:47 AM, Wen Congyang wrote:
> At 11/27/2012 04:49 PM, H. Peter Anvin Wrote:
>> On 11/27/2012 12:29 AM, Tang Chen wrote:
>>> Another approach is like the following:
>>> movable_node = 1,3-5,8
>>> This could set all the memory on the nodes to be movable. And the rest
>>> of memory works as usual. But movablecore_map is more flexible.
>>
>> ... but *much* harder for users, so movable_node is better in most cases.
>
> But numa is initialized very later, and we need the information in SRAT...
>
> Thanks
> Wen Congyang
>

I think you need to deal with it for usability reasons, though...


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-27  9:53           ` H. Peter Anvin
  0 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-27  9:53 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Tang Chen, Bob Liu, akpm, rob, isimatu.yasuaki, laijs, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	Rafael J. Wysocki

On 11/27/2012 01:47 AM, Wen Congyang wrote:
> At 11/27/2012 04:49 PM, H. Peter Anvin Wrote:
>> On 11/27/2012 12:29 AM, Tang Chen wrote:
>>> Another approach is like the following:
>>> movable_node = 1,3-5,8
>>> This could set all the memory on the nodes to be movable. And the rest
>>> of memory works as usual. But movablecore_map is more flexible.
>>
>> ... but *much* harder for users, so movable_node is better in most cases.
>
> But numa is initialized very later, and we need the information in SRAT...
>
> Thanks
> Wen Congyang
>

I think you need to deal with it for usability reasons, though...


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-27  8:49       ` H. Peter Anvin
@ 2012-11-27  9:59         ` Yasuaki Ishimatsu
  -1 siblings, 0 replies; 170+ messages in thread
From: Yasuaki Ishimatsu @ 2012-11-27  9:59 UTC (permalink / raw)
  To: H. Peter Anvin, Tang Chen
  Cc: Bob Liu, akpm, rob, laijs, wency, linfeng, jiang.liu, yinghai,
	kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc

Hi HPA and Tang,

2012/11/27 17:49, H. Peter Anvin wrote:
> On 11/27/2012 12:29 AM, Tang Chen wrote:
>> Another approach is like the following:
>> movable_node = 1,3-5,8
>> This could set all the memory on the nodes to be movable. And the rest
>> of memory works as usual. But movablecore_map is more flexible.
>
> ... but *much* harder for users, so movable_node is better in most cases.

It seems that movable_node is easier to use than movablecore_map.
But I do not think movable_node is better because the node number is set
by OS and changed easily.


For exmaple:
If system has 4 nodes and we set moveble_node=2, we can hot remove node2.

    node0   node1   node2   node3
   +-----+ +-----+ +-----+ +-----+
   |     | |     | |/////| |     |
   |     | |     | |/////| |     |
   |     | |     | |/////| |     |
   |     | |     | |/////| |     |
   +-----+ +-----+ +-----+ +-----+
                   movable
                    node

But if we hot remove node2 and reboot the system, node3 is changed to node2
and set to movable node.

    node0   node1           node2
   +-----+ +-----+         +-----+
   |     | |     |         |/////|
   |     | |     |         |/////|
   |     | |     |         |/////|
   |     | |     |         |/////|
   +-----+ +-----+         +-----+
                           movable
                            node

Originally, node3 is not movable node. Changing the node attribution to
movable node is not intended. So if user uses movable_node,
user must confirm whether boot option is correctly set at hotplug.

But memory range is set by firmware and not changed. So if we set node2
as movable node by movablecore_map, the issue does not occur.

Thanks,
Yasuaki Ishimatsu

>
> 	-hpa
>



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-27  9:59         ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 170+ messages in thread
From: Yasuaki Ishimatsu @ 2012-11-27  9:59 UTC (permalink / raw)
  To: H. Peter Anvin, Tang Chen
  Cc: Bob Liu, akpm, rob, laijs, wency, linfeng, jiang.liu, yinghai,
	kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc

Hi HPA and Tang,

2012/11/27 17:49, H. Peter Anvin wrote:
> On 11/27/2012 12:29 AM, Tang Chen wrote:
>> Another approach is like the following:
>> movable_node = 1,3-5,8
>> This could set all the memory on the nodes to be movable. And the rest
>> of memory works as usual. But movablecore_map is more flexible.
>
> ... but *much* harder for users, so movable_node is better in most cases.

It seems that movable_node is easier to use than movablecore_map.
But I do not think movable_node is better because the node number is set
by OS and changed easily.


For exmaple:
If system has 4 nodes and we set moveble_node=2, we can hot remove node2.

    node0   node1   node2   node3
   +-----+ +-----+ +-----+ +-----+
   |     | |     | |/////| |     |
   |     | |     | |/////| |     |
   |     | |     | |/////| |     |
   |     | |     | |/////| |     |
   +-----+ +-----+ +-----+ +-----+
                   movable
                    node

But if we hot remove node2 and reboot the system, node3 is changed to node2
and set to movable node.

    node0   node1           node2
   +-----+ +-----+         +-----+
   |     | |     |         |/////|
   |     | |     |         |/////|
   |     | |     |         |/////|
   |     | |     |         |/////|
   +-----+ +-----+         +-----+
                           movable
                            node

Originally, node3 is not movable node. Changing the node attribution to
movable node is not intended. So if user uses movable_node,
user must confirm whether boot option is correctly set at hotplug.

But memory range is set by firmware and not changed. So if we set node2
as movable node by movablecore_map, the issue does not occur.

Thanks,
Yasuaki Ishimatsu

>
> 	-hpa
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-27  8:29     ` Tang Chen
@ 2012-11-27 12:09       ` Bob Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Bob Liu @ 2012-11-27 12:09 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
> On 11/27/2012 04:00 PM, Bob Liu wrote:
>>
>> Hi Tang,
>>
>> On Fri, Nov 23, 2012 at 6:44 PM, Tang Chen<tangchen@cn.fujitsu.com>
>> wrote:
>>>
>>> [What we are doing]
>>> This patchset provide a boot option for user to specify ZONE_MOVABLE
>>> memory
>>> map for each node in the system.
>>>
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>> This option make sure memory range from ss to ss+nn is movable memory.
>>>
>>>
>>> [Why we do this]
>>> If we hot remove a memroy, the memory cannot have kernel memory,
>>> because Linux cannot migrate kernel memory currently. Therefore,
>>> we have to guarantee that the hot removed memory has only movable
>>> memoroy.
>>>
>>> Linux has two boot options, kernelcore= and movablecore=, for
>>> creating movable memory. These boot options can specify the amount
>>> of memory use as kernel or movable memory. Using them, we can
>>> create ZONE_MOVABLE which has only movable memory.
>>>
>>> But it does not fulfill a requirement of memory hot remove, because
>>> even if we specify the boot options, movable memory is distributed
>>> in each node evenly. So when we want to hot remove memory which
>>> memory range is 0x80000000-0c0000000, we have no way to specify
>>> the memory as movable memory.
>>>
>>
>> Sorry, I'm still not get your idea.
>> Why you need a specify range that is movable?
>> Could you describe the requirement and situation a bit more?
>> Thank you.
>
>
> Hi Liu,
>
> This feature is used in memory hotplug.
>
> In order to implement a whole node hotplug, we need to make sure the
> node contains no kernel memory, because memory used by kernel could
> not be migrated. (Since the kernel memory is directly mapped,
> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>
> User could specify all the memory on a node to be movable, so that the
> node could be hot-removed.
>

Thank you for your explanation. It's reasonable.

But i think it's a bit duplicated with CMA, i'm not sure but maybe we
can combine it with CMA which already in mainline?

> Another approach is like the following:
> movable_node = 1,3-5,8
> This could set all the memory on the nodes to be movable. And the rest
> of memory works as usual. But movablecore_map is more flexible.
>
> Thanks. :)
>
>
>>
>>> So we proposed a new feature which specifies memory range to use as
>>> movable memory.
>>>
>>>
>>> [Ways to do this]
>>> There may be 2 ways to specify movable memory.
>>>   1. use firmware information
>>>   2. use boot option
>>>
>>> 1. use firmware information
>>>    According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>    and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>    Affinity Structure". If we use the information, we might be able to
>>>    specify movable memory by firmware. For example, if Hot Pluggable
>>>    Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>>    This is our proposal. New boot option can specify memory range to use
>>>    as movable memory.
>>>
>>>
>>> [How we do this]
>>> We chose second way, because if we use first way, users cannot change
>>> memory range to use as movable memory easily. We think if we create
>>> movable memory, performance regression may occur by NUMA. In this case,
>>> user can turn off the feature easily if we prepare the boot option.
>>> And if we prepare the boot optino, the user can select which memory
>>> to use as movable memory easily.
>>>
>>>
>>> [How to use]
>>> Specify the following boot option:
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>> That means physical address range from ss to ss+nn will be allocated as
>>> ZONE_MOVABLE.
>>>
>>> And the following points should be considered.
>>>
>>> 1) If the range is involved in a single node, then from ss to the end of
>>>     the node will be ZONE_MOVABLE.
>>> 2) If the range covers two or more nodes, then from ss to the end of
>>>     the node will be ZONE_MOVABLE, and all the other nodes will only
>>>     have ZONE_MOVABLE.
>>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>>     unless kernelcore or movablecore is specified.
>>> 4) This option could be specified at most MAX_NUMNODES times.
>>> 5) If kernelcore or movablecore is also specified, movablecore_map will
>>> have
>>>     higher priority to be satisfied.
>>> 6) This option has no conflict with memmap option.
>>>
>>>
>>>
>>> Tang Chen (4):
>>>    page_alloc: add movable_memmap kernel parameter
>>>    page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>>      nodes
>>>    page_alloc: Make movablecore_map has higher priority
>>>    page_alloc: Bootmem limit with movablecore_map
>>>
>>> Yasuaki Ishimatsu (1):
>>>    x86: get pg_data_t's memory from other node
>>>
>>>   Documentation/kernel-parameters.txt |   17 +++
>>>   arch/x86/mm/numa.c                  |   11 ++-
>>>   include/linux/memblock.h            |    1 +
>>>   include/linux/mm.h                  |   11 ++
>>>   mm/memblock.c                       |   15 +++-
>>>   mm/page_alloc.c                     |  216
>>> ++++++++++++++++++++++++++++++++++-
>>>   6 files changed, 263 insertions(+), 8 deletions(-)
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
>>
>>
>
-- 
Regards,
--Bob

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-27 12:09       ` Bob Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Bob Liu @ 2012-11-27 12:09 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
> On 11/27/2012 04:00 PM, Bob Liu wrote:
>>
>> Hi Tang,
>>
>> On Fri, Nov 23, 2012 at 6:44 PM, Tang Chen<tangchen@cn.fujitsu.com>
>> wrote:
>>>
>>> [What we are doing]
>>> This patchset provide a boot option for user to specify ZONE_MOVABLE
>>> memory
>>> map for each node in the system.
>>>
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>> This option make sure memory range from ss to ss+nn is movable memory.
>>>
>>>
>>> [Why we do this]
>>> If we hot remove a memroy, the memory cannot have kernel memory,
>>> because Linux cannot migrate kernel memory currently. Therefore,
>>> we have to guarantee that the hot removed memory has only movable
>>> memoroy.
>>>
>>> Linux has two boot options, kernelcore= and movablecore=, for
>>> creating movable memory. These boot options can specify the amount
>>> of memory use as kernel or movable memory. Using them, we can
>>> create ZONE_MOVABLE which has only movable memory.
>>>
>>> But it does not fulfill a requirement of memory hot remove, because
>>> even if we specify the boot options, movable memory is distributed
>>> in each node evenly. So when we want to hot remove memory which
>>> memory range is 0x80000000-0c0000000, we have no way to specify
>>> the memory as movable memory.
>>>
>>
>> Sorry, I'm still not get your idea.
>> Why you need a specify range that is movable?
>> Could you describe the requirement and situation a bit more?
>> Thank you.
>
>
> Hi Liu,
>
> This feature is used in memory hotplug.
>
> In order to implement a whole node hotplug, we need to make sure the
> node contains no kernel memory, because memory used by kernel could
> not be migrated. (Since the kernel memory is directly mapped,
> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>
> User could specify all the memory on a node to be movable, so that the
> node could be hot-removed.
>

Thank you for your explanation. It's reasonable.

But i think it's a bit duplicated with CMA, i'm not sure but maybe we
can combine it with CMA which already in mainline?

> Another approach is like the following:
> movable_node = 1,3-5,8
> This could set all the memory on the nodes to be movable. And the rest
> of memory works as usual. But movablecore_map is more flexible.
>
> Thanks. :)
>
>
>>
>>> So we proposed a new feature which specifies memory range to use as
>>> movable memory.
>>>
>>>
>>> [Ways to do this]
>>> There may be 2 ways to specify movable memory.
>>>   1. use firmware information
>>>   2. use boot option
>>>
>>> 1. use firmware information
>>>    According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>    and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>    Affinity Structure". If we use the information, we might be able to
>>>    specify movable memory by firmware. For example, if Hot Pluggable
>>>    Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>>    This is our proposal. New boot option can specify memory range to use
>>>    as movable memory.
>>>
>>>
>>> [How we do this]
>>> We chose second way, because if we use first way, users cannot change
>>> memory range to use as movable memory easily. We think if we create
>>> movable memory, performance regression may occur by NUMA. In this case,
>>> user can turn off the feature easily if we prepare the boot option.
>>> And if we prepare the boot optino, the user can select which memory
>>> to use as movable memory easily.
>>>
>>>
>>> [How to use]
>>> Specify the following boot option:
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>> That means physical address range from ss to ss+nn will be allocated as
>>> ZONE_MOVABLE.
>>>
>>> And the following points should be considered.
>>>
>>> 1) If the range is involved in a single node, then from ss to the end of
>>>     the node will be ZONE_MOVABLE.
>>> 2) If the range covers two or more nodes, then from ss to the end of
>>>     the node will be ZONE_MOVABLE, and all the other nodes will only
>>>     have ZONE_MOVABLE.
>>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>>     unless kernelcore or movablecore is specified.
>>> 4) This option could be specified at most MAX_NUMNODES times.
>>> 5) If kernelcore or movablecore is also specified, movablecore_map will
>>> have
>>>     higher priority to be satisfied.
>>> 6) This option has no conflict with memmap option.
>>>
>>>
>>>
>>> Tang Chen (4):
>>>    page_alloc: add movable_memmap kernel parameter
>>>    page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>>      nodes
>>>    page_alloc: Make movablecore_map has higher priority
>>>    page_alloc: Bootmem limit with movablecore_map
>>>
>>> Yasuaki Ishimatsu (1):
>>>    x86: get pg_data_t's memory from other node
>>>
>>>   Documentation/kernel-parameters.txt |   17 +++
>>>   arch/x86/mm/numa.c                  |   11 ++-
>>>   include/linux/memblock.h            |    1 +
>>>   include/linux/mm.h                  |   11 ++
>>>   mm/memblock.c                       |   15 +++-
>>>   mm/page_alloc.c                     |  216
>>> ++++++++++++++++++++++++++++++++++-
>>>   6 files changed, 263 insertions(+), 8 deletions(-)
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
>>
>>
>
-- 
Regards,
--Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-27 12:09       ` Bob Liu
@ 2012-11-27 12:49         ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-27 12:49 UTC (permalink / raw)
  To: Bob Liu
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

On 11/27/2012 08:09 PM, Bob Liu wrote:
> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>> Hi Liu,
>>
>> This feature is used in memory hotplug.
>>
>> In order to implement a whole node hotplug, we need to make sure the
>> node contains no kernel memory, because memory used by kernel could
>> not be migrated. (Since the kernel memory is directly mapped,
>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>
>> User could specify all the memory on a node to be movable, so that the
>> node could be hot-removed.
>>
>
> Thank you for your explanation. It's reasonable.
>
> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
> can combine it with CMA which already in mainline?
>
Hi Liu,

Thanks for your advice. :)

CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
controlling where is the start of ZONE_MOVABLE of each node. Could
CMA do this job ?

And also, after a short investigation, CMA seems need to base on
memblock. But we need to limit memblock not to allocate memory on
ZONE_MOVABLE. As a result, we need to know the ranges before memblock
could be used. I'm afraid we still need an approach to get the ranges,
such as a boot option, or from static ACPI tables such as SRAT/MPST.

I'm don't know much about CMA for now. So if you have any better idea,
please share with us, thanks. :)



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-27 12:49         ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-27 12:49 UTC (permalink / raw)
  To: Bob Liu
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

On 11/27/2012 08:09 PM, Bob Liu wrote:
> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>> Hi Liu,
>>
>> This feature is used in memory hotplug.
>>
>> In order to implement a whole node hotplug, we need to make sure the
>> node contains no kernel memory, because memory used by kernel could
>> not be migrated. (Since the kernel memory is directly mapped,
>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>
>> User could specify all the memory on a node to be movable, so that the
>> node could be hot-removed.
>>
>
> Thank you for your explanation. It's reasonable.
>
> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
> can combine it with CMA which already in mainline?
>
Hi Liu,

Thanks for your advice. :)

CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
controlling where is the start of ZONE_MOVABLE of each node. Could
CMA do this job ?

And also, after a short investigation, CMA seems need to base on
memblock. But we need to limit memblock not to allocate memory on
ZONE_MOVABLE. As a result, we need to know the ranges before memblock
could be used. I'm afraid we still need an approach to get the ranges,
such as a boot option, or from static ACPI tables such as SRAT/MPST.

I'm don't know much about CMA for now. So if you have any better idea,
please share with us, thanks. :)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-27 12:49         ` Tang Chen
@ 2012-11-28  3:24           ` Bob Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Bob Liu @ 2012-11-28  3:24 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>
>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>> wrote:
>>>
>>> Hi Liu,
>>>
>>>
>>> This feature is used in memory hotplug.
>>>
>>> In order to implement a whole node hotplug, we need to make sure the
>>> node contains no kernel memory, because memory used by kernel could
>>> not be migrated. (Since the kernel memory is directly mapped,
>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>
>>> User could specify all the memory on a node to be movable, so that the
>>> node could be hot-removed.
>>>
>>
>> Thank you for your explanation. It's reasonable.
>>
>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>> can combine it with CMA which already in mainline?
>>
> Hi Liu,
>
> Thanks for your advice. :)
>
> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
> controlling where is the start of ZONE_MOVABLE of each node. Could
> CMA do this job ?

cma will not control the start of ZONE_MOVABLE of each node, but it
can declare a memory that always movable
and all non movable allocate request will not happen on that area.

Currently cma use a boot parameter "cma=" to declare a memory size
that always movable.
I think it might fulfill your requirement if extending the boot
parameter with a start address.

more info at http://lwn.net/Articles/468044/
>
> And also, after a short investigation, CMA seems need to base on
> memblock. But we need to limit memblock not to allocate memory on
> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
> could be used. I'm afraid we still need an approach to get the ranges,
> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>

Yes, it's based on memblock and with boot option.
In setup_arch32()
    dma_contiguous_reserve(0);   => will declare a cma area using
memblock_reserve()

> I'm don't know much about CMA for now. So if you have any better idea,
> please share with us, thanks. :)

My idea is reuse cma like below patch(even not compiled) and boot with
"cma=size@start_address".
I don't know whether it can work and whether suitable for your
requirement, if not forgive me for this noises.

diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 612afcc..564962a 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
  */
 static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
 static long size_cmdline = -1;
+static long cma_start_cmdline = -1;

 static int __init early_cma(char *p)
 {
+       char *oldp;
        pr_debug("%s(%s)\n", __func__, p);
+       oldp = p;
        size_cmdline = memparse(p, &p);
+
+       if (*p == '@')
+               cma_start_cmdline = memparse(p+1, &p);
+       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
        return 0;
 }
 early_param("cma", early_cma);
@@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
        if (selected_size) {
                pr_debug("%s: reserving %ld MiB for global area\n", __func__,
                         selected_size / SZ_1M);
-
-               dma_declare_contiguous(NULL, selected_size, 0, limit);
+               if (cma_size_cmdline != -1)
+                       dma_declare_contiguous(NULL, selected_size,
cma_start_cmdline, limit);
+               else
+                       dma_declare_contiguous(NULL, selected_size, 0, limit);
        }
 };

-- 
Regards,
--Bob

^ permalink raw reply related	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28  3:24           ` Bob Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Bob Liu @ 2012-11-28  3:24 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>
>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>> wrote:
>>>
>>> Hi Liu,
>>>
>>>
>>> This feature is used in memory hotplug.
>>>
>>> In order to implement a whole node hotplug, we need to make sure the
>>> node contains no kernel memory, because memory used by kernel could
>>> not be migrated. (Since the kernel memory is directly mapped,
>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>
>>> User could specify all the memory on a node to be movable, so that the
>>> node could be hot-removed.
>>>
>>
>> Thank you for your explanation. It's reasonable.
>>
>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>> can combine it with CMA which already in mainline?
>>
> Hi Liu,
>
> Thanks for your advice. :)
>
> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
> controlling where is the start of ZONE_MOVABLE of each node. Could
> CMA do this job ?

cma will not control the start of ZONE_MOVABLE of each node, but it
can declare a memory that always movable
and all non movable allocate request will not happen on that area.

Currently cma use a boot parameter "cma=" to declare a memory size
that always movable.
I think it might fulfill your requirement if extending the boot
parameter with a start address.

more info at http://lwn.net/Articles/468044/
>
> And also, after a short investigation, CMA seems need to base on
> memblock. But we need to limit memblock not to allocate memory on
> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
> could be used. I'm afraid we still need an approach to get the ranges,
> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>

Yes, it's based on memblock and with boot option.
In setup_arch32()
    dma_contiguous_reserve(0);   => will declare a cma area using
memblock_reserve()

> I'm don't know much about CMA for now. So if you have any better idea,
> please share with us, thanks. :)

My idea is reuse cma like below patch(even not compiled) and boot with
"cma=size@start_address".
I don't know whether it can work and whether suitable for your
requirement, if not forgive me for this noises.

diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 612afcc..564962a 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
  */
 static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
 static long size_cmdline = -1;
+static long cma_start_cmdline = -1;

 static int __init early_cma(char *p)
 {
+       char *oldp;
        pr_debug("%s(%s)\n", __func__, p);
+       oldp = p;
        size_cmdline = memparse(p, &p);
+
+       if (*p == '@')
+               cma_start_cmdline = memparse(p+1, &p);
+       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
        return 0;
 }
 early_param("cma", early_cma);
@@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
        if (selected_size) {
                pr_debug("%s: reserving %ld MiB for global area\n", __func__,
                         selected_size / SZ_1M);
-
-               dma_declare_contiguous(NULL, selected_size, 0, limit);
+               if (cma_size_cmdline != -1)
+                       dma_declare_contiguous(NULL, selected_size,
cma_start_cmdline, limit);
+               else
+                       dma_declare_contiguous(NULL, selected_size, 0, limit);
        }
 };

-- 
Regards,
--Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-27  3:10   ` wujianguo
@ 2012-11-28  3:47     ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-28  3:47 UTC (permalink / raw)
  To: wujianguo
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/27/2012 11:10 AM, wujianguo wrote:
>
> Hi Tang,
> 	DMA address can't be set as movable, if some one boot kernel with
> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
> system maybe boot failed. Should this case be handled or mentioned
> in the change log and kernel-parameters.txt?

Hi Wu,

I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
address as movable. Just ignore the address lower than them, and set
the rest as movable. How do you think ?

And, since we cannot figure out the minimum of memory kernel needs, I
think for now, we can just add some warning into kernel-parameters.txt.

Thanks. :)

>
> Thanks,
> Jianguo Wu
>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28  3:47     ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-28  3:47 UTC (permalink / raw)
  To: wujianguo
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/27/2012 11:10 AM, wujianguo wrote:
>
> Hi Tang,
> 	DMA address can't be set as movable, if some one boot kernel with
> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
> system maybe boot failed. Should this case be handled or mentioned
> in the change log and kernel-parameters.txt?

Hi Wu,

I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
address as movable. Just ignore the address lower than them, and set
the rest as movable. How do you think ?

And, since we cannot figure out the minimum of memory kernel needs, I
think for now, we can just add some warning into kernel-parameters.txt.

Thanks. :)

>
> Thanks,
> Jianguo Wu
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28  3:47     ` Tang Chen
@ 2012-11-28  4:01       ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-28  4:01 UTC (permalink / raw)
  To: Tang Chen
  Cc: wujianguo, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 2012-11-28 11:47, Tang Chen wrote:
> On 11/27/2012 11:10 AM, wujianguo wrote:
>>
>> Hi Tang,
>>     DMA address can't be set as movable, if some one boot kernel with
>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>> system maybe boot failed. Should this case be handled or mentioned
>> in the change log and kernel-parameters.txt?
> 
> Hi Wu,
> 
> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
> address as movable. Just ignore the address lower than them, and set
> the rest as movable. How do you think ?
> 
> And, since we cannot figure out the minimum of memory kernel needs, I
> think for now, we can just add some warning into kernel-parameters.txt.
> 
> Thanks. :)
On one other OS, there is a mechanism to dynamically convert pages from
movable zones into normal zones.

Regards!
Gerry

> 
>>
>> Thanks,
>> Jianguo Wu
>>
> 
> .
> 



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28  4:01       ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-28  4:01 UTC (permalink / raw)
  To: Tang Chen
  Cc: wujianguo, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 2012-11-28 11:47, Tang Chen wrote:
> On 11/27/2012 11:10 AM, wujianguo wrote:
>>
>> Hi Tang,
>>     DMA address can't be set as movable, if some one boot kernel with
>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>> system maybe boot failed. Should this case be handled or mentioned
>> in the change log and kernel-parameters.txt?
> 
> Hi Wu,
> 
> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
> address as movable. Just ignore the address lower than them, and set
> the rest as movable. How do you think ?
> 
> And, since we cannot figure out the minimum of memory kernel needs, I
> think for now, we can just add some warning into kernel-parameters.txt.
> 
> Thanks. :)
On one other OS, there is a mechanism to dynamically convert pages from
movable zones into normal zones.

Regards!
Gerry

> 
>>
>> Thanks,
>> Jianguo Wu
>>
> 
> .
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28  3:24           ` Bob Liu
@ 2012-11-28  4:08             ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-28  4:08 UTC (permalink / raw)
  To: Bob Liu
  Cc: Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

On 2012-11-28 11:24, Bob Liu wrote:
> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>
>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>> wrote:
>>>>
>>>> Hi Liu,
>>>>
>>>>
>>>> This feature is used in memory hotplug.
>>>>
>>>> In order to implement a whole node hotplug, we need to make sure the
>>>> node contains no kernel memory, because memory used by kernel could
>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>
>>>> User could specify all the memory on a node to be movable, so that the
>>>> node could be hot-removed.
>>>>
>>>
>>> Thank you for your explanation. It's reasonable.
>>>
>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>> can combine it with CMA which already in mainline?
>>>
>> Hi Liu,
>>
>> Thanks for your advice. :)
>>
>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>> controlling where is the start of ZONE_MOVABLE of each node. Could
>> CMA do this job ?
> 
> cma will not control the start of ZONE_MOVABLE of each node, but it
> can declare a memory that always movable
> and all non movable allocate request will not happen on that area.
> 
> Currently cma use a boot parameter "cma=" to declare a memory size
> that always movable.
> I think it might fulfill your requirement if extending the boot
> parameter with a start address.
> 
> more info at http://lwn.net/Articles/468044/
>>
>> And also, after a short investigation, CMA seems need to base on
>> memblock. But we need to limit memblock not to allocate memory on
>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>> could be used. I'm afraid we still need an approach to get the ranges,
>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>
> 
> Yes, it's based on memblock and with boot option.
> In setup_arch32()
>     dma_contiguous_reserve(0);   => will declare a cma area using
> memblock_reserve()
> 
>> I'm don't know much about CMA for now. So if you have any better idea,
>> please share with us, thanks. :)
> 
> My idea is reuse cma like below patch(even not compiled) and boot with
> "cma=size@start_address".
> I don't know whether it can work and whether suitable for your
> requirement, if not forgive me for this noises.
> 
> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
> index 612afcc..564962a 100644
> --- a/drivers/base/dma-contiguous.c
> +++ b/drivers/base/dma-contiguous.c
> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>   */
>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>  static long size_cmdline = -1;
> +static long cma_start_cmdline = -1;
> 
>  static int __init early_cma(char *p)
>  {
> +       char *oldp;
>         pr_debug("%s(%s)\n", __func__, p);
> +       oldp = p;
>         size_cmdline = memparse(p, &p);
> +
> +       if (*p == '@')
> +               cma_start_cmdline = memparse(p+1, &p);
> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>         return 0;
>  }
>  early_param("cma", early_cma);
> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>         if (selected_size) {
>                 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>                          selected_size / SZ_1M);
> -
> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
> +               if (cma_size_cmdline != -1)
> +                       dma_declare_contiguous(NULL, selected_size,
> cma_start_cmdline, limit);
> +               else
> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>         }
>  };
Seems a good idea to reserve memory by reusing CMA logic, though need more
investigation here. One of CMA goal is to ensure pages in CMA are really
movable, and this patchset tries to achieve the same goal at a first glance.

 



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28  4:08             ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-28  4:08 UTC (permalink / raw)
  To: Bob Liu
  Cc: Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

On 2012-11-28 11:24, Bob Liu wrote:
> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>
>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>> wrote:
>>>>
>>>> Hi Liu,
>>>>
>>>>
>>>> This feature is used in memory hotplug.
>>>>
>>>> In order to implement a whole node hotplug, we need to make sure the
>>>> node contains no kernel memory, because memory used by kernel could
>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>
>>>> User could specify all the memory on a node to be movable, so that the
>>>> node could be hot-removed.
>>>>
>>>
>>> Thank you for your explanation. It's reasonable.
>>>
>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>> can combine it with CMA which already in mainline?
>>>
>> Hi Liu,
>>
>> Thanks for your advice. :)
>>
>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>> controlling where is the start of ZONE_MOVABLE of each node. Could
>> CMA do this job ?
> 
> cma will not control the start of ZONE_MOVABLE of each node, but it
> can declare a memory that always movable
> and all non movable allocate request will not happen on that area.
> 
> Currently cma use a boot parameter "cma=" to declare a memory size
> that always movable.
> I think it might fulfill your requirement if extending the boot
> parameter with a start address.
> 
> more info at http://lwn.net/Articles/468044/
>>
>> And also, after a short investigation, CMA seems need to base on
>> memblock. But we need to limit memblock not to allocate memory on
>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>> could be used. I'm afraid we still need an approach to get the ranges,
>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>
> 
> Yes, it's based on memblock and with boot option.
> In setup_arch32()
>     dma_contiguous_reserve(0);   => will declare a cma area using
> memblock_reserve()
> 
>> I'm don't know much about CMA for now. So if you have any better idea,
>> please share with us, thanks. :)
> 
> My idea is reuse cma like below patch(even not compiled) and boot with
> "cma=size@start_address".
> I don't know whether it can work and whether suitable for your
> requirement, if not forgive me for this noises.
> 
> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
> index 612afcc..564962a 100644
> --- a/drivers/base/dma-contiguous.c
> +++ b/drivers/base/dma-contiguous.c
> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>   */
>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>  static long size_cmdline = -1;
> +static long cma_start_cmdline = -1;
> 
>  static int __init early_cma(char *p)
>  {
> +       char *oldp;
>         pr_debug("%s(%s)\n", __func__, p);
> +       oldp = p;
>         size_cmdline = memparse(p, &p);
> +
> +       if (*p == '@')
> +               cma_start_cmdline = memparse(p+1, &p);
> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>         return 0;
>  }
>  early_param("cma", early_cma);
> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>         if (selected_size) {
>                 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>                          selected_size / SZ_1M);
> -
> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
> +               if (cma_size_cmdline != -1)
> +                       dma_declare_contiguous(NULL, selected_size,
> cma_start_cmdline, limit);
> +               else
> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>         }
>  };
Seems a good idea to reserve memory by reusing CMA logic, though need more
investigation here. One of CMA goal is to ensure pages in CMA are really
movable, and this patchset tries to achieve the same goal at a first glance.

 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28  3:47     ` Tang Chen
@ 2012-11-28  4:53       ` Jianguo Wu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jianguo Wu @ 2012-11-28  4:53 UTC (permalink / raw)
  To: Tang Chen
  Cc: wujianguo, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 2012/11/28 11:47, Tang Chen wrote:

> On 11/27/2012 11:10 AM, wujianguo wrote:
>>
>> Hi Tang,
>>     DMA address can't be set as movable, if some one boot kernel with
>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>> system maybe boot failed. Should this case be handled or mentioned
>> in the change log and kernel-parameters.txt?
> 
> Hi Wu,
> 
> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
> address as movable. Just ignore the address lower than them, and set
> the rest as movable. How do you think ?
> 

I think it's OK for now.

> And, since we cannot figure out the minimum of memory kernel needs, I
> think for now, we can just add some warning into kernel-parameters.txt.
> 
> Thanks. :)
> 
>>
>> Thanks,
>> Jianguo Wu
>>
> 
> .
> 




^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28  4:53       ` Jianguo Wu
  0 siblings, 0 replies; 170+ messages in thread
From: Jianguo Wu @ 2012-11-28  4:53 UTC (permalink / raw)
  To: Tang Chen
  Cc: wujianguo, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 2012/11/28 11:47, Tang Chen wrote:

> On 11/27/2012 11:10 AM, wujianguo wrote:
>>
>> Hi Tang,
>>     DMA address can't be set as movable, if some one boot kernel with
>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>> system maybe boot failed. Should this case be handled or mentioned
>> in the change log and kernel-parameters.txt?
> 
> Hi Wu,
> 
> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
> address as movable. Just ignore the address lower than them, and set
> the rest as movable. How do you think ?
> 

I think it's OK for now.

> And, since we cannot figure out the minimum of memory kernel needs, I
> think for now, we can just add some warning into kernel-parameters.txt.
> 
> Thanks. :)
> 
>>
>> Thanks,
>> Jianguo Wu
>>
> 
> .
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28  5:21         ` Wen Congyang
@ 2012-11-28  5:17           ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-28  5:17 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Tang Chen, wujianguo, hpa, akpm, rob, isimatu.yasuaki, laijs,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 2012-11-28 13:21, Wen Congyang wrote:
> At 11/28/2012 12:01 PM, Jiang Liu Wrote:
>> On 2012-11-28 11:47, Tang Chen wrote:
>>> On 11/27/2012 11:10 AM, wujianguo wrote:
>>>>
>>>> Hi Tang,
>>>>     DMA address can't be set as movable, if some one boot kernel with
>>>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>>>> system maybe boot failed. Should this case be handled or mentioned
>>>> in the change log and kernel-parameters.txt?
>>>
>>> Hi Wu,
>>>
>>> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
>>> address as movable. Just ignore the address lower than them, and set
>>> the rest as movable. How do you think ?
>>>
>>> And, since we cannot figure out the minimum of memory kernel needs, I
>>> think for now, we can just add some warning into kernel-parameters.txt.
>>>
>>> Thanks. :)
>> On one other OS, there is a mechanism to dynamically convert pages from
>> movable zones into normal zones.
> 
> The OS auto does it? Or the user coverts it?
> 
> We can convert pages from movable zones into normal zones by the following
> interface:
> echo online_kernel >/sys/devices/system/memory/memoryX/state
> 
> We have posted a patchset to implement it, and it is in mm tree now.
OS automatically converts it, no manual operations needed.



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28  5:17           ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-28  5:17 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Tang Chen, wujianguo, hpa, akpm, rob, isimatu.yasuaki, laijs,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 2012-11-28 13:21, Wen Congyang wrote:
> At 11/28/2012 12:01 PM, Jiang Liu Wrote:
>> On 2012-11-28 11:47, Tang Chen wrote:
>>> On 11/27/2012 11:10 AM, wujianguo wrote:
>>>>
>>>> Hi Tang,
>>>>     DMA address can't be set as movable, if some one boot kernel with
>>>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>>>> system maybe boot failed. Should this case be handled or mentioned
>>>> in the change log and kernel-parameters.txt?
>>>
>>> Hi Wu,
>>>
>>> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
>>> address as movable. Just ignore the address lower than them, and set
>>> the rest as movable. How do you think ?
>>>
>>> And, since we cannot figure out the minimum of memory kernel needs, I
>>> think for now, we can just add some warning into kernel-parameters.txt.
>>>
>>> Thanks. :)
>> On one other OS, there is a mechanism to dynamically convert pages from
>> movable zones into normal zones.
> 
> The OS auto does it? Or the user coverts it?
> 
> We can convert pages from movable zones into normal zones by the following
> interface:
> echo online_kernel >/sys/devices/system/memory/memoryX/state
> 
> We have posted a patchset to implement it, and it is in mm tree now.
OS automatically converts it, no manual operations needed.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28  4:01       ` Jiang Liu
@ 2012-11-28  5:21         ` Wen Congyang
  -1 siblings, 0 replies; 170+ messages in thread
From: Wen Congyang @ 2012-11-28  5:21 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Tang Chen, wujianguo, hpa, akpm, rob, isimatu.yasuaki, laijs,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

At 11/28/2012 12:01 PM, Jiang Liu Wrote:
> On 2012-11-28 11:47, Tang Chen wrote:
>> On 11/27/2012 11:10 AM, wujianguo wrote:
>>>
>>> Hi Tang,
>>>     DMA address can't be set as movable, if some one boot kernel with
>>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>>> system maybe boot failed. Should this case be handled or mentioned
>>> in the change log and kernel-parameters.txt?
>>
>> Hi Wu,
>>
>> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
>> address as movable. Just ignore the address lower than them, and set
>> the rest as movable. How do you think ?
>>
>> And, since we cannot figure out the minimum of memory kernel needs, I
>> think for now, we can just add some warning into kernel-parameters.txt.
>>
>> Thanks. :)
> On one other OS, there is a mechanism to dynamically convert pages from
> movable zones into normal zones.

The OS auto does it? Or the user coverts it?

We can convert pages from movable zones into normal zones by the following
interface:
echo online_kernel >/sys/devices/system/memory/memoryX/state

We have posted a patchset to implement it, and it is in mm tree now.

Thanks
Wen Congyang

> 
> Regards!
> Gerry
> 
>>
>>>
>>> Thanks,
>>> Jianguo Wu
>>>
>>
>> .
>>
> 
> 
> 


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28  5:21         ` Wen Congyang
  0 siblings, 0 replies; 170+ messages in thread
From: Wen Congyang @ 2012-11-28  5:21 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Tang Chen, wujianguo, hpa, akpm, rob, isimatu.yasuaki, laijs,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

At 11/28/2012 12:01 PM, Jiang Liu Wrote:
> On 2012-11-28 11:47, Tang Chen wrote:
>> On 11/27/2012 11:10 AM, wujianguo wrote:
>>>
>>> Hi Tang,
>>>     DMA address can't be set as movable, if some one boot kernel with
>>> movablecore_map=4G@0xa00000 or other memory region that contains DMA address,
>>> system maybe boot failed. Should this case be handled or mentioned
>>> in the change log and kernel-parameters.txt?
>>
>> Hi Wu,
>>
>> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
>> address as movable. Just ignore the address lower than them, and set
>> the rest as movable. How do you think ?
>>
>> And, since we cannot figure out the minimum of memory kernel needs, I
>> think for now, we can just add some warning into kernel-parameters.txt.
>>
>> Thanks. :)
> On one other OS, there is a mechanism to dynamically convert pages from
> movable zones into normal zones.

The OS auto does it? Or the user coverts it?

We can convert pages from movable zones into normal zones by the following
interface:
echo online_kernel >/sys/devices/system/memory/memoryX/state

We have posted a patchset to implement it, and it is in mm tree now.

Thanks
Wen Congyang

> 
> Regards!
> Gerry
> 
>>
>>>
>>> Thanks,
>>> Jianguo Wu
>>>
>>
>> .
>>
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28  4:08             ` Jiang Liu
@ 2012-11-28  6:16               ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-28  6:16 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Bob Liu, hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	yinghai, kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, m.szyprowski

Hi Bob, Liu Jiang,

About CMA, could you give me more info ?
Thanks for your patent and nice advice. :)


1) I saw the following on http://lwn.net/Articles/447405/:

The "CMA" type is sticky; pages which are marked as being for CMA
should never have their migration type changed by the kernel.

As Wen said, we now support a user interface to change movable memory
into kernel memory. But seeing from above, the memory specified as
CMA will not be able to be changed, right ?  If so, I don't think
using CMA is a good idea.


2) Is CMA just implemented on ARM platform ?  I found the following in
kernel-parameters.txt.

cma=nn[MG]      [ARM,KNL]
         Sets the size of kernel global memory area for contiguous
         memory allocations. For more information, see
         include/linux/dma-contiguous.h

We are developing on x86. Could we use it ?


3) Is CMA just used for DMA ? I am a little confused here. :)
I found the main code of CMA is implemented in dma-contiguous.c.


4) The boot options cma=xxx and movablecore_map=xxx have different
meanings for user. Reusing CMA could make user confused, I'm afraid.

And, even if we reuse "cma=" option, we still need to do the work
in patch 3~5, right ?


Thanks. :)



On 11/28/2012 12:08 PM, Jiang Liu wrote:
> On 2012-11-28 11:24, Bob Liu wrote:
>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>
>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>>> wrote:
>>>>>
>>>>> Hi Liu,
>>>>>
>>>>>
>>>>> This feature is used in memory hotplug.
>>>>>
>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>> node contains no kernel memory, because memory used by kernel could
>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>
>>>>> User could specify all the memory on a node to be movable, so that the
>>>>> node could be hot-removed.
>>>>>
>>>>
>>>> Thank you for your explanation. It's reasonable.
>>>>
>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>> can combine it with CMA which already in mainline?
>>>>
>>> Hi Liu,
>>>
>>> Thanks for your advice. :)
>>>
>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>> CMA do this job ?
>>
>> cma will not control the start of ZONE_MOVABLE of each node, but it
>> can declare a memory that always movable
>> and all non movable allocate request will not happen on that area.
>>
>> Currently cma use a boot parameter "cma=" to declare a memory size
>> that always movable.
>> I think it might fulfill your requirement if extending the boot
>> parameter with a start address.
>>
>> more info at http://lwn.net/Articles/468044/
>>>
>>> And also, after a short investigation, CMA seems need to base on
>>> memblock. But we need to limit memblock not to allocate memory on
>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>> could be used. I'm afraid we still need an approach to get the ranges,
>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>
>>
>> Yes, it's based on memblock and with boot option.
>> In setup_arch32()
>>      dma_contiguous_reserve(0);   =>  will declare a cma area using
>> memblock_reserve()
>>
>>> I'm don't know much about CMA for now. So if you have any better idea,
>>> please share with us, thanks. :)
>>
>> My idea is reuse cma like below patch(even not compiled) and boot with
>> "cma=size@start_address".
>> I don't know whether it can work and whether suitable for your
>> requirement, if not forgive me for this noises.
>>
>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>> index 612afcc..564962a 100644
>> --- a/drivers/base/dma-contiguous.c
>> +++ b/drivers/base/dma-contiguous.c
>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>    */
>>   static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>   static long size_cmdline = -1;
>> +static long cma_start_cmdline = -1;
>>
>>   static int __init early_cma(char *p)
>>   {
>> +       char *oldp;
>>          pr_debug("%s(%s)\n", __func__, p);
>> +       oldp = p;
>>          size_cmdline = memparse(p,&p);
>> +
>> +       if (*p == '@')
>> +               cma_start_cmdline = memparse(p+1,&p);
>> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>          return 0;
>>   }
>>   early_param("cma", early_cma);
>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>          if (selected_size) {
>>                  pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>                           selected_size / SZ_1M);
>> -
>> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
>> +               if (cma_size_cmdline != -1)
>> +                       dma_declare_contiguous(NULL, selected_size,
>> cma_start_cmdline, limit);
>> +               else
>> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>>          }
>>   };
> Seems a good idea to reserve memory by reusing CMA logic, though need more
> investigation here. One of CMA goal is to ensure pages in CMA are really
> movable, and this patchset tries to achieve the same goal at a first glance.
>
>
>
>
>


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28  6:16               ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-28  6:16 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Bob Liu, hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	yinghai, kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, m.szyprowski

Hi Bob, Liu Jiang,

About CMA, could you give me more info ?
Thanks for your patent and nice advice. :)


1) I saw the following on http://lwn.net/Articles/447405/:

The "CMA" type is sticky; pages which are marked as being for CMA
should never have their migration type changed by the kernel.

As Wen said, we now support a user interface to change movable memory
into kernel memory. But seeing from above, the memory specified as
CMA will not be able to be changed, right ?  If so, I don't think
using CMA is a good idea.


2) Is CMA just implemented on ARM platform ?  I found the following in
kernel-parameters.txt.

cma=nn[MG]      [ARM,KNL]
         Sets the size of kernel global memory area for contiguous
         memory allocations. For more information, see
         include/linux/dma-contiguous.h

We are developing on x86. Could we use it ?


3) Is CMA just used for DMA ? I am a little confused here. :)
I found the main code of CMA is implemented in dma-contiguous.c.


4) The boot options cma=xxx and movablecore_map=xxx have different
meanings for user. Reusing CMA could make user confused, I'm afraid.

And, even if we reuse "cma=" option, we still need to do the work
in patch 3~5, right ?


Thanks. :)



On 11/28/2012 12:08 PM, Jiang Liu wrote:
> On 2012-11-28 11:24, Bob Liu wrote:
>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>
>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>>> wrote:
>>>>>
>>>>> Hi Liu,
>>>>>
>>>>>
>>>>> This feature is used in memory hotplug.
>>>>>
>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>> node contains no kernel memory, because memory used by kernel could
>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>
>>>>> User could specify all the memory on a node to be movable, so that the
>>>>> node could be hot-removed.
>>>>>
>>>>
>>>> Thank you for your explanation. It's reasonable.
>>>>
>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>> can combine it with CMA which already in mainline?
>>>>
>>> Hi Liu,
>>>
>>> Thanks for your advice. :)
>>>
>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>> CMA do this job ?
>>
>> cma will not control the start of ZONE_MOVABLE of each node, but it
>> can declare a memory that always movable
>> and all non movable allocate request will not happen on that area.
>>
>> Currently cma use a boot parameter "cma=" to declare a memory size
>> that always movable.
>> I think it might fulfill your requirement if extending the boot
>> parameter with a start address.
>>
>> more info at http://lwn.net/Articles/468044/
>>>
>>> And also, after a short investigation, CMA seems need to base on
>>> memblock. But we need to limit memblock not to allocate memory on
>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>> could be used. I'm afraid we still need an approach to get the ranges,
>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>
>>
>> Yes, it's based on memblock and with boot option.
>> In setup_arch32()
>>      dma_contiguous_reserve(0);   =>  will declare a cma area using
>> memblock_reserve()
>>
>>> I'm don't know much about CMA for now. So if you have any better idea,
>>> please share with us, thanks. :)
>>
>> My idea is reuse cma like below patch(even not compiled) and boot with
>> "cma=size@start_address".
>> I don't know whether it can work and whether suitable for your
>> requirement, if not forgive me for this noises.
>>
>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>> index 612afcc..564962a 100644
>> --- a/drivers/base/dma-contiguous.c
>> +++ b/drivers/base/dma-contiguous.c
>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>    */
>>   static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>   static long size_cmdline = -1;
>> +static long cma_start_cmdline = -1;
>>
>>   static int __init early_cma(char *p)
>>   {
>> +       char *oldp;
>>          pr_debug("%s(%s)\n", __func__, p);
>> +       oldp = p;
>>          size_cmdline = memparse(p,&p);
>> +
>> +       if (*p == '@')
>> +               cma_start_cmdline = memparse(p+1,&p);
>> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>          return 0;
>>   }
>>   early_param("cma", early_cma);
>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>          if (selected_size) {
>>                  pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>                           selected_size / SZ_1M);
>> -
>> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
>> +               if (cma_size_cmdline != -1)
>> +                       dma_declare_contiguous(NULL, selected_size,
>> cma_start_cmdline, limit);
>> +               else
>> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>>          }
>>   };
> Seems a good idea to reserve memory by reusing CMA logic, though need more
> investigation here. One of CMA goal is to ensure pages in CMA are really
> movable, and this patchset tries to achieve the same goal at a first glance.
>
>
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28  6:16               ` Tang Chen
@ 2012-11-28  7:03                 ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-28  7:03 UTC (permalink / raw)
  To: Tang Chen
  Cc: Bob Liu, hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	yinghai, kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, m.szyprowski

Hi Chen,

If a pageblock's migration type is movable, it may be converted to
reclaimable under memory pressure. CMA is introduced to guarantee
that pages of CMA won't be converted to other migratetypes.

And we are trying to avoid allocating kernel/DMA memory from specific
memory ranges, so we could easily reclaim pages when hot-removing
memory devices. 

I think the idea is not to directly reuse CMA for hotplug, but to 
reuse the mechanism to reserve specific memory ranges from bootmem
allocator. So CMA and hotplug could use the same code.
Basically we may try to reuse dma_declare_contiguous(), so that
we don't need to add special logic into bootmem allocator.

Regards!
Gerry

On 2012-11-28 14:16, Tang Chen wrote:
> Hi Bob, Liu Jiang,
> 
> About CMA, could you give me more info ?
> Thanks for your patent and nice advice. :)
> 
> 
> 1) I saw the following on http://lwn.net/Articles/447405/:
> 
> The "CMA" type is sticky; pages which are marked as being for CMA
> should never have their migration type changed by the kernel.
> 
> As Wen said, we now support a user interface to change movable memory
> into kernel memory. But seeing from above, the memory specified as
> CMA will not be able to be changed, right ?  If so, I don't think
> using CMA is a good idea.
> 
> 
> 2) Is CMA just implemented on ARM platform ?  I found the following in
> kernel-parameters.txt.
> 
> cma=nn[MG]      [ARM,KNL]
>         Sets the size of kernel global memory area for contiguous
>         memory allocations. For more information, see
>         include/linux/dma-contiguous.h
> 
> We are developing on x86. Could we use it ?
> 
> 
> 3) Is CMA just used for DMA ? I am a little confused here. :)
> I found the main code of CMA is implemented in dma-contiguous.c.
> 
> 
> 4) The boot options cma=xxx and movablecore_map=xxx have different
> meanings for user. Reusing CMA could make user confused, I'm afraid.
> 
> And, even if we reuse "cma=" option, we still need to do the work
> in patch 3~5, right ?
> 
> 
> Thanks. :)
> 
> 
> 
> On 11/28/2012 12:08 PM, Jiang Liu wrote:
>> On 2012-11-28 11:24, Bob Liu wrote:
>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>>
>>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi Liu,
>>>>>>
>>>>>>
>>>>>> This feature is used in memory hotplug.
>>>>>>
>>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>>> node contains no kernel memory, because memory used by kernel could
>>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>>
>>>>>> User could specify all the memory on a node to be movable, so that the
>>>>>> node could be hot-removed.
>>>>>>
>>>>>
>>>>> Thank you for your explanation. It's reasonable.
>>>>>
>>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>>> can combine it with CMA which already in mainline?
>>>>>
>>>> Hi Liu,
>>>>
>>>> Thanks for your advice. :)
>>>>
>>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>>> CMA do this job ?
>>>
>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>> can declare a memory that always movable
>>> and all non movable allocate request will not happen on that area.
>>>
>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>> that always movable.
>>> I think it might fulfill your requirement if extending the boot
>>> parameter with a start address.
>>>
>>> more info at http://lwn.net/Articles/468044/
>>>>
>>>> And also, after a short investigation, CMA seems need to base on
>>>> memblock. But we need to limit memblock not to allocate memory on
>>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>>> could be used. I'm afraid we still need an approach to get the ranges,
>>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>>
>>>
>>> Yes, it's based on memblock and with boot option.
>>> In setup_arch32()
>>>      dma_contiguous_reserve(0);   =>  will declare a cma area using
>>> memblock_reserve()
>>>
>>>> I'm don't know much about CMA for now. So if you have any better idea,
>>>> please share with us, thanks. :)
>>>
>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>> "cma=size@start_address".
>>> I don't know whether it can work and whether suitable for your
>>> requirement, if not forgive me for this noises.
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index 612afcc..564962a 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>>    */
>>>   static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>>   static long size_cmdline = -1;
>>> +static long cma_start_cmdline = -1;
>>>
>>>   static int __init early_cma(char *p)
>>>   {
>>> +       char *oldp;
>>>          pr_debug("%s(%s)\n", __func__, p);
>>> +       oldp = p;
>>>          size_cmdline = memparse(p,&p);
>>> +
>>> +       if (*p == '@')
>>> +               cma_start_cmdline = memparse(p+1,&p);
>>> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>>          return 0;
>>>   }
>>>   early_param("cma", early_cma);
>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>>          if (selected_size) {
>>>                  pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>>                           selected_size / SZ_1M);
>>> -
>>> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> +               if (cma_size_cmdline != -1)
>>> +                       dma_declare_contiguous(NULL, selected_size,
>>> cma_start_cmdline, limit);
>>> +               else
>>> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>>>          }
>>>   };
>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>> investigation here. One of CMA goal is to ensure pages in CMA are really
>> movable, and this patchset tries to achieve the same goal at a first glance.
>>
>>
>>
>>
>>
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28  7:03                 ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-28  7:03 UTC (permalink / raw)
  To: Tang Chen
  Cc: Bob Liu, hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	yinghai, kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, m.szyprowski

Hi Chen,

If a pageblock's migration type is movable, it may be converted to
reclaimable under memory pressure. CMA is introduced to guarantee
that pages of CMA won't be converted to other migratetypes.

And we are trying to avoid allocating kernel/DMA memory from specific
memory ranges, so we could easily reclaim pages when hot-removing
memory devices. 

I think the idea is not to directly reuse CMA for hotplug, but to 
reuse the mechanism to reserve specific memory ranges from bootmem
allocator. So CMA and hotplug could use the same code.
Basically we may try to reuse dma_declare_contiguous(), so that
we don't need to add special logic into bootmem allocator.

Regards!
Gerry

On 2012-11-28 14:16, Tang Chen wrote:
> Hi Bob, Liu Jiang,
> 
> About CMA, could you give me more info ?
> Thanks for your patent and nice advice. :)
> 
> 
> 1) I saw the following on http://lwn.net/Articles/447405/:
> 
> The "CMA" type is sticky; pages which are marked as being for CMA
> should never have their migration type changed by the kernel.
> 
> As Wen said, we now support a user interface to change movable memory
> into kernel memory. But seeing from above, the memory specified as
> CMA will not be able to be changed, right ?  If so, I don't think
> using CMA is a good idea.
> 
> 
> 2) Is CMA just implemented on ARM platform ?  I found the following in
> kernel-parameters.txt.
> 
> cma=nn[MG]      [ARM,KNL]
>         Sets the size of kernel global memory area for contiguous
>         memory allocations. For more information, see
>         include/linux/dma-contiguous.h
> 
> We are developing on x86. Could we use it ?
> 
> 
> 3) Is CMA just used for DMA ? I am a little confused here. :)
> I found the main code of CMA is implemented in dma-contiguous.c.
> 
> 
> 4) The boot options cma=xxx and movablecore_map=xxx have different
> meanings for user. Reusing CMA could make user confused, I'm afraid.
> 
> And, even if we reuse "cma=" option, we still need to do the work
> in patch 3~5, right ?
> 
> 
> Thanks. :)
> 
> 
> 
> On 11/28/2012 12:08 PM, Jiang Liu wrote:
>> On 2012-11-28 11:24, Bob Liu wrote:
>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>>
>>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi Liu,
>>>>>>
>>>>>>
>>>>>> This feature is used in memory hotplug.
>>>>>>
>>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>>> node contains no kernel memory, because memory used by kernel could
>>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>>
>>>>>> User could specify all the memory on a node to be movable, so that the
>>>>>> node could be hot-removed.
>>>>>>
>>>>>
>>>>> Thank you for your explanation. It's reasonable.
>>>>>
>>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>>> can combine it with CMA which already in mainline?
>>>>>
>>>> Hi Liu,
>>>>
>>>> Thanks for your advice. :)
>>>>
>>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>>> CMA do this job ?
>>>
>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>> can declare a memory that always movable
>>> and all non movable allocate request will not happen on that area.
>>>
>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>> that always movable.
>>> I think it might fulfill your requirement if extending the boot
>>> parameter with a start address.
>>>
>>> more info at http://lwn.net/Articles/468044/
>>>>
>>>> And also, after a short investigation, CMA seems need to base on
>>>> memblock. But we need to limit memblock not to allocate memory on
>>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>>> could be used. I'm afraid we still need an approach to get the ranges,
>>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>>
>>>
>>> Yes, it's based on memblock and with boot option.
>>> In setup_arch32()
>>>      dma_contiguous_reserve(0);   =>  will declare a cma area using
>>> memblock_reserve()
>>>
>>>> I'm don't know much about CMA for now. So if you have any better idea,
>>>> please share with us, thanks. :)
>>>
>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>> "cma=size@start_address".
>>> I don't know whether it can work and whether suitable for your
>>> requirement, if not forgive me for this noises.
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index 612afcc..564962a 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>>    */
>>>   static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>>   static long size_cmdline = -1;
>>> +static long cma_start_cmdline = -1;
>>>
>>>   static int __init early_cma(char *p)
>>>   {
>>> +       char *oldp;
>>>          pr_debug("%s(%s)\n", __func__, p);
>>> +       oldp = p;
>>>          size_cmdline = memparse(p,&p);
>>> +
>>> +       if (*p == '@')
>>> +               cma_start_cmdline = memparse(p+1,&p);
>>> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>>          return 0;
>>>   }
>>>   early_param("cma", early_cma);
>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>>          if (selected_size) {
>>>                  pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>>                           selected_size / SZ_1M);
>>> -
>>> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> +               if (cma_size_cmdline != -1)
>>> +                       dma_declare_contiguous(NULL, selected_size,
>>> cma_start_cmdline, limit);
>>> +               else
>>> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>>>          }
>>>   };
>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>> investigation here. One of CMA goal is to ensure pages in CMA are really
>> movable, and this patchset tries to achieve the same goal at a first glance.
>>
>>
>>
>>
>>
> 
> 
> .
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28  8:29               ` Wen Congyang
@ 2012-11-28  8:28                 ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-28  8:28 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Bob Liu, Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

On 2012-11-28 16:29, Wen Congyang wrote:
> At 11/28/2012 12:08 PM, Jiang Liu Wrote:
>> On 2012-11-28 11:24, Bob Liu wrote:
>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
>>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>>
>>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi Liu,
>>>>>>
>>>>>>
>>>>>> This feature is used in memory hotplug.
>>>>>>
>>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>>> node contains no kernel memory, because memory used by kernel could
>>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>>
>>>>>> User could specify all the memory on a node to be movable, so that the
>>>>>> node could be hot-removed.
>>>>>>
>>>>>
>>>>> Thank you for your explanation. It's reasonable.
>>>>>
>>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>>> can combine it with CMA which already in mainline?
>>>>>
>>>> Hi Liu,
>>>>
>>>> Thanks for your advice. :)
>>>>
>>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>>> CMA do this job ?
>>>
>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>> can declare a memory that always movable
>>> and all non movable allocate request will not happen on that area.
>>>
>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>> that always movable.
>>> I think it might fulfill your requirement if extending the boot
>>> parameter with a start address.
>>>
>>> more info at http://lwn.net/Articles/468044/
>>>>
>>>> And also, after a short investigation, CMA seems need to base on
>>>> memblock. But we need to limit memblock not to allocate memory on
>>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>>> could be used. I'm afraid we still need an approach to get the ranges,
>>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>>
>>>
>>> Yes, it's based on memblock and with boot option.
>>> In setup_arch32()
>>>     dma_contiguous_reserve(0);   => will declare a cma area using
>>> memblock_reserve()
>>>
>>>> I'm don't know much about CMA for now. So if you have any better idea,
>>>> please share with us, thanks. :)
>>>
>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>> "cma=size@start_address".
>>> I don't know whether it can work and whether suitable for your
>>> requirement, if not forgive me for this noises.
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index 612afcc..564962a 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>>   */
>>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>>  static long size_cmdline = -1;
>>> +static long cma_start_cmdline = -1;
>>>
>>>  static int __init early_cma(char *p)
>>>  {
>>> +       char *oldp;
>>>         pr_debug("%s(%s)\n", __func__, p);
>>> +       oldp = p;
>>>         size_cmdline = memparse(p, &p);
>>> +
>>> +       if (*p == '@')
>>> +               cma_start_cmdline = memparse(p+1, &p);
>>> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>>         return 0;
>>>  }
>>>  early_param("cma", early_cma);
>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>>         if (selected_size) {
>>>                 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>>                          selected_size / SZ_1M);
>>> -
>>> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> +               if (cma_size_cmdline != -1)
>>> +                       dma_declare_contiguous(NULL, selected_size,
>>> cma_start_cmdline, limit);
>>> +               else
>>> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>>>         }
>>>  };
>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>> investigation here. One of CMA goal is to ensure pages in CMA are really
>> movable, and this patchset tries to achieve the same goal at a first glance.
> 
> Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
> for movable memory, I think movable zone is enough. And the start address is
> not acceptable, because we want to specify the start address for each node.
> 
> I think we can implement movablecore_map like that:
> 1. parse the parameter
> 2. reserve the memory after efi_reserve_boot_services()
This sounds good, but the code to reserve memory for movable
nodes will be similar to dma_declare_contiguous().

> 3. release the memory in mem_init
> 
> What about this?
> 
> Thanks
> Wen Congyang
>>
>>  
>>
>>
>>
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28  8:28                 ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-28  8:28 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Bob Liu, Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

On 2012-11-28 16:29, Wen Congyang wrote:
> At 11/28/2012 12:08 PM, Jiang Liu Wrote:
>> On 2012-11-28 11:24, Bob Liu wrote:
>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
>>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>>
>>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi Liu,
>>>>>>
>>>>>>
>>>>>> This feature is used in memory hotplug.
>>>>>>
>>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>>> node contains no kernel memory, because memory used by kernel could
>>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>>
>>>>>> User could specify all the memory on a node to be movable, so that the
>>>>>> node could be hot-removed.
>>>>>>
>>>>>
>>>>> Thank you for your explanation. It's reasonable.
>>>>>
>>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>>> can combine it with CMA which already in mainline?
>>>>>
>>>> Hi Liu,
>>>>
>>>> Thanks for your advice. :)
>>>>
>>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>>> CMA do this job ?
>>>
>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>> can declare a memory that always movable
>>> and all non movable allocate request will not happen on that area.
>>>
>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>> that always movable.
>>> I think it might fulfill your requirement if extending the boot
>>> parameter with a start address.
>>>
>>> more info at http://lwn.net/Articles/468044/
>>>>
>>>> And also, after a short investigation, CMA seems need to base on
>>>> memblock. But we need to limit memblock not to allocate memory on
>>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>>> could be used. I'm afraid we still need an approach to get the ranges,
>>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>>
>>>
>>> Yes, it's based on memblock and with boot option.
>>> In setup_arch32()
>>>     dma_contiguous_reserve(0);   => will declare a cma area using
>>> memblock_reserve()
>>>
>>>> I'm don't know much about CMA for now. So if you have any better idea,
>>>> please share with us, thanks. :)
>>>
>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>> "cma=size@start_address".
>>> I don't know whether it can work and whether suitable for your
>>> requirement, if not forgive me for this noises.
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index 612afcc..564962a 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>>   */
>>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>>  static long size_cmdline = -1;
>>> +static long cma_start_cmdline = -1;
>>>
>>>  static int __init early_cma(char *p)
>>>  {
>>> +       char *oldp;
>>>         pr_debug("%s(%s)\n", __func__, p);
>>> +       oldp = p;
>>>         size_cmdline = memparse(p, &p);
>>> +
>>> +       if (*p == '@')
>>> +               cma_start_cmdline = memparse(p+1, &p);
>>> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>>         return 0;
>>>  }
>>>  early_param("cma", early_cma);
>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>>         if (selected_size) {
>>>                 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>>                          selected_size / SZ_1M);
>>> -
>>> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> +               if (cma_size_cmdline != -1)
>>> +                       dma_declare_contiguous(NULL, selected_size,
>>> cma_start_cmdline, limit);
>>> +               else
>>> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>>>         }
>>>  };
>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>> investigation here. One of CMA goal is to ensure pages in CMA are really
>> movable, and this patchset tries to achieve the same goal at a first glance.
> 
> Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
> for movable memory, I think movable zone is enough. And the start address is
> not acceptable, because we want to specify the start address for each node.
> 
> I think we can implement movablecore_map like that:
> 1. parse the parameter
> 2. reserve the memory after efi_reserve_boot_services()
This sounds good, but the code to reserve memory for movable
nodes will be similar to dma_declare_contiguous().

> 3. release the memory in mem_init
> 
> What about this?
> 
> Thanks
> Wen Congyang
>>
>>  
>>
>>
>>
> 
> 
> .
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28  4:08             ` Jiang Liu
@ 2012-11-28  8:29               ` Wen Congyang
  -1 siblings, 0 replies; 170+ messages in thread
From: Wen Congyang @ 2012-11-28  8:29 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Bob Liu, Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

At 11/28/2012 12:08 PM, Jiang Liu Wrote:
> On 2012-11-28 11:24, Bob Liu wrote:
>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>
>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>>> wrote:
>>>>>
>>>>> Hi Liu,
>>>>>
>>>>>
>>>>> This feature is used in memory hotplug.
>>>>>
>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>> node contains no kernel memory, because memory used by kernel could
>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>
>>>>> User could specify all the memory on a node to be movable, so that the
>>>>> node could be hot-removed.
>>>>>
>>>>
>>>> Thank you for your explanation. It's reasonable.
>>>>
>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>> can combine it with CMA which already in mainline?
>>>>
>>> Hi Liu,
>>>
>>> Thanks for your advice. :)
>>>
>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>> CMA do this job ?
>>
>> cma will not control the start of ZONE_MOVABLE of each node, but it
>> can declare a memory that always movable
>> and all non movable allocate request will not happen on that area.
>>
>> Currently cma use a boot parameter "cma=" to declare a memory size
>> that always movable.
>> I think it might fulfill your requirement if extending the boot
>> parameter with a start address.
>>
>> more info at http://lwn.net/Articles/468044/
>>>
>>> And also, after a short investigation, CMA seems need to base on
>>> memblock. But we need to limit memblock not to allocate memory on
>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>> could be used. I'm afraid we still need an approach to get the ranges,
>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>
>>
>> Yes, it's based on memblock and with boot option.
>> In setup_arch32()
>>     dma_contiguous_reserve(0);   => will declare a cma area using
>> memblock_reserve()
>>
>>> I'm don't know much about CMA for now. So if you have any better idea,
>>> please share with us, thanks. :)
>>
>> My idea is reuse cma like below patch(even not compiled) and boot with
>> "cma=size@start_address".
>> I don't know whether it can work and whether suitable for your
>> requirement, if not forgive me for this noises.
>>
>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>> index 612afcc..564962a 100644
>> --- a/drivers/base/dma-contiguous.c
>> +++ b/drivers/base/dma-contiguous.c
>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>   */
>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>  static long size_cmdline = -1;
>> +static long cma_start_cmdline = -1;
>>
>>  static int __init early_cma(char *p)
>>  {
>> +       char *oldp;
>>         pr_debug("%s(%s)\n", __func__, p);
>> +       oldp = p;
>>         size_cmdline = memparse(p, &p);
>> +
>> +       if (*p == '@')
>> +               cma_start_cmdline = memparse(p+1, &p);
>> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>         return 0;
>>  }
>>  early_param("cma", early_cma);
>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>         if (selected_size) {
>>                 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>                          selected_size / SZ_1M);
>> -
>> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
>> +               if (cma_size_cmdline != -1)
>> +                       dma_declare_contiguous(NULL, selected_size,
>> cma_start_cmdline, limit);
>> +               else
>> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>>         }
>>  };
> Seems a good idea to reserve memory by reusing CMA logic, though need more
> investigation here. One of CMA goal is to ensure pages in CMA are really
> movable, and this patchset tries to achieve the same goal at a first glance.

Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
for movable memory, I think movable zone is enough. And the start address is
not acceptable, because we want to specify the start address for each node.

I think we can implement movablecore_map like that:
1. parse the parameter
2. reserve the memory after efi_reserve_boot_services()
3. release the memory in mem_init

What about this?

Thanks
Wen Congyang
> 
>  
> 
> 
> 


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28  8:29               ` Wen Congyang
  0 siblings, 0 replies; 170+ messages in thread
From: Wen Congyang @ 2012-11-28  8:29 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Bob Liu, Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

At 11/28/2012 12:08 PM, Jiang Liu Wrote:
> On 2012-11-28 11:24, Bob Liu wrote:
>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>
>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>>> wrote:
>>>>>
>>>>> Hi Liu,
>>>>>
>>>>>
>>>>> This feature is used in memory hotplug.
>>>>>
>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>> node contains no kernel memory, because memory used by kernel could
>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>
>>>>> User could specify all the memory on a node to be movable, so that the
>>>>> node could be hot-removed.
>>>>>
>>>>
>>>> Thank you for your explanation. It's reasonable.
>>>>
>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>> can combine it with CMA which already in mainline?
>>>>
>>> Hi Liu,
>>>
>>> Thanks for your advice. :)
>>>
>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>> CMA do this job ?
>>
>> cma will not control the start of ZONE_MOVABLE of each node, but it
>> can declare a memory that always movable
>> and all non movable allocate request will not happen on that area.
>>
>> Currently cma use a boot parameter "cma=" to declare a memory size
>> that always movable.
>> I think it might fulfill your requirement if extending the boot
>> parameter with a start address.
>>
>> more info at http://lwn.net/Articles/468044/
>>>
>>> And also, after a short investigation, CMA seems need to base on
>>> memblock. But we need to limit memblock not to allocate memory on
>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>> could be used. I'm afraid we still need an approach to get the ranges,
>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>
>>
>> Yes, it's based on memblock and with boot option.
>> In setup_arch32()
>>     dma_contiguous_reserve(0);   => will declare a cma area using
>> memblock_reserve()
>>
>>> I'm don't know much about CMA for now. So if you have any better idea,
>>> please share with us, thanks. :)
>>
>> My idea is reuse cma like below patch(even not compiled) and boot with
>> "cma=size@start_address".
>> I don't know whether it can work and whether suitable for your
>> requirement, if not forgive me for this noises.
>>
>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>> index 612afcc..564962a 100644
>> --- a/drivers/base/dma-contiguous.c
>> +++ b/drivers/base/dma-contiguous.c
>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>   */
>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>  static long size_cmdline = -1;
>> +static long cma_start_cmdline = -1;
>>
>>  static int __init early_cma(char *p)
>>  {
>> +       char *oldp;
>>         pr_debug("%s(%s)\n", __func__, p);
>> +       oldp = p;
>>         size_cmdline = memparse(p, &p);
>> +
>> +       if (*p == '@')
>> +               cma_start_cmdline = memparse(p+1, &p);
>> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>         return 0;
>>  }
>>  early_param("cma", early_cma);
>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>         if (selected_size) {
>>                 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>                          selected_size / SZ_1M);
>> -
>> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
>> +               if (cma_size_cmdline != -1)
>> +                       dma_declare_contiguous(NULL, selected_size,
>> cma_start_cmdline, limit);
>> +               else
>> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>>         }
>>  };
> Seems a good idea to reserve memory by reusing CMA logic, though need more
> investigation here. One of CMA goal is to ensure pages in CMA are really
> movable, and this patchset tries to achieve the same goal at a first glance.

Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
for movable memory, I think movable zone is enough. And the start address is
not acceptable, because we want to specify the start address for each node.

I think we can implement movablecore_map like that:
1. parse the parameter
2. reserve the memory after efi_reserve_boot_services()
3. release the memory in mem_init

What about this?

Thanks
Wen Congyang
> 
>  
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28  8:28                 ` Jiang Liu
@ 2012-11-28  8:38                   ` Wen Congyang
  -1 siblings, 0 replies; 170+ messages in thread
From: Wen Congyang @ 2012-11-28  8:38 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Bob Liu, Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

At 11/28/2012 04:28 PM, Jiang Liu Wrote:
> On 2012-11-28 16:29, Wen Congyang wrote:
>> At 11/28/2012 12:08 PM, Jiang Liu Wrote:
>>> On 2012-11-28 11:24, Bob Liu wrote:
>>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
>>>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>>>
>>>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Liu,
>>>>>>>
>>>>>>>
>>>>>>> This feature is used in memory hotplug.
>>>>>>>
>>>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>>>> node contains no kernel memory, because memory used by kernel could
>>>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>>>
>>>>>>> User could specify all the memory on a node to be movable, so that the
>>>>>>> node could be hot-removed.
>>>>>>>
>>>>>>
>>>>>> Thank you for your explanation. It's reasonable.
>>>>>>
>>>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>>>> can combine it with CMA which already in mainline?
>>>>>>
>>>>> Hi Liu,
>>>>>
>>>>> Thanks for your advice. :)
>>>>>
>>>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>>>> CMA do this job ?
>>>>
>>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>>> can declare a memory that always movable
>>>> and all non movable allocate request will not happen on that area.
>>>>
>>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>>> that always movable.
>>>> I think it might fulfill your requirement if extending the boot
>>>> parameter with a start address.
>>>>
>>>> more info at http://lwn.net/Articles/468044/
>>>>>
>>>>> And also, after a short investigation, CMA seems need to base on
>>>>> memblock. But we need to limit memblock not to allocate memory on
>>>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>>>> could be used. I'm afraid we still need an approach to get the ranges,
>>>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>>>
>>>>
>>>> Yes, it's based on memblock and with boot option.
>>>> In setup_arch32()
>>>>     dma_contiguous_reserve(0);   => will declare a cma area using
>>>> memblock_reserve()
>>>>
>>>>> I'm don't know much about CMA for now. So if you have any better idea,
>>>>> please share with us, thanks. :)
>>>>
>>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>>> "cma=size@start_address".
>>>> I don't know whether it can work and whether suitable for your
>>>> requirement, if not forgive me for this noises.
>>>>
>>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>>> index 612afcc..564962a 100644
>>>> --- a/drivers/base/dma-contiguous.c
>>>> +++ b/drivers/base/dma-contiguous.c
>>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>>>   */
>>>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>>>  static long size_cmdline = -1;
>>>> +static long cma_start_cmdline = -1;
>>>>
>>>>  static int __init early_cma(char *p)
>>>>  {
>>>> +       char *oldp;
>>>>         pr_debug("%s(%s)\n", __func__, p);
>>>> +       oldp = p;
>>>>         size_cmdline = memparse(p, &p);
>>>> +
>>>> +       if (*p == '@')
>>>> +               cma_start_cmdline = memparse(p+1, &p);
>>>> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>>>         return 0;
>>>>  }
>>>>  early_param("cma", early_cma);
>>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>>>         if (selected_size) {
>>>>                 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>>>                          selected_size / SZ_1M);
>>>> -
>>>> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
>>>> +               if (cma_size_cmdline != -1)
>>>> +                       dma_declare_contiguous(NULL, selected_size,
>>>> cma_start_cmdline, limit);
>>>> +               else
>>>> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>>>>         }
>>>>  };
>>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>>> investigation here. One of CMA goal is to ensure pages in CMA are really
>>> movable, and this patchset tries to achieve the same goal at a first glance.
>>
>> Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
>> for movable memory, I think movable zone is enough. And the start address is
>> not acceptable, because we want to specify the start address for each node.
>>
>> I think we can implement movablecore_map like that:
>> 1. parse the parameter
>> 2. reserve the memory after efi_reserve_boot_services()
> This sounds good, but the code to reserve memory for movable
> nodes will be similar to dma_declare_contiguous().

Yes, it may be very similar. I think we can move it into mm/page_alloc.c, and
both cma and movablecore_map can use this function.

Thanks
Wen Congyang

> 
>> 3. release the memory in mem_init
>>
>> What about this?
>>
>> Thanks
>> Wen Congyang
>>>
>>>  
>>>
>>>
>>>
>>
>>
>> .
>>
> 
> 
> 


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28  8:38                   ` Wen Congyang
  0 siblings, 0 replies; 170+ messages in thread
From: Wen Congyang @ 2012-11-28  8:38 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Bob Liu, Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

At 11/28/2012 04:28 PM, Jiang Liu Wrote:
> On 2012-11-28 16:29, Wen Congyang wrote:
>> At 11/28/2012 12:08 PM, Jiang Liu Wrote:
>>> On 2012-11-28 11:24, Bob Liu wrote:
>>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
>>>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>>>
>>>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Liu,
>>>>>>>
>>>>>>>
>>>>>>> This feature is used in memory hotplug.
>>>>>>>
>>>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>>>> node contains no kernel memory, because memory used by kernel could
>>>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>>>
>>>>>>> User could specify all the memory on a node to be movable, so that the
>>>>>>> node could be hot-removed.
>>>>>>>
>>>>>>
>>>>>> Thank you for your explanation. It's reasonable.
>>>>>>
>>>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>>>> can combine it with CMA which already in mainline?
>>>>>>
>>>>> Hi Liu,
>>>>>
>>>>> Thanks for your advice. :)
>>>>>
>>>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>>>> CMA do this job ?
>>>>
>>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>>> can declare a memory that always movable
>>>> and all non movable allocate request will not happen on that area.
>>>>
>>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>>> that always movable.
>>>> I think it might fulfill your requirement if extending the boot
>>>> parameter with a start address.
>>>>
>>>> more info at http://lwn.net/Articles/468044/
>>>>>
>>>>> And also, after a short investigation, CMA seems need to base on
>>>>> memblock. But we need to limit memblock not to allocate memory on
>>>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>>>> could be used. I'm afraid we still need an approach to get the ranges,
>>>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>>>
>>>>
>>>> Yes, it's based on memblock and with boot option.
>>>> In setup_arch32()
>>>>     dma_contiguous_reserve(0);   => will declare a cma area using
>>>> memblock_reserve()
>>>>
>>>>> I'm don't know much about CMA for now. So if you have any better idea,
>>>>> please share with us, thanks. :)
>>>>
>>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>>> "cma=size@start_address".
>>>> I don't know whether it can work and whether suitable for your
>>>> requirement, if not forgive me for this noises.
>>>>
>>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>>> index 612afcc..564962a 100644
>>>> --- a/drivers/base/dma-contiguous.c
>>>> +++ b/drivers/base/dma-contiguous.c
>>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>>>   */
>>>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>>>  static long size_cmdline = -1;
>>>> +static long cma_start_cmdline = -1;
>>>>
>>>>  static int __init early_cma(char *p)
>>>>  {
>>>> +       char *oldp;
>>>>         pr_debug("%s(%s)\n", __func__, p);
>>>> +       oldp = p;
>>>>         size_cmdline = memparse(p, &p);
>>>> +
>>>> +       if (*p == '@')
>>>> +               cma_start_cmdline = memparse(p+1, &p);
>>>> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>>>         return 0;
>>>>  }
>>>>  early_param("cma", early_cma);
>>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>>>         if (selected_size) {
>>>>                 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>>>                          selected_size / SZ_1M);
>>>> -
>>>> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
>>>> +               if (cma_size_cmdline != -1)
>>>> +                       dma_declare_contiguous(NULL, selected_size,
>>>> cma_start_cmdline, limit);
>>>> +               else
>>>> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>>>>         }
>>>>  };
>>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>>> investigation here. One of CMA goal is to ensure pages in CMA are really
>>> movable, and this patchset tries to achieve the same goal at a first glance.
>>
>> Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
>> for movable memory, I think movable zone is enough. And the start address is
>> not acceptable, because we want to specify the start address for each node.
>>
>> I think we can implement movablecore_map like that:
>> 1. parse the parameter
>> 2. reserve the memory after efi_reserve_boot_services()
> This sounds good, but the code to reserve memory for movable
> nodes will be similar to dma_declare_contiguous().

Yes, it may be very similar. I think we can move it into mm/page_alloc.c, and
both cma and movablecore_map can use this function.

Thanks
Wen Congyang

> 
>> 3. release the memory in mem_init
>>
>> What about this?
>>
>> Thanks
>> Wen Congyang
>>>
>>>  
>>>
>>>
>>>
>>
>>
>> .
>>
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-23 10:44 ` Tang Chen
@ 2012-11-28  8:47   ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-28  8:47 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng, yinghai,
	kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, Len Brown, Tony Luck, Wang,
	Frank

Hi all,
	Seems it's a great chance to discuss about the memory hotplug feature
within this thread. So I will try to give some high level thoughts about memory
hotplug feature on x86/IA64. Any comments are welcomed!
	First of all, I think usability really matters. Ideally, memory hotplug
feature should just work out of box, and we shouldn't expect administrators to 
add several extra platform dependent parameters to enable memory hotplug. 
But how to enable memory (or CPU/node) hotplug out of box? I think the key point
is to cooperate with BIOS/ACPI/firmware/device management teams. 
	I still position memory hotplug as an advanced feature for high end 
servers and those systems may/should provide some management interfaces to 
configure CPU/memory/node hotplug features. The configuration UI may be provided
by BIOS, BMC or centralized system management suite. Once administrator enables
hotplug feature through those management UI, OS should support system device
hotplug out of box. For example, HP SuperDome2 management suite provides interface
to configure a node as floating node(hot-removable). And OpenSolaris supports
CPU/memory hotplug out of box without any extra configurations. So we should
shape interfaces between firmware and OS to better support system device hotplug.
	On the other hand, I think there are no commercial available x86/IA64
platforms with system device hotplug capabilities in the field yet, at least only
limited quantity if any. So backward compatibility is not a big issue for us now.
So I think it's doable to rely on firmware to provide better support for system
device hotplug.
	Then what should be enhanced to better support system device hotplug?

1) ACPI specification should be enhanced to provide a static table to describe
components with hotplug features, so OS could reserve special resources for
hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
hot-add. Currently we guess maximum number of CPUs supported by the platform
by counting CPU entries in APIC table, that's not reliable.

2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
hotplug. SRAT associates memory ranges with proximity domains with an extra
"hotpluggable" flag. PMTT provides memory device topology information, such
as "socket->memory controller->DIMM". MPST is used for memory power management
and provides a way to associate memory ranges with memory devices in PMTT.
With all information from SRAT, MPST and PMTT, OS could figure out hotplug
memory ranges automatically, so no extra kernel parameters needed.

3) Enhance ACPICA to provide a method to scan static ACPI tables before
memory subsystem has been initialized because OS need to access SRAT,
MPST and PMTT when initializing memory subsystem.

4) The last and the most important issue is how to minimize performance
drop caused by memory hotplug. As proposed by this patchset, once we
configure all memory of a NUMA node as movable, it essentially disable
NUMA optimization of kernel memory allocation from that node. According
to experience, that will cause huge performance drop. We have observed
10-30% performance drop with memory hotplug enabled. And on another
OS the average performance drop caused by memory hotplug is about 10%.
If we can't resolve the performance drop, memory hotplug is just a feature
for demo:( With help from hardware, we do have some chances to reduce
performance penalty caused by memory hotplug.
	As we know, Linux could migrate movable page, but can't migrate
non-movable pages used by kernel/DMA etc. And the most hard part is how
to deal with those unmovable pages when hot-removing a memory device.
Now hardware has given us a hand with a technology named memory migration,
which could transparently migrate memory between memory devices. There's
no OS visible changes except NUMA topology before and after hardware memory
migration.
	And if there are multiple memory devices within a NUMA node,
we could configure some memory devices to host unmovable memory and the
other to host movable memory. With this configuration, there won't be
bigger performance drop because we have preserved all NUMA optimizations.
We also could achieve memory hotplug remove by:
1) Use existing page migration mechanism to reclaim movable pages.
2) For memory devices hosting unmovable pages, we need:
2.1) find a movable memory device on other nodes with enough capacity
and reclaim it.
2.2) use hardware migration technology to migrate unmovable memory to
the just reclaimed memory device on other nodes.

	I hope we could expect users to adopt memory hotplug technology
with all these implemented.

	Back to this patch, we could rely on the mechanism provided
by it to automatically mark memory ranges as movable with information
from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
manually configure kernel parameters to enable memory hotplug.

	Again, any comments are welcomed!

Regards!
Gerry


On 2012-11-23 18:44, Tang Chen wrote:
> [What we are doing]
> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
> map for each node in the system.
> 
> movablecore_map=nn[KMG]@ss[KMG]
> 
> This option make sure memory range from ss to ss+nn is movable memory.
> 
> 
> [Why we do this]
> If we hot remove a memroy, the memory cannot have kernel memory,
> because Linux cannot migrate kernel memory currently. Therefore,
> we have to guarantee that the hot removed memory has only movable
> memoroy.
> 
> Linux has two boot options, kernelcore= and movablecore=, for
> creating movable memory. These boot options can specify the amount
> of memory use as kernel or movable memory. Using them, we can
> create ZONE_MOVABLE which has only movable memory.
> 
> But it does not fulfill a requirement of memory hot remove, because
> even if we specify the boot options, movable memory is distributed
> in each node evenly. So when we want to hot remove memory which
> memory range is 0x80000000-0c0000000, we have no way to specify
> the memory as movable memory.
> 
> So we proposed a new feature which specifies memory range to use as
> movable memory.
> 
> 
> [Ways to do this]
> There may be 2 ways to specify movable memory.
>  1. use firmware information
>  2. use boot option
> 
> 1. use firmware information
>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>   Affinity Structure". If we use the information, we might be able to
>   specify movable memory by firmware. For example, if Hot Pluggable
>   Filed is enabled, Linux sets the memory as movable memory.
> 
> 2. use boot option
>   This is our proposal. New boot option can specify memory range to use
>   as movable memory.
> 
> 
> [How we do this]
> We chose second way, because if we use first way, users cannot change
> memory range to use as movable memory easily. We think if we create
> movable memory, performance regression may occur by NUMA. In this case,
> user can turn off the feature easily if we prepare the boot option.
> And if we prepare the boot optino, the user can select which memory
> to use as movable memory easily. 
> 
> 
> [How to use]
> Specify the following boot option:
> movablecore_map=nn[KMG]@ss[KMG]
> 
> That means physical address range from ss to ss+nn will be allocated as
> ZONE_MOVABLE.
> 
> And the following points should be considered.
> 
> 1) If the range is involved in a single node, then from ss to the end of
>    the node will be ZONE_MOVABLE.
> 2) If the range covers two or more nodes, then from ss to the end of
>    the node will be ZONE_MOVABLE, and all the other nodes will only
>    have ZONE_MOVABLE.
> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>    unless kernelcore or movablecore is specified.
> 4) This option could be specified at most MAX_NUMNODES times.
> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>    higher priority to be satisfied.
> 6) This option has no conflict with memmap option.
> 
> 
> 
> Tang Chen (4):
>   page_alloc: add movable_memmap kernel parameter
>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>     nodes
>   page_alloc: Make movablecore_map has higher priority
>   page_alloc: Bootmem limit with movablecore_map
> 
> Yasuaki Ishimatsu (1):
>   x86: get pg_data_t's memory from other node
> 
>  Documentation/kernel-parameters.txt |   17 +++
>  arch/x86/mm/numa.c                  |   11 ++-
>  include/linux/memblock.h            |    1 +
>  include/linux/mm.h                  |   11 ++
>  mm/memblock.c                       |   15 +++-
>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>  6 files changed, 263 insertions(+), 8 deletions(-)
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28  8:47   ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-28  8:47 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng, yinghai,
	kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, Len Brown, Tony Luck, Wang,
	Frank

Hi all,
	Seems it's a great chance to discuss about the memory hotplug feature
within this thread. So I will try to give some high level thoughts about memory
hotplug feature on x86/IA64. Any comments are welcomed!
	First of all, I think usability really matters. Ideally, memory hotplug
feature should just work out of box, and we shouldn't expect administrators to 
add several extra platform dependent parameters to enable memory hotplug. 
But how to enable memory (or CPU/node) hotplug out of box? I think the key point
is to cooperate with BIOS/ACPI/firmware/device management teams. 
	I still position memory hotplug as an advanced feature for high end 
servers and those systems may/should provide some management interfaces to 
configure CPU/memory/node hotplug features. The configuration UI may be provided
by BIOS, BMC or centralized system management suite. Once administrator enables
hotplug feature through those management UI, OS should support system device
hotplug out of box. For example, HP SuperDome2 management suite provides interface
to configure a node as floating node(hot-removable). And OpenSolaris supports
CPU/memory hotplug out of box without any extra configurations. So we should
shape interfaces between firmware and OS to better support system device hotplug.
	On the other hand, I think there are no commercial available x86/IA64
platforms with system device hotplug capabilities in the field yet, at least only
limited quantity if any. So backward compatibility is not a big issue for us now.
So I think it's doable to rely on firmware to provide better support for system
device hotplug.
	Then what should be enhanced to better support system device hotplug?

1) ACPI specification should be enhanced to provide a static table to describe
components with hotplug features, so OS could reserve special resources for
hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
hot-add. Currently we guess maximum number of CPUs supported by the platform
by counting CPU entries in APIC table, that's not reliable.

2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
hotplug. SRAT associates memory ranges with proximity domains with an extra
"hotpluggable" flag. PMTT provides memory device topology information, such
as "socket->memory controller->DIMM". MPST is used for memory power management
and provides a way to associate memory ranges with memory devices in PMTT.
With all information from SRAT, MPST and PMTT, OS could figure out hotplug
memory ranges automatically, so no extra kernel parameters needed.

3) Enhance ACPICA to provide a method to scan static ACPI tables before
memory subsystem has been initialized because OS need to access SRAT,
MPST and PMTT when initializing memory subsystem.

4) The last and the most important issue is how to minimize performance
drop caused by memory hotplug. As proposed by this patchset, once we
configure all memory of a NUMA node as movable, it essentially disable
NUMA optimization of kernel memory allocation from that node. According
to experience, that will cause huge performance drop. We have observed
10-30% performance drop with memory hotplug enabled. And on another
OS the average performance drop caused by memory hotplug is about 10%.
If we can't resolve the performance drop, memory hotplug is just a feature
for demo:( With help from hardware, we do have some chances to reduce
performance penalty caused by memory hotplug.
	As we know, Linux could migrate movable page, but can't migrate
non-movable pages used by kernel/DMA etc. And the most hard part is how
to deal with those unmovable pages when hot-removing a memory device.
Now hardware has given us a hand with a technology named memory migration,
which could transparently migrate memory between memory devices. There's
no OS visible changes except NUMA topology before and after hardware memory
migration.
	And if there are multiple memory devices within a NUMA node,
we could configure some memory devices to host unmovable memory and the
other to host movable memory. With this configuration, there won't be
bigger performance drop because we have preserved all NUMA optimizations.
We also could achieve memory hotplug remove by:
1) Use existing page migration mechanism to reclaim movable pages.
2) For memory devices hosting unmovable pages, we need:
2.1) find a movable memory device on other nodes with enough capacity
and reclaim it.
2.2) use hardware migration technology to migrate unmovable memory to
the just reclaimed memory device on other nodes.

	I hope we could expect users to adopt memory hotplug technology
with all these implemented.

	Back to this patch, we could rely on the mechanism provided
by it to automatically mark memory ranges as movable with information
from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
manually configure kernel parameters to enable memory hotplug.

	Again, any comments are welcomed!

Regards!
Gerry


On 2012-11-23 18:44, Tang Chen wrote:
> [What we are doing]
> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
> map for each node in the system.
> 
> movablecore_map=nn[KMG]@ss[KMG]
> 
> This option make sure memory range from ss to ss+nn is movable memory.
> 
> 
> [Why we do this]
> If we hot remove a memroy, the memory cannot have kernel memory,
> because Linux cannot migrate kernel memory currently. Therefore,
> we have to guarantee that the hot removed memory has only movable
> memoroy.
> 
> Linux has two boot options, kernelcore= and movablecore=, for
> creating movable memory. These boot options can specify the amount
> of memory use as kernel or movable memory. Using them, we can
> create ZONE_MOVABLE which has only movable memory.
> 
> But it does not fulfill a requirement of memory hot remove, because
> even if we specify the boot options, movable memory is distributed
> in each node evenly. So when we want to hot remove memory which
> memory range is 0x80000000-0c0000000, we have no way to specify
> the memory as movable memory.
> 
> So we proposed a new feature which specifies memory range to use as
> movable memory.
> 
> 
> [Ways to do this]
> There may be 2 ways to specify movable memory.
>  1. use firmware information
>  2. use boot option
> 
> 1. use firmware information
>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>   Affinity Structure". If we use the information, we might be able to
>   specify movable memory by firmware. For example, if Hot Pluggable
>   Filed is enabled, Linux sets the memory as movable memory.
> 
> 2. use boot option
>   This is our proposal. New boot option can specify memory range to use
>   as movable memory.
> 
> 
> [How we do this]
> We chose second way, because if we use first way, users cannot change
> memory range to use as movable memory easily. We think if we create
> movable memory, performance regression may occur by NUMA. In this case,
> user can turn off the feature easily if we prepare the boot option.
> And if we prepare the boot optino, the user can select which memory
> to use as movable memory easily. 
> 
> 
> [How to use]
> Specify the following boot option:
> movablecore_map=nn[KMG]@ss[KMG]
> 
> That means physical address range from ss to ss+nn will be allocated as
> ZONE_MOVABLE.
> 
> And the following points should be considered.
> 
> 1) If the range is involved in a single node, then from ss to the end of
>    the node will be ZONE_MOVABLE.
> 2) If the range covers two or more nodes, then from ss to the end of
>    the node will be ZONE_MOVABLE, and all the other nodes will only
>    have ZONE_MOVABLE.
> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>    unless kernelcore or movablecore is specified.
> 4) This option could be specified at most MAX_NUMNODES times.
> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>    higher priority to be satisfied.
> 6) This option has no conflict with memmap option.
> 
> 
> 
> Tang Chen (4):
>   page_alloc: add movable_memmap kernel parameter
>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>     nodes
>   page_alloc: Make movablecore_map has higher priority
>   page_alloc: Bootmem limit with movablecore_map
> 
> Yasuaki Ishimatsu (1):
>   x86: get pg_data_t's memory from other node
> 
>  Documentation/kernel-parameters.txt |   17 +++
>  arch/x86/mm/numa.c                  |   11 ++-
>  include/linux/memblock.h            |    1 +
>  include/linux/mm.h                  |   11 ++
>  mm/memblock.c                       |   15 +++-
>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>  6 files changed, 263 insertions(+), 8 deletions(-)
> 
> 
> .
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* RE: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28  8:47   ` Jiang Liu
@ 2012-11-28 21:34     ` Luck, Tony
  -1 siblings, 0 replies; 170+ messages in thread
From: Luck, Tony @ 2012-11-28 21:34 UTC (permalink / raw)
  To: Jiang Liu, Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng, yinghai,
	kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, Len Brown, Wang, Frank

> 1. use firmware information
>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>   Affinity Structure". If we use the information, we might be able to
>   specify movable memory by firmware. For example, if Hot Pluggable
>   Filed is enabled, Linux sets the memory as movable memory.
> 
> 2. use boot option
>   This is our proposal. New boot option can specify memory range to use
>   as movable memory.

Isn't this just moving the work to the user? To pick good values for the
movable areas, they need to know how the memory lines up across
node boundaries ... because they need to make sure to allow some
non-movable memory allocations on each node so that the kernel can
take advantage of node locality.

So the user would have to read at least the SRAT table, and perhaps
more, to figure out what to provide as arguments.

Since this is going to be used on a dynamic system where nodes might
be added an removed - the right values for these arguments might
change from one boot to the next. So even if the user gets them right
on day 1, a month later when a new node has been added, or a broken
node removed the values would be stale.

-Tony

^ permalink raw reply	[flat|nested] 170+ messages in thread

* RE: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28 21:34     ` Luck, Tony
  0 siblings, 0 replies; 170+ messages in thread
From: Luck, Tony @ 2012-11-28 21:34 UTC (permalink / raw)
  To: Jiang Liu, Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng, yinghai,
	kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, Len Brown, Wang, Frank

> 1. use firmware information
>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>   Affinity Structure". If we use the information, we might be able to
>   specify movable memory by firmware. For example, if Hot Pluggable
>   Filed is enabled, Linux sets the memory as movable memory.
> 
> 2. use boot option
>   This is our proposal. New boot option can specify memory range to use
>   as movable memory.

Isn't this just moving the work to the user? To pick good values for the
movable areas, they need to know how the memory lines up across
node boundaries ... because they need to make sure to allow some
non-movable memory allocations on each node so that the kernel can
take advantage of node locality.

So the user would have to read at least the SRAT table, and perhaps
more, to figure out what to provide as arguments.

Since this is going to be used on a dynamic system where nodes might
be added an removed - the right values for these arguments might
change from one boot to the next. So even if the user gets them right
on day 1, a month later when a new node has been added, or a broken
node removed the values would be stale.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28 21:34     ` Luck, Tony
@ 2012-11-28 21:38       ` H. Peter Anvin
  -1 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-28 21:38 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

On 11/28/2012 01:34 PM, Luck, Tony wrote:
>>
>> 2. use boot option
>>   This is our proposal. New boot option can specify memory range to use
>>   as movable memory.
> 
> Isn't this just moving the work to the user? To pick good values for the
> movable areas, they need to know how the memory lines up across
> node boundaries ... because they need to make sure to allow some
> non-movable memory allocations on each node so that the kernel can
> take advantage of node locality.
> 
> So the user would have to read at least the SRAT table, and perhaps
> more, to figure out what to provide as arguments.
> 
> Since this is going to be used on a dynamic system where nodes might
> be added an removed - the right values for these arguments might
> change from one boot to the next. So even if the user gets them right
> on day 1, a month later when a new node has been added, or a broken
> node removed the values would be stale.
> 

I gave this feedback in person at LCE: I consider the kernel
configuration option to be useless for anything other than debugging.
Trying to promote it as an actual solution, to be used by end users in
the field, is ridiculous at best.

	-hpa



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-28 21:38       ` H. Peter Anvin
  0 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-28 21:38 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

On 11/28/2012 01:34 PM, Luck, Tony wrote:
>>
>> 2. use boot option
>>   This is our proposal. New boot option can specify memory range to use
>>   as movable memory.
> 
> Isn't this just moving the work to the user? To pick good values for the
> movable areas, they need to know how the memory lines up across
> node boundaries ... because they need to make sure to allow some
> non-movable memory allocations on each node so that the kernel can
> take advantage of node locality.
> 
> So the user would have to read at least the SRAT table, and perhaps
> more, to figure out what to provide as arguments.
> 
> Since this is going to be used on a dynamic system where nodes might
> be added an removed - the right values for these arguments might
> change from one boot to the next. So even if the user gets them right
> on day 1, a month later when a new node has been added, or a broken
> node removed the values would be stale.
> 

I gave this feedback in person at LCE: I consider the kernel
configuration option to be useless for anything other than debugging.
Trying to promote it as an actual solution, to be used by end users in
the field, is ridiculous at best.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28  8:29               ` Wen Congyang
@ 2012-11-29  0:43                 ` Jaegeuk Hanse
  -1 siblings, 0 replies; 170+ messages in thread
From: Jaegeuk Hanse @ 2012-11-29  0:43 UTC (permalink / raw)
  To: Tang Chen
  Cc: Wen Congyang, Jiang Liu, Bob Liu, Tang Chen, hpa, akpm, rob,
	isimatu.yasuaki, laijs, linfeng, yinghai, kosaki.motohiro,
	minchan.kim, mgorman, rientjes, rusty, linux-kernel, linux-mm,
	linux-doc, m.szyprowski

On Wed, Nov 28, 2012 at 04:29:01PM +0800, Wen Congyang wrote:
>At 11/28/2012 12:08 PM, Jiang Liu Wrote:
>> On 2012-11-28 11:24, Bob Liu wrote:
>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
>>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>>
>>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi Liu,
>>>>>>
>>>>>>
>>>>>> This feature is used in memory hotplug.
>>>>>>
>>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>>> node contains no kernel memory, because memory used by kernel could
>>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>>
>>>>>> User could specify all the memory on a node to be movable, so that the
>>>>>> node could be hot-removed.
>>>>>>
>>>>>
>>>>> Thank you for your explanation. It's reasonable.
>>>>>
>>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>>> can combine it with CMA which already in mainline?
>>>>>
>>>> Hi Liu,
>>>>
>>>> Thanks for your advice. :)
>>>>
>>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>>> CMA do this job ?
>>>
>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>> can declare a memory that always movable
>>> and all non movable allocate request will not happen on that area.
>>>
>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>> that always movable.
>>> I think it might fulfill your requirement if extending the boot
>>> parameter with a start address.
>>>
>>> more info at http://lwn.net/Articles/468044/
>>>>
>>>> And also, after a short investigation, CMA seems need to base on
>>>> memblock. But we need to limit memblock not to allocate memory on
>>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>>> could be used. I'm afraid we still need an approach to get the ranges,
>>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>>
>>>
>>> Yes, it's based on memblock and with boot option.
>>> In setup_arch32()
>>>     dma_contiguous_reserve(0);   => will declare a cma area using
>>> memblock_reserve()
>>>
>>>> I'm don't know much about CMA for now. So if you have any better idea,
>>>> please share with us, thanks. :)
>>>
>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>> "cma=size@start_address".
>>> I don't know whether it can work and whether suitable for your
>>> requirement, if not forgive me for this noises.
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index 612afcc..564962a 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>>   */
>>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>>  static long size_cmdline = -1;
>>> +static long cma_start_cmdline = -1;
>>>
>>>  static int __init early_cma(char *p)
>>>  {
>>> +       char *oldp;
>>>         pr_debug("%s(%s)\n", __func__, p);
>>> +       oldp = p;
>>>         size_cmdline = memparse(p, &p);
>>> +
>>> +       if (*p == '@')
>>> +               cma_start_cmdline = memparse(p+1, &p);
>>> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>>         return 0;
>>>  }
>>>  early_param("cma", early_cma);
>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>>         if (selected_size) {
>>>                 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>>                          selected_size / SZ_1M);
>>> -
>>> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> +               if (cma_size_cmdline != -1)
>>> +                       dma_declare_contiguous(NULL, selected_size,
>>> cma_start_cmdline, limit);
>>> +               else
>>> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>>>         }
>>>  };
>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>> investigation here. One of CMA goal is to ensure pages in CMA are really
>> movable, and this patchset tries to achieve the same goal at a first glance.
>
>Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
>for movable memory, I think movable zone is enough. And the start address is
>not acceptable, because we want to specify the start address for each node.
>
>I think we can implement movablecore_map like that:
>1. parse the parameter
>2. reserve the memory after efi_reserve_boot_services()
>3. release the memory in mem_init
>

Hi Tang,

I haven't read the patchset yet, but could you give a short describe how 
you design your implementation in this patchset?

Regards,
Jaegeuk

>What about this?
>
>Thanks
>Wen Congyang
>> 
>>  
>> 
>> 
>> 
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-29  0:43                 ` Jaegeuk Hanse
  0 siblings, 0 replies; 170+ messages in thread
From: Jaegeuk Hanse @ 2012-11-29  0:43 UTC (permalink / raw)
  To: Tang Chen
  Cc: Wen Congyang, Jiang Liu, Bob Liu, hpa, akpm, rob,
	isimatu.yasuaki, laijs, linfeng, yinghai, kosaki.motohiro,
	minchan.kim, mgorman, rientjes, rusty, linux-kernel, linux-mm,
	linux-doc, m.szyprowski

On Wed, Nov 28, 2012 at 04:29:01PM +0800, Wen Congyang wrote:
>At 11/28/2012 12:08 PM, Jiang Liu Wrote:
>> On 2012-11-28 11:24, Bob Liu wrote:
>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
>>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>>
>>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi Liu,
>>>>>>
>>>>>>
>>>>>> This feature is used in memory hotplug.
>>>>>>
>>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>>> node contains no kernel memory, because memory used by kernel could
>>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>>
>>>>>> User could specify all the memory on a node to be movable, so that the
>>>>>> node could be hot-removed.
>>>>>>
>>>>>
>>>>> Thank you for your explanation. It's reasonable.
>>>>>
>>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>>> can combine it with CMA which already in mainline?
>>>>>
>>>> Hi Liu,
>>>>
>>>> Thanks for your advice. :)
>>>>
>>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>>> CMA do this job ?
>>>
>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>> can declare a memory that always movable
>>> and all non movable allocate request will not happen on that area.
>>>
>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>> that always movable.
>>> I think it might fulfill your requirement if extending the boot
>>> parameter with a start address.
>>>
>>> more info at http://lwn.net/Articles/468044/
>>>>
>>>> And also, after a short investigation, CMA seems need to base on
>>>> memblock. But we need to limit memblock not to allocate memory on
>>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>>> could be used. I'm afraid we still need an approach to get the ranges,
>>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>>
>>>
>>> Yes, it's based on memblock and with boot option.
>>> In setup_arch32()
>>>     dma_contiguous_reserve(0);   => will declare a cma area using
>>> memblock_reserve()
>>>
>>>> I'm don't know much about CMA for now. So if you have any better idea,
>>>> please share with us, thanks. :)
>>>
>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>> "cma=size@start_address".
>>> I don't know whether it can work and whether suitable for your
>>> requirement, if not forgive me for this noises.
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index 612afcc..564962a 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>>   */
>>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>>  static long size_cmdline = -1;
>>> +static long cma_start_cmdline = -1;
>>>
>>>  static int __init early_cma(char *p)
>>>  {
>>> +       char *oldp;
>>>         pr_debug("%s(%s)\n", __func__, p);
>>> +       oldp = p;
>>>         size_cmdline = memparse(p, &p);
>>> +
>>> +       if (*p == '@')
>>> +               cma_start_cmdline = memparse(p+1, &p);
>>> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>>         return 0;
>>>  }
>>>  early_param("cma", early_cma);
>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>>         if (selected_size) {
>>>                 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>>                          selected_size / SZ_1M);
>>> -
>>> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> +               if (cma_size_cmdline != -1)
>>> +                       dma_declare_contiguous(NULL, selected_size,
>>> cma_start_cmdline, limit);
>>> +               else
>>> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>>>         }
>>>  };
>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>> investigation here. One of CMA goal is to ensure pages in CMA are really
>> movable, and this patchset tries to achieve the same goal at a first glance.
>
>Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
>for movable memory, I think movable zone is enough. And the start address is
>not acceptable, because we want to specify the start address for each node.
>
>I think we can implement movablecore_map like that:
>1. parse the parameter
>2. reserve the memory after efi_reserve_boot_services()
>3. release the memory in mem_init
>

Hi Tang,

I haven't read the patchset yet, but could you give a short describe how 
you design your implementation in this patchset?

Regards,
Jaegeuk

>What about this?
>
>Thanks
>Wen Congyang
>> 
>>  
>> 
>> 
>> 
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-29  0:43                 ` Jaegeuk Hanse
@ 2012-11-29  1:24                   ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-29  1:24 UTC (permalink / raw)
  To: Jaegeuk Hanse
  Cc: Wen Congyang, Jiang Liu, Bob Liu, hpa, akpm, rob,
	isimatu.yasuaki, laijs, linfeng, yinghai, kosaki.motohiro,
	minchan.kim, mgorman, rientjes, rusty, linux-kernel, linux-mm,
	linux-doc, m.szyprowski

On 11/29/2012 08:43 AM, Jaegeuk Hanse wrote:
> Hi Tang,
>
> I haven't read the patchset yet, but could you give a short describe how
> you design your implementation in this patchset?
>
> Regards,
> Jaegeuk
>

Hi Jaegeuk,

Thanks for your joining in. :)

This feature is used in memory hotplug.

In order to implement a whole node hotplug, we need to make sure the
node contains no kernel memory, because memory used by kernel could
not be migrated. (Since the kernel memory is directly mapped,
VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

With this boot option, user could specify all the memory on a node to
be movable(which means they are in ZONE_MOVABLE), so that the node
could be hot-removed.

Thanks.




^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-29  1:24                   ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-11-29  1:24 UTC (permalink / raw)
  To: Jaegeuk Hanse
  Cc: Wen Congyang, Jiang Liu, Bob Liu, hpa, akpm, rob,
	isimatu.yasuaki, laijs, linfeng, yinghai, kosaki.motohiro,
	minchan.kim, mgorman, rientjes, rusty, linux-kernel, linux-mm,
	linux-doc, m.szyprowski

On 11/29/2012 08:43 AM, Jaegeuk Hanse wrote:
> Hi Tang,
>
> I haven't read the patchset yet, but could you give a short describe how
> you design your implementation in this patchset?
>
> Regards,
> Jaegeuk
>

Hi Jaegeuk,

Thanks for your joining in. :)

This feature is used in memory hotplug.

In order to implement a whole node hotplug, we need to make sure the
node contains no kernel memory, because memory used by kernel could
not be migrated. (Since the kernel memory is directly mapped,
VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

With this boot option, user could specify all the memory on a node to
be movable(which means they are in ZONE_MOVABLE), so that the node
could be hot-removed.

Thanks.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28  8:47   ` Jiang Liu
@ 2012-11-29  1:42     ` Jaegeuk Hanse
  -1 siblings, 0 replies; 170+ messages in thread
From: Jaegeuk Hanse @ 2012-11-29  1:42 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Tony Luck, Wang, Frank

On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>Hi all,
>	Seems it's a great chance to discuss about the memory hotplug feature
>within this thread. So I will try to give some high level thoughts about memory
>hotplug feature on x86/IA64. Any comments are welcomed!
>	First of all, I think usability really matters. Ideally, memory hotplug
>feature should just work out of box, and we shouldn't expect administrators to 
>add several extra platform dependent parameters to enable memory hotplug. 
>But how to enable memory (or CPU/node) hotplug out of box? I think the key point
>is to cooperate with BIOS/ACPI/firmware/device management teams. 
>	I still position memory hotplug as an advanced feature for high end 
>servers and those systems may/should provide some management interfaces to 
>configure CPU/memory/node hotplug features. The configuration UI may be provided
>by BIOS, BMC or centralized system management suite. Once administrator enables
>hotplug feature through those management UI, OS should support system device
>hotplug out of box. For example, HP SuperDome2 management suite provides interface
>to configure a node as floating node(hot-removable). And OpenSolaris supports
>CPU/memory hotplug out of box without any extra configurations. So we should
>shape interfaces between firmware and OS to better support system device hotplug.
>	On the other hand, I think there are no commercial available x86/IA64
>platforms with system device hotplug capabilities in the field yet, at least only
>limited quantity if any. So backward compatibility is not a big issue for us now.
>So I think it's doable to rely on firmware to provide better support for system
>device hotplug.
>	Then what should be enhanced to better support system device hotplug?
>
>1) ACPI specification should be enhanced to provide a static table to describe
>components with hotplug features, so OS could reserve special resources for
>hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>hot-add. Currently we guess maximum number of CPUs supported by the platform
>by counting CPU entries in APIC table, that's not reliable.
>
>2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>hotplug. SRAT associates memory ranges with proximity domains with an extra
>"hotpluggable" flag. PMTT provides memory device topology information, such
>as "socket->memory controller->DIMM". MPST is used for memory power management
>and provides a way to associate memory ranges with memory devices in PMTT.
>With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>memory ranges automatically, so no extra kernel parameters needed.
>
>3) Enhance ACPICA to provide a method to scan static ACPI tables before
>memory subsystem has been initialized because OS need to access SRAT,
>MPST and PMTT when initializing memory subsystem.
>
>4) The last and the most important issue is how to minimize performance
>drop caused by memory hotplug. As proposed by this patchset, once we
>configure all memory of a NUMA node as movable, it essentially disable
>NUMA optimization of kernel memory allocation from that node. According
>to experience, that will cause huge performance drop. We have observed
>10-30% performance drop with memory hotplug enabled. And on another
>OS the average performance drop caused by memory hotplug is about 10%.
>If we can't resolve the performance drop, memory hotplug is just a feature
>for demo:( With help from hardware, we do have some chances to reduce
>performance penalty caused by memory hotplug.
>	As we know, Linux could migrate movable page, but can't migrate
>non-movable pages used by kernel/DMA etc. And the most hard part is how
>to deal with those unmovable pages when hot-removing a memory device.
>Now hardware has given us a hand with a technology named memory migration,
>which could transparently migrate memory between memory devices. There's
>no OS visible changes except NUMA topology before and after hardware memory
>migration.
>	And if there are multiple memory devices within a NUMA node,
>we could configure some memory devices to host unmovable memory and the
>other to host movable memory. With this configuration, there won't be
>bigger performance drop because we have preserved all NUMA optimizations.
>We also could achieve memory hotplug remove by:
>1) Use existing page migration mechanism to reclaim movable pages.
>2) For memory devices hosting unmovable pages, we need:
>2.1) find a movable memory device on other nodes with enough capacity
>and reclaim it.
>2.2) use hardware migration technology to migrate unmovable memory to

Hi Jiang,

Could you give an explanation how hardware migration technology works?

Regards,
Jaegeuk

>the just reclaimed memory device on other nodes.
>
>	I hope we could expect users to adopt memory hotplug technology
>with all these implemented.
>
>	Back to this patch, we could rely on the mechanism provided
>by it to automatically mark memory ranges as movable with information
>from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
>manually configure kernel parameters to enable memory hotplug.
>
>	Again, any comments are welcomed!
>
>Regards!
>Gerry
>
>
>On 2012-11-23 18:44, Tang Chen wrote:
>> [What we are doing]
>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>> map for each node in the system.
>> 
>> movablecore_map=nn[KMG]@ss[KMG]
>> 
>> This option make sure memory range from ss to ss+nn is movable memory.
>> 
>> 
>> [Why we do this]
>> If we hot remove a memroy, the memory cannot have kernel memory,
>> because Linux cannot migrate kernel memory currently. Therefore,
>> we have to guarantee that the hot removed memory has only movable
>> memoroy.
>> 
>> Linux has two boot options, kernelcore= and movablecore=, for
>> creating movable memory. These boot options can specify the amount
>> of memory use as kernel or movable memory. Using them, we can
>> create ZONE_MOVABLE which has only movable memory.
>> 
>> But it does not fulfill a requirement of memory hot remove, because
>> even if we specify the boot options, movable memory is distributed
>> in each node evenly. So when we want to hot remove memory which
>> memory range is 0x80000000-0c0000000, we have no way to specify
>> the memory as movable memory.
>> 
>> So we proposed a new feature which specifies memory range to use as
>> movable memory.
>> 
>> 
>> [Ways to do this]
>> There may be 2 ways to specify movable memory.
>>  1. use firmware information
>>  2. use boot option
>> 
>> 1. use firmware information
>>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>   Affinity Structure". If we use the information, we might be able to
>>   specify movable memory by firmware. For example, if Hot Pluggable
>>   Filed is enabled, Linux sets the memory as movable memory.
>> 
>> 2. use boot option
>>   This is our proposal. New boot option can specify memory range to use
>>   as movable memory.
>> 
>> 
>> [How we do this]
>> We chose second way, because if we use first way, users cannot change
>> memory range to use as movable memory easily. We think if we create
>> movable memory, performance regression may occur by NUMA. In this case,
>> user can turn off the feature easily if we prepare the boot option.
>> And if we prepare the boot optino, the user can select which memory
>> to use as movable memory easily. 
>> 
>> 
>> [How to use]
>> Specify the following boot option:
>> movablecore_map=nn[KMG]@ss[KMG]
>> 
>> That means physical address range from ss to ss+nn will be allocated as
>> ZONE_MOVABLE.
>> 
>> And the following points should be considered.
>> 
>> 1) If the range is involved in a single node, then from ss to the end of
>>    the node will be ZONE_MOVABLE.
>> 2) If the range covers two or more nodes, then from ss to the end of
>>    the node will be ZONE_MOVABLE, and all the other nodes will only
>>    have ZONE_MOVABLE.
>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>    unless kernelcore or movablecore is specified.
>> 4) This option could be specified at most MAX_NUMNODES times.
>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>>    higher priority to be satisfied.
>> 6) This option has no conflict with memmap option.
>> 
>> 
>> 
>> Tang Chen (4):
>>   page_alloc: add movable_memmap kernel parameter
>>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>     nodes
>>   page_alloc: Make movablecore_map has higher priority
>>   page_alloc: Bootmem limit with movablecore_map
>> 
>> Yasuaki Ishimatsu (1):
>>   x86: get pg_data_t's memory from other node
>> 
>>  Documentation/kernel-parameters.txt |   17 +++
>>  arch/x86/mm/numa.c                  |   11 ++-
>>  include/linux/memblock.h            |    1 +
>>  include/linux/mm.h                  |   11 ++
>>  mm/memblock.c                       |   15 +++-
>>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>>  6 files changed, 263 insertions(+), 8 deletions(-)
>> 
>> 
>> .
>> 
>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-29  1:42     ` Jaegeuk Hanse
  0 siblings, 0 replies; 170+ messages in thread
From: Jaegeuk Hanse @ 2012-11-29  1:42 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Tony Luck, Wang, Frank

On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>Hi all,
>	Seems it's a great chance to discuss about the memory hotplug feature
>within this thread. So I will try to give some high level thoughts about memory
>hotplug feature on x86/IA64. Any comments are welcomed!
>	First of all, I think usability really matters. Ideally, memory hotplug
>feature should just work out of box, and we shouldn't expect administrators to 
>add several extra platform dependent parameters to enable memory hotplug. 
>But how to enable memory (or CPU/node) hotplug out of box? I think the key point
>is to cooperate with BIOS/ACPI/firmware/device management teams. 
>	I still position memory hotplug as an advanced feature for high end 
>servers and those systems may/should provide some management interfaces to 
>configure CPU/memory/node hotplug features. The configuration UI may be provided
>by BIOS, BMC or centralized system management suite. Once administrator enables
>hotplug feature through those management UI, OS should support system device
>hotplug out of box. For example, HP SuperDome2 management suite provides interface
>to configure a node as floating node(hot-removable). And OpenSolaris supports
>CPU/memory hotplug out of box without any extra configurations. So we should
>shape interfaces between firmware and OS to better support system device hotplug.
>	On the other hand, I think there are no commercial available x86/IA64
>platforms with system device hotplug capabilities in the field yet, at least only
>limited quantity if any. So backward compatibility is not a big issue for us now.
>So I think it's doable to rely on firmware to provide better support for system
>device hotplug.
>	Then what should be enhanced to better support system device hotplug?
>
>1) ACPI specification should be enhanced to provide a static table to describe
>components with hotplug features, so OS could reserve special resources for
>hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>hot-add. Currently we guess maximum number of CPUs supported by the platform
>by counting CPU entries in APIC table, that's not reliable.
>
>2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>hotplug. SRAT associates memory ranges with proximity domains with an extra
>"hotpluggable" flag. PMTT provides memory device topology information, such
>as "socket->memory controller->DIMM". MPST is used for memory power management
>and provides a way to associate memory ranges with memory devices in PMTT.
>With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>memory ranges automatically, so no extra kernel parameters needed.
>
>3) Enhance ACPICA to provide a method to scan static ACPI tables before
>memory subsystem has been initialized because OS need to access SRAT,
>MPST and PMTT when initializing memory subsystem.
>
>4) The last and the most important issue is how to minimize performance
>drop caused by memory hotplug. As proposed by this patchset, once we
>configure all memory of a NUMA node as movable, it essentially disable
>NUMA optimization of kernel memory allocation from that node. According
>to experience, that will cause huge performance drop. We have observed
>10-30% performance drop with memory hotplug enabled. And on another
>OS the average performance drop caused by memory hotplug is about 10%.
>If we can't resolve the performance drop, memory hotplug is just a feature
>for demo:( With help from hardware, we do have some chances to reduce
>performance penalty caused by memory hotplug.
>	As we know, Linux could migrate movable page, but can't migrate
>non-movable pages used by kernel/DMA etc. And the most hard part is how
>to deal with those unmovable pages when hot-removing a memory device.
>Now hardware has given us a hand with a technology named memory migration,
>which could transparently migrate memory between memory devices. There's
>no OS visible changes except NUMA topology before and after hardware memory
>migration.
>	And if there are multiple memory devices within a NUMA node,
>we could configure some memory devices to host unmovable memory and the
>other to host movable memory. With this configuration, there won't be
>bigger performance drop because we have preserved all NUMA optimizations.
>We also could achieve memory hotplug remove by:
>1) Use existing page migration mechanism to reclaim movable pages.
>2) For memory devices hosting unmovable pages, we need:
>2.1) find a movable memory device on other nodes with enough capacity
>and reclaim it.
>2.2) use hardware migration technology to migrate unmovable memory to

Hi Jiang,

Could you give an explanation how hardware migration technology works?

Regards,
Jaegeuk

>the just reclaimed memory device on other nodes.
>
>	I hope we could expect users to adopt memory hotplug technology
>with all these implemented.
>
>	Back to this patch, we could rely on the mechanism provided
>by it to automatically mark memory ranges as movable with information
>from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
>manually configure kernel parameters to enable memory hotplug.
>
>	Again, any comments are welcomed!
>
>Regards!
>Gerry
>
>
>On 2012-11-23 18:44, Tang Chen wrote:
>> [What we are doing]
>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>> map for each node in the system.
>> 
>> movablecore_map=nn[KMG]@ss[KMG]
>> 
>> This option make sure memory range from ss to ss+nn is movable memory.
>> 
>> 
>> [Why we do this]
>> If we hot remove a memroy, the memory cannot have kernel memory,
>> because Linux cannot migrate kernel memory currently. Therefore,
>> we have to guarantee that the hot removed memory has only movable
>> memoroy.
>> 
>> Linux has two boot options, kernelcore= and movablecore=, for
>> creating movable memory. These boot options can specify the amount
>> of memory use as kernel or movable memory. Using them, we can
>> create ZONE_MOVABLE which has only movable memory.
>> 
>> But it does not fulfill a requirement of memory hot remove, because
>> even if we specify the boot options, movable memory is distributed
>> in each node evenly. So when we want to hot remove memory which
>> memory range is 0x80000000-0c0000000, we have no way to specify
>> the memory as movable memory.
>> 
>> So we proposed a new feature which specifies memory range to use as
>> movable memory.
>> 
>> 
>> [Ways to do this]
>> There may be 2 ways to specify movable memory.
>>  1. use firmware information
>>  2. use boot option
>> 
>> 1. use firmware information
>>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>   Affinity Structure". If we use the information, we might be able to
>>   specify movable memory by firmware. For example, if Hot Pluggable
>>   Filed is enabled, Linux sets the memory as movable memory.
>> 
>> 2. use boot option
>>   This is our proposal. New boot option can specify memory range to use
>>   as movable memory.
>> 
>> 
>> [How we do this]
>> We chose second way, because if we use first way, users cannot change
>> memory range to use as movable memory easily. We think if we create
>> movable memory, performance regression may occur by NUMA. In this case,
>> user can turn off the feature easily if we prepare the boot option.
>> And if we prepare the boot optino, the user can select which memory
>> to use as movable memory easily. 
>> 
>> 
>> [How to use]
>> Specify the following boot option:
>> movablecore_map=nn[KMG]@ss[KMG]
>> 
>> That means physical address range from ss to ss+nn will be allocated as
>> ZONE_MOVABLE.
>> 
>> And the following points should be considered.
>> 
>> 1) If the range is involved in a single node, then from ss to the end of
>>    the node will be ZONE_MOVABLE.
>> 2) If the range covers two or more nodes, then from ss to the end of
>>    the node will be ZONE_MOVABLE, and all the other nodes will only
>>    have ZONE_MOVABLE.
>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>    unless kernelcore or movablecore is specified.
>> 4) This option could be specified at most MAX_NUMNODES times.
>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>>    higher priority to be satisfied.
>> 6) This option has no conflict with memmap option.
>> 
>> 
>> 
>> Tang Chen (4):
>>   page_alloc: add movable_memmap kernel parameter
>>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>     nodes
>>   page_alloc: Make movablecore_map has higher priority
>>   page_alloc: Bootmem limit with movablecore_map
>> 
>> Yasuaki Ishimatsu (1):
>>   x86: get pg_data_t's memory from other node
>> 
>>  Documentation/kernel-parameters.txt |   17 +++
>>  arch/x86/mm/numa.c                  |   11 ++-
>>  include/linux/memblock.h            |    1 +
>>  include/linux/mm.h                  |   11 ++
>>  mm/memblock.c                       |   15 +++-
>>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>>  6 files changed, 263 insertions(+), 8 deletions(-)
>> 
>> 
>> .
>> 
>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-29  1:42     ` Jaegeuk Hanse
@ 2012-11-29  2:25       ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-29  2:25 UTC (permalink / raw)
  To: Jaegeuk Hanse
  Cc: Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Tony Luck, Wang, Frank

On 2012-11-29 9:42, Jaegeuk Hanse wrote:
> On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>> Hi all,
>> 	Seems it's a great chance to discuss about the memory hotplug feature
>> within this thread. So I will try to give some high level thoughts about memory
>> hotplug feature on x86/IA64. Any comments are welcomed!
>> 	First of all, I think usability really matters. Ideally, memory hotplug
>> feature should just work out of box, and we shouldn't expect administrators to 
>> add several extra platform dependent parameters to enable memory hotplug. 
>> But how to enable memory (or CPU/node) hotplug out of box? I think the key point
>> is to cooperate with BIOS/ACPI/firmware/device management teams. 
>> 	I still position memory hotplug as an advanced feature for high end 
>> servers and those systems may/should provide some management interfaces to 
>> configure CPU/memory/node hotplug features. The configuration UI may be provided
>> by BIOS, BMC or centralized system management suite. Once administrator enables
>> hotplug feature through those management UI, OS should support system device
>> hotplug out of box. For example, HP SuperDome2 management suite provides interface
>> to configure a node as floating node(hot-removable). And OpenSolaris supports
>> CPU/memory hotplug out of box without any extra configurations. So we should
>> shape interfaces between firmware and OS to better support system device hotplug.
>> 	On the other hand, I think there are no commercial available x86/IA64
>> platforms with system device hotplug capabilities in the field yet, at least only
>> limited quantity if any. So backward compatibility is not a big issue for us now.
>> So I think it's doable to rely on firmware to provide better support for system
>> device hotplug.
>> 	Then what should be enhanced to better support system device hotplug?
>>
>> 1) ACPI specification should be enhanced to provide a static table to describe
>> components with hotplug features, so OS could reserve special resources for
>> hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>> hot-add. Currently we guess maximum number of CPUs supported by the platform
>> by counting CPU entries in APIC table, that's not reliable.
>>
>> 2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>> hotplug. SRAT associates memory ranges with proximity domains with an extra
>> "hotpluggable" flag. PMTT provides memory device topology information, such
>> as "socket->memory controller->DIMM". MPST is used for memory power management
>> and provides a way to associate memory ranges with memory devices in PMTT.
>> With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>> memory ranges automatically, so no extra kernel parameters needed.
>>
>> 3) Enhance ACPICA to provide a method to scan static ACPI tables before
>> memory subsystem has been initialized because OS need to access SRAT,
>> MPST and PMTT when initializing memory subsystem.
>>
>> 4) The last and the most important issue is how to minimize performance
>> drop caused by memory hotplug. As proposed by this patchset, once we
>> configure all memory of a NUMA node as movable, it essentially disable
>> NUMA optimization of kernel memory allocation from that node. According
>> to experience, that will cause huge performance drop. We have observed
>> 10-30% performance drop with memory hotplug enabled. And on another
>> OS the average performance drop caused by memory hotplug is about 10%.
>> If we can't resolve the performance drop, memory hotplug is just a feature
>> for demo:( With help from hardware, we do have some chances to reduce
>> performance penalty caused by memory hotplug.
>> 	As we know, Linux could migrate movable page, but can't migrate
>> non-movable pages used by kernel/DMA etc. And the most hard part is how
>> to deal with those unmovable pages when hot-removing a memory device.
>> Now hardware has given us a hand with a technology named memory migration,
>> which could transparently migrate memory between memory devices. There's
>> no OS visible changes except NUMA topology before and after hardware memory
>> migration.
>> 	And if there are multiple memory devices within a NUMA node,
>> we could configure some memory devices to host unmovable memory and the
>> other to host movable memory. With this configuration, there won't be
>> bigger performance drop because we have preserved all NUMA optimizations.
>> We also could achieve memory hotplug remove by:
>> 1) Use existing page migration mechanism to reclaim movable pages.
>> 2) For memory devices hosting unmovable pages, we need:
>> 2.1) find a movable memory device on other nodes with enough capacity
>> and reclaim it.
>> 2.2) use hardware migration technology to migrate unmovable memory to
> 
> Hi Jiang,
> 
> Could you give an explanation how hardware migration technology works?
Hi Jaegeuk,
	Now some severs support a hardware memory RAS feature called memory
mirror, something like RAID1. The mirrored memory devices will be configured
with the same address and host same contents. And you could transparently
hot-remove one of the mirrored memory device without any help from OS.

We could think memory migration as an extension to the memory mirror technology.
The basic flow for memory migration is:
1) Find a spare memory device with enough capacity in the system.
2) OS issues a request to firmware to migrate from source memory device (A)
   to the spare memory device (B).
3) Firmware configures A and B into memory mode, and configure A as master
   and B as slave.
4) Firmware resilver the mirror to synchronize the content from A to B
5) Firmware reconfigure B as master and A as slave.
6) Firmware deconfigures the memory mirror and removes A
7) Firmware report results to OS.
8) Now user could hot-remove the source memory device A from system.

During memory migration, A and B are in mirror mode, so CPUs and IO devices
could access it as normal. After memory migration, memory device B will have
the same address ranges and content as memory device A, so there's no OS 
visible changes except latency (because A and B may belong to different NUMA
domains).

So hardware memory migration could be used to migrate pages can't be migrated
by OS.

Regards!
Gerry

> 
> Regards,
> Jaegeuk
> 
>> the just reclaimed memory device on other nodes.
>>
>> 	I hope we could expect users to adopt memory hotplug technology
>> with all these implemented.
>>
>> 	Back to this patch, we could rely on the mechanism provided
>> by it to automatically mark memory ranges as movable with information
>>from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
>> manually configure kernel parameters to enable memory hotplug.
>>
>> 	Again, any comments are welcomed!
>>
>> Regards!
>> Gerry
>>
>>
>> On 2012-11-23 18:44, Tang Chen wrote:
>>> [What we are doing]
>>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>>> map for each node in the system.
>>>
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>> This option make sure memory range from ss to ss+nn is movable memory.
>>>
>>>
>>> [Why we do this]
>>> If we hot remove a memroy, the memory cannot have kernel memory,
>>> because Linux cannot migrate kernel memory currently. Therefore,
>>> we have to guarantee that the hot removed memory has only movable
>>> memoroy.
>>>
>>> Linux has two boot options, kernelcore= and movablecore=, for
>>> creating movable memory. These boot options can specify the amount
>>> of memory use as kernel or movable memory. Using them, we can
>>> create ZONE_MOVABLE which has only movable memory.
>>>
>>> But it does not fulfill a requirement of memory hot remove, because
>>> even if we specify the boot options, movable memory is distributed
>>> in each node evenly. So when we want to hot remove memory which
>>> memory range is 0x80000000-0c0000000, we have no way to specify
>>> the memory as movable memory.
>>>
>>> So we proposed a new feature which specifies memory range to use as
>>> movable memory.
>>>
>>>
>>> [Ways to do this]
>>> There may be 2 ways to specify movable memory.
>>>  1. use firmware information
>>>  2. use boot option
>>>
>>> 1. use firmware information
>>>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>   Affinity Structure". If we use the information, we might be able to
>>>   specify movable memory by firmware. For example, if Hot Pluggable
>>>   Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>>   This is our proposal. New boot option can specify memory range to use
>>>   as movable memory.
>>>
>>>
>>> [How we do this]
>>> We chose second way, because if we use first way, users cannot change
>>> memory range to use as movable memory easily. We think if we create
>>> movable memory, performance regression may occur by NUMA. In this case,
>>> user can turn off the feature easily if we prepare the boot option.
>>> And if we prepare the boot optino, the user can select which memory
>>> to use as movable memory easily. 
>>>
>>>
>>> [How to use]
>>> Specify the following boot option:
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>> That means physical address range from ss to ss+nn will be allocated as
>>> ZONE_MOVABLE.
>>>
>>> And the following points should be considered.
>>>
>>> 1) If the range is involved in a single node, then from ss to the end of
>>>    the node will be ZONE_MOVABLE.
>>> 2) If the range covers two or more nodes, then from ss to the end of
>>>    the node will be ZONE_MOVABLE, and all the other nodes will only
>>>    have ZONE_MOVABLE.
>>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>>    unless kernelcore or movablecore is specified.
>>> 4) This option could be specified at most MAX_NUMNODES times.
>>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>>>    higher priority to be satisfied.
>>> 6) This option has no conflict with memmap option.
>>>
>>>
>>>
>>> Tang Chen (4):
>>>   page_alloc: add movable_memmap kernel parameter
>>>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>>     nodes
>>>   page_alloc: Make movablecore_map has higher priority
>>>   page_alloc: Bootmem limit with movablecore_map
>>>
>>> Yasuaki Ishimatsu (1):
>>>   x86: get pg_data_t's memory from other node
>>>
>>>  Documentation/kernel-parameters.txt |   17 +++
>>>  arch/x86/mm/numa.c                  |   11 ++-
>>>  include/linux/memblock.h            |    1 +
>>>  include/linux/mm.h                  |   11 ++
>>>  mm/memblock.c                       |   15 +++-
>>>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>>>  6 files changed, 263 insertions(+), 8 deletions(-)
>>>
>>>
>>> .
>>>
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> .
> 



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-29  2:25       ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-29  2:25 UTC (permalink / raw)
  To: Jaegeuk Hanse
  Cc: Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Tony Luck, Wang, Frank

On 2012-11-29 9:42, Jaegeuk Hanse wrote:
> On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>> Hi all,
>> 	Seems it's a great chance to discuss about the memory hotplug feature
>> within this thread. So I will try to give some high level thoughts about memory
>> hotplug feature on x86/IA64. Any comments are welcomed!
>> 	First of all, I think usability really matters. Ideally, memory hotplug
>> feature should just work out of box, and we shouldn't expect administrators to 
>> add several extra platform dependent parameters to enable memory hotplug. 
>> But how to enable memory (or CPU/node) hotplug out of box? I think the key point
>> is to cooperate with BIOS/ACPI/firmware/device management teams. 
>> 	I still position memory hotplug as an advanced feature for high end 
>> servers and those systems may/should provide some management interfaces to 
>> configure CPU/memory/node hotplug features. The configuration UI may be provided
>> by BIOS, BMC or centralized system management suite. Once administrator enables
>> hotplug feature through those management UI, OS should support system device
>> hotplug out of box. For example, HP SuperDome2 management suite provides interface
>> to configure a node as floating node(hot-removable). And OpenSolaris supports
>> CPU/memory hotplug out of box without any extra configurations. So we should
>> shape interfaces between firmware and OS to better support system device hotplug.
>> 	On the other hand, I think there are no commercial available x86/IA64
>> platforms with system device hotplug capabilities in the field yet, at least only
>> limited quantity if any. So backward compatibility is not a big issue for us now.
>> So I think it's doable to rely on firmware to provide better support for system
>> device hotplug.
>> 	Then what should be enhanced to better support system device hotplug?
>>
>> 1) ACPI specification should be enhanced to provide a static table to describe
>> components with hotplug features, so OS could reserve special resources for
>> hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>> hot-add. Currently we guess maximum number of CPUs supported by the platform
>> by counting CPU entries in APIC table, that's not reliable.
>>
>> 2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>> hotplug. SRAT associates memory ranges with proximity domains with an extra
>> "hotpluggable" flag. PMTT provides memory device topology information, such
>> as "socket->memory controller->DIMM". MPST is used for memory power management
>> and provides a way to associate memory ranges with memory devices in PMTT.
>> With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>> memory ranges automatically, so no extra kernel parameters needed.
>>
>> 3) Enhance ACPICA to provide a method to scan static ACPI tables before
>> memory subsystem has been initialized because OS need to access SRAT,
>> MPST and PMTT when initializing memory subsystem.
>>
>> 4) The last and the most important issue is how to minimize performance
>> drop caused by memory hotplug. As proposed by this patchset, once we
>> configure all memory of a NUMA node as movable, it essentially disable
>> NUMA optimization of kernel memory allocation from that node. According
>> to experience, that will cause huge performance drop. We have observed
>> 10-30% performance drop with memory hotplug enabled. And on another
>> OS the average performance drop caused by memory hotplug is about 10%.
>> If we can't resolve the performance drop, memory hotplug is just a feature
>> for demo:( With help from hardware, we do have some chances to reduce
>> performance penalty caused by memory hotplug.
>> 	As we know, Linux could migrate movable page, but can't migrate
>> non-movable pages used by kernel/DMA etc. And the most hard part is how
>> to deal with those unmovable pages when hot-removing a memory device.
>> Now hardware has given us a hand with a technology named memory migration,
>> which could transparently migrate memory between memory devices. There's
>> no OS visible changes except NUMA topology before and after hardware memory
>> migration.
>> 	And if there are multiple memory devices within a NUMA node,
>> we could configure some memory devices to host unmovable memory and the
>> other to host movable memory. With this configuration, there won't be
>> bigger performance drop because we have preserved all NUMA optimizations.
>> We also could achieve memory hotplug remove by:
>> 1) Use existing page migration mechanism to reclaim movable pages.
>> 2) For memory devices hosting unmovable pages, we need:
>> 2.1) find a movable memory device on other nodes with enough capacity
>> and reclaim it.
>> 2.2) use hardware migration technology to migrate unmovable memory to
> 
> Hi Jiang,
> 
> Could you give an explanation how hardware migration technology works?
Hi Jaegeuk,
	Now some severs support a hardware memory RAS feature called memory
mirror, something like RAID1. The mirrored memory devices will be configured
with the same address and host same contents. And you could transparently
hot-remove one of the mirrored memory device without any help from OS.

We could think memory migration as an extension to the memory mirror technology.
The basic flow for memory migration is:
1) Find a spare memory device with enough capacity in the system.
2) OS issues a request to firmware to migrate from source memory device (A)
   to the spare memory device (B).
3) Firmware configures A and B into memory mode, and configure A as master
   and B as slave.
4) Firmware resilver the mirror to synchronize the content from A to B
5) Firmware reconfigure B as master and A as slave.
6) Firmware deconfigures the memory mirror and removes A
7) Firmware report results to OS.
8) Now user could hot-remove the source memory device A from system.

During memory migration, A and B are in mirror mode, so CPUs and IO devices
could access it as normal. After memory migration, memory device B will have
the same address ranges and content as memory device A, so there's no OS 
visible changes except latency (because A and B may belong to different NUMA
domains).

So hardware memory migration could be used to migrate pages can't be migrated
by OS.

Regards!
Gerry

> 
> Regards,
> Jaegeuk
> 
>> the just reclaimed memory device on other nodes.
>>
>> 	I hope we could expect users to adopt memory hotplug technology
>> with all these implemented.
>>
>> 	Back to this patch, we could rely on the mechanism provided
>> by it to automatically mark memory ranges as movable with information
>>from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
>> manually configure kernel parameters to enable memory hotplug.
>>
>> 	Again, any comments are welcomed!
>>
>> Regards!
>> Gerry
>>
>>
>> On 2012-11-23 18:44, Tang Chen wrote:
>>> [What we are doing]
>>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>>> map for each node in the system.
>>>
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>> This option make sure memory range from ss to ss+nn is movable memory.
>>>
>>>
>>> [Why we do this]
>>> If we hot remove a memroy, the memory cannot have kernel memory,
>>> because Linux cannot migrate kernel memory currently. Therefore,
>>> we have to guarantee that the hot removed memory has only movable
>>> memoroy.
>>>
>>> Linux has two boot options, kernelcore= and movablecore=, for
>>> creating movable memory. These boot options can specify the amount
>>> of memory use as kernel or movable memory. Using them, we can
>>> create ZONE_MOVABLE which has only movable memory.
>>>
>>> But it does not fulfill a requirement of memory hot remove, because
>>> even if we specify the boot options, movable memory is distributed
>>> in each node evenly. So when we want to hot remove memory which
>>> memory range is 0x80000000-0c0000000, we have no way to specify
>>> the memory as movable memory.
>>>
>>> So we proposed a new feature which specifies memory range to use as
>>> movable memory.
>>>
>>>
>>> [Ways to do this]
>>> There may be 2 ways to specify movable memory.
>>>  1. use firmware information
>>>  2. use boot option
>>>
>>> 1. use firmware information
>>>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>   Affinity Structure". If we use the information, we might be able to
>>>   specify movable memory by firmware. For example, if Hot Pluggable
>>>   Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>>   This is our proposal. New boot option can specify memory range to use
>>>   as movable memory.
>>>
>>>
>>> [How we do this]
>>> We chose second way, because if we use first way, users cannot change
>>> memory range to use as movable memory easily. We think if we create
>>> movable memory, performance regression may occur by NUMA. In this case,
>>> user can turn off the feature easily if we prepare the boot option.
>>> And if we prepare the boot optino, the user can select which memory
>>> to use as movable memory easily. 
>>>
>>>
>>> [How to use]
>>> Specify the following boot option:
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>> That means physical address range from ss to ss+nn will be allocated as
>>> ZONE_MOVABLE.
>>>
>>> And the following points should be considered.
>>>
>>> 1) If the range is involved in a single node, then from ss to the end of
>>>    the node will be ZONE_MOVABLE.
>>> 2) If the range covers two or more nodes, then from ss to the end of
>>>    the node will be ZONE_MOVABLE, and all the other nodes will only
>>>    have ZONE_MOVABLE.
>>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>>    unless kernelcore or movablecore is specified.
>>> 4) This option could be specified at most MAX_NUMNODES times.
>>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>>>    higher priority to be satisfied.
>>> 6) This option has no conflict with memmap option.
>>>
>>>
>>>
>>> Tang Chen (4):
>>>   page_alloc: add movable_memmap kernel parameter
>>>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>>     nodes
>>>   page_alloc: Make movablecore_map has higher priority
>>>   page_alloc: Bootmem limit with movablecore_map
>>>
>>> Yasuaki Ishimatsu (1):
>>>   x86: get pg_data_t's memory from other node
>>>
>>>  Documentation/kernel-parameters.txt |   17 +++
>>>  arch/x86/mm/numa.c                  |   11 ++-
>>>  include/linux/memblock.h            |    1 +
>>>  include/linux/mm.h                  |   11 ++
>>>  mm/memblock.c                       |   15 +++-
>>>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>>>  6 files changed, 263 insertions(+), 8 deletions(-)
>>>
>>>
>>> .
>>>
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> .
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-29  2:25       ` Jiang Liu
  (?)
  (?)
@ 2012-11-29  2:49       ` Wanpeng Li
  2012-11-29  2:59           ` Jiang Liu
  -1 siblings, 1 reply; 170+ messages in thread
From: Wanpeng Li @ 2012-11-29  2:49 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Tony Luck, Wang, Frank

On Thu, Nov 29, 2012 at 10:25:40AM +0800, Jiang Liu wrote:
>On 2012-11-29 9:42, Jaegeuk Hanse wrote:
>> On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>>> Hi all,
>>> 	Seems it's a great chance to discuss about the memory hotplug feature
>>> within this thread. So I will try to give some high level thoughts about memory
>>> hotplug feature on x86/IA64. Any comments are welcomed!
>>> 	First of all, I think usability really matters. Ideally, memory hotplug
>>> feature should just work out of box, and we shouldn't expect administrators to 
>>> add several extra platform dependent parameters to enable memory hotplug. 
>>> But how to enable memory (or CPU/node) hotplug out of box? I think the key point
>>> is to cooperate with BIOS/ACPI/firmware/device management teams. 
>>> 	I still position memory hotplug as an advanced feature for high end 
>>> servers and those systems may/should provide some management interfaces to 
>>> configure CPU/memory/node hotplug features. The configuration UI may be provided
>>> by BIOS, BMC or centralized system management suite. Once administrator enables
>>> hotplug feature through those management UI, OS should support system device
>>> hotplug out of box. For example, HP SuperDome2 management suite provides interface
>>> to configure a node as floating node(hot-removable). And OpenSolaris supports
>>> CPU/memory hotplug out of box without any extra configurations. So we should
>>> shape interfaces between firmware and OS to better support system device hotplug.
>>> 	On the other hand, I think there are no commercial available x86/IA64
>>> platforms with system device hotplug capabilities in the field yet, at least only
>>> limited quantity if any. So backward compatibility is not a big issue for us now.
>>> So I think it's doable to rely on firmware to provide better support for system
>>> device hotplug.
>>> 	Then what should be enhanced to better support system device hotplug?
>>>
>>> 1) ACPI specification should be enhanced to provide a static table to describe
>>> components with hotplug features, so OS could reserve special resources for
>>> hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>>> hot-add. Currently we guess maximum number of CPUs supported by the platform
>>> by counting CPU entries in APIC table, that's not reliable.
>>>
>>> 2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>>> hotplug. SRAT associates memory ranges with proximity domains with an extra
>>> "hotpluggable" flag. PMTT provides memory device topology information, such
>>> as "socket->memory controller->DIMM". MPST is used for memory power management
>>> and provides a way to associate memory ranges with memory devices in PMTT.
>>> With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>>> memory ranges automatically, so no extra kernel parameters needed.
>>>
>>> 3) Enhance ACPICA to provide a method to scan static ACPI tables before
>>> memory subsystem has been initialized because OS need to access SRAT,
>>> MPST and PMTT when initializing memory subsystem.
>>>
>>> 4) The last and the most important issue is how to minimize performance
>>> drop caused by memory hotplug. As proposed by this patchset, once we
>>> configure all memory of a NUMA node as movable, it essentially disable
>>> NUMA optimization of kernel memory allocation from that node. According
>>> to experience, that will cause huge performance drop. We have observed
>>> 10-30% performance drop with memory hotplug enabled. And on another
>>> OS the average performance drop caused by memory hotplug is about 10%.
>>> If we can't resolve the performance drop, memory hotplug is just a feature
>>> for demo:( With help from hardware, we do have some chances to reduce
>>> performance penalty caused by memory hotplug.
>>> 	As we know, Linux could migrate movable page, but can't migrate
>>> non-movable pages used by kernel/DMA etc. And the most hard part is how
>>> to deal with those unmovable pages when hot-removing a memory device.
>>> Now hardware has given us a hand with a technology named memory migration,
>>> which could transparently migrate memory between memory devices. There's
>>> no OS visible changes except NUMA topology before and after hardware memory
>>> migration.
>>> 	And if there are multiple memory devices within a NUMA node,
>>> we could configure some memory devices to host unmovable memory and the
>>> other to host movable memory. With this configuration, there won't be
>>> bigger performance drop because we have preserved all NUMA optimizations.
>>> We also could achieve memory hotplug remove by:
>>> 1) Use existing page migration mechanism to reclaim movable pages.
>>> 2) For memory devices hosting unmovable pages, we need:
>>> 2.1) find a movable memory device on other nodes with enough capacity
>>> and reclaim it.
>>> 2.2) use hardware migration technology to migrate unmovable memory to
>> 
>> Hi Jiang,
>> 
>> Could you give an explanation how hardware migration technology works?
>Hi Jaegeuk,
>	Now some severs support a hardware memory RAS feature called memory
>mirror, something like RAID1. The mirrored memory devices will be configured
>with the same address and host same contents. And you could transparently
>hot-remove one of the mirrored memory device without any help from OS.
>
>We could think memory migration as an extension to the memory mirror technology.
>The basic flow for memory migration is:
>1) Find a spare memory device with enough capacity in the system.
>2) OS issues a request to firmware to migrate from source memory device (A)
>   to the spare memory device (B).
>3) Firmware configures A and B into memory mode, and configure A as master
>   and B as slave.

Hi Jiang,

THanks for your detail explanation. Then why should configure who is
master and who is slave? It seems that in your explanation OS only can 
know the change after firmware report the results.

Regards,
Jaegeuk

>4) Firmware resilver the mirror to synchronize the content from A to B
>5) Firmware reconfigure B as master and A as slave.
>6) Firmware deconfigures the memory mirror and removes A
>7) Firmware report results to OS.
>8) Now user could hot-remove the source memory device A from system.
>
>During memory migration, A and B are in mirror mode, so CPUs and IO devices
>could access it as normal. After memory migration, memory device B will have
>the same address ranges and content as memory device A, so there's no OS 
>visible changes except latency (because A and B may belong to different NUMA
>domains).
>
>So hardware memory migration could be used to migrate pages can't be migrated
>by OS.
>
>Regards!
>Gerry
>
>> 
>> Regards,
>> Jaegeuk
>> 
>>> the just reclaimed memory device on other nodes.
>>>
>>> 	I hope we could expect users to adopt memory hotplug technology
>>> with all these implemented.
>>>
>>> 	Back to this patch, we could rely on the mechanism provided
>>> by it to automatically mark memory ranges as movable with information
>>>from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
>>> manually configure kernel parameters to enable memory hotplug.
>>>
>>> 	Again, any comments are welcomed!
>>>
>>> Regards!
>>> Gerry
>>>
>>>
>>> On 2012-11-23 18:44, Tang Chen wrote:
>>>> [What we are doing]
>>>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>>>> map for each node in the system.
>>>>
>>>> movablecore_map=nn[KMG]@ss[KMG]
>>>>
>>>> This option make sure memory range from ss to ss+nn is movable memory.
>>>>
>>>>
>>>> [Why we do this]
>>>> If we hot remove a memroy, the memory cannot have kernel memory,
>>>> because Linux cannot migrate kernel memory currently. Therefore,
>>>> we have to guarantee that the hot removed memory has only movable
>>>> memoroy.
>>>>
>>>> Linux has two boot options, kernelcore= and movablecore=, for
>>>> creating movable memory. These boot options can specify the amount
>>>> of memory use as kernel or movable memory. Using them, we can
>>>> create ZONE_MOVABLE which has only movable memory.
>>>>
>>>> But it does not fulfill a requirement of memory hot remove, because
>>>> even if we specify the boot options, movable memory is distributed
>>>> in each node evenly. So when we want to hot remove memory which
>>>> memory range is 0x80000000-0c0000000, we have no way to specify
>>>> the memory as movable memory.
>>>>
>>>> So we proposed a new feature which specifies memory range to use as
>>>> movable memory.
>>>>
>>>>
>>>> [Ways to do this]
>>>> There may be 2 ways to specify movable memory.
>>>>  1. use firmware information
>>>>  2. use boot option
>>>>
>>>> 1. use firmware information
>>>>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>>   Affinity Structure". If we use the information, we might be able to
>>>>   specify movable memory by firmware. For example, if Hot Pluggable
>>>>   Filed is enabled, Linux sets the memory as movable memory.
>>>>
>>>> 2. use boot option
>>>>   This is our proposal. New boot option can specify memory range to use
>>>>   as movable memory.
>>>>
>>>>
>>>> [How we do this]
>>>> We chose second way, because if we use first way, users cannot change
>>>> memory range to use as movable memory easily. We think if we create
>>>> movable memory, performance regression may occur by NUMA. In this case,
>>>> user can turn off the feature easily if we prepare the boot option.
>>>> And if we prepare the boot optino, the user can select which memory
>>>> to use as movable memory easily. 
>>>>
>>>>
>>>> [How to use]
>>>> Specify the following boot option:
>>>> movablecore_map=nn[KMG]@ss[KMG]
>>>>
>>>> That means physical address range from ss to ss+nn will be allocated as
>>>> ZONE_MOVABLE.
>>>>
>>>> And the following points should be considered.
>>>>
>>>> 1) If the range is involved in a single node, then from ss to the end of
>>>>    the node will be ZONE_MOVABLE.
>>>> 2) If the range covers two or more nodes, then from ss to the end of
>>>>    the node will be ZONE_MOVABLE, and all the other nodes will only
>>>>    have ZONE_MOVABLE.
>>>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>>>    unless kernelcore or movablecore is specified.
>>>> 4) This option could be specified at most MAX_NUMNODES times.
>>>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>>>>    higher priority to be satisfied.
>>>> 6) This option has no conflict with memmap option.
>>>>
>>>>
>>>>
>>>> Tang Chen (4):
>>>>   page_alloc: add movable_memmap kernel parameter
>>>>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>>>     nodes
>>>>   page_alloc: Make movablecore_map has higher priority
>>>>   page_alloc: Bootmem limit with movablecore_map
>>>>
>>>> Yasuaki Ishimatsu (1):
>>>>   x86: get pg_data_t's memory from other node
>>>>
>>>>  Documentation/kernel-parameters.txt |   17 +++
>>>>  arch/x86/mm/numa.c                  |   11 ++-
>>>>  include/linux/memblock.h            |    1 +
>>>>  include/linux/mm.h                  |   11 ++
>>>>  mm/memblock.c                       |   15 +++-
>>>>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>>>>  6 files changed, 263 insertions(+), 8 deletions(-)
>>>>
>>>>
>>>> .
>>>>
>>>
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> 
>> .
>> 
>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-29  2:25       ` Jiang Liu
  (?)
@ 2012-11-29  2:49       ` Wanpeng Li
  -1 siblings, 0 replies; 170+ messages in thread
From: Wanpeng Li @ 2012-11-29  2:49 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Tony Luck, Wang, Frank

On Thu, Nov 29, 2012 at 10:25:40AM +0800, Jiang Liu wrote:
>On 2012-11-29 9:42, Jaegeuk Hanse wrote:
>> On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>>> Hi all,
>>> 	Seems it's a great chance to discuss about the memory hotplug feature
>>> within this thread. So I will try to give some high level thoughts about memory
>>> hotplug feature on x86/IA64. Any comments are welcomed!
>>> 	First of all, I think usability really matters. Ideally, memory hotplug
>>> feature should just work out of box, and we shouldn't expect administrators to 
>>> add several extra platform dependent parameters to enable memory hotplug. 
>>> But how to enable memory (or CPU/node) hotplug out of box? I think the key point
>>> is to cooperate with BIOS/ACPI/firmware/device management teams. 
>>> 	I still position memory hotplug as an advanced feature for high end 
>>> servers and those systems may/should provide some management interfaces to 
>>> configure CPU/memory/node hotplug features. The configuration UI may be provided
>>> by BIOS, BMC or centralized system management suite. Once administrator enables
>>> hotplug feature through those management UI, OS should support system device
>>> hotplug out of box. For example, HP SuperDome2 management suite provides interface
>>> to configure a node as floating node(hot-removable). And OpenSolaris supports
>>> CPU/memory hotplug out of box without any extra configurations. So we should
>>> shape interfaces between firmware and OS to better support system device hotplug.
>>> 	On the other hand, I think there are no commercial available x86/IA64
>>> platforms with system device hotplug capabilities in the field yet, at least only
>>> limited quantity if any. So backward compatibility is not a big issue for us now.
>>> So I think it's doable to rely on firmware to provide better support for system
>>> device hotplug.
>>> 	Then what should be enhanced to better support system device hotplug?
>>>
>>> 1) ACPI specification should be enhanced to provide a static table to describe
>>> components with hotplug features, so OS could reserve special resources for
>>> hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>>> hot-add. Currently we guess maximum number of CPUs supported by the platform
>>> by counting CPU entries in APIC table, that's not reliable.
>>>
>>> 2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>>> hotplug. SRAT associates memory ranges with proximity domains with an extra
>>> "hotpluggable" flag. PMTT provides memory device topology information, such
>>> as "socket->memory controller->DIMM". MPST is used for memory power management
>>> and provides a way to associate memory ranges with memory devices in PMTT.
>>> With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>>> memory ranges automatically, so no extra kernel parameters needed.
>>>
>>> 3) Enhance ACPICA to provide a method to scan static ACPI tables before
>>> memory subsystem has been initialized because OS need to access SRAT,
>>> MPST and PMTT when initializing memory subsystem.
>>>
>>> 4) The last and the most important issue is how to minimize performance
>>> drop caused by memory hotplug. As proposed by this patchset, once we
>>> configure all memory of a NUMA node as movable, it essentially disable
>>> NUMA optimization of kernel memory allocation from that node. According
>>> to experience, that will cause huge performance drop. We have observed
>>> 10-30% performance drop with memory hotplug enabled. And on another
>>> OS the average performance drop caused by memory hotplug is about 10%.
>>> If we can't resolve the performance drop, memory hotplug is just a feature
>>> for demo:( With help from hardware, we do have some chances to reduce
>>> performance penalty caused by memory hotplug.
>>> 	As we know, Linux could migrate movable page, but can't migrate
>>> non-movable pages used by kernel/DMA etc. And the most hard part is how
>>> to deal with those unmovable pages when hot-removing a memory device.
>>> Now hardware has given us a hand with a technology named memory migration,
>>> which could transparently migrate memory between memory devices. There's
>>> no OS visible changes except NUMA topology before and after hardware memory
>>> migration.
>>> 	And if there are multiple memory devices within a NUMA node,
>>> we could configure some memory devices to host unmovable memory and the
>>> other to host movable memory. With this configuration, there won't be
>>> bigger performance drop because we have preserved all NUMA optimizations.
>>> We also could achieve memory hotplug remove by:
>>> 1) Use existing page migration mechanism to reclaim movable pages.
>>> 2) For memory devices hosting unmovable pages, we need:
>>> 2.1) find a movable memory device on other nodes with enough capacity
>>> and reclaim it.
>>> 2.2) use hardware migration technology to migrate unmovable memory to
>> 
>> Hi Jiang,
>> 
>> Could you give an explanation how hardware migration technology works?
>Hi Jaegeuk,
>	Now some severs support a hardware memory RAS feature called memory
>mirror, something like RAID1. The mirrored memory devices will be configured
>with the same address and host same contents. And you could transparently
>hot-remove one of the mirrored memory device without any help from OS.
>
>We could think memory migration as an extension to the memory mirror technology.
>The basic flow for memory migration is:
>1) Find a spare memory device with enough capacity in the system.
>2) OS issues a request to firmware to migrate from source memory device (A)
>   to the spare memory device (B).
>3) Firmware configures A and B into memory mode, and configure A as master
>   and B as slave.

Hi Jiang,

THanks for your detail explanation. Then why should configure who is
master and who is slave? It seems that in your explanation OS only can 
know the change after firmware report the results.

Regards,
Jaegeuk

>4) Firmware resilver the mirror to synchronize the content from A to B
>5) Firmware reconfigure B as master and A as slave.
>6) Firmware deconfigures the memory mirror and removes A
>7) Firmware report results to OS.
>8) Now user could hot-remove the source memory device A from system.
>
>During memory migration, A and B are in mirror mode, so CPUs and IO devices
>could access it as normal. After memory migration, memory device B will have
>the same address ranges and content as memory device A, so there's no OS 
>visible changes except latency (because A and B may belong to different NUMA
>domains).
>
>So hardware memory migration could be used to migrate pages can't be migrated
>by OS.
>
>Regards!
>Gerry
>
>> 
>> Regards,
>> Jaegeuk
>> 
>>> the just reclaimed memory device on other nodes.
>>>
>>> 	I hope we could expect users to adopt memory hotplug technology
>>> with all these implemented.
>>>
>>> 	Back to this patch, we could rely on the mechanism provided
>>> by it to automatically mark memory ranges as movable with information
>>>from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
>>> manually configure kernel parameters to enable memory hotplug.
>>>
>>> 	Again, any comments are welcomed!
>>>
>>> Regards!
>>> Gerry
>>>
>>>
>>> On 2012-11-23 18:44, Tang Chen wrote:
>>>> [What we are doing]
>>>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>>>> map for each node in the system.
>>>>
>>>> movablecore_map=nn[KMG]@ss[KMG]
>>>>
>>>> This option make sure memory range from ss to ss+nn is movable memory.
>>>>
>>>>
>>>> [Why we do this]
>>>> If we hot remove a memroy, the memory cannot have kernel memory,
>>>> because Linux cannot migrate kernel memory currently. Therefore,
>>>> we have to guarantee that the hot removed memory has only movable
>>>> memoroy.
>>>>
>>>> Linux has two boot options, kernelcore= and movablecore=, for
>>>> creating movable memory. These boot options can specify the amount
>>>> of memory use as kernel or movable memory. Using them, we can
>>>> create ZONE_MOVABLE which has only movable memory.
>>>>
>>>> But it does not fulfill a requirement of memory hot remove, because
>>>> even if we specify the boot options, movable memory is distributed
>>>> in each node evenly. So when we want to hot remove memory which
>>>> memory range is 0x80000000-0c0000000, we have no way to specify
>>>> the memory as movable memory.
>>>>
>>>> So we proposed a new feature which specifies memory range to use as
>>>> movable memory.
>>>>
>>>>
>>>> [Ways to do this]
>>>> There may be 2 ways to specify movable memory.
>>>>  1. use firmware information
>>>>  2. use boot option
>>>>
>>>> 1. use firmware information
>>>>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>>   Affinity Structure". If we use the information, we might be able to
>>>>   specify movable memory by firmware. For example, if Hot Pluggable
>>>>   Filed is enabled, Linux sets the memory as movable memory.
>>>>
>>>> 2. use boot option
>>>>   This is our proposal. New boot option can specify memory range to use
>>>>   as movable memory.
>>>>
>>>>
>>>> [How we do this]
>>>> We chose second way, because if we use first way, users cannot change
>>>> memory range to use as movable memory easily. We think if we create
>>>> movable memory, performance regression may occur by NUMA. In this case,
>>>> user can turn off the feature easily if we prepare the boot option.
>>>> And if we prepare the boot optino, the user can select which memory
>>>> to use as movable memory easily. 
>>>>
>>>>
>>>> [How to use]
>>>> Specify the following boot option:
>>>> movablecore_map=nn[KMG]@ss[KMG]
>>>>
>>>> That means physical address range from ss to ss+nn will be allocated as
>>>> ZONE_MOVABLE.
>>>>
>>>> And the following points should be considered.
>>>>
>>>> 1) If the range is involved in a single node, then from ss to the end of
>>>>    the node will be ZONE_MOVABLE.
>>>> 2) If the range covers two or more nodes, then from ss to the end of
>>>>    the node will be ZONE_MOVABLE, and all the other nodes will only
>>>>    have ZONE_MOVABLE.
>>>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>>>    unless kernelcore or movablecore is specified.
>>>> 4) This option could be specified at most MAX_NUMNODES times.
>>>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>>>>    higher priority to be satisfied.
>>>> 6) This option has no conflict with memmap option.
>>>>
>>>>
>>>>
>>>> Tang Chen (4):
>>>>   page_alloc: add movable_memmap kernel parameter
>>>>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>>>     nodes
>>>>   page_alloc: Make movablecore_map has higher priority
>>>>   page_alloc: Bootmem limit with movablecore_map
>>>>
>>>> Yasuaki Ishimatsu (1):
>>>>   x86: get pg_data_t's memory from other node
>>>>
>>>>  Documentation/kernel-parameters.txt |   17 +++
>>>>  arch/x86/mm/numa.c                  |   11 ++-
>>>>  include/linux/memblock.h            |    1 +
>>>>  include/linux/mm.h                  |   11 ++
>>>>  mm/memblock.c                       |   15 +++-
>>>>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>>>>  6 files changed, 263 insertions(+), 8 deletions(-)
>>>>
>>>>
>>>> .
>>>>
>>>
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> 
>> .
>> 
>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-29  2:49       ` Wanpeng Li
@ 2012-11-29  2:59           ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-29  2:59 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Tony Luck, Wang, Frank

On 2012-11-29 10:49, Wanpeng Li wrote:
> On Thu, Nov 29, 2012 at 10:25:40AM +0800, Jiang Liu wrote:
>> On 2012-11-29 9:42, Jaegeuk Hanse wrote:
>>> On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>>>> Hi all,
>>>> 	Seems it's a great chance to discuss about the memory hotplug feature
>>>> within this thread. So I will try to give some high level thoughts about memory
>>>> hotplug feature on x86/IA64. Any comments are welcomed!
>>>> 	First of all, I think usability really matters. Ideally, memory hotplug
>>>> feature should just work out of box, and we shouldn't expect administrators to 
>>>> add several extra platform dependent parameters to enable memory hotplug. 
>>>> But how to enable memory (or CPU/node) hotplug out of box? I think the key point
>>>> is to cooperate with BIOS/ACPI/firmware/device management teams. 
>>>> 	I still position memory hotplug as an advanced feature for high end 
>>>> servers and those systems may/should provide some management interfaces to 
>>>> configure CPU/memory/node hotplug features. The configuration UI may be provided
>>>> by BIOS, BMC or centralized system management suite. Once administrator enables
>>>> hotplug feature through those management UI, OS should support system device
>>>> hotplug out of box. For example, HP SuperDome2 management suite provides interface
>>>> to configure a node as floating node(hot-removable). And OpenSolaris supports
>>>> CPU/memory hotplug out of box without any extra configurations. So we should
>>>> shape interfaces between firmware and OS to better support system device hotplug.
>>>> 	On the other hand, I think there are no commercial available x86/IA64
>>>> platforms with system device hotplug capabilities in the field yet, at least only
>>>> limited quantity if any. So backward compatibility is not a big issue for us now.
>>>> So I think it's doable to rely on firmware to provide better support for system
>>>> device hotplug.
>>>> 	Then what should be enhanced to better support system device hotplug?
>>>>
>>>> 1) ACPI specification should be enhanced to provide a static table to describe
>>>> components with hotplug features, so OS could reserve special resources for
>>>> hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>>>> hot-add. Currently we guess maximum number of CPUs supported by the platform
>>>> by counting CPU entries in APIC table, that's not reliable.
>>>>
>>>> 2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>>>> hotplug. SRAT associates memory ranges with proximity domains with an extra
>>>> "hotpluggable" flag. PMTT provides memory device topology information, such
>>>> as "socket->memory controller->DIMM". MPST is used for memory power management
>>>> and provides a way to associate memory ranges with memory devices in PMTT.
>>>> With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>>>> memory ranges automatically, so no extra kernel parameters needed.
>>>>
>>>> 3) Enhance ACPICA to provide a method to scan static ACPI tables before
>>>> memory subsystem has been initialized because OS need to access SRAT,
>>>> MPST and PMTT when initializing memory subsystem.
>>>>
>>>> 4) The last and the most important issue is how to minimize performance
>>>> drop caused by memory hotplug. As proposed by this patchset, once we
>>>> configure all memory of a NUMA node as movable, it essentially disable
>>>> NUMA optimization of kernel memory allocation from that node. According
>>>> to experience, that will cause huge performance drop. We have observed
>>>> 10-30% performance drop with memory hotplug enabled. And on another
>>>> OS the average performance drop caused by memory hotplug is about 10%.
>>>> If we can't resolve the performance drop, memory hotplug is just a feature
>>>> for demo:( With help from hardware, we do have some chances to reduce
>>>> performance penalty caused by memory hotplug.
>>>> 	As we know, Linux could migrate movable page, but can't migrate
>>>> non-movable pages used by kernel/DMA etc. And the most hard part is how
>>>> to deal with those unmovable pages when hot-removing a memory device.
>>>> Now hardware has given us a hand with a technology named memory migration,
>>>> which could transparently migrate memory between memory devices. There's
>>>> no OS visible changes except NUMA topology before and after hardware memory
>>>> migration.
>>>> 	And if there are multiple memory devices within a NUMA node,
>>>> we could configure some memory devices to host unmovable memory and the
>>>> other to host movable memory. With this configuration, there won't be
>>>> bigger performance drop because we have preserved all NUMA optimizations.
>>>> We also could achieve memory hotplug remove by:
>>>> 1) Use existing page migration mechanism to reclaim movable pages.
>>>> 2) For memory devices hosting unmovable pages, we need:
>>>> 2.1) find a movable memory device on other nodes with enough capacity
>>>> and reclaim it.
>>>> 2.2) use hardware migration technology to migrate unmovable memory to
>>>
>>> Hi Jiang,
>>>
>>> Could you give an explanation how hardware migration technology works?
>> Hi Jaegeuk,
>> 	Now some severs support a hardware memory RAS feature called memory
>> mirror, something like RAID1. The mirrored memory devices will be configured
>> with the same address and host same contents. And you could transparently
>> hot-remove one of the mirrored memory device without any help from OS.
>>
>> We could think memory migration as an extension to the memory mirror technology.
>> The basic flow for memory migration is:
>> 1) Find a spare memory device with enough capacity in the system.
>> 2) OS issues a request to firmware to migrate from source memory device (A)
>>   to the spare memory device (B).
>> 3) Firmware configures A and B into memory mode, and configure A as master
>>   and B as slave.
> 
> Hi Jiang,
> 
> THanks for your detail explanation. Then why should configure who is
> master and who is slave? It seems that in your explanation OS only can 
> know the change after firmware report the results.
Hi Wanpeng,
	It's a hardware requirement. The memory mirror is designed that
1) all memory read/write transactions will be directed to the master
2) master will synchronize write transactions to the slave

But all these details are handled by memory controller and transparent
to OS. From Linux mm subsystem's view, it doesn't know/care about whether
a memory range is mirrored or not. All the magics are hidden by hardware.

Regards!
Gerry

> 
> Regards,
> Jaegeuk
> 
>> 4) Firmware resilver the mirror to synchronize the content from A to B
>> 5) Firmware reconfigure B as master and A as slave.
>> 6) Firmware deconfigures the memory mirror and removes A
>> 7) Firmware report results to OS.
>> 8) Now user could hot-remove the source memory device A from system.
>>
>> During memory migration, A and B are in mirror mode, so CPUs and IO devices
>> could access it as normal. After memory migration, memory device B will have
>> the same address ranges and content as memory device A, so there's no OS 
>> visible changes except latency (because A and B may belong to different NUMA
>> domains).
>>
>> So hardware memory migration could be used to migrate pages can't be migrated
>> by OS.
>>
>> Regards!
>> Gerry
>>
>>>
>>> Regards,
>>> Jaegeuk
>>>
>>>> the just reclaimed memory device on other nodes.
>>>>
>>>> 	I hope we could expect users to adopt memory hotplug technology
>>>> with all these implemented.
>>>>
>>>> 	Back to this patch, we could rely on the mechanism provided
>>>> by it to automatically mark memory ranges as movable with information
>>> >from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
>>>> manually configure kernel parameters to enable memory hotplug.
>>>>
>>>> 	Again, any comments are welcomed!
>>>>
>>>> Regards!
>>>> Gerry
>>>>
>>>>
>>>> On 2012-11-23 18:44, Tang Chen wrote:
>>>>> [What we are doing]
>>>>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>>>>> map for each node in the system.
>>>>>
>>>>> movablecore_map=nn[KMG]@ss[KMG]
>>>>>
>>>>> This option make sure memory range from ss to ss+nn is movable memory.
>>>>>
>>>>>
>>>>> [Why we do this]
>>>>> If we hot remove a memroy, the memory cannot have kernel memory,
>>>>> because Linux cannot migrate kernel memory currently. Therefore,
>>>>> we have to guarantee that the hot removed memory has only movable
>>>>> memoroy.
>>>>>
>>>>> Linux has two boot options, kernelcore= and movablecore=, for
>>>>> creating movable memory. These boot options can specify the amount
>>>>> of memory use as kernel or movable memory. Using them, we can
>>>>> create ZONE_MOVABLE which has only movable memory.
>>>>>
>>>>> But it does not fulfill a requirement of memory hot remove, because
>>>>> even if we specify the boot options, movable memory is distributed
>>>>> in each node evenly. So when we want to hot remove memory which
>>>>> memory range is 0x80000000-0c0000000, we have no way to specify
>>>>> the memory as movable memory.
>>>>>
>>>>> So we proposed a new feature which specifies memory range to use as
>>>>> movable memory.
>>>>>
>>>>>
>>>>> [Ways to do this]
>>>>> There may be 2 ways to specify movable memory.
>>>>>  1. use firmware information
>>>>>  2. use boot option
>>>>>
>>>>> 1. use firmware information
>>>>>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>>>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>>>   Affinity Structure". If we use the information, we might be able to
>>>>>   specify movable memory by firmware. For example, if Hot Pluggable
>>>>>   Filed is enabled, Linux sets the memory as movable memory.
>>>>>
>>>>> 2. use boot option
>>>>>   This is our proposal. New boot option can specify memory range to use
>>>>>   as movable memory.
>>>>>
>>>>>
>>>>> [How we do this]
>>>>> We chose second way, because if we use first way, users cannot change
>>>>> memory range to use as movable memory easily. We think if we create
>>>>> movable memory, performance regression may occur by NUMA. In this case,
>>>>> user can turn off the feature easily if we prepare the boot option.
>>>>> And if we prepare the boot optino, the user can select which memory
>>>>> to use as movable memory easily. 
>>>>>
>>>>>
>>>>> [How to use]
>>>>> Specify the following boot option:
>>>>> movablecore_map=nn[KMG]@ss[KMG]
>>>>>
>>>>> That means physical address range from ss to ss+nn will be allocated as
>>>>> ZONE_MOVABLE.
>>>>>
>>>>> And the following points should be considered.
>>>>>
>>>>> 1) If the range is involved in a single node, then from ss to the end of
>>>>>    the node will be ZONE_MOVABLE.
>>>>> 2) If the range covers two or more nodes, then from ss to the end of
>>>>>    the node will be ZONE_MOVABLE, and all the other nodes will only
>>>>>    have ZONE_MOVABLE.
>>>>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>>>>    unless kernelcore or movablecore is specified.
>>>>> 4) This option could be specified at most MAX_NUMNODES times.
>>>>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>>>>>    higher priority to be satisfied.
>>>>> 6) This option has no conflict with memmap option.
>>>>>
>>>>>
>>>>>
>>>>> Tang Chen (4):
>>>>>   page_alloc: add movable_memmap kernel parameter
>>>>>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>>>>     nodes
>>>>>   page_alloc: Make movablecore_map has higher priority
>>>>>   page_alloc: Bootmem limit with movablecore_map
>>>>>
>>>>> Yasuaki Ishimatsu (1):
>>>>>   x86: get pg_data_t's memory from other node
>>>>>
>>>>>  Documentation/kernel-parameters.txt |   17 +++
>>>>>  arch/x86/mm/numa.c                  |   11 ++-
>>>>>  include/linux/memblock.h            |    1 +
>>>>>  include/linux/mm.h                  |   11 ++
>>>>>  mm/memblock.c                       |   15 +++-
>>>>>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>>>>>  6 files changed, 263 insertions(+), 8 deletions(-)
>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>> .
>>>
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-29  2:59           ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-29  2:59 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Tony Luck, Wang, Frank

On 2012-11-29 10:49, Wanpeng Li wrote:
> On Thu, Nov 29, 2012 at 10:25:40AM +0800, Jiang Liu wrote:
>> On 2012-11-29 9:42, Jaegeuk Hanse wrote:
>>> On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>>>> Hi all,
>>>> 	Seems it's a great chance to discuss about the memory hotplug feature
>>>> within this thread. So I will try to give some high level thoughts about memory
>>>> hotplug feature on x86/IA64. Any comments are welcomed!
>>>> 	First of all, I think usability really matters. Ideally, memory hotplug
>>>> feature should just work out of box, and we shouldn't expect administrators to 
>>>> add several extra platform dependent parameters to enable memory hotplug. 
>>>> But how to enable memory (or CPU/node) hotplug out of box? I think the key point
>>>> is to cooperate with BIOS/ACPI/firmware/device management teams. 
>>>> 	I still position memory hotplug as an advanced feature for high end 
>>>> servers and those systems may/should provide some management interfaces to 
>>>> configure CPU/memory/node hotplug features. The configuration UI may be provided
>>>> by BIOS, BMC or centralized system management suite. Once administrator enables
>>>> hotplug feature through those management UI, OS should support system device
>>>> hotplug out of box. For example, HP SuperDome2 management suite provides interface
>>>> to configure a node as floating node(hot-removable). And OpenSolaris supports
>>>> CPU/memory hotplug out of box without any extra configurations. So we should
>>>> shape interfaces between firmware and OS to better support system device hotplug.
>>>> 	On the other hand, I think there are no commercial available x86/IA64
>>>> platforms with system device hotplug capabilities in the field yet, at least only
>>>> limited quantity if any. So backward compatibility is not a big issue for us now.
>>>> So I think it's doable to rely on firmware to provide better support for system
>>>> device hotplug.
>>>> 	Then what should be enhanced to better support system device hotplug?
>>>>
>>>> 1) ACPI specification should be enhanced to provide a static table to describe
>>>> components with hotplug features, so OS could reserve special resources for
>>>> hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>>>> hot-add. Currently we guess maximum number of CPUs supported by the platform
>>>> by counting CPU entries in APIC table, that's not reliable.
>>>>
>>>> 2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>>>> hotplug. SRAT associates memory ranges with proximity domains with an extra
>>>> "hotpluggable" flag. PMTT provides memory device topology information, such
>>>> as "socket->memory controller->DIMM". MPST is used for memory power management
>>>> and provides a way to associate memory ranges with memory devices in PMTT.
>>>> With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>>>> memory ranges automatically, so no extra kernel parameters needed.
>>>>
>>>> 3) Enhance ACPICA to provide a method to scan static ACPI tables before
>>>> memory subsystem has been initialized because OS need to access SRAT,
>>>> MPST and PMTT when initializing memory subsystem.
>>>>
>>>> 4) The last and the most important issue is how to minimize performance
>>>> drop caused by memory hotplug. As proposed by this patchset, once we
>>>> configure all memory of a NUMA node as movable, it essentially disable
>>>> NUMA optimization of kernel memory allocation from that node. According
>>>> to experience, that will cause huge performance drop. We have observed
>>>> 10-30% performance drop with memory hotplug enabled. And on another
>>>> OS the average performance drop caused by memory hotplug is about 10%.
>>>> If we can't resolve the performance drop, memory hotplug is just a feature
>>>> for demo:( With help from hardware, we do have some chances to reduce
>>>> performance penalty caused by memory hotplug.
>>>> 	As we know, Linux could migrate movable page, but can't migrate
>>>> non-movable pages used by kernel/DMA etc. And the most hard part is how
>>>> to deal with those unmovable pages when hot-removing a memory device.
>>>> Now hardware has given us a hand with a technology named memory migration,
>>>> which could transparently migrate memory between memory devices. There's
>>>> no OS visible changes except NUMA topology before and after hardware memory
>>>> migration.
>>>> 	And if there are multiple memory devices within a NUMA node,
>>>> we could configure some memory devices to host unmovable memory and the
>>>> other to host movable memory. With this configuration, there won't be
>>>> bigger performance drop because we have preserved all NUMA optimizations.
>>>> We also could achieve memory hotplug remove by:
>>>> 1) Use existing page migration mechanism to reclaim movable pages.
>>>> 2) For memory devices hosting unmovable pages, we need:
>>>> 2.1) find a movable memory device on other nodes with enough capacity
>>>> and reclaim it.
>>>> 2.2) use hardware migration technology to migrate unmovable memory to
>>>
>>> Hi Jiang,
>>>
>>> Could you give an explanation how hardware migration technology works?
>> Hi Jaegeuk,
>> 	Now some severs support a hardware memory RAS feature called memory
>> mirror, something like RAID1. The mirrored memory devices will be configured
>> with the same address and host same contents. And you could transparently
>> hot-remove one of the mirrored memory device without any help from OS.
>>
>> We could think memory migration as an extension to the memory mirror technology.
>> The basic flow for memory migration is:
>> 1) Find a spare memory device with enough capacity in the system.
>> 2) OS issues a request to firmware to migrate from source memory device (A)
>>   to the spare memory device (B).
>> 3) Firmware configures A and B into memory mode, and configure A as master
>>   and B as slave.
> 
> Hi Jiang,
> 
> THanks for your detail explanation. Then why should configure who is
> master and who is slave? It seems that in your explanation OS only can 
> know the change after firmware report the results.
Hi Wanpeng,
	It's a hardware requirement. The memory mirror is designed that
1) all memory read/write transactions will be directed to the master
2) master will synchronize write transactions to the slave

But all these details are handled by memory controller and transparent
to OS. From Linux mm subsystem's view, it doesn't know/care about whether
a memory range is mirrored or not. All the magics are hidden by hardware.

Regards!
Gerry

> 
> Regards,
> Jaegeuk
> 
>> 4) Firmware resilver the mirror to synchronize the content from A to B
>> 5) Firmware reconfigure B as master and A as slave.
>> 6) Firmware deconfigures the memory mirror and removes A
>> 7) Firmware report results to OS.
>> 8) Now user could hot-remove the source memory device A from system.
>>
>> During memory migration, A and B are in mirror mode, so CPUs and IO devices
>> could access it as normal. After memory migration, memory device B will have
>> the same address ranges and content as memory device A, so there's no OS 
>> visible changes except latency (because A and B may belong to different NUMA
>> domains).
>>
>> So hardware memory migration could be used to migrate pages can't be migrated
>> by OS.
>>
>> Regards!
>> Gerry
>>
>>>
>>> Regards,
>>> Jaegeuk
>>>
>>>> the just reclaimed memory device on other nodes.
>>>>
>>>> 	I hope we could expect users to adopt memory hotplug technology
>>>> with all these implemented.
>>>>
>>>> 	Back to this patch, we could rely on the mechanism provided
>>>> by it to automatically mark memory ranges as movable with information
>>> >from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
>>>> manually configure kernel parameters to enable memory hotplug.
>>>>
>>>> 	Again, any comments are welcomed!
>>>>
>>>> Regards!
>>>> Gerry
>>>>
>>>>
>>>> On 2012-11-23 18:44, Tang Chen wrote:
>>>>> [What we are doing]
>>>>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>>>>> map for each node in the system.
>>>>>
>>>>> movablecore_map=nn[KMG]@ss[KMG]
>>>>>
>>>>> This option make sure memory range from ss to ss+nn is movable memory.
>>>>>
>>>>>
>>>>> [Why we do this]
>>>>> If we hot remove a memroy, the memory cannot have kernel memory,
>>>>> because Linux cannot migrate kernel memory currently. Therefore,
>>>>> we have to guarantee that the hot removed memory has only movable
>>>>> memoroy.
>>>>>
>>>>> Linux has two boot options, kernelcore= and movablecore=, for
>>>>> creating movable memory. These boot options can specify the amount
>>>>> of memory use as kernel or movable memory. Using them, we can
>>>>> create ZONE_MOVABLE which has only movable memory.
>>>>>
>>>>> But it does not fulfill a requirement of memory hot remove, because
>>>>> even if we specify the boot options, movable memory is distributed
>>>>> in each node evenly. So when we want to hot remove memory which
>>>>> memory range is 0x80000000-0c0000000, we have no way to specify
>>>>> the memory as movable memory.
>>>>>
>>>>> So we proposed a new feature which specifies memory range to use as
>>>>> movable memory.
>>>>>
>>>>>
>>>>> [Ways to do this]
>>>>> There may be 2 ways to specify movable memory.
>>>>>  1. use firmware information
>>>>>  2. use boot option
>>>>>
>>>>> 1. use firmware information
>>>>>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>>>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>>>   Affinity Structure". If we use the information, we might be able to
>>>>>   specify movable memory by firmware. For example, if Hot Pluggable
>>>>>   Filed is enabled, Linux sets the memory as movable memory.
>>>>>
>>>>> 2. use boot option
>>>>>   This is our proposal. New boot option can specify memory range to use
>>>>>   as movable memory.
>>>>>
>>>>>
>>>>> [How we do this]
>>>>> We chose second way, because if we use first way, users cannot change
>>>>> memory range to use as movable memory easily. We think if we create
>>>>> movable memory, performance regression may occur by NUMA. In this case,
>>>>> user can turn off the feature easily if we prepare the boot option.
>>>>> And if we prepare the boot optino, the user can select which memory
>>>>> to use as movable memory easily. 
>>>>>
>>>>>
>>>>> [How to use]
>>>>> Specify the following boot option:
>>>>> movablecore_map=nn[KMG]@ss[KMG]
>>>>>
>>>>> That means physical address range from ss to ss+nn will be allocated as
>>>>> ZONE_MOVABLE.
>>>>>
>>>>> And the following points should be considered.
>>>>>
>>>>> 1) If the range is involved in a single node, then from ss to the end of
>>>>>    the node will be ZONE_MOVABLE.
>>>>> 2) If the range covers two or more nodes, then from ss to the end of
>>>>>    the node will be ZONE_MOVABLE, and all the other nodes will only
>>>>>    have ZONE_MOVABLE.
>>>>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>>>>    unless kernelcore or movablecore is specified.
>>>>> 4) This option could be specified at most MAX_NUMNODES times.
>>>>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>>>>>    higher priority to be satisfied.
>>>>> 6) This option has no conflict with memmap option.
>>>>>
>>>>>
>>>>>
>>>>> Tang Chen (4):
>>>>>   page_alloc: add movable_memmap kernel parameter
>>>>>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>>>>     nodes
>>>>>   page_alloc: Make movablecore_map has higher priority
>>>>>   page_alloc: Bootmem limit with movablecore_map
>>>>>
>>>>> Yasuaki Ishimatsu (1):
>>>>>   x86: get pg_data_t's memory from other node
>>>>>
>>>>>  Documentation/kernel-parameters.txt |   17 +++
>>>>>  arch/x86/mm/numa.c                  |   11 ++-
>>>>>  include/linux/memblock.h            |    1 +
>>>>>  include/linux/mm.h                  |   11 ++
>>>>>  mm/memblock.c                       |   15 +++-
>>>>>  mm/page_alloc.c                     |  216 ++++++++++++++++++++++++++++++++++-
>>>>>  6 files changed, 263 insertions(+), 8 deletions(-)
>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>> .
>>>
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> 
> .
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28 21:34     ` Luck, Tony
@ 2012-11-29 10:38       ` Yasuaki Ishimatsu
  -1 siblings, 0 replies; 170+ messages in thread
From: Yasuaki Ishimatsu @ 2012-11-29 10:38 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Jiang Liu, Tang Chen, hpa, akpm, rob, laijs, wency, linfeng,
	yinghai, kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, Len Brown, Wang, Frank

Hi Tony,

2012/11/29 6:34, Luck, Tony wrote:
>> 1. use firmware information
>>    According to ACPI spec 5.0, SRAT table has memory affinity structure
>>    and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>    Affinity Structure". If we use the information, we might be able to
>>    specify movable memory by firmware. For example, if Hot Pluggable
>>    Filed is enabled, Linux sets the memory as movable memory.
>>
>> 2. use boot option
>>    This is our proposal. New boot option can specify memory range to use
>>    as movable memory.
>
> Isn't this just moving the work to the user? To pick good values for the

Yes.

> movable areas, they need to know how the memory lines up across
> node boundaries ... because they need to make sure to allow some
> non-movable memory allocations on each node so that the kernel can
> take advantage of node locality.

There is no problem.
Linux has already two boot options, kernelcore= and movablecore=.
So if we use them, non-movable memory is divided into each node evenly.

But there is no way to specify a node used as movable currently. So
we proposed the new boot option.

> So the user would have to read at least the SRAT table, and perhaps
> more, to figure out what to provide as arguments.
>

> Since this is going to be used on a dynamic system where nodes might
> be added an removed - the right values for these arguments might
> change from one boot to the next. So even if the user gets them right
> on day 1, a month later when a new node has been added, or a broken
> node removed the values would be stale.

I don't think so. Even if we hot add/remove node, the memory range of
each memory device is not changed. So we don't need to change the boot
option.

Thanks,
Yasuaki Ishimatsu

>
> -Tony
>



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-29 10:38       ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 170+ messages in thread
From: Yasuaki Ishimatsu @ 2012-11-29 10:38 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Jiang Liu, Tang Chen, hpa, akpm, rob, laijs, wency, linfeng,
	yinghai, kosaki.motohiro, minchan.kim, mgorman, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, Len Brown, Wang, Frank

Hi Tony,

2012/11/29 6:34, Luck, Tony wrote:
>> 1. use firmware information
>>    According to ACPI spec 5.0, SRAT table has memory affinity structure
>>    and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>    Affinity Structure". If we use the information, we might be able to
>>    specify movable memory by firmware. For example, if Hot Pluggable
>>    Filed is enabled, Linux sets the memory as movable memory.
>>
>> 2. use boot option
>>    This is our proposal. New boot option can specify memory range to use
>>    as movable memory.
>
> Isn't this just moving the work to the user? To pick good values for the

Yes.

> movable areas, they need to know how the memory lines up across
> node boundaries ... because they need to make sure to allow some
> non-movable memory allocations on each node so that the kernel can
> take advantage of node locality.

There is no problem.
Linux has already two boot options, kernelcore= and movablecore=.
So if we use them, non-movable memory is divided into each node evenly.

But there is no way to specify a node used as movable currently. So
we proposed the new boot option.

> So the user would have to read at least the SRAT table, and perhaps
> more, to figure out what to provide as arguments.
>

> Since this is going to be used on a dynamic system where nodes might
> be added an removed - the right values for these arguments might
> change from one boot to the next. So even if the user gets them right
> on day 1, a month later when a new node has been added, or a broken
> node removed the values would be stale.

I don't think so. Even if we hot add/remove node, the memory range of
each memory device is not changed. So we don't need to change the boot
option.

Thanks,
Yasuaki Ishimatsu

>
> -Tony
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28 21:38       ` H. Peter Anvin
@ 2012-11-29 11:00         ` Mel Gorman
  -1 siblings, 0 replies; 170+ messages in thread
From: Mel Gorman @ 2012-11-29 11:00 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Luck, Tony, Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki,
	laijs, wency, linfeng, yinghai, kosaki.motohiro, minchan.kim,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
> On 11/28/2012 01:34 PM, Luck, Tony wrote:
> >>
> >> 2. use boot option
> >>   This is our proposal. New boot option can specify memory range to use
> >>   as movable memory.
> > 
> > Isn't this just moving the work to the user? To pick good values for the
> > movable areas, they need to know how the memory lines up across
> > node boundaries ... because they need to make sure to allow some
> > non-movable memory allocations on each node so that the kernel can
> > take advantage of node locality.
> > 
> > So the user would have to read at least the SRAT table, and perhaps
> > more, to figure out what to provide as arguments.
> > 
> > Since this is going to be used on a dynamic system where nodes might
> > be added an removed - the right values for these arguments might
> > change from one boot to the next. So even if the user gets them right
> > on day 1, a month later when a new node has been added, or a broken
> > node removed the values would be stale.
> > 
> 
> I gave this feedback in person at LCE: I consider the kernel
> configuration option to be useless for anything other than debugging.
> Trying to promote it as an actual solution, to be used by end users in
> the field, is ridiculous at best.
> 

I've not been paying a whole pile of attention to this because it's not an
area I'm active in but I agree that configuring ZONE_MOVABLE like
this at boot-time is going to be problematic. As awkward as it is, it
would probably work out better to only boot with one node by default and
then hot-add the nodes at runtime using either an online sysfs file or
an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
clumsy but better than specifying addresses on the command line.

That said, I also find using ZONE_MOVABLE to be a problem in itself that
will cause problems down the road. Maybe this was discussed already but
just in case I'll describe the problems I see.

If any significant percentage of memory is in ZONE_MOVABLE then the memory
hotplug people will have to deal with all the lowmem/highmem problems
that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
metadata intensive workloads will not be able to use all of memory because
the kernel allocations will be confined to a subset of memory. A more
complex example is that page table page allocations are also restricted
meaning it's possible that a process will not even be able to mmap() a high
percentage of memory simply because it cannot allocate the page tables to
store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
was a hack when it was introduced but at least then the expectation was
that ZONE_MOVABLE was going to be used for huge pages and there at least
an expectation that it would not be available for normal usage.

Fundamentally the reason one would want to use ZONE_MOVABLE is because
we cannot migrate a lot of kernel memory -- slab pages, page table pages,
device-allocated buffers etc.  My understanding is that other OS's get around
this by requiring that subsystems and drivers have callbacks that allow the
core VM to force certain memory to be released but that may be impractical
for Linux. I don't know for sure though, this is just what I heard.

For Linux, the hotplug people need to start thinking about how to get
around this migration problem. The first problem faced is the memory model
and how it maps virt->phys addresses. We have a 1:1 mapping because it's
fast but not because it's a fundamental requirement. Start considering
what happens if the memory model is changed to allow some sections to have
fast lookup for virt_to_phys and other sections to have slow lookups. On
hotplug, try and empty all the sections. If the section cannot be emptied
because of kernel pages then the section gets marked as "offline-migrated"
or something. Stop the whole machine (yes, I mean stop_machine), copy
those unmovable pages to another location, update the kernel virt->phys
mapping for the section being offlined so the virt addresses point to the
new physical addresses and resume.  Virt->phys lookups are going to be
a lot slower because a full section lookup will be necessary every time
effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
but it should work. This will cover some slab pages where the data is only
accessed via the virtual address -- inode caches, dcache etc.

It will not work where the physical address is used. The obvious example
is page table pages. For page tables, during stop machine you will have to
walk all processes page tables looking for references to the page you're
trying to move and update them. It is possible to just plain migrate
page table pages but when it was last implemented years ago there was a
constant performance penalty for everybody and it was not popular.  Taking a
heavy-handed approach just during memory hot-remove might be more palatable.

For the remaining pages such as those that have been handed to devices
or are pinned for DMA then your options become more limited. You may
still have to restrict allocating these pages (where possible) to a
region that cannot be hot-removed but at least this will be relatively
few pages.

The big downside of this proposal is that it's unproven, not designed,
would be extremely intrusive and I expect it would be a *massive* amount
of development effort that will be difficult to get right. The upside is
configuring it will be a lot easier because all you'll need is a variation
of kernelcore= to reserve a percentage of memory for allocations we *really*
cannot migrate because the physical pages are owned by a device that cannot
release them, potentially forever. The other upside is that it does not
hit crazy lowmem/highmem style problems.

ZONE_MOVABLE at least will all a node to be removed very quickly but
because it will paste you into a corner there should be a plan on what
you're going to replace it with.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-29 11:00         ` Mel Gorman
  0 siblings, 0 replies; 170+ messages in thread
From: Mel Gorman @ 2012-11-29 11:00 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Luck, Tony, Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki,
	laijs, wency, linfeng, yinghai, kosaki.motohiro, minchan.kim,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
> On 11/28/2012 01:34 PM, Luck, Tony wrote:
> >>
> >> 2. use boot option
> >>   This is our proposal. New boot option can specify memory range to use
> >>   as movable memory.
> > 
> > Isn't this just moving the work to the user? To pick good values for the
> > movable areas, they need to know how the memory lines up across
> > node boundaries ... because they need to make sure to allow some
> > non-movable memory allocations on each node so that the kernel can
> > take advantage of node locality.
> > 
> > So the user would have to read at least the SRAT table, and perhaps
> > more, to figure out what to provide as arguments.
> > 
> > Since this is going to be used on a dynamic system where nodes might
> > be added an removed - the right values for these arguments might
> > change from one boot to the next. So even if the user gets them right
> > on day 1, a month later when a new node has been added, or a broken
> > node removed the values would be stale.
> > 
> 
> I gave this feedback in person at LCE: I consider the kernel
> configuration option to be useless for anything other than debugging.
> Trying to promote it as an actual solution, to be used by end users in
> the field, is ridiculous at best.
> 

I've not been paying a whole pile of attention to this because it's not an
area I'm active in but I agree that configuring ZONE_MOVABLE like
this at boot-time is going to be problematic. As awkward as it is, it
would probably work out better to only boot with one node by default and
then hot-add the nodes at runtime using either an online sysfs file or
an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
clumsy but better than specifying addresses on the command line.

That said, I also find using ZONE_MOVABLE to be a problem in itself that
will cause problems down the road. Maybe this was discussed already but
just in case I'll describe the problems I see.

If any significant percentage of memory is in ZONE_MOVABLE then the memory
hotplug people will have to deal with all the lowmem/highmem problems
that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
metadata intensive workloads will not be able to use all of memory because
the kernel allocations will be confined to a subset of memory. A more
complex example is that page table page allocations are also restricted
meaning it's possible that a process will not even be able to mmap() a high
percentage of memory simply because it cannot allocate the page tables to
store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
was a hack when it was introduced but at least then the expectation was
that ZONE_MOVABLE was going to be used for huge pages and there at least
an expectation that it would not be available for normal usage.

Fundamentally the reason one would want to use ZONE_MOVABLE is because
we cannot migrate a lot of kernel memory -- slab pages, page table pages,
device-allocated buffers etc.  My understanding is that other OS's get around
this by requiring that subsystems and drivers have callbacks that allow the
core VM to force certain memory to be released but that may be impractical
for Linux. I don't know for sure though, this is just what I heard.

For Linux, the hotplug people need to start thinking about how to get
around this migration problem. The first problem faced is the memory model
and how it maps virt->phys addresses. We have a 1:1 mapping because it's
fast but not because it's a fundamental requirement. Start considering
what happens if the memory model is changed to allow some sections to have
fast lookup for virt_to_phys and other sections to have slow lookups. On
hotplug, try and empty all the sections. If the section cannot be emptied
because of kernel pages then the section gets marked as "offline-migrated"
or something. Stop the whole machine (yes, I mean stop_machine), copy
those unmovable pages to another location, update the kernel virt->phys
mapping for the section being offlined so the virt addresses point to the
new physical addresses and resume.  Virt->phys lookups are going to be
a lot slower because a full section lookup will be necessary every time
effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
but it should work. This will cover some slab pages where the data is only
accessed via the virtual address -- inode caches, dcache etc.

It will not work where the physical address is used. The obvious example
is page table pages. For page tables, during stop machine you will have to
walk all processes page tables looking for references to the page you're
trying to move and update them. It is possible to just plain migrate
page table pages but when it was last implemented years ago there was a
constant performance penalty for everybody and it was not popular.  Taking a
heavy-handed approach just during memory hot-remove might be more palatable.

For the remaining pages such as those that have been handed to devices
or are pinned for DMA then your options become more limited. You may
still have to restrict allocating these pages (where possible) to a
region that cannot be hot-removed but at least this will be relatively
few pages.

The big downside of this proposal is that it's unproven, not designed,
would be extremely intrusive and I expect it would be a *massive* amount
of development effort that will be difficult to get right. The upside is
configuring it will be a lot easier because all you'll need is a variation
of kernelcore= to reserve a percentage of memory for allocations we *really*
cannot migrate because the physical pages are owned by a device that cannot
release them, potentially forever. The other upside is that it does not
hit crazy lowmem/highmem style problems.

ZONE_MOVABLE at least will all a node to be removed very quickly but
because it will paste you into a corner there should be a plan on what
you're going to replace it with.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-29 10:38       ` Yasuaki Ishimatsu
@ 2012-11-29 11:05         ` Mel Gorman
  -1 siblings, 0 replies; 170+ messages in thread
From: Mel Gorman @ 2012-11-29 11:05 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: Luck, Tony, Jiang Liu, Tang Chen, hpa, akpm, rob, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, Len Brown, Wang, Frank

On Thu, Nov 29, 2012 at 07:38:26PM +0900, Yasuaki Ishimatsu wrote:
> Hi Tony,
> 
> 2012/11/29 6:34, Luck, Tony wrote:
> >>1. use firmware information
> >>   According to ACPI spec 5.0, SRAT table has memory affinity structure
> >>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
> >>   Affinity Structure". If we use the information, we might be able to
> >>   specify movable memory by firmware. For example, if Hot Pluggable
> >>   Filed is enabled, Linux sets the memory as movable memory.
> >>
> >>2. use boot option
> >>   This is our proposal. New boot option can specify memory range to use
> >>   as movable memory.
> >
> >Isn't this just moving the work to the user? To pick good values for the
> 
> Yes.
> 
> >movable areas, they need to know how the memory lines up across
> >node boundaries ... because they need to make sure to allow some
> >non-movable memory allocations on each node so that the kernel can
> >take advantage of node locality.
> 
> There is no problem.
> Linux has already two boot options, kernelcore= and movablecore=.
> So if we use them, non-movable memory is divided into each node evenly.
> 

The motivation for those options was to reserve a percentage of memory
to be used for hugepage allocation. If hugepages were not being used at
a particular time then they could be used for other purposes. While the
system could in theory face lowmem/highmem style problems, in practice
it did not happen because the memory would be allocated as hugetlbfs
pages and unavailable anyway. The same does not really apply to a general
purpose system that you want to support memory hot-remove on so be wary of
lowmem/highmem style problems caused by relying too heavily on ZONE_MOVABLE.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-29 11:05         ` Mel Gorman
  0 siblings, 0 replies; 170+ messages in thread
From: Mel Gorman @ 2012-11-29 11:05 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: Luck, Tony, Jiang Liu, Tang Chen, hpa, akpm, rob, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, Len Brown, Wang, Frank

On Thu, Nov 29, 2012 at 07:38:26PM +0900, Yasuaki Ishimatsu wrote:
> Hi Tony,
> 
> 2012/11/29 6:34, Luck, Tony wrote:
> >>1. use firmware information
> >>   According to ACPI spec 5.0, SRAT table has memory affinity structure
> >>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
> >>   Affinity Structure". If we use the information, we might be able to
> >>   specify movable memory by firmware. For example, if Hot Pluggable
> >>   Filed is enabled, Linux sets the memory as movable memory.
> >>
> >>2. use boot option
> >>   This is our proposal. New boot option can specify memory range to use
> >>   as movable memory.
> >
> >Isn't this just moving the work to the user? To pick good values for the
> 
> Yes.
> 
> >movable areas, they need to know how the memory lines up across
> >node boundaries ... because they need to make sure to allow some
> >non-movable memory allocations on each node so that the kernel can
> >take advantage of node locality.
> 
> There is no problem.
> Linux has already two boot options, kernelcore= and movablecore=.
> So if we use them, non-movable memory is divided into each node evenly.
> 

The motivation for those options was to reserve a percentage of memory
to be used for hugepage allocation. If hugepages were not being used at
a particular time then they could be used for other purposes. While the
system could in theory face lowmem/highmem style problems, in practice
it did not happen because the memory would be allocated as hugetlbfs
pages and unavailable anyway. The same does not really apply to a general
purpose system that you want to support memory hot-remove on so be wary of
lowmem/highmem style problems caused by relying too heavily on ZONE_MOVABLE.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-29 10:38       ` Yasuaki Ishimatsu
@ 2012-11-29 15:47         ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-29 15:47 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: Luck, Tony, Jiang Liu, Tang Chen, hpa, akpm, rob, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

On 11/29/2012 06:38 PM, Yasuaki Ishimatsu wrote:
> Hi Tony,
> 
> 2012/11/29 6:34, Luck, Tony wrote:
>>> 1. use firmware information
>>>    According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>    and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>    Affinity Structure". If we use the information, we might be able to
>>>    specify movable memory by firmware. For example, if Hot Pluggable
>>>    Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>>    This is our proposal. New boot option can specify memory range to use
>>>    as movable memory.
>>
>> Isn't this just moving the work to the user? To pick good values for the
> 
> Yes.
> 
>> movable areas, they need to know how the memory lines up across
>> node boundaries ... because they need to make sure to allow some
>> non-movable memory allocations on each node so that the kernel can
>> take advantage of node locality.
> 
> There is no problem.
> Linux has already two boot options, kernelcore= and movablecore=.
> So if we use them, non-movable memory is divided into each node evenly.
> 
> But there is no way to specify a node used as movable currently. So
> we proposed the new boot option.
> 
>> So the user would have to read at least the SRAT table, and perhaps
>> more, to figure out what to provide as arguments.
>>
> 
>> Since this is going to be used on a dynamic system where nodes might
>> be added an removed - the right values for these arguments might
>> change from one boot to the next. So even if the user gets them right
>> on day 1, a month later when a new node has been added, or a broken
>> node removed the values would be stale.
> 
> I don't think so. Even if we hot add/remove node, the memory range of
> each memory device is not changed. So we don't need to change the boot
> option.
Hi Yasuaki,
	Addresses assigned to each memory device may change under different 
hardware configurations.
	According to my experiences with some hotplug capable Xeon and Itanium
systems, a typical algorithm adopted by BIOS to support memory hotplug is:
1) For backward compatibility, BIOS assigns continuous addresses to memory
devices present at boot time. In other words, there are no holes in the memory
addresses except the hole just below 4G reserved for MMIO and other arch 
specific usage.
2) To support memory hotplug, BIOS reserves enough memory address ranges 
at the high end.
 
	Let's take a typical 4 sockets system as an example. Say we have four
sockets S0-S3, and each socket supports two memory devices(M0-M1) at maximum. 
Each memory device supports 128G memory at maximum. And at boot, all memory
slots are fully populated with 4GB memory. Then the address assignment looks
like:
0-2G: 		S0.M0
2-4G: 		MMIO
4-8G: 		S0.M1
8-12G: 		S1.M0
12-16G: 	S1.M1
16-20G: 	S2.M0
20-24G:		S2.M1
24-28G: 	S2.M0
28-32G:		S2.M1
32-34G:		S0.M0 (memory recovered from the MMIO hole)
1024-1152G:	reserved for S0.M0
1152-1280G:	reserved for S0.M1
1280-1408G:	reserved for S1.M0
1408-1536G:	reserved for S1.M1
1536-1664G:	reserved for S2.M0
1664-1792G:	reserved for S2.M1
1792-1920G:	reserved for S3.M0
1920-2048G:	reserved for S4.M1

If we hot-remove S2.M0 and add back a bigger memory device with 8G memory, it will
be assigned a new memory address range 1536-1544G.

Based on above algorithm, and we configure 16-24G(S2.M0 and S2.M1) as movable memory.
1) memory on S3 will be configured as movable if S2 isn't present at boot time. (the
same effect as "movable_node" in discussion at https://lkml.org/lkml/2012/11/27/154)
2) S2.M0 will be configured as non-movable and S3.M0 will be configured as movable
   if S1.M0 isn't present at boot.
3) And how about replace S1.M0 with a 8GB memory device?

To summarize, kernel parameter to configure movable memory for hotplug will easily
become invalid if hardware configuration changes, and that may confuse administrators.
I still think the most reliable way is to figure out movable memory for hotplug by
parsing hardware configuration information from BIOS.

Regards!
Gerry


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-29 15:47         ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-29 15:47 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: Luck, Tony, Jiang Liu, Tang Chen, hpa, akpm, rob, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

On 11/29/2012 06:38 PM, Yasuaki Ishimatsu wrote:
> Hi Tony,
> 
> 2012/11/29 6:34, Luck, Tony wrote:
>>> 1. use firmware information
>>>    According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>    and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>    Affinity Structure". If we use the information, we might be able to
>>>    specify movable memory by firmware. For example, if Hot Pluggable
>>>    Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>>    This is our proposal. New boot option can specify memory range to use
>>>    as movable memory.
>>
>> Isn't this just moving the work to the user? To pick good values for the
> 
> Yes.
> 
>> movable areas, they need to know how the memory lines up across
>> node boundaries ... because they need to make sure to allow some
>> non-movable memory allocations on each node so that the kernel can
>> take advantage of node locality.
> 
> There is no problem.
> Linux has already two boot options, kernelcore= and movablecore=.
> So if we use them, non-movable memory is divided into each node evenly.
> 
> But there is no way to specify a node used as movable currently. So
> we proposed the new boot option.
> 
>> So the user would have to read at least the SRAT table, and perhaps
>> more, to figure out what to provide as arguments.
>>
> 
>> Since this is going to be used on a dynamic system where nodes might
>> be added an removed - the right values for these arguments might
>> change from one boot to the next. So even if the user gets them right
>> on day 1, a month later when a new node has been added, or a broken
>> node removed the values would be stale.
> 
> I don't think so. Even if we hot add/remove node, the memory range of
> each memory device is not changed. So we don't need to change the boot
> option.
Hi Yasuaki,
	Addresses assigned to each memory device may change under different 
hardware configurations.
	According to my experiences with some hotplug capable Xeon and Itanium
systems, a typical algorithm adopted by BIOS to support memory hotplug is:
1) For backward compatibility, BIOS assigns continuous addresses to memory
devices present at boot time. In other words, there are no holes in the memory
addresses except the hole just below 4G reserved for MMIO and other arch 
specific usage.
2) To support memory hotplug, BIOS reserves enough memory address ranges 
at the high end.
 
	Let's take a typical 4 sockets system as an example. Say we have four
sockets S0-S3, and each socket supports two memory devices(M0-M1) at maximum. 
Each memory device supports 128G memory at maximum. And at boot, all memory
slots are fully populated with 4GB memory. Then the address assignment looks
like:
0-2G: 		S0.M0
2-4G: 		MMIO
4-8G: 		S0.M1
8-12G: 		S1.M0
12-16G: 	S1.M1
16-20G: 	S2.M0
20-24G:		S2.M1
24-28G: 	S2.M0
28-32G:		S2.M1
32-34G:		S0.M0 (memory recovered from the MMIO hole)
1024-1152G:	reserved for S0.M0
1152-1280G:	reserved for S0.M1
1280-1408G:	reserved for S1.M0
1408-1536G:	reserved for S1.M1
1536-1664G:	reserved for S2.M0
1664-1792G:	reserved for S2.M1
1792-1920G:	reserved for S3.M0
1920-2048G:	reserved for S4.M1

If we hot-remove S2.M0 and add back a bigger memory device with 8G memory, it will
be assigned a new memory address range 1536-1544G.

Based on above algorithm, and we configure 16-24G(S2.M0 and S2.M1) as movable memory.
1) memory on S3 will be configured as movable if S2 isn't present at boot time. (the
same effect as "movable_node" in discussion at https://lkml.org/lkml/2012/11/27/154)
2) S2.M0 will be configured as non-movable and S3.M0 will be configured as movable
   if S1.M0 isn't present at boot.
3) And how about replace S1.M0 with a 8GB memory device?

To summarize, kernel parameter to configure movable memory for hotplug will easily
become invalid if hardware configuration changes, and that may confuse administrators.
I still think the most reliable way is to figure out movable memory for hotplug by
parsing hardware configuration information from BIOS.

Regards!
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-29 10:38       ` Yasuaki Ishimatsu
@ 2012-11-29 15:53         ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-29 15:53 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: Luck, Tony, Jiang Liu, Tang Chen, hpa, akpm, rob, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

Hi Yasuaki,
	Forgot to mention that I have no objection to this patchset.
I think it's a good start point, but we still need to improve usabilities
of memory hotplug by passing platform specific information from BIOS.
And mechanism provided by this patchset will/may be used to improve
usabilities too. 

Regards!
Gerry

On 11/29/2012 06:38 PM, Yasuaki Ishimatsu wrote:
> Hi Tony,
> 
> 2012/11/29 6:34, Luck, Tony wrote:
>>> 1. use firmware information
>>>    According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>    and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>    Affinity Structure". If we use the information, we might be able to
>>>    specify movable memory by firmware. For example, if Hot Pluggable
>>>    Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>>    This is our proposal. New boot option can specify memory range to use
>>>    as movable memory.
>>
>> Isn't this just moving the work to the user? To pick good values for the
> 
> Yes.
> 
>> movable areas, they need to know how the memory lines up across
>> node boundaries ... because they need to make sure to allow some
>> non-movable memory allocations on each node so that the kernel can
>> take advantage of node locality.
> 
> There is no problem.
> Linux has already two boot options, kernelcore= and movablecore=.
> So if we use them, non-movable memory is divided into each node evenly.
> 
> But there is no way to specify a node used as movable currently. So
> we proposed the new boot option.
> 
>> So the user would have to read at least the SRAT table, and perhaps
>> more, to figure out what to provide as arguments.
>>
> 
>> Since this is going to be used on a dynamic system where nodes might
>> be added an removed - the right values for these arguments might
>> change from one boot to the next. So even if the user gets them right
>> on day 1, a month later when a new node has been added, or a broken
>> node removed the values would be stale.
> 
> I don't think so. Even if we hot add/remove node, the memory range of
> each memory device is not changed. So we don't need to change the boot
> option.
> 
> Thanks,
> Yasuaki Ishimatsu
> 
>>
>> -Tony
>>
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-29 15:53         ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-29 15:53 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: Luck, Tony, Jiang Liu, Tang Chen, hpa, akpm, rob, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

Hi Yasuaki,
	Forgot to mention that I have no objection to this patchset.
I think it's a good start point, but we still need to improve usabilities
of memory hotplug by passing platform specific information from BIOS.
And mechanism provided by this patchset will/may be used to improve
usabilities too. 

Regards!
Gerry

On 11/29/2012 06:38 PM, Yasuaki Ishimatsu wrote:
> Hi Tony,
> 
> 2012/11/29 6:34, Luck, Tony wrote:
>>> 1. use firmware information
>>>    According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>    and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>    Affinity Structure". If we use the information, we might be able to
>>>    specify movable memory by firmware. For example, if Hot Pluggable
>>>    Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>>    This is our proposal. New boot option can specify memory range to use
>>>    as movable memory.
>>
>> Isn't this just moving the work to the user? To pick good values for the
> 
> Yes.
> 
>> movable areas, they need to know how the memory lines up across
>> node boundaries ... because they need to make sure to allow some
>> non-movable memory allocations on each node so that the kernel can
>> take advantage of node locality.
> 
> There is no problem.
> Linux has already two boot options, kernelcore= and movablecore=.
> So if we use them, non-movable memory is divided into each node evenly.
> 
> But there is no way to specify a node used as movable currently. So
> we proposed the new boot option.
> 
>> So the user would have to read at least the SRAT table, and perhaps
>> more, to figure out what to provide as arguments.
>>
> 
>> Since this is going to be used on a dynamic system where nodes might
>> be added an removed - the right values for these arguments might
>> change from one boot to the next. So even if the user gets them right
>> on day 1, a month later when a new node has been added, or a broken
>> node removed the values would be stale.
> 
> I don't think so. Even if we hot add/remove node, the memory range of
> each memory device is not changed. So we don't need to change the boot
> option.
> 
> Thanks,
> Yasuaki Ishimatsu
> 
>>
>> -Tony
>>
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-29 11:00         ` Mel Gorman
@ 2012-11-29 16:07           ` H. Peter Anvin
  -1 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-29 16:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Luck, Tony, Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki,
	laijs, wency, linfeng, yinghai, kosaki.motohiro, minchan.kim,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

On 11/29/2012 03:00 AM, Mel Gorman wrote:
> 
> I've not been paying a whole pile of attention to this because it's not an
> area I'm active in but I agree that configuring ZONE_MOVABLE like
> this at boot-time is going to be problematic. As awkward as it is, it
> would probably work out better to only boot with one node by default and
> then hot-add the nodes at runtime using either an online sysfs file or
> an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
> clumsy but better than specifying addresses on the command line.
> 
> That said, I also find using ZONE_MOVABLE to be a problem in itself that
> will cause problems down the road. Maybe this was discussed already but
> just in case I'll describe the problems I see.
> 

Yes, and it does mean that we definitely don't want everything that can
be in ZONE_MOVABLE to be there without administrator control.  I suspect
that a lot of users of such platforms actually will not use the feature,
and don't want to take the substantial penalty.

The other bit is that if you really really want high reliability, memory
mirroring is the way to go; it is the only way you will be able to
hotremove memory without having to have a pre-event to migrate the
memory away from the affected node before the memory is offlined.

	-hpa


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-29 16:07           ` H. Peter Anvin
  0 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-29 16:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Luck, Tony, Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki,
	laijs, wency, linfeng, yinghai, kosaki.motohiro, minchan.kim,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

On 11/29/2012 03:00 AM, Mel Gorman wrote:
> 
> I've not been paying a whole pile of attention to this because it's not an
> area I'm active in but I agree that configuring ZONE_MOVABLE like
> this at boot-time is going to be problematic. As awkward as it is, it
> would probably work out better to only boot with one node by default and
> then hot-add the nodes at runtime using either an online sysfs file or
> an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
> clumsy but better than specifying addresses on the command line.
> 
> That said, I also find using ZONE_MOVABLE to be a problem in itself that
> will cause problems down the road. Maybe this was discussed already but
> just in case I'll describe the problems I see.
> 

Yes, and it does mean that we definitely don't want everything that can
be in ZONE_MOVABLE to be there without administrator control.  I suspect
that a lot of users of such platforms actually will not use the feature,
and don't want to take the substantial penalty.

The other bit is that if you really really want high reliability, memory
mirroring is the way to go; it is the only way you will be able to
hotremove memory without having to have a pre-event to migrate the
memory away from the affected node before the memory is offlined.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* RE: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-29 16:07           ` H. Peter Anvin
@ 2012-11-29 22:41             ` Luck, Tony
  -1 siblings, 0 replies; 170+ messages in thread
From: Luck, Tony @ 2012-11-29 22:41 UTC (permalink / raw)
  To: H. Peter Anvin, Mel Gorman
  Cc: Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, Len Brown, Wang, Frank

> The other bit is that if you really really want high reliability, memory
> mirroring is the way to go; it is the only way you will be able to
> hotremove memory without having to have a pre-event to migrate the
> memory away from the affected node before the memory is offlined.

Some platforms don't support cross-node mirrors ... but we still want to
be able to remove a node.

-Tony

^ permalink raw reply	[flat|nested] 170+ messages in thread

* RE: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-29 22:41             ` Luck, Tony
  0 siblings, 0 replies; 170+ messages in thread
From: Luck, Tony @ 2012-11-29 22:41 UTC (permalink / raw)
  To: H. Peter Anvin, Mel Gorman
  Cc: Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, Len Brown, Wang, Frank

> The other bit is that if you really really want high reliability, memory
> mirroring is the way to go; it is the only way you will be able to
> hotremove memory without having to have a pre-event to migrate the
> memory away from the affected node before the memory is offlined.

Some platforms don't support cross-node mirrors ... but we still want to
be able to remove a node.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-29 22:41             ` Luck, Tony
@ 2012-11-29 22:45               ` H. Peter Anvin
  -1 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-29 22:45 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Mel Gorman, Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki,
	laijs, wency, linfeng, yinghai, kosaki.motohiro, minchan.kim,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

On 11/29/2012 02:41 PM, Luck, Tony wrote:
>> The other bit is that if you really really want high reliability, memory
>> mirroring is the way to go; it is the only way you will be able to
>> hotremove memory without having to have a pre-event to migrate the
>> memory away from the affected node before the memory is offlined.
> 
> Some platforms don't support cross-node mirrors ... but we still want to
> be able to remove a node.
> 

Yes, well, those platforms don't support that degree of "really really
high reliability", since the unannounced failure of the node controller
will bring down the system.

	-hpa



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-29 22:45               ` H. Peter Anvin
  0 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-29 22:45 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Mel Gorman, Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki,
	laijs, wency, linfeng, yinghai, kosaki.motohiro, minchan.kim,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

On 11/29/2012 02:41 PM, Luck, Tony wrote:
>> The other bit is that if you really really want high reliability, memory
>> mirroring is the way to go; it is the only way you will be able to
>> hotremove memory without having to have a pre-event to migrate the
>> memory away from the affected node before the memory is offlined.
> 
> Some platforms don't support cross-node mirrors ... but we still want to
> be able to remove a node.
> 

Yes, well, those platforms don't support that degree of "really really
high reliability", since the unannounced failure of the node controller
will bring down the system.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-29 11:00         ` Mel Gorman
@ 2012-11-30  2:56           ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-30  2:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: H. Peter Anvin, Luck, Tony, Tang Chen, akpm, rob,
	isimatu.yasuaki, laijs, wency, linfeng, yinghai, kosaki.motohiro,
	minchan.kim, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	Len Brown, Wang, Frank

Hi Mel,
	Thanks for your great comments!

On 2012-11-29 19:00, Mel Gorman wrote:
> On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
>> On 11/28/2012 01:34 PM, Luck, Tony wrote:
>>>>
>>>> 2. use boot option
>>>>   This is our proposal. New boot option can specify memory range to use
>>>>   as movable memory.
>>>
>>> Isn't this just moving the work to the user? To pick good values for the
>>> movable areas, they need to know how the memory lines up across
>>> node boundaries ... because they need to make sure to allow some
>>> non-movable memory allocations on each node so that the kernel can
>>> take advantage of node locality.
>>>
>>> So the user would have to read at least the SRAT table, and perhaps
>>> more, to figure out what to provide as arguments.
>>>
>>> Since this is going to be used on a dynamic system where nodes might
>>> be added an removed - the right values for these arguments might
>>> change from one boot to the next. So even if the user gets them right
>>> on day 1, a month later when a new node has been added, or a broken
>>> node removed the values would be stale.
>>>
>>
>> I gave this feedback in person at LCE: I consider the kernel
>> configuration option to be useless for anything other than debugging.
>> Trying to promote it as an actual solution, to be used by end users in
>> the field, is ridiculous at best.
>>
> 
> I've not been paying a whole pile of attention to this because it's not an
> area I'm active in but I agree that configuring ZONE_MOVABLE like
> this at boot-time is going to be problematic. As awkward as it is, it
> would probably work out better to only boot with one node by default and
> then hot-add the nodes at runtime using either an online sysfs file or
> an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
> clumsy but better than specifying addresses on the command line.
> 
> That said, I also find using ZONE_MOVABLE to be a problem in itself that
> will cause problems down the road. Maybe this was discussed already but
> just in case I'll describe the problems I see.
> 
> If any significant percentage of memory is in ZONE_MOVABLE then the memory
> hotplug people will have to deal with all the lowmem/highmem problems
> that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
> metadata intensive workloads will not be able to use all of memory because
> the kernel allocations will be confined to a subset of memory. A more
> complex example is that page table page allocations are also restricted
> meaning it's possible that a process will not even be able to mmap() a high
> percentage of memory simply because it cannot allocate the page tables to
> store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
> was a hack when it was introduced but at least then the expectation was
> that ZONE_MOVABLE was going to be used for huge pages and there at least
> an expectation that it would not be available for normal usage.
> 
> Fundamentally the reason one would want to use ZONE_MOVABLE is because
> we cannot migrate a lot of kernel memory -- slab pages, page table pages,
> device-allocated buffers etc.  My understanding is that other OS's get around
> this by requiring that subsystems and drivers have callbacks that allow the
> core VM to force certain memory to be released but that may be impractical
> for Linux. I don't know for sure though, this is just what I heard.
As I know, one other OS limits immovable pages at low end, and the limit
will increase on demand. But the drawback of this solution is serious
performance drop (average about 10%) because it essentially disable NUMA
optimization for kernel/DMA memory allocations.

> For Linux, the hotplug people need to start thinking about how to get
> around this migration problem. The first problem faced is the memory model
> and how it maps virt->phys addresses. We have a 1:1 mapping because it's
> fast but not because it's a fundamental requirement. Start considering
> what happens if the memory model is changed to allow some sections to have
> fast lookup for virt_to_phys and other sections to have slow lookups. On
> hotplug, try and empty all the sections. If the section cannot be emptied
> because of kernel pages then the section gets marked as "offline-migrated"
> or something. Stop the whole machine (yes, I mean stop_machine), copy
> those unmovable pages to another location, update the kernel virt->phys
> mapping for the section being offlined so the virt addresses point to the
> new physical addresses and resume.  Virt->phys lookups are going to be
> a lot slower because a full section lookup will be necessary every time
> effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
> but it should work. This will cover some slab pages where the data is only
> accessed via the virtual address -- inode caches, dcache etc.
> 
> It will not work where the physical address is used. The obvious example
> is page table pages. For page tables, during stop machine you will have to
> walk all processes page tables looking for references to the page you're
> trying to move and update them. It is possible to just plain migrate
> page table pages but when it was last implemented years ago there was a
> constant performance penalty for everybody and it was not popular.  Taking a
> heavy-handed approach just during memory hot-remove might be more palatable.
> 
> For the remaining pages such as those that have been handed to devices
> or are pinned for DMA then your options become more limited. You may
> still have to restrict allocating these pages (where possible) to a
> region that cannot be hot-removed but at least this will be relatively
> few pages.
> 
> The big downside of this proposal is that it's unproven, not designed,
> would be extremely intrusive and I expect it would be a *massive* amount
> of development effort that will be difficult to get right. The upside is
> configuring it will be a lot easier because all you'll need is a variation
> of kernelcore= to reserve a percentage of memory for allocations we *really*
> cannot migrate because the physical pages are owned by a device that cannot
> release them, potentially forever. The other upside is that it does not
> hit crazy lowmem/highmem style problems.
> 
> ZONE_MOVABLE at least will all a node to be removed very quickly but
> because it will paste you into a corner there should be a plan on what
> you're going to replace it with.

I have some thoughts here. The basic idea is that it needs cooperation
between OS, BIOS and hardware to implement a flexible memory hotplug
solution.

As you have mentioned, ZONE_MOVABLE is a quick but a little dirty
solution. It's quick because we could rely on existing mechanism
to configure movable zone and no changes to the memory model needed.
It's a little dirty because:
1) We need to handle cases of running out of immovable pages. The hotplug
implementation shouldn't cause extra service interruption when normal zones
are under pressure. Otherwise it's really a joke that some service
interruptions are really caused by features trying to improve service
availabilities.
2) We still can't handle normal kernel pages used by kernel, device etc.
3) It may cause serious performance drop if we configure all memory
on a NUMA node as ZONE_MOVABLE.

For the first issue, I think we could automatically convert pages
from movable zones into normal zones. Congyan from Fujitsu has provided
a patchset to manually convert pages from movable zones into normal zones,
I think we could extend that mechanism to automatically convert when
normal zones are under pressure by hooking into the slow page allocation
path.

We rely on hardware features to solve the second and third issues.
Some new platforms provide a new RAS feature called "hardware memory
migration", which transparent migrate memory from one memory device
to another. With hardware memory migration, we could configure one
memory device on a NUMA node to host normal zone, and the other memory
devices to host movable zone. By this configuration, it won't cause
performance drop because each NUMA node still has local normal zone.
When trying to remove a memory device hosting normal zone, we just
need to find another spare memory device and use hardware memory migration
to transparently migrate memory content to the spare one. The drawback
is we have strong dependency on hardware features so it's not a common
solution for all architectures.

Regards!
Gerry



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-30  2:56           ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-30  2:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: H. Peter Anvin, Luck, Tony, Tang Chen, akpm, rob,
	isimatu.yasuaki, laijs, wency, linfeng, yinghai, kosaki.motohiro,
	minchan.kim, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	Len Brown, Wang, Frank

Hi Mel,
	Thanks for your great comments!

On 2012-11-29 19:00, Mel Gorman wrote:
> On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
>> On 11/28/2012 01:34 PM, Luck, Tony wrote:
>>>>
>>>> 2. use boot option
>>>>   This is our proposal. New boot option can specify memory range to use
>>>>   as movable memory.
>>>
>>> Isn't this just moving the work to the user? To pick good values for the
>>> movable areas, they need to know how the memory lines up across
>>> node boundaries ... because they need to make sure to allow some
>>> non-movable memory allocations on each node so that the kernel can
>>> take advantage of node locality.
>>>
>>> So the user would have to read at least the SRAT table, and perhaps
>>> more, to figure out what to provide as arguments.
>>>
>>> Since this is going to be used on a dynamic system where nodes might
>>> be added an removed - the right values for these arguments might
>>> change from one boot to the next. So even if the user gets them right
>>> on day 1, a month later when a new node has been added, or a broken
>>> node removed the values would be stale.
>>>
>>
>> I gave this feedback in person at LCE: I consider the kernel
>> configuration option to be useless for anything other than debugging.
>> Trying to promote it as an actual solution, to be used by end users in
>> the field, is ridiculous at best.
>>
> 
> I've not been paying a whole pile of attention to this because it's not an
> area I'm active in but I agree that configuring ZONE_MOVABLE like
> this at boot-time is going to be problematic. As awkward as it is, it
> would probably work out better to only boot with one node by default and
> then hot-add the nodes at runtime using either an online sysfs file or
> an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
> clumsy but better than specifying addresses on the command line.
> 
> That said, I also find using ZONE_MOVABLE to be a problem in itself that
> will cause problems down the road. Maybe this was discussed already but
> just in case I'll describe the problems I see.
> 
> If any significant percentage of memory is in ZONE_MOVABLE then the memory
> hotplug people will have to deal with all the lowmem/highmem problems
> that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
> metadata intensive workloads will not be able to use all of memory because
> the kernel allocations will be confined to a subset of memory. A more
> complex example is that page table page allocations are also restricted
> meaning it's possible that a process will not even be able to mmap() a high
> percentage of memory simply because it cannot allocate the page tables to
> store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
> was a hack when it was introduced but at least then the expectation was
> that ZONE_MOVABLE was going to be used for huge pages and there at least
> an expectation that it would not be available for normal usage.
> 
> Fundamentally the reason one would want to use ZONE_MOVABLE is because
> we cannot migrate a lot of kernel memory -- slab pages, page table pages,
> device-allocated buffers etc.  My understanding is that other OS's get around
> this by requiring that subsystems and drivers have callbacks that allow the
> core VM to force certain memory to be released but that may be impractical
> for Linux. I don't know for sure though, this is just what I heard.
As I know, one other OS limits immovable pages at low end, and the limit
will increase on demand. But the drawback of this solution is serious
performance drop (average about 10%) because it essentially disable NUMA
optimization for kernel/DMA memory allocations.

> For Linux, the hotplug people need to start thinking about how to get
> around this migration problem. The first problem faced is the memory model
> and how it maps virt->phys addresses. We have a 1:1 mapping because it's
> fast but not because it's a fundamental requirement. Start considering
> what happens if the memory model is changed to allow some sections to have
> fast lookup for virt_to_phys and other sections to have slow lookups. On
> hotplug, try and empty all the sections. If the section cannot be emptied
> because of kernel pages then the section gets marked as "offline-migrated"
> or something. Stop the whole machine (yes, I mean stop_machine), copy
> those unmovable pages to another location, update the kernel virt->phys
> mapping for the section being offlined so the virt addresses point to the
> new physical addresses and resume.  Virt->phys lookups are going to be
> a lot slower because a full section lookup will be necessary every time
> effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
> but it should work. This will cover some slab pages where the data is only
> accessed via the virtual address -- inode caches, dcache etc.
> 
> It will not work where the physical address is used. The obvious example
> is page table pages. For page tables, during stop machine you will have to
> walk all processes page tables looking for references to the page you're
> trying to move and update them. It is possible to just plain migrate
> page table pages but when it was last implemented years ago there was a
> constant performance penalty for everybody and it was not popular.  Taking a
> heavy-handed approach just during memory hot-remove might be more palatable.
> 
> For the remaining pages such as those that have been handed to devices
> or are pinned for DMA then your options become more limited. You may
> still have to restrict allocating these pages (where possible) to a
> region that cannot be hot-removed but at least this will be relatively
> few pages.
> 
> The big downside of this proposal is that it's unproven, not designed,
> would be extremely intrusive and I expect it would be a *massive* amount
> of development effort that will be difficult to get right. The upside is
> configuring it will be a lot easier because all you'll need is a variation
> of kernelcore= to reserve a percentage of memory for allocations we *really*
> cannot migrate because the physical pages are owned by a device that cannot
> release them, potentially forever. The other upside is that it does not
> hit crazy lowmem/highmem style problems.
> 
> ZONE_MOVABLE at least will all a node to be removed very quickly but
> because it will paste you into a corner there should be a plan on what
> you're going to replace it with.

I have some thoughts here. The basic idea is that it needs cooperation
between OS, BIOS and hardware to implement a flexible memory hotplug
solution.

As you have mentioned, ZONE_MOVABLE is a quick but a little dirty
solution. It's quick because we could rely on existing mechanism
to configure movable zone and no changes to the memory model needed.
It's a little dirty because:
1) We need to handle cases of running out of immovable pages. The hotplug
implementation shouldn't cause extra service interruption when normal zones
are under pressure. Otherwise it's really a joke that some service
interruptions are really caused by features trying to improve service
availabilities.
2) We still can't handle normal kernel pages used by kernel, device etc.
3) It may cause serious performance drop if we configure all memory
on a NUMA node as ZONE_MOVABLE.

For the first issue, I think we could automatically convert pages
from movable zones into normal zones. Congyan from Fujitsu has provided
a patchset to manually convert pages from movable zones into normal zones,
I think we could extend that mechanism to automatically convert when
normal zones are under pressure by hooking into the slow page allocation
path.

We rely on hardware features to solve the second and third issues.
Some new platforms provide a new RAS feature called "hardware memory
migration", which transparent migrate memory from one memory device
to another. With hardware memory migration, we could configure one
memory device on a NUMA node to host normal zone, and the other memory
devices to host movable zone. By this configuration, it won't cause
performance drop because each NUMA node still has local normal zone.
When trying to remove a memory device hosting normal zone, we just
need to find another spare memory device and use hardware memory migration
to transparently migrate memory content to the spare one. The drawback
is we have strong dependency on hardware features so it's not a common
solution for all architectures.

Regards!
Gerry


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* RE: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-29 11:00         ` Mel Gorman
@ 2012-11-30  2:58           ` Luck, Tony
  -1 siblings, 0 replies; 170+ messages in thread
From: Luck, Tony @ 2012-11-30  2:58 UTC (permalink / raw)
  To: Mel Gorman, H. Peter Anvin
  Cc: Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, Len Brown, Wang, Frank

> If any significant percentage of memory is in ZONE_MOVABLE then the memory
> hotplug people will have to deal with all the lowmem/highmem problems
> that used to be faced by 32-bit x86 with PAE enabled. 

While these problems may still exist on large systems - I think it becomes
harder to construct workloads that run into problems.  In those bad old days
a significant fraction of lowmem was consumed by the kernel ... so it was
pretty easy to find meta-data intensive workloads that would push it over
a cliff.  Here we  are talking about systems with say 128GB per node divided
into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
low-end machine).  Unless the workload consists of zillions of tiny processes
all mapping shared memory blocks, the percentage of memory allocated to
the kernel is going to be tiny compared with the old 4GB days.

-Tony


^ permalink raw reply	[flat|nested] 170+ messages in thread

* RE: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-30  2:58           ` Luck, Tony
  0 siblings, 0 replies; 170+ messages in thread
From: Luck, Tony @ 2012-11-30  2:58 UTC (permalink / raw)
  To: Mel Gorman, H. Peter Anvin
  Cc: Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, Len Brown, Wang, Frank

> If any significant percentage of memory is in ZONE_MOVABLE then the memory
> hotplug people will have to deal with all the lowmem/highmem problems
> that used to be faced by 32-bit x86 with PAE enabled. 

While these problems may still exist on large systems - I think it becomes
harder to construct workloads that run into problems.  In those bad old days
a significant fraction of lowmem was consumed by the kernel ... so it was
pretty easy to find meta-data intensive workloads that would push it over
a cliff.  Here we  are talking about systems with say 128GB per node divided
into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
low-end machine).  Unless the workload consists of zillions of tiny processes
all mapping shared memory blocks, the percentage of memory allocated to
the kernel is going to be tiny compared with the old 4GB days.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-30  2:56           ` Jiang Liu
@ 2012-11-30  3:15             ` Yasuaki Ishimatsu
  -1 siblings, 0 replies; 170+ messages in thread
From: Yasuaki Ishimatsu @ 2012-11-30  3:15 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Mel Gorman, H. Peter Anvin, Luck, Tony, Tang Chen, akpm, rob,
	laijs, wency, linfeng, yinghai, kosaki.motohiro, minchan.kim,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

Hi Jiang,

2012/11/30 11:56, Jiang Liu wrote:
> Hi Mel,
> 	Thanks for your great comments!
>
> On 2012-11-29 19:00, Mel Gorman wrote:
>> On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
>>> On 11/28/2012 01:34 PM, Luck, Tony wrote:
>>>>>
>>>>> 2. use boot option
>>>>>    This is our proposal. New boot option can specify memory range to use
>>>>>    as movable memory.
>>>>
>>>> Isn't this just moving the work to the user? To pick good values for the
>>>> movable areas, they need to know how the memory lines up across
>>>> node boundaries ... because they need to make sure to allow some
>>>> non-movable memory allocations on each node so that the kernel can
>>>> take advantage of node locality.
>>>>
>>>> So the user would have to read at least the SRAT table, and perhaps
>>>> more, to figure out what to provide as arguments.
>>>>
>>>> Since this is going to be used on a dynamic system where nodes might
>>>> be added an removed - the right values for these arguments might
>>>> change from one boot to the next. So even if the user gets them right
>>>> on day 1, a month later when a new node has been added, or a broken
>>>> node removed the values would be stale.
>>>>
>>>
>>> I gave this feedback in person at LCE: I consider the kernel
>>> configuration option to be useless for anything other than debugging.
>>> Trying to promote it as an actual solution, to be used by end users in
>>> the field, is ridiculous at best.
>>>
>>
>> I've not been paying a whole pile of attention to this because it's not an
>> area I'm active in but I agree that configuring ZONE_MOVABLE like
>> this at boot-time is going to be problematic. As awkward as it is, it
>> would probably work out better to only boot with one node by default and
>> then hot-add the nodes at runtime using either an online sysfs file or
>> an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
>> clumsy but better than specifying addresses on the command line.
>>
>> That said, I also find using ZONE_MOVABLE to be a problem in itself that
>> will cause problems down the road. Maybe this was discussed already but
>> just in case I'll describe the problems I see.
>>
>> If any significant percentage of memory is in ZONE_MOVABLE then the memory
>> hotplug people will have to deal with all the lowmem/highmem problems
>> that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
>> metadata intensive workloads will not be able to use all of memory because
>> the kernel allocations will be confined to a subset of memory. A more
>> complex example is that page table page allocations are also restricted
>> meaning it's possible that a process will not even be able to mmap() a high
>> percentage of memory simply because it cannot allocate the page tables to
>> store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
>> was a hack when it was introduced but at least then the expectation was
>> that ZONE_MOVABLE was going to be used for huge pages and there at least
>> an expectation that it would not be available for normal usage.
>>
>> Fundamentally the reason one would want to use ZONE_MOVABLE is because
>> we cannot migrate a lot of kernel memory -- slab pages, page table pages,
>> device-allocated buffers etc.  My understanding is that other OS's get around
>> this by requiring that subsystems and drivers have callbacks that allow the
>> core VM to force certain memory to be released but that may be impractical
>> for Linux. I don't know for sure though, this is just what I heard.
> As I know, one other OS limits immovable pages at low end, and the limit
> will increase on demand. But the drawback of this solution is serious
> performance drop (average about 10%) because it essentially disable NUMA
> optimization for kernel/DMA memory allocations.
>
>> For Linux, the hotplug people need to start thinking about how to get
>> around this migration problem. The first problem faced is the memory model
>> and how it maps virt->phys addresses. We have a 1:1 mapping because it's
>> fast but not because it's a fundamental requirement. Start considering
>> what happens if the memory model is changed to allow some sections to have
>> fast lookup for virt_to_phys and other sections to have slow lookups. On
>> hotplug, try and empty all the sections. If the section cannot be emptied
>> because of kernel pages then the section gets marked as "offline-migrated"
>> or something. Stop the whole machine (yes, I mean stop_machine), copy
>> those unmovable pages to another location, update the kernel virt->phys
>> mapping for the section being offlined so the virt addresses point to the
>> new physical addresses and resume.  Virt->phys lookups are going to be
>> a lot slower because a full section lookup will be necessary every time
>> effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
>> but it should work. This will cover some slab pages where the data is only
>> accessed via the virtual address -- inode caches, dcache etc.
>>
>> It will not work where the physical address is used. The obvious example
>> is page table pages. For page tables, during stop machine you will have to
>> walk all processes page tables looking for references to the page you're
>> trying to move and update them. It is possible to just plain migrate
>> page table pages but when it was last implemented years ago there was a
>> constant performance penalty for everybody and it was not popular.  Taking a
>> heavy-handed approach just during memory hot-remove might be more palatable.
>>
>> For the remaining pages such as those that have been handed to devices
>> or are pinned for DMA then your options become more limited. You may
>> still have to restrict allocating these pages (where possible) to a
>> region that cannot be hot-removed but at least this will be relatively
>> few pages.
>>
>> The big downside of this proposal is that it's unproven, not designed,
>> would be extremely intrusive and I expect it would be a *massive* amount
>> of development effort that will be difficult to get right. The upside is
>> configuring it will be a lot easier because all you'll need is a variation
>> of kernelcore= to reserve a percentage of memory for allocations we *really*
>> cannot migrate because the physical pages are owned by a device that cannot
>> release them, potentially forever. The other upside is that it does not
>> hit crazy lowmem/highmem style problems.
>>
>> ZONE_MOVABLE at least will all a node to be removed very quickly but
>> because it will paste you into a corner there should be a plan on what
>> you're going to replace it with.
>
> I have some thoughts here. The basic idea is that it needs cooperation
> between OS, BIOS and hardware to implement a flexible memory hotplug
> solution.
>
> As you have mentioned, ZONE_MOVABLE is a quick but a little dirty
> solution. It's quick because we could rely on existing mechanism
> to configure movable zone and no changes to the memory model needed.
> It's a little dirty because:
> 1) We need to handle cases of running out of immovable pages. The hotplug
> implementation shouldn't cause extra service interruption when normal zones
> are under pressure. Otherwise it's really a joke that some service
> interruptions are really caused by features trying to improve service
> availabilities.
> 2) We still can't handle normal kernel pages used by kernel, device etc.
> 3) It may cause serious performance drop if we configure all memory
> on a NUMA node as ZONE_MOVABLE.
>
> For the first issue, I think we could automatically convert pages
> from movable zones into normal zones. Congyan from Fujitsu has provided
> a patchset to manually convert pages from movable zones into normal zones,
> I think we could extend that mechanism to automatically convert when
> normal zones are under pressure by hooking into the slow page allocation
> path.
>
> We rely on hardware features to solve the second and third issues.
> Some new platforms provide a new RAS feature called "hardware memory
> migration", which transparent migrate memory from one memory device
> to another. With hardware memory migration, we could configure one
> memory device on a NUMA node to host normal zone, and the other memory
> devices to host movable zone. By this configuration, it won't cause
> performance drop because each NUMA node still has local normal zone.
> When trying to remove a memory device hosting normal zone, we just
> need to find another spare memory device and use hardware memory migration
> to transparently migrate memory content to the spare one. The drawback
> is we have strong dependency on hardware features so it's not a common
> solution for all architectures.

I agree with you. If BIOS and hardware support memory hotplug, OS should
use them. But if OS cannot use them, we need to solve in OS. I think
that our proposal which used ZONE_MOVABLE is first step for supporting
memory hotplug.

Thanks,
Yasuaki Ishimatsu

>
> Regards!
> Gerry
>
>



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-30  3:15             ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 170+ messages in thread
From: Yasuaki Ishimatsu @ 2012-11-30  3:15 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Mel Gorman, H. Peter Anvin, Luck, Tony, Tang Chen, akpm, rob,
	laijs, wency, linfeng, yinghai, kosaki.motohiro, minchan.kim,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

Hi Jiang,

2012/11/30 11:56, Jiang Liu wrote:
> Hi Mel,
> 	Thanks for your great comments!
>
> On 2012-11-29 19:00, Mel Gorman wrote:
>> On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
>>> On 11/28/2012 01:34 PM, Luck, Tony wrote:
>>>>>
>>>>> 2. use boot option
>>>>>    This is our proposal. New boot option can specify memory range to use
>>>>>    as movable memory.
>>>>
>>>> Isn't this just moving the work to the user? To pick good values for the
>>>> movable areas, they need to know how the memory lines up across
>>>> node boundaries ... because they need to make sure to allow some
>>>> non-movable memory allocations on each node so that the kernel can
>>>> take advantage of node locality.
>>>>
>>>> So the user would have to read at least the SRAT table, and perhaps
>>>> more, to figure out what to provide as arguments.
>>>>
>>>> Since this is going to be used on a dynamic system where nodes might
>>>> be added an removed - the right values for these arguments might
>>>> change from one boot to the next. So even if the user gets them right
>>>> on day 1, a month later when a new node has been added, or a broken
>>>> node removed the values would be stale.
>>>>
>>>
>>> I gave this feedback in person at LCE: I consider the kernel
>>> configuration option to be useless for anything other than debugging.
>>> Trying to promote it as an actual solution, to be used by end users in
>>> the field, is ridiculous at best.
>>>
>>
>> I've not been paying a whole pile of attention to this because it's not an
>> area I'm active in but I agree that configuring ZONE_MOVABLE like
>> this at boot-time is going to be problematic. As awkward as it is, it
>> would probably work out better to only boot with one node by default and
>> then hot-add the nodes at runtime using either an online sysfs file or
>> an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
>> clumsy but better than specifying addresses on the command line.
>>
>> That said, I also find using ZONE_MOVABLE to be a problem in itself that
>> will cause problems down the road. Maybe this was discussed already but
>> just in case I'll describe the problems I see.
>>
>> If any significant percentage of memory is in ZONE_MOVABLE then the memory
>> hotplug people will have to deal with all the lowmem/highmem problems
>> that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
>> metadata intensive workloads will not be able to use all of memory because
>> the kernel allocations will be confined to a subset of memory. A more
>> complex example is that page table page allocations are also restricted
>> meaning it's possible that a process will not even be able to mmap() a high
>> percentage of memory simply because it cannot allocate the page tables to
>> store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
>> was a hack when it was introduced but at least then the expectation was
>> that ZONE_MOVABLE was going to be used for huge pages and there at least
>> an expectation that it would not be available for normal usage.
>>
>> Fundamentally the reason one would want to use ZONE_MOVABLE is because
>> we cannot migrate a lot of kernel memory -- slab pages, page table pages,
>> device-allocated buffers etc.  My understanding is that other OS's get around
>> this by requiring that subsystems and drivers have callbacks that allow the
>> core VM to force certain memory to be released but that may be impractical
>> for Linux. I don't know for sure though, this is just what I heard.
> As I know, one other OS limits immovable pages at low end, and the limit
> will increase on demand. But the drawback of this solution is serious
> performance drop (average about 10%) because it essentially disable NUMA
> optimization for kernel/DMA memory allocations.
>
>> For Linux, the hotplug people need to start thinking about how to get
>> around this migration problem. The first problem faced is the memory model
>> and how it maps virt->phys addresses. We have a 1:1 mapping because it's
>> fast but not because it's a fundamental requirement. Start considering
>> what happens if the memory model is changed to allow some sections to have
>> fast lookup for virt_to_phys and other sections to have slow lookups. On
>> hotplug, try and empty all the sections. If the section cannot be emptied
>> because of kernel pages then the section gets marked as "offline-migrated"
>> or something. Stop the whole machine (yes, I mean stop_machine), copy
>> those unmovable pages to another location, update the kernel virt->phys
>> mapping for the section being offlined so the virt addresses point to the
>> new physical addresses and resume.  Virt->phys lookups are going to be
>> a lot slower because a full section lookup will be necessary every time
>> effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
>> but it should work. This will cover some slab pages where the data is only
>> accessed via the virtual address -- inode caches, dcache etc.
>>
>> It will not work where the physical address is used. The obvious example
>> is page table pages. For page tables, during stop machine you will have to
>> walk all processes page tables looking for references to the page you're
>> trying to move and update them. It is possible to just plain migrate
>> page table pages but when it was last implemented years ago there was a
>> constant performance penalty for everybody and it was not popular.  Taking a
>> heavy-handed approach just during memory hot-remove might be more palatable.
>>
>> For the remaining pages such as those that have been handed to devices
>> or are pinned for DMA then your options become more limited. You may
>> still have to restrict allocating these pages (where possible) to a
>> region that cannot be hot-removed but at least this will be relatively
>> few pages.
>>
>> The big downside of this proposal is that it's unproven, not designed,
>> would be extremely intrusive and I expect it would be a *massive* amount
>> of development effort that will be difficult to get right. The upside is
>> configuring it will be a lot easier because all you'll need is a variation
>> of kernelcore= to reserve a percentage of memory for allocations we *really*
>> cannot migrate because the physical pages are owned by a device that cannot
>> release them, potentially forever. The other upside is that it does not
>> hit crazy lowmem/highmem style problems.
>>
>> ZONE_MOVABLE at least will all a node to be removed very quickly but
>> because it will paste you into a corner there should be a plan on what
>> you're going to replace it with.
>
> I have some thoughts here. The basic idea is that it needs cooperation
> between OS, BIOS and hardware to implement a flexible memory hotplug
> solution.
>
> As you have mentioned, ZONE_MOVABLE is a quick but a little dirty
> solution. It's quick because we could rely on existing mechanism
> to configure movable zone and no changes to the memory model needed.
> It's a little dirty because:
> 1) We need to handle cases of running out of immovable pages. The hotplug
> implementation shouldn't cause extra service interruption when normal zones
> are under pressure. Otherwise it's really a joke that some service
> interruptions are really caused by features trying to improve service
> availabilities.
> 2) We still can't handle normal kernel pages used by kernel, device etc.
> 3) It may cause serious performance drop if we configure all memory
> on a NUMA node as ZONE_MOVABLE.
>
> For the first issue, I think we could automatically convert pages
> from movable zones into normal zones. Congyan from Fujitsu has provided
> a patchset to manually convert pages from movable zones into normal zones,
> I think we could extend that mechanism to automatically convert when
> normal zones are under pressure by hooking into the slow page allocation
> path.
>
> We rely on hardware features to solve the second and third issues.
> Some new platforms provide a new RAS feature called "hardware memory
> migration", which transparent migrate memory from one memory device
> to another. With hardware memory migration, we could configure one
> memory device on a NUMA node to host normal zone, and the other memory
> devices to host movable zone. By this configuration, it won't cause
> performance drop because each NUMA node still has local normal zone.
> When trying to remove a memory device hosting normal zone, we just
> need to find another spare memory device and use hardware memory migration
> to transparently migrate memory content to the spare one. The drawback
> is we have strong dependency on hardware features so it's not a common
> solution for all architectures.

I agree with you. If BIOS and hardware support memory hotplug, OS should
use them. But if OS cannot use them, we need to solve in OS. I think
that our proposal which used ZONE_MOVABLE is first step for supporting
memory hotplug.

Thanks,
Yasuaki Ishimatsu

>
> Regards!
> Gerry
>
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* RE: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-30  2:58           ` Luck, Tony
@ 2012-11-30  3:28             ` H. Peter Anvin
  -1 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-30  3:28 UTC (permalink / raw)
  To: Luck, Tony, Mel Gorman
  Cc: Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, Len Brown, Wang, Frank

Disk I/O is still a big consumer of lowmem.

"Luck, Tony" <tony.luck@intel.com> wrote:

>> If any significant percentage of memory is in ZONE_MOVABLE then the
>memory
>> hotplug people will have to deal with all the lowmem/highmem problems
>> that used to be faced by 32-bit x86 with PAE enabled. 
>
>While these problems may still exist on large systems - I think it
>becomes
>harder to construct workloads that run into problems.  In those bad old
>days
>a significant fraction of lowmem was consumed by the kernel ... so it
>was
>pretty easy to find meta-data intensive workloads that would push it
>over
>a cliff.  Here we  are talking about systems with say 128GB per node
>divided
>into 64GB moveable and 64GB non-moveable (and I'd regard this as a
>rather
>low-end machine).  Unless the workload consists of zillions of tiny
>processes
>all mapping shared memory blocks, the percentage of memory allocated to
>the kernel is going to be tiny compared with the old 4GB days.
>
>-Tony

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 170+ messages in thread

* RE: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-30  3:28             ` H. Peter Anvin
  0 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-11-30  3:28 UTC (permalink / raw)
  To: Luck, Tony, Mel Gorman
  Cc: Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, rientjes, rusty,
	linux-kernel, linux-mm, linux-doc, Len Brown, Wang, Frank

Disk I/O is still a big consumer of lowmem.

"Luck, Tony" <tony.luck@intel.com> wrote:

>> If any significant percentage of memory is in ZONE_MOVABLE then the
>memory
>> hotplug people will have to deal with all the lowmem/highmem problems
>> that used to be faced by 32-bit x86 with PAE enabled. 
>
>While these problems may still exist on large systems - I think it
>becomes
>harder to construct workloads that run into problems.  In those bad old
>days
>a significant fraction of lowmem was consumed by the kernel ... so it
>was
>pretty easy to find meta-data intensive workloads that would push it
>over
>a cliff.  Here we  are talking about systems with say 128GB per node
>divided
>into 64GB moveable and 64GB non-moveable (and I'd regard this as a
>rather
>low-end machine).  Unless the workload consists of zillions of tiny
>processes
>all mapping shared memory blocks, the percentage of memory allocated to
>the kernel is going to be tiny compared with the old 4GB days.
>
>-Tony

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-28  4:08             ` Jiang Liu
@ 2012-11-30  9:20               ` Lai Jiangshan
  -1 siblings, 0 replies; 170+ messages in thread
From: Lai Jiangshan @ 2012-11-30  9:20 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Bob Liu, Tang Chen, hpa, akpm, rob, isimatu.yasuaki, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

On 11/28/2012 12:08 PM, Jiang Liu wrote:
> On 2012-11-28 11:24, Bob Liu wrote:
>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>
>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>>> wrote:
>>>>>
>>>>> Hi Liu,
>>>>>
>>>>>
>>>>> This feature is used in memory hotplug.
>>>>>
>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>> node contains no kernel memory, because memory used by kernel could
>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>
>>>>> User could specify all the memory on a node to be movable, so that the
>>>>> node could be hot-removed.
>>>>>
>>>>
>>>> Thank you for your explanation. It's reasonable.
>>>>
>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>> can combine it with CMA which already in mainline?
>>>>
>>> Hi Liu,
>>>
>>> Thanks for your advice. :)
>>>
>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>> CMA do this job ?
>>
>> cma will not control the start of ZONE_MOVABLE of each node, but it
>> can declare a memory that always movable
>> and all non movable allocate request will not happen on that area.
>>
>> Currently cma use a boot parameter "cma=" to declare a memory size
>> that always movable.
>> I think it might fulfill your requirement if extending the boot
>> parameter with a start address.
>>
>> more info at http://lwn.net/Articles/468044/
>>>
>>> And also, after a short investigation, CMA seems need to base on
>>> memblock. But we need to limit memblock not to allocate memory on
>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>> could be used. I'm afraid we still need an approach to get the ranges,
>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>
>>
>> Yes, it's based on memblock and with boot option.
>> In setup_arch32()
>>     dma_contiguous_reserve(0);   => will declare a cma area using
>> memblock_reserve()
>>
>>> I'm don't know much about CMA for now. So if you have any better idea,
>>> please share with us, thanks. :)
>>
>> My idea is reuse cma like below patch(even not compiled) and boot with
>> "cma=size@start_address".
>> I don't know whether it can work and whether suitable for your
>> requirement, if not forgive me for this noises.
>>
>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>> index 612afcc..564962a 100644
>> --- a/drivers/base/dma-contiguous.c
>> +++ b/drivers/base/dma-contiguous.c
>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>   */
>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>  static long size_cmdline = -1;
>> +static long cma_start_cmdline = -1;
>>
>>  static int __init early_cma(char *p)
>>  {
>> +       char *oldp;
>>         pr_debug("%s(%s)\n", __func__, p);
>> +       oldp = p;
>>         size_cmdline = memparse(p, &p);
>> +
>> +       if (*p == '@')
>> +               cma_start_cmdline = memparse(p+1, &p);
>> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>         return 0;
>>  }
>>  early_param("cma", early_cma);
>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>         if (selected_size) {
>>                 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>                          selected_size / SZ_1M);
>> -
>> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
>> +               if (cma_size_cmdline != -1)
>> +                       dma_declare_contiguous(NULL, selected_size,
>> cma_start_cmdline, limit);
>> +               else
>> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>>         }
>>  };
> Seems a good idea to reserve memory by reusing CMA logic, though need more
> investigation here. One of CMA goal is to ensure pages in CMA are really
> movable, and this patchset tries to achieve the same goal at a first glance.
> 

The approach is already implemented: https://lkml.org/lkml/2012/7/4/145
(add new MIGRATE_HOTREMOVE, not reuse MIGRATE_CMA)

MIGRATE_HOTREMOVE and MIGRATE_CMA both have this problem:
https://lkml.org/lkml/2012/7/5/83

R.I.P for this idea.

zone->managed_pages(you proposed, but don't manage MIGRATE_HOTREMOVE nor MIGRATE_CMA) +
proxy zone(handle all MIGRATE_HOTREMOVE, MIGRATE_CMA and ZONE_MOVABLE of the node)
may be a good idea.

Thanks,
Lai

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-30  9:20               ` Lai Jiangshan
  0 siblings, 0 replies; 170+ messages in thread
From: Lai Jiangshan @ 2012-11-30  9:20 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Bob Liu, Tang Chen, hpa, akpm, rob, isimatu.yasuaki, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, m.szyprowski

On 11/28/2012 12:08 PM, Jiang Liu wrote:
> On 2012-11-28 11:24, Bob Liu wrote:
>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
>>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>>
>>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen<tangchen@cn.fujitsu.com>
>>>> wrote:
>>>>>
>>>>> Hi Liu,
>>>>>
>>>>>
>>>>> This feature is used in memory hotplug.
>>>>>
>>>>> In order to implement a whole node hotplug, we need to make sure the
>>>>> node contains no kernel memory, because memory used by kernel could
>>>>> not be migrated. (Since the kernel memory is directly mapped,
>>>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>>>
>>>>> User could specify all the memory on a node to be movable, so that the
>>>>> node could be hot-removed.
>>>>>
>>>>
>>>> Thank you for your explanation. It's reasonable.
>>>>
>>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>>> can combine it with CMA which already in mainline?
>>>>
>>> Hi Liu,
>>>
>>> Thanks for your advice. :)
>>>
>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>> CMA do this job ?
>>
>> cma will not control the start of ZONE_MOVABLE of each node, but it
>> can declare a memory that always movable
>> and all non movable allocate request will not happen on that area.
>>
>> Currently cma use a boot parameter "cma=" to declare a memory size
>> that always movable.
>> I think it might fulfill your requirement if extending the boot
>> parameter with a start address.
>>
>> more info at http://lwn.net/Articles/468044/
>>>
>>> And also, after a short investigation, CMA seems need to base on
>>> memblock. But we need to limit memblock not to allocate memory on
>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>> could be used. I'm afraid we still need an approach to get the ranges,
>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>
>>
>> Yes, it's based on memblock and with boot option.
>> In setup_arch32()
>>     dma_contiguous_reserve(0);   => will declare a cma area using
>> memblock_reserve()
>>
>>> I'm don't know much about CMA for now. So if you have any better idea,
>>> please share with us, thanks. :)
>>
>> My idea is reuse cma like below patch(even not compiled) and boot with
>> "cma=size@start_address".
>> I don't know whether it can work and whether suitable for your
>> requirement, if not forgive me for this noises.
>>
>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>> index 612afcc..564962a 100644
>> --- a/drivers/base/dma-contiguous.c
>> +++ b/drivers/base/dma-contiguous.c
>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>   */
>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>  static long size_cmdline = -1;
>> +static long cma_start_cmdline = -1;
>>
>>  static int __init early_cma(char *p)
>>  {
>> +       char *oldp;
>>         pr_debug("%s(%s)\n", __func__, p);
>> +       oldp = p;
>>         size_cmdline = memparse(p, &p);
>> +
>> +       if (*p == '@')
>> +               cma_start_cmdline = memparse(p+1, &p);
>> +       printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
>>         return 0;
>>  }
>>  early_param("cma", early_cma);
>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>         if (selected_size) {
>>                 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>                          selected_size / SZ_1M);
>> -
>> -               dma_declare_contiguous(NULL, selected_size, 0, limit);
>> +               if (cma_size_cmdline != -1)
>> +                       dma_declare_contiguous(NULL, selected_size,
>> cma_start_cmdline, limit);
>> +               else
>> +                       dma_declare_contiguous(NULL, selected_size, 0, limit);
>>         }
>>  };
> Seems a good idea to reserve memory by reusing CMA logic, though need more
> investigation here. One of CMA goal is to ensure pages in CMA are really
> movable, and this patchset tries to achieve the same goal at a first glance.
> 

The approach is already implemented: https://lkml.org/lkml/2012/7/4/145
(add new MIGRATE_HOTREMOVE, not reuse MIGRATE_CMA)

MIGRATE_HOTREMOVE and MIGRATE_CMA both have this problem:
https://lkml.org/lkml/2012/7/5/83

R.I.P for this idea.

zone->managed_pages(you proposed, but don't manage MIGRATE_HOTREMOVE nor MIGRATE_CMA) +
proxy zone(handle all MIGRATE_HOTREMOVE, MIGRATE_CMA and ZONE_MOVABLE of the node)
may be a good idea.

Thanks,
Lai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-30  2:58           ` Luck, Tony
@ 2012-11-30 10:19             ` Glauber Costa
  -1 siblings, 0 replies; 170+ messages in thread
From: Glauber Costa @ 2012-11-30 10:19 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Mel Gorman, H. Peter Anvin, Jiang Liu, Tang Chen, akpm, rob,
	isimatu.yasuaki, laijs, wency, linfeng, yinghai, kosaki.motohiro,
	minchan.kim, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	Len Brown, Wang, Frank

On 11/30/2012 06:58 AM, Luck, Tony wrote:
>> If any significant percentage of memory is in ZONE_MOVABLE then the memory
>> hotplug people will have to deal with all the lowmem/highmem problems
>> that used to be faced by 32-bit x86 with PAE enabled. 
> 
> While these problems may still exist on large systems - I think it becomes
> harder to construct workloads that run into problems.  In those bad old days
> a significant fraction of lowmem was consumed by the kernel ... so it was
> pretty easy to find meta-data intensive workloads that would push it over
> a cliff.  Here we  are talking about systems with say 128GB per node divided
> into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
> low-end machine).  Unless the workload consists of zillions of tiny processes
> all mapping shared memory blocks, the percentage of memory allocated to
> the kernel is going to be tiny compared with the old 4GB days.
> 

Which is a perfectly common workload for containers, where you can have
hundreds of machines (per node) being sold out to third parties, a lot
of them consuming every single bit of metadata they can.




^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-30 10:19             ` Glauber Costa
  0 siblings, 0 replies; 170+ messages in thread
From: Glauber Costa @ 2012-11-30 10:19 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Mel Gorman, H. Peter Anvin, Jiang Liu, Tang Chen, akpm, rob,
	isimatu.yasuaki, laijs, wency, linfeng, yinghai, kosaki.motohiro,
	minchan.kim, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	Len Brown, Wang, Frank

On 11/30/2012 06:58 AM, Luck, Tony wrote:
>> If any significant percentage of memory is in ZONE_MOVABLE then the memory
>> hotplug people will have to deal with all the lowmem/highmem problems
>> that used to be faced by 32-bit x86 with PAE enabled. 
> 
> While these problems may still exist on large systems - I think it becomes
> harder to construct workloads that run into problems.  In those bad old days
> a significant fraction of lowmem was consumed by the kernel ... so it was
> pretty easy to find meta-data intensive workloads that would push it over
> a cliff.  Here we  are talking about systems with say 128GB per node divided
> into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
> low-end machine).  Unless the workload consists of zillions of tiny processes
> all mapping shared memory blocks, the percentage of memory allocated to
> the kernel is going to be tiny compared with the old 4GB days.
> 

Which is a perfectly common workload for containers, where you can have
hundreds of machines (per node) being sold out to third parties, a lot
of them consuming every single bit of metadata they can.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-30  2:58           ` Luck, Tony
@ 2012-11-30 10:52             ` Mel Gorman
  -1 siblings, 0 replies; 170+ messages in thread
From: Mel Gorman @ 2012-11-30 10:52 UTC (permalink / raw)
  To: Luck, Tony
  Cc: H. Peter Anvin, Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki,
	laijs, wency, linfeng, yinghai, kosaki.motohiro, minchan.kim,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

On Fri, Nov 30, 2012 at 02:58:40AM +0000, Luck, Tony wrote:
> > If any significant percentage of memory is in ZONE_MOVABLE then the memory
> > hotplug people will have to deal with all the lowmem/highmem problems
> > that used to be faced by 32-bit x86 with PAE enabled. 
> 
> While these problems may still exist on large systems - I think it becomes
> harder to construct workloads that run into problems.  In those bad old days
> a significant fraction of lowmem was consumed by the kernel ... so it was
> pretty easy to find meta-data intensive workloads that would push it over
> a cliff.  Here we  are talking about systems with say 128GB per node divided
> into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
> low-end machine).  Unless the workload consists of zillions of tiny processes
> all mapping shared memory blocks, the percentage of memory allocated to
> the kernel is going to be tiny compared with the old 4GB days.
> 

Sure, if that's how the end-user decides to configure it. My concern is
what they'll do is configure node-0 to be ZONE_NORMAL and all other nodes
to be ZONE_MOVABLE -- 3 to 1 ratio "highmem" to "lowmem" effectively on
a 4-node machine or 7 to 1 on an 8-node. It'll be harder than it was in
the old days to trigger the problems but it'll still be possible and it
will generate bug reports down the road. Some will be obvious at least --
OOM killer triggered for GFP_KERNEL with plenty of free memory but all in
ZONE_MOVABLE. Others will be less obvious -- major stalls during IO tests
while ramping up with large amounts of reclaim activity visible even though
only 20-40% of memory is in use.

I'm not even getting into the impact this has on NUMA performance.

I'm not saying that ZONE_MOVABLE will not work. It will and it'll work
in the short-term but it's far from being a great long-term solution and
it is going to generate bug reports that will have to be supported by
distributions. Even if the interface to how it is configured gets ironed
out there still should be a replacement plan in place. FWIW, I dislike the
command-line configuration option. If it was me, I would have gone with
starting a machine with memory mostly off-lined and used sysfs files or
different sysfs strings written to the "online" file to determine if a
section was ZONE_MOVABLE or the next best alternative.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-30 10:52             ` Mel Gorman
  0 siblings, 0 replies; 170+ messages in thread
From: Mel Gorman @ 2012-11-30 10:52 UTC (permalink / raw)
  To: Luck, Tony
  Cc: H. Peter Anvin, Jiang Liu, Tang Chen, akpm, rob, isimatu.yasuaki,
	laijs, wency, linfeng, yinghai, kosaki.motohiro, minchan.kim,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Wang, Frank

On Fri, Nov 30, 2012 at 02:58:40AM +0000, Luck, Tony wrote:
> > If any significant percentage of memory is in ZONE_MOVABLE then the memory
> > hotplug people will have to deal with all the lowmem/highmem problems
> > that used to be faced by 32-bit x86 with PAE enabled. 
> 
> While these problems may still exist on large systems - I think it becomes
> harder to construct workloads that run into problems.  In those bad old days
> a significant fraction of lowmem was consumed by the kernel ... so it was
> pretty easy to find meta-data intensive workloads that would push it over
> a cliff.  Here we  are talking about systems with say 128GB per node divided
> into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
> low-end machine).  Unless the workload consists of zillions of tiny processes
> all mapping shared memory blocks, the percentage of memory allocated to
> the kernel is going to be tiny compared with the old 4GB days.
> 

Sure, if that's how the end-user decides to configure it. My concern is
what they'll do is configure node-0 to be ZONE_NORMAL and all other nodes
to be ZONE_MOVABLE -- 3 to 1 ratio "highmem" to "lowmem" effectively on
a 4-node machine or 7 to 1 on an 8-node. It'll be harder than it was in
the old days to trigger the problems but it'll still be possible and it
will generate bug reports down the road. Some will be obvious at least --
OOM killer triggered for GFP_KERNEL with plenty of free memory but all in
ZONE_MOVABLE. Others will be less obvious -- major stalls during IO tests
while ramping up with large amounts of reclaim activity visible even though
only 20-40% of memory is in use.

I'm not even getting into the impact this has on NUMA performance.

I'm not saying that ZONE_MOVABLE will not work. It will and it'll work
in the short-term but it's far from being a great long-term solution and
it is going to generate bug reports that will have to be supported by
distributions. Even if the interface to how it is configured gets ironed
out there still should be a replacement plan in place. FWIW, I dislike the
command-line configuration option. If it was me, I would have gone with
starting a machine with memory mostly off-lined and used sysfs files or
different sysfs strings written to the "online" file to determine if a
section was ZONE_MOVABLE or the next best alternative.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-30  3:15             ` Yasuaki Ishimatsu
@ 2012-11-30 15:36               ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-30 15:36 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: Jiang Liu, Mel Gorman, H. Peter Anvin, Luck, Tony, Tang Chen,
	akpm, rob, laijs, wency, linfeng, yinghai, kosaki.motohiro,
	minchan.kim, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	Len Brown, Wang, Frank

On 11/30/2012 11:15 AM, Yasuaki Ishimatsu wrote:
> Hi Jiang,
> 
>>
>> For the first issue, I think we could automatically convert pages
>> from movable zones into normal zones. Congyan from Fujitsu has provided
>> a patchset to manually convert pages from movable zones into normal zones,
>> I think we could extend that mechanism to automatically convert when
>> normal zones are under pressure by hooking into the slow page allocation
>> path.
>>
>> We rely on hardware features to solve the second and third issues.
>> Some new platforms provide a new RAS feature called "hardware memory
>> migration", which transparent migrate memory from one memory device
>> to another. With hardware memory migration, we could configure one
>> memory device on a NUMA node to host normal zone, and the other memory
>> devices to host movable zone. By this configuration, it won't cause
>> performance drop because each NUMA node still has local normal zone.
>> When trying to remove a memory device hosting normal zone, we just
>> need to find another spare memory device and use hardware memory migration
>> to transparently migrate memory content to the spare one. The drawback
>> is we have strong dependency on hardware features so it's not a common
>> solution for all architectures.
> 
> I agree with you. If BIOS and hardware support memory hotplug, OS should
> use them. But if OS cannot use them, we need to solve in OS. I think
> that our proposal which used ZONE_MOVABLE is first step for supporting
> memory hotplug.
Hi Yasuaki,
	It's true, we should start with first step then improve it.
Regards!
Gerry


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-30 15:36               ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-11-30 15:36 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: Jiang Liu, Mel Gorman, H. Peter Anvin, Luck, Tony, Tang Chen,
	akpm, rob, laijs, wency, linfeng, yinghai, kosaki.motohiro,
	minchan.kim, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	Len Brown, Wang, Frank

On 11/30/2012 11:15 AM, Yasuaki Ishimatsu wrote:
> Hi Jiang,
> 
>>
>> For the first issue, I think we could automatically convert pages
>> from movable zones into normal zones. Congyan from Fujitsu has provided
>> a patchset to manually convert pages from movable zones into normal zones,
>> I think we could extend that mechanism to automatically convert when
>> normal zones are under pressure by hooking into the slow page allocation
>> path.
>>
>> We rely on hardware features to solve the second and third issues.
>> Some new platforms provide a new RAS feature called "hardware memory
>> migration", which transparent migrate memory from one memory device
>> to another. With hardware memory migration, we could configure one
>> memory device on a NUMA node to host normal zone, and the other memory
>> devices to host movable zone. By this configuration, it won't cause
>> performance drop because each NUMA node still has local normal zone.
>> When trying to remove a memory device hosting normal zone, we just
>> need to find another spare memory device and use hardware memory migration
>> to transparently migrate memory content to the spare one. The drawback
>> is we have strong dependency on hardware features so it's not a common
>> solution for all architectures.
> 
> I agree with you. If BIOS and hardware support memory hotplug, OS should
> use them. But if OS cannot use them, we need to solve in OS. I think
> that our proposal which used ZONE_MOVABLE is first step for supporting
> memory hotplug.
Hi Yasuaki,
	It's true, we should start with first step then improve it.
Regards!
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
  2012-11-29  2:25       ` Jiang Liu
@ 2012-11-30 22:27         ` Toshi Kani
  -1 siblings, 0 replies; 170+ messages in thread
From: Toshi Kani @ 2012-11-30 22:27 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Jaegeuk Hanse, Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs,
	wency, linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Tony Luck, Wang, Frank

On Thu, 2012-11-29 at 10:25 +0800, Jiang Liu wrote:
> On 2012-11-29 9:42, Jaegeuk Hanse wrote:
> > On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
> >> Hi all,
> >> 	Seems it's a great chance to discuss about the memory hotplug feature
> >> within this thread. So I will try to give some high level thoughts about memory
> >> hotplug feature on x86/IA64. Any comments are welcomed!
> >> 	First of all, I think usability really matters. Ideally, memory hotplug
> >> feature should just work out of box, and we shouldn't expect administrators to 
> >> add several extra platform dependent parameters to enable memory hotplug. 
> >> But how to enable memory (or CPU/node) hotplug out of box? I think the key point
> >> is to cooperate with BIOS/ACPI/firmware/device management teams. 
> >> 	I still position memory hotplug as an advanced feature for high end 
> >> servers and those systems may/should provide some management interfaces to 
> >> configure CPU/memory/node hotplug features. The configuration UI may be provided
> >> by BIOS, BMC or centralized system management suite. Once administrator enables
> >> hotplug feature through those management UI, OS should support system device
> >> hotplug out of box. For example, HP SuperDome2 management suite provides interface
> >> to configure a node as floating node(hot-removable). And OpenSolaris supports
> >> CPU/memory hotplug out of box without any extra configurations. So we should
> >> shape interfaces between firmware and OS to better support system device hotplug.

Well described.  I agree with you.  I am also OK to have the boot option
for the time being, but we should be able to get the info from ACPI for
better TCE.

> >> 	On the other hand, I think there are no commercial available x86/IA64
> >> platforms with system device hotplug capabilities in the field yet, at least only
> >> limited quantity if any. So backward compatibility is not a big issue for us now.

HP SuperDome is IA64-based and supports node hotplug when running with
HP-UX.  It implements vendor-unique ACPI interface to describe movable
memory ranges.

> >> So I think it's doable to rely on firmware to provide better support for system
> >> device hotplug.
> >> 	Then what should be enhanced to better support system device hotplug?
> >>
> >> 1) ACPI specification should be enhanced to provide a static table to describe
> >> components with hotplug features, so OS could reserve special resources for
> >> hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
> >> hot-add. Currently we guess maximum number of CPUs supported by the platform
> >> by counting CPU entries in APIC table, that's not reliable.

Right.  HP SuperDome implements vendor-unique ACPI interface for this as
well.  For Linux, it is nice to have a standard interface defined.

> >> 2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
> >> hotplug. SRAT associates memory ranges with proximity domains with an extra
> >> "hotpluggable" flag. PMTT provides memory device topology information, such
> >> as "socket->memory controller->DIMM". MPST is used for memory power management
> >> and provides a way to associate memory ranges with memory devices in PMTT.
> >> With all information from SRAT, MPST and PMTT, OS could figure out hotplug
> >> memory ranges automatically, so no extra kernel parameters needed.

I agree that using SRAT is a good compromise.  The hotpluggable flag is
supposed to indicate the platform's capability, but could use for this
purpose until we have a better interface defined.

> >> 3) Enhance ACPICA to provide a method to scan static ACPI tables before
> >> memory subsystem has been initialized because OS need to access SRAT,
> >> MPST and PMTT when initializing memory subsystem.

I do not think this is an ACPICA issue.  HP-UX also uses ACPICA, and can
access ACPI tables and walk ACPI namespace during early boot-time.  This
is achieved by the acpi_os layer to use special early boot-time memory
allocator at early boot-time.  Therefore, boot-time and hot-add config
code are very consistent in HP-UX.

> >> 4) The last and the most important issue is how to minimize performance
> >> drop caused by memory hotplug. As proposed by this patchset, once we
> >> configure all memory of a NUMA node as movable, it essentially disable
> >> NUMA optimization of kernel memory allocation from that node. According
> >> to experience, that will cause huge performance drop. We have observed
> >> 10-30% performance drop with memory hotplug enabled. And on another
> >> OS the average performance drop caused by memory hotplug is about 10%.
> >> If we can't resolve the performance drop, memory hotplug is just a feature
> >> for demo:( With help from hardware, we do have some chances to reduce
> >> performance penalty caused by memory hotplug.
> >> 	As we know, Linux could migrate movable page, but can't migrate
> >> non-movable pages used by kernel/DMA etc. And the most hard part is how
> >> to deal with those unmovable pages when hot-removing a memory device.
> >> Now hardware has given us a hand with a technology named memory migration,
> >> which could transparently migrate memory between memory devices. There's
> >> no OS visible changes except NUMA topology before and after hardware memory
> >> migration.
> >> 	And if there are multiple memory devices within a NUMA node,
> >> we could configure some memory devices to host unmovable memory and the
> >> other to host movable memory. With this configuration, there won't be
> >> bigger performance drop because we have preserved all NUMA optimizations.
> >> We also could achieve memory hotplug remove by:
> >> 1) Use existing page migration mechanism to reclaim movable pages.
> >> 2) For memory devices hosting unmovable pages, we need:
> >> 2.1) find a movable memory device on other nodes with enough capacity
> >> and reclaim it.
> >> 2.2) use hardware migration technology to migrate unmovable memory to
> >> the just reclaimed memory device on other nodes.
>>>
> >> 	I hope we could expect users to adopt memory hotplug technology
> >> with all these implemented.
> >>
> >> 	Back to this patch, we could rely on the mechanism provided
> >> by it to automatically mark memory ranges as movable with information
> >>from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
> >> manually configure kernel parameters to enable memory hotplug.

Right.

Thanks,
-Toshi



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 0/5] Add movablecore_map boot option
@ 2012-11-30 22:27         ` Toshi Kani
  0 siblings, 0 replies; 170+ messages in thread
From: Toshi Kani @ 2012-11-30 22:27 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Jaegeuk Hanse, Tang Chen, hpa, akpm, rob, isimatu.yasuaki, laijs,
	wency, linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, Len Brown,
	Tony Luck, Wang, Frank

On Thu, 2012-11-29 at 10:25 +0800, Jiang Liu wrote:
> On 2012-11-29 9:42, Jaegeuk Hanse wrote:
> > On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
> >> Hi all,
> >> 	Seems it's a great chance to discuss about the memory hotplug feature
> >> within this thread. So I will try to give some high level thoughts about memory
> >> hotplug feature on x86/IA64. Any comments are welcomed!
> >> 	First of all, I think usability really matters. Ideally, memory hotplug
> >> feature should just work out of box, and we shouldn't expect administrators to 
> >> add several extra platform dependent parameters to enable memory hotplug. 
> >> But how to enable memory (or CPU/node) hotplug out of box? I think the key point
> >> is to cooperate with BIOS/ACPI/firmware/device management teams. 
> >> 	I still position memory hotplug as an advanced feature for high end 
> >> servers and those systems may/should provide some management interfaces to 
> >> configure CPU/memory/node hotplug features. The configuration UI may be provided
> >> by BIOS, BMC or centralized system management suite. Once administrator enables
> >> hotplug feature through those management UI, OS should support system device
> >> hotplug out of box. For example, HP SuperDome2 management suite provides interface
> >> to configure a node as floating node(hot-removable). And OpenSolaris supports
> >> CPU/memory hotplug out of box without any extra configurations. So we should
> >> shape interfaces between firmware and OS to better support system device hotplug.

Well described.  I agree with you.  I am also OK to have the boot option
for the time being, but we should be able to get the info from ACPI for
better TCE.

> >> 	On the other hand, I think there are no commercial available x86/IA64
> >> platforms with system device hotplug capabilities in the field yet, at least only
> >> limited quantity if any. So backward compatibility is not a big issue for us now.

HP SuperDome is IA64-based and supports node hotplug when running with
HP-UX.  It implements vendor-unique ACPI interface to describe movable
memory ranges.

> >> So I think it's doable to rely on firmware to provide better support for system
> >> device hotplug.
> >> 	Then what should be enhanced to better support system device hotplug?
> >>
> >> 1) ACPI specification should be enhanced to provide a static table to describe
> >> components with hotplug features, so OS could reserve special resources for
> >> hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
> >> hot-add. Currently we guess maximum number of CPUs supported by the platform
> >> by counting CPU entries in APIC table, that's not reliable.

Right.  HP SuperDome implements vendor-unique ACPI interface for this as
well.  For Linux, it is nice to have a standard interface defined.

> >> 2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
> >> hotplug. SRAT associates memory ranges with proximity domains with an extra
> >> "hotpluggable" flag. PMTT provides memory device topology information, such
> >> as "socket->memory controller->DIMM". MPST is used for memory power management
> >> and provides a way to associate memory ranges with memory devices in PMTT.
> >> With all information from SRAT, MPST and PMTT, OS could figure out hotplug
> >> memory ranges automatically, so no extra kernel parameters needed.

I agree that using SRAT is a good compromise.  The hotpluggable flag is
supposed to indicate the platform's capability, but could use for this
purpose until we have a better interface defined.

> >> 3) Enhance ACPICA to provide a method to scan static ACPI tables before
> >> memory subsystem has been initialized because OS need to access SRAT,
> >> MPST and PMTT when initializing memory subsystem.

I do not think this is an ACPICA issue.  HP-UX also uses ACPICA, and can
access ACPI tables and walk ACPI namespace during early boot-time.  This
is achieved by the acpi_os layer to use special early boot-time memory
allocator at early boot-time.  Therefore, boot-time and hot-add config
code are very consistent in HP-UX.

> >> 4) The last and the most important issue is how to minimize performance
> >> drop caused by memory hotplug. As proposed by this patchset, once we
> >> configure all memory of a NUMA node as movable, it essentially disable
> >> NUMA optimization of kernel memory allocation from that node. According
> >> to experience, that will cause huge performance drop. We have observed
> >> 10-30% performance drop with memory hotplug enabled. And on another
> >> OS the average performance drop caused by memory hotplug is about 10%.
> >> If we can't resolve the performance drop, memory hotplug is just a feature
> >> for demo:( With help from hardware, we do have some chances to reduce
> >> performance penalty caused by memory hotplug.
> >> 	As we know, Linux could migrate movable page, but can't migrate
> >> non-movable pages used by kernel/DMA etc. And the most hard part is how
> >> to deal with those unmovable pages when hot-removing a memory device.
> >> Now hardware has given us a hand with a technology named memory migration,
> >> which could transparently migrate memory between memory devices. There's
> >> no OS visible changes except NUMA topology before and after hardware memory
> >> migration.
> >> 	And if there are multiple memory devices within a NUMA node,
> >> we could configure some memory devices to host unmovable memory and the
> >> other to host movable memory. With this configuration, there won't be
> >> bigger performance drop because we have preserved all NUMA optimizations.
> >> We also could achieve memory hotplug remove by:
> >> 1) Use existing page migration mechanism to reclaim movable pages.
> >> 2) For memory devices hosting unmovable pages, we need:
> >> 2.1) find a movable memory device on other nodes with enough capacity
> >> and reclaim it.
> >> 2.2) use hardware migration technology to migrate unmovable memory to
> >> the just reclaimed memory device on other nodes.
>>>
> >> 	I hope we could expect users to adopt memory hotplug technology
> >> with all these implemented.
> >>
> >> 	Back to this patch, we could rely on the mechanism provided
> >> by it to automatically mark memory ranges as movable with information
> >>from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
> >> manually configure kernel parameters to enable memory hotplug.

Right.

Thanks,
-Toshi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 1/5] x86: get pg_data_t's memory from other node
  2012-11-23 10:44   ` Tang Chen
@ 2012-12-02 15:11     ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-12-02 15:11 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/23/2012 06:44 PM, Tang Chen wrote:
> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> If system can create movable node which all memory of the
> node is allocated as ZONE_MOVABLE, setup_node_data() cannot
> allocate memory for the node's pg_data_t.
> So when memblock_alloc_nid() fails, setup_node_data() retries
> memblock_alloc().
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> ---
>  arch/x86/mm/numa.c |   11 ++++++++---
>  1 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 2d125be..734bbd2 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -224,9 +224,14 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
>  	} else {
>  		nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
>  		if (!nd_pa) {
> -			pr_err("Cannot find %zu bytes in node %d\n",
> -			       nd_size, nid);
> -			return;
> +			pr_warn("Cannot find %zu bytes in node %d\n",
> +				nd_size, nid);
> +			nd_pa = memblock_alloc(nd_size, SMP_CACHE_BYTES);
> +			if (!nd_pa) {
> +				pr_err("Cannot find %zu bytes in other node\n",
> +				       nd_size);
> +				return;
> +			}
Hi Tang,
	Seems memblock_alloc_try_nid() serves the same purpose, so you may just
simply replace memblock_alloc_nid() with memblock_alloc_try_nid().

Regards!
Gerry

>  		}
>  		nd = __va(nd_pa);
>  	}
> 


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 1/5] x86: get pg_data_t's memory from other node
@ 2012-12-02 15:11     ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-12-02 15:11 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/23/2012 06:44 PM, Tang Chen wrote:
> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> If system can create movable node which all memory of the
> node is allocated as ZONE_MOVABLE, setup_node_data() cannot
> allocate memory for the node's pg_data_t.
> So when memblock_alloc_nid() fails, setup_node_data() retries
> memblock_alloc().
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> ---
>  arch/x86/mm/numa.c |   11 ++++++++---
>  1 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 2d125be..734bbd2 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -224,9 +224,14 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
>  	} else {
>  		nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
>  		if (!nd_pa) {
> -			pr_err("Cannot find %zu bytes in node %d\n",
> -			       nd_size, nid);
> -			return;
> +			pr_warn("Cannot find %zu bytes in node %d\n",
> +				nd_size, nid);
> +			nd_pa = memblock_alloc(nd_size, SMP_CACHE_BYTES);
> +			if (!nd_pa) {
> +				pr_err("Cannot find %zu bytes in other node\n",
> +				       nd_size);
> +				return;
> +			}
Hi Tang,
	Seems memblock_alloc_try_nid() serves the same purpose, so you may just
simply replace memblock_alloc_nid() with memblock_alloc_try_nid().

Regards!
Gerry

>  		}
>  		nd = __va(nd_pa);
>  	}
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
  2012-11-23 10:44   ` Tang Chen
@ 2012-12-05 15:43     ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-12-05 15:43 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

If we make "movablecore_map" take precedence over "movablecore/kernelcore",
the logic could be simplified. I think it's not so attractive to support
both "movablecore_map" and "movablecore/kernelcore" at the same time.

On 11/23/2012 06:44 PM, Tang Chen wrote:
> If kernelcore or movablecore is specified at the same time
> with movablecore_map, movablecore_map will have higher
> priority to be satisfied.
> This patch will make find_zone_movable_pfns_for_nodes()
> calculate zone_movable_pfn[] with the limit from
> zone_movable_limit[].
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
> ---
>  mm/page_alloc.c |   35 +++++++++++++++++++++++++++++++----
>  1 files changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f23d76a..05bafbb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4800,12 +4800,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>  		required_kernelcore = max(required_kernelcore, corepages);
>  	}
>  
> -	/* If kernelcore was not specified, there is no ZONE_MOVABLE */
> -	if (!required_kernelcore)
> +	/*
> +	 * No matter kernelcore/movablecore was limited or not, movable_zone
> +	 * should always be set to a usable zone index.
> +	 */
> +	find_usable_zone_for_movable();
> +
> +	/*
> +	 * If neither kernelcore/movablecore nor movablecore_map is specified,
> +	 * there is no ZONE_MOVABLE. But if movablecore_map is specified, the
> +	 * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
> +	 */
> +	if (!required_kernelcore) {
> +		if (movablecore_map.nr_map)
> +			memcpy(zone_movable_pfn, zone_movable_limit,
> +				sizeof(zone_movable_pfn));
>  		goto out;
> +	}
>  
>  	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
> -	find_usable_zone_for_movable();
>  	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
>  
>  restart:
> @@ -4833,10 +4846,24 @@ restart:
>  		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
>  			unsigned long size_pages;
>  
> +			/*
> +			 * Find more memory for kernelcore in
> +			 * [zone_movable_pfn[nid], zone_movable_limit[nid]).
> +			 */
>  			start_pfn = max(start_pfn, zone_movable_pfn[nid]);
>  			if (start_pfn >= end_pfn)
>  				continue;
>  
> +			if (zone_movable_limit[nid]) {
> +				end_pfn = min(end_pfn, zone_movable_limit[nid]);
> +				/* No range left for kernelcore in this node */
> +				if (start_pfn >= end_pfn) {
> +					zone_movable_pfn[nid] =
> +							zone_movable_limit[nid];
> +					break;
> +				}
> +			}
> +
>  			/* Account for what is only usable for kernelcore */
>  			if (start_pfn < usable_startpfn) {
>  				unsigned long kernel_pages;
> @@ -4896,12 +4923,12 @@ restart:
>  	if (usable_nodes && required_kernelcore > usable_nodes)
>  		goto restart;
>  
> +out:
>  	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
>  	for (nid = 0; nid < MAX_NUMNODES; nid++)
>  		zone_movable_pfn[nid] =
>  			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
>  
> -out:
>  	/* restore the node_state */
>  	node_states[N_HIGH_MEMORY] = saved_node_state;
>  }
> 


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
@ 2012-12-05 15:43     ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-12-05 15:43 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

If we make "movablecore_map" take precedence over "movablecore/kernelcore",
the logic could be simplified. I think it's not so attractive to support
both "movablecore_map" and "movablecore/kernelcore" at the same time.

On 11/23/2012 06:44 PM, Tang Chen wrote:
> If kernelcore or movablecore is specified at the same time
> with movablecore_map, movablecore_map will have higher
> priority to be satisfied.
> This patch will make find_zone_movable_pfns_for_nodes()
> calculate zone_movable_pfn[] with the limit from
> zone_movable_limit[].
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
> ---
>  mm/page_alloc.c |   35 +++++++++++++++++++++++++++++++----
>  1 files changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f23d76a..05bafbb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4800,12 +4800,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>  		required_kernelcore = max(required_kernelcore, corepages);
>  	}
>  
> -	/* If kernelcore was not specified, there is no ZONE_MOVABLE */
> -	if (!required_kernelcore)
> +	/*
> +	 * No matter kernelcore/movablecore was limited or not, movable_zone
> +	 * should always be set to a usable zone index.
> +	 */
> +	find_usable_zone_for_movable();
> +
> +	/*
> +	 * If neither kernelcore/movablecore nor movablecore_map is specified,
> +	 * there is no ZONE_MOVABLE. But if movablecore_map is specified, the
> +	 * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
> +	 */
> +	if (!required_kernelcore) {
> +		if (movablecore_map.nr_map)
> +			memcpy(zone_movable_pfn, zone_movable_limit,
> +				sizeof(zone_movable_pfn));
>  		goto out;
> +	}
>  
>  	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
> -	find_usable_zone_for_movable();
>  	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
>  
>  restart:
> @@ -4833,10 +4846,24 @@ restart:
>  		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
>  			unsigned long size_pages;
>  
> +			/*
> +			 * Find more memory for kernelcore in
> +			 * [zone_movable_pfn[nid], zone_movable_limit[nid]).
> +			 */
>  			start_pfn = max(start_pfn, zone_movable_pfn[nid]);
>  			if (start_pfn >= end_pfn)
>  				continue;
>  
> +			if (zone_movable_limit[nid]) {
> +				end_pfn = min(end_pfn, zone_movable_limit[nid]);
> +				/* No range left for kernelcore in this node */
> +				if (start_pfn >= end_pfn) {
> +					zone_movable_pfn[nid] =
> +							zone_movable_limit[nid];
> +					break;
> +				}
> +			}
> +
>  			/* Account for what is only usable for kernelcore */
>  			if (start_pfn < usable_startpfn) {
>  				unsigned long kernel_pages;
> @@ -4896,12 +4923,12 @@ restart:
>  	if (usable_nodes && required_kernelcore > usable_nodes)
>  		goto restart;
>  
> +out:
>  	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
>  	for (nid = 0; nid < MAX_NUMNODES; nid++)
>  		zone_movable_pfn[nid] =
>  			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
>  
> -out:
>  	/* restore the node_state */
>  	node_states[N_HIGH_MEMORY] = saved_node_state;
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes
  2012-11-23 10:44   ` Tang Chen
@ 2012-12-05 15:46     ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-12-05 15:46 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/23/2012 06:44 PM, Tang Chen wrote:
> This patch introduces a new array zone_movable_limit[] to store the
> ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
> The function sanitize_zone_movable_limit() will find out to which
> node the ranges in movable_map.map[] belongs, and calculates the
> low boundary of ZONE_MOVABLE for each node.
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
> ---
>  mm/page_alloc.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 55 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fb5cf12..f23d76a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -206,6 +206,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
>  static unsigned long __initdata required_kernelcore;
>  static unsigned long __initdata required_movablecore;
>  static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
> +static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
>  
>  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
>  int movable_zone;
> @@ -4323,6 +4324,55 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
>  	return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
>  }
>  
> +/**
> + * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
> + *
> + * zone_movable_limit is initialized as 0. This function will try to get
> + * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
> + * assigne them to zone_movable_limit.
> + * zone_movable_limit[nid] == 0 means no limit for the node.
> + *
> + * Note: Each range is represented as [start_pfn, end_pfn)
> + */
> +static void __meminit sanitize_zone_movable_limit(void)
> +{
> +	int map_pos = 0, i, nid;
> +	unsigned long start_pfn, end_pfn;
> +
> +	if (!movablecore_map.nr_map)
> +		return;
> +
> +	/* Iterate all ranges from minimum to maximum */
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> +		/*
> +		 * If we have found lowest pfn of ZONE_MOVABLE of the node
> +		 * specified by user, just go on to check next range.
> +		 */
> +		if (zone_movable_limit[nid])
> +			continue;
Need special handling of low memory here on systems with highmem, otherwise
it will cause us to configure both lowmem and highmem as movable_zone.

> +
> +		while (map_pos < movablecore_map.nr_map) {
> +			if (end_pfn <= movablecore_map.map[map_pos].start)
> +				break;
> +
> +			if (start_pfn >= movablecore_map.map[map_pos].end) {
> +				map_pos++;
> +				continue;
> +			}
> +
> +			/*
> +			 * The start_pfn of ZONE_MOVABLE is either the minimum
> +			 * pfn specified by movablecore_map, or 0, which means
> +			 * the node has no ZONE_MOVABLE.
> +			 */
> +			zone_movable_limit[nid] = max(start_pfn,
> +					movablecore_map.map[map_pos].start);
> +
> +			break;
> +		}
> +	}
> +}
> +
>  #else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
>  					unsigned long zone_type,
> @@ -4341,6 +4391,10 @@ static inline unsigned long __meminit zone_absent_pages_in_node(int nid,
>  	return zholes_size[zone_type];
>  }
>  
> +static void __meminit sanitize_zone_movable_limit(void)
> +{
> +}
> +
>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
>  static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
> @@ -4906,6 +4960,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
>  
>  	/* Find the PFNs that ZONE_MOVABLE begins at in each node */
>  	memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
> +	sanitize_zone_movable_limit();
>  	find_zone_movable_pfns_for_nodes();
>  
>  	/* Print out the zone ranges */
> 


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes
@ 2012-12-05 15:46     ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-12-05 15:46 UTC (permalink / raw)
  To: Tang Chen
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 11/23/2012 06:44 PM, Tang Chen wrote:
> This patch introduces a new array zone_movable_limit[] to store the
> ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
> The function sanitize_zone_movable_limit() will find out to which
> node the ranges in movable_map.map[] belongs, and calculates the
> low boundary of ZONE_MOVABLE for each node.
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
> ---
>  mm/page_alloc.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 55 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fb5cf12..f23d76a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -206,6 +206,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
>  static unsigned long __initdata required_kernelcore;
>  static unsigned long __initdata required_movablecore;
>  static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
> +static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
>  
>  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
>  int movable_zone;
> @@ -4323,6 +4324,55 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
>  	return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
>  }
>  
> +/**
> + * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
> + *
> + * zone_movable_limit is initialized as 0. This function will try to get
> + * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
> + * assigne them to zone_movable_limit.
> + * zone_movable_limit[nid] == 0 means no limit for the node.
> + *
> + * Note: Each range is represented as [start_pfn, end_pfn)
> + */
> +static void __meminit sanitize_zone_movable_limit(void)
> +{
> +	int map_pos = 0, i, nid;
> +	unsigned long start_pfn, end_pfn;
> +
> +	if (!movablecore_map.nr_map)
> +		return;
> +
> +	/* Iterate all ranges from minimum to maximum */
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> +		/*
> +		 * If we have found lowest pfn of ZONE_MOVABLE of the node
> +		 * specified by user, just go on to check next range.
> +		 */
> +		if (zone_movable_limit[nid])
> +			continue;
Need special handling of low memory here on systems with highmem, otherwise
it will cause us to configure both lowmem and highmem as movable_zone.

> +
> +		while (map_pos < movablecore_map.nr_map) {
> +			if (end_pfn <= movablecore_map.map[map_pos].start)
> +				break;
> +
> +			if (start_pfn >= movablecore_map.map[map_pos].end) {
> +				map_pos++;
> +				continue;
> +			}
> +
> +			/*
> +			 * The start_pfn of ZONE_MOVABLE is either the minimum
> +			 * pfn specified by movablecore_map, or 0, which means
> +			 * the node has no ZONE_MOVABLE.
> +			 */
> +			zone_movable_limit[nid] = max(start_pfn,
> +					movablecore_map.map[map_pos].start);
> +
> +			break;
> +		}
> +	}
> +}
> +
>  #else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
>  					unsigned long zone_type,
> @@ -4341,6 +4391,10 @@ static inline unsigned long __meminit zone_absent_pages_in_node(int nid,
>  	return zholes_size[zone_type];
>  }
>  
> +static void __meminit sanitize_zone_movable_limit(void)
> +{
> +}
> +
>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
>  static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
> @@ -4906,6 +4960,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
>  
>  	/* Find the PFNs that ZONE_MOVABLE begins at in each node */
>  	memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
> +	sanitize_zone_movable_limit();
>  	find_zone_movable_pfns_for_nodes();
>  
>  	/* Print out the zone ranges */
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes
  2012-12-05 15:46     ` Jiang Liu
@ 2012-12-06  1:20       ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-12-06  1:20 UTC (permalink / raw)
  To: Jiang Liu
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 12/05/2012 11:46 PM, Jiang Liu wrote:
> On 11/23/2012 06:44 PM, Tang Chen wrote:
>> This patch introduces a new array zone_movable_limit[] to store the
>> ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
>> The function sanitize_zone_movable_limit() will find out to which
>> node the ranges in movable_map.map[] belongs, and calculates the
>> low boundary of ZONE_MOVABLE for each node.
>>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> Reviewed-by: Wen Congyang<wency@cn.fujitsu.com>
>> Reviewed-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>> Tested-by: Lin Feng<linfeng@cn.fujitsu.com>
>> ---
>>   mm/page_alloc.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 files changed, 55 insertions(+), 0 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index fb5cf12..f23d76a 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -206,6 +206,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
>>   static unsigned long __initdata required_kernelcore;
>>   static unsigned long __initdata required_movablecore;
>>   static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
>> +static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
>>
>>   /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
>>   int movable_zone;
>> @@ -4323,6 +4324,55 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
>>   	return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
>>   }
>>
>> +/**
>> + * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
>> + *
>> + * zone_movable_limit is initialized as 0. This function will try to get
>> + * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
>> + * assigne them to zone_movable_limit.
>> + * zone_movable_limit[nid] == 0 means no limit for the node.
>> + *
>> + * Note: Each range is represented as [start_pfn, end_pfn)
>> + */
>> +static void __meminit sanitize_zone_movable_limit(void)
>> +{
>> +	int map_pos = 0, i, nid;
>> +	unsigned long start_pfn, end_pfn;
>> +
>> +	if (!movablecore_map.nr_map)
>> +		return;
>> +
>> +	/* Iterate all ranges from minimum to maximum */
>> +	for_each_mem_pfn_range(i, MAX_NUMNODES,&start_pfn,&end_pfn,&nid) {
>> +		/*
>> +		 * If we have found lowest pfn of ZONE_MOVABLE of the node
>> +		 * specified by user, just go on to check next range.
>> +		 */
>> +		if (zone_movable_limit[nid])
>> +			continue;
> Need special handling of low memory here on systems with highmem, otherwise
> it will cause us to configure both lowmem and highmem as movable_zone.

Hi Liu,

Yes, and also the DMA address checking you mentioned before.

Thanks. :)

>
>> +
>> +		while (map_pos<  movablecore_map.nr_map) {
>> +			if (end_pfn<= movablecore_map.map[map_pos].start)
>> +				break;
>> +
>> +			if (start_pfn>= movablecore_map.map[map_pos].end) {
>> +				map_pos++;
>> +				continue;
>> +			}
>> +
>> +			/*
>> +			 * The start_pfn of ZONE_MOVABLE is either the minimum
>> +			 * pfn specified by movablecore_map, or 0, which means
>> +			 * the node has no ZONE_MOVABLE.
>> +			 */
>> +			zone_movable_limit[nid] = max(start_pfn,
>> +					movablecore_map.map[map_pos].start);
>> +
>> +			break;
>> +		}
>> +	}
>> +}
>> +
>>   #else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>>   static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
>>   					unsigned long zone_type,
>> @@ -4341,6 +4391,10 @@ static inline unsigned long __meminit zone_absent_pages_in_node(int nid,
>>   	return zholes_size[zone_type];
>>   }
>>
>> +static void __meminit sanitize_zone_movable_limit(void)
>> +{
>> +}
>> +
>>   #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>>
>>   static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
>> @@ -4906,6 +4960,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
>>
>>   	/* Find the PFNs that ZONE_MOVABLE begins at in each node */
>>   	memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
>> +	sanitize_zone_movable_limit();
>>   	find_zone_movable_pfns_for_nodes();
>>
>>   	/* Print out the zone ranges */
>>
>
>


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes
@ 2012-12-06  1:20       ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-12-06  1:20 UTC (permalink / raw)
  To: Jiang Liu
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 12/05/2012 11:46 PM, Jiang Liu wrote:
> On 11/23/2012 06:44 PM, Tang Chen wrote:
>> This patch introduces a new array zone_movable_limit[] to store the
>> ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
>> The function sanitize_zone_movable_limit() will find out to which
>> node the ranges in movable_map.map[] belongs, and calculates the
>> low boundary of ZONE_MOVABLE for each node.
>>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> Reviewed-by: Wen Congyang<wency@cn.fujitsu.com>
>> Reviewed-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>> Tested-by: Lin Feng<linfeng@cn.fujitsu.com>
>> ---
>>   mm/page_alloc.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 files changed, 55 insertions(+), 0 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index fb5cf12..f23d76a 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -206,6 +206,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
>>   static unsigned long __initdata required_kernelcore;
>>   static unsigned long __initdata required_movablecore;
>>   static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
>> +static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
>>
>>   /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
>>   int movable_zone;
>> @@ -4323,6 +4324,55 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
>>   	return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
>>   }
>>
>> +/**
>> + * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
>> + *
>> + * zone_movable_limit is initialized as 0. This function will try to get
>> + * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
>> + * assigne them to zone_movable_limit.
>> + * zone_movable_limit[nid] == 0 means no limit for the node.
>> + *
>> + * Note: Each range is represented as [start_pfn, end_pfn)
>> + */
>> +static void __meminit sanitize_zone_movable_limit(void)
>> +{
>> +	int map_pos = 0, i, nid;
>> +	unsigned long start_pfn, end_pfn;
>> +
>> +	if (!movablecore_map.nr_map)
>> +		return;
>> +
>> +	/* Iterate all ranges from minimum to maximum */
>> +	for_each_mem_pfn_range(i, MAX_NUMNODES,&start_pfn,&end_pfn,&nid) {
>> +		/*
>> +		 * If we have found lowest pfn of ZONE_MOVABLE of the node
>> +		 * specified by user, just go on to check next range.
>> +		 */
>> +		if (zone_movable_limit[nid])
>> +			continue;
> Need special handling of low memory here on systems with highmem, otherwise
> it will cause us to configure both lowmem and highmem as movable_zone.

Hi Liu,

Yes, and also the DMA address checking you mentioned before.

Thanks. :)

>
>> +
>> +		while (map_pos<  movablecore_map.nr_map) {
>> +			if (end_pfn<= movablecore_map.map[map_pos].start)
>> +				break;
>> +
>> +			if (start_pfn>= movablecore_map.map[map_pos].end) {
>> +				map_pos++;
>> +				continue;
>> +			}
>> +
>> +			/*
>> +			 * The start_pfn of ZONE_MOVABLE is either the minimum
>> +			 * pfn specified by movablecore_map, or 0, which means
>> +			 * the node has no ZONE_MOVABLE.
>> +			 */
>> +			zone_movable_limit[nid] = max(start_pfn,
>> +					movablecore_map.map[map_pos].start);
>> +
>> +			break;
>> +		}
>> +	}
>> +}
>> +
>>   #else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>>   static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
>>   					unsigned long zone_type,
>> @@ -4341,6 +4391,10 @@ static inline unsigned long __meminit zone_absent_pages_in_node(int nid,
>>   	return zholes_size[zone_type];
>>   }
>>
>> +static void __meminit sanitize_zone_movable_limit(void)
>> +{
>> +}
>> +
>>   #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>>
>>   static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
>> @@ -4906,6 +4960,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
>>
>>   	/* Find the PFNs that ZONE_MOVABLE begins at in each node */
>>   	memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
>> +	sanitize_zone_movable_limit();
>>   	find_zone_movable_pfns_for_nodes();
>>
>>   	/* Print out the zone ranges */
>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
  2012-12-05 15:43     ` Jiang Liu
@ 2012-12-06  1:26       ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-12-06  1:26 UTC (permalink / raw)
  To: Jiang Liu
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 12/05/2012 11:43 PM, Jiang Liu wrote:
> If we make "movablecore_map" take precedence over "movablecore/kernelcore",
> the logic could be simplified. I think it's not so attractive to support
> both "movablecore_map" and "movablecore/kernelcore" at the same time.

Hi Liu,

Thanks for you advice. :)

Memory hotplug needs different support on different hardware. We are
trying to figure out a way to satisfy as many users as we can.
Since it is a little difficult, it may take sometime. :)

But I still think we need a boot option to support it. Just a metter of
how to make it easier to use. :)

Thanks. :)

>
> On 11/23/2012 06:44 PM, Tang Chen wrote:
>> If kernelcore or movablecore is specified at the same time
>> with movablecore_map, movablecore_map will have higher
>> priority to be satisfied.
>> This patch will make find_zone_movable_pfns_for_nodes()
>> calculate zone_movable_pfn[] with the limit from
>> zone_movable_limit[].
>>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> Reviewed-by: Wen Congyang<wency@cn.fujitsu.com>
>> Reviewed-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>> Tested-by: Lin Feng<linfeng@cn.fujitsu.com>
>> ---
>>   mm/page_alloc.c |   35 +++++++++++++++++++++++++++++++----
>>   1 files changed, 31 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index f23d76a..05bafbb 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4800,12 +4800,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>>   		required_kernelcore = max(required_kernelcore, corepages);
>>   	}
>>
>> -	/* If kernelcore was not specified, there is no ZONE_MOVABLE */
>> -	if (!required_kernelcore)
>> +	/*
>> +	 * No matter kernelcore/movablecore was limited or not, movable_zone
>> +	 * should always be set to a usable zone index.
>> +	 */
>> +	find_usable_zone_for_movable();
>> +
>> +	/*
>> +	 * If neither kernelcore/movablecore nor movablecore_map is specified,
>> +	 * there is no ZONE_MOVABLE. But if movablecore_map is specified, the
>> +	 * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
>> +	 */
>> +	if (!required_kernelcore) {
>> +		if (movablecore_map.nr_map)
>> +			memcpy(zone_movable_pfn, zone_movable_limit,
>> +				sizeof(zone_movable_pfn));
>>   		goto out;
>> +	}
>>
>>   	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
>> -	find_usable_zone_for_movable();
>>   	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
>>
>>   restart:
>> @@ -4833,10 +4846,24 @@ restart:
>>   		for_each_mem_pfn_range(i, nid,&start_pfn,&end_pfn, NULL) {
>>   			unsigned long size_pages;
>>
>> +			/*
>> +			 * Find more memory for kernelcore in
>> +			 * [zone_movable_pfn[nid], zone_movable_limit[nid]).
>> +			 */
>>   			start_pfn = max(start_pfn, zone_movable_pfn[nid]);
>>   			if (start_pfn>= end_pfn)
>>   				continue;
>>
>> +			if (zone_movable_limit[nid]) {
>> +				end_pfn = min(end_pfn, zone_movable_limit[nid]);
>> +				/* No range left for kernelcore in this node */
>> +				if (start_pfn>= end_pfn) {
>> +					zone_movable_pfn[nid] =
>> +							zone_movable_limit[nid];
>> +					break;
>> +				}
>> +			}
>> +
>>   			/* Account for what is only usable for kernelcore */
>>   			if (start_pfn<  usable_startpfn) {
>>   				unsigned long kernel_pages;
>> @@ -4896,12 +4923,12 @@ restart:
>>   	if (usable_nodes&&  required_kernelcore>  usable_nodes)
>>   		goto restart;
>>
>> +out:
>>   	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
>>   	for (nid = 0; nid<  MAX_NUMNODES; nid++)
>>   		zone_movable_pfn[nid] =
>>   			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
>>
>> -out:
>>   	/* restore the node_state */
>>   	node_states[N_HIGH_MEMORY] = saved_node_state;
>>   }
>>
>
>


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
@ 2012-12-06  1:26       ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-12-06  1:26 UTC (permalink / raw)
  To: Jiang Liu
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 12/05/2012 11:43 PM, Jiang Liu wrote:
> If we make "movablecore_map" take precedence over "movablecore/kernelcore",
> the logic could be simplified. I think it's not so attractive to support
> both "movablecore_map" and "movablecore/kernelcore" at the same time.

Hi Liu,

Thanks for you advice. :)

Memory hotplug needs different support on different hardware. We are
trying to figure out a way to satisfy as many users as we can.
Since it is a little difficult, it may take sometime. :)

But I still think we need a boot option to support it. Just a metter of
how to make it easier to use. :)

Thanks. :)

>
> On 11/23/2012 06:44 PM, Tang Chen wrote:
>> If kernelcore or movablecore is specified at the same time
>> with movablecore_map, movablecore_map will have higher
>> priority to be satisfied.
>> This patch will make find_zone_movable_pfns_for_nodes()
>> calculate zone_movable_pfn[] with the limit from
>> zone_movable_limit[].
>>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> Reviewed-by: Wen Congyang<wency@cn.fujitsu.com>
>> Reviewed-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>> Tested-by: Lin Feng<linfeng@cn.fujitsu.com>
>> ---
>>   mm/page_alloc.c |   35 +++++++++++++++++++++++++++++++----
>>   1 files changed, 31 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index f23d76a..05bafbb 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4800,12 +4800,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>>   		required_kernelcore = max(required_kernelcore, corepages);
>>   	}
>>
>> -	/* If kernelcore was not specified, there is no ZONE_MOVABLE */
>> -	if (!required_kernelcore)
>> +	/*
>> +	 * No matter kernelcore/movablecore was limited or not, movable_zone
>> +	 * should always be set to a usable zone index.
>> +	 */
>> +	find_usable_zone_for_movable();
>> +
>> +	/*
>> +	 * If neither kernelcore/movablecore nor movablecore_map is specified,
>> +	 * there is no ZONE_MOVABLE. But if movablecore_map is specified, the
>> +	 * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
>> +	 */
>> +	if (!required_kernelcore) {
>> +		if (movablecore_map.nr_map)
>> +			memcpy(zone_movable_pfn, zone_movable_limit,
>> +				sizeof(zone_movable_pfn));
>>   		goto out;
>> +	}
>>
>>   	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
>> -	find_usable_zone_for_movable();
>>   	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
>>
>>   restart:
>> @@ -4833,10 +4846,24 @@ restart:
>>   		for_each_mem_pfn_range(i, nid,&start_pfn,&end_pfn, NULL) {
>>   			unsigned long size_pages;
>>
>> +			/*
>> +			 * Find more memory for kernelcore in
>> +			 * [zone_movable_pfn[nid], zone_movable_limit[nid]).
>> +			 */
>>   			start_pfn = max(start_pfn, zone_movable_pfn[nid]);
>>   			if (start_pfn>= end_pfn)
>>   				continue;
>>
>> +			if (zone_movable_limit[nid]) {
>> +				end_pfn = min(end_pfn, zone_movable_limit[nid]);
>> +				/* No range left for kernelcore in this node */
>> +				if (start_pfn>= end_pfn) {
>> +					zone_movable_pfn[nid] =
>> +							zone_movable_limit[nid];
>> +					break;
>> +				}
>> +			}
>> +
>>   			/* Account for what is only usable for kernelcore */
>>   			if (start_pfn<  usable_startpfn) {
>>   				unsigned long kernel_pages;
>> @@ -4896,12 +4923,12 @@ restart:
>>   	if (usable_nodes&&  required_kernelcore>  usable_nodes)
>>   		goto restart;
>>
>> +out:
>>   	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
>>   	for (nid = 0; nid<  MAX_NUMNODES; nid++)
>>   		zone_movable_pfn[nid] =
>>   			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
>>
>> -out:
>>   	/* restore the node_state */
>>   	node_states[N_HIGH_MEMORY] = saved_node_state;
>>   }
>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
  2012-12-06  1:26       ` Tang Chen
@ 2012-12-06  2:26         ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-12-06  2:26 UTC (permalink / raw)
  To: Tang Chen
  Cc: Jiang Liu, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

[-- Attachment #1: Type: text/plain, Size: 4428 bytes --]

On 2012-12-6 9:26, Tang Chen wrote:
> On 12/05/2012 11:43 PM, Jiang Liu wrote:
>> If we make "movablecore_map" take precedence over "movablecore/kernelcore",
>> the logic could be simplified. I think it's not so attractive to support
>> both "movablecore_map" and "movablecore/kernelcore" at the same time.
> 
> Hi Liu,
> 
> Thanks for you advice. :)
> 
> Memory hotplug needs different support on different hardware. We are
> trying to figure out a way to satisfy as many users as we can.
> Since it is a little difficult, it may take sometime. :)
> 
> But I still think we need a boot option to support it. Just a metter of
> how to make it easier to use. :)
> 
> Thanks. :)
> 
>>
>> On 11/23/2012 06:44 PM, Tang Chen wrote:
>>> If kernelcore or movablecore is specified at the same time
>>> with movablecore_map, movablecore_map will have higher
>>> priority to be satisfied.
>>> This patch will make find_zone_movable_pfns_for_nodes()
>>> calculate zone_movable_pfn[] with the limit from
>>> zone_movable_limit[].
>>>
>>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>>> Reviewed-by: Wen Congyang<wency@cn.fujitsu.com>
>>> Reviewed-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>>> Tested-by: Lin Feng<linfeng@cn.fujitsu.com>
>>> ---
>>>   mm/page_alloc.c |   35 +++++++++++++++++++++++++++++++----
>>>   1 files changed, 31 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index f23d76a..05bafbb 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4800,12 +4800,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>>>           required_kernelcore = max(required_kernelcore, corepages);
>>>       }
>>>
>>> -    /* If kernelcore was not specified, there is no ZONE_MOVABLE */
>>> -    if (!required_kernelcore)
>>> +    /*
>>> +     * No matter kernelcore/movablecore was limited or not, movable_zone
>>> +     * should always be set to a usable zone index.
>>> +     */
>>> +    find_usable_zone_for_movable();
>>> +
>>> +    /*
>>> +     * If neither kernelcore/movablecore nor movablecore_map is specified,
>>> +     * there is no ZONE_MOVABLE. But if movablecore_map is specified, the
>>> +     * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
>>> +     */
>>> +    if (!required_kernelcore) {
>>> +        if (movablecore_map.nr_map)
>>> +            memcpy(zone_movable_pfn, zone_movable_limit,
>>> +                sizeof(zone_movable_pfn));
>>>           goto out;
>>> +    }
>>>
>>>       /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
>>> -    find_usable_zone_for_movable();
>>>       usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
>>>
>>>   restart:
>>> @@ -4833,10 +4846,24 @@ restart:
>>>           for_each_mem_pfn_range(i, nid,&start_pfn,&end_pfn, NULL) {
>>>               unsigned long size_pages;
>>>
>>> +            /*
>>> +             * Find more memory for kernelcore in
>>> +             * [zone_movable_pfn[nid], zone_movable_limit[nid]).
>>> +             */
>>>               start_pfn = max(start_pfn, zone_movable_pfn[nid]);
>>>               if (start_pfn>= end_pfn)
>>>                   continue;
>>>
>>> +            if (zone_movable_limit[nid]) {
>>> +                end_pfn = min(end_pfn, zone_movable_limit[nid]);
>>> +                /* No range left for kernelcore in this node */
>>> +                if (start_pfn>= end_pfn) {
>>> +                    zone_movable_pfn[nid] =
>>> +                            zone_movable_limit[nid];
>>> +                    break;
>>> +                }
>>> +            }
Hi Tang,
	I just to remove the above logic, so the implementation will be greatly
simplified. Please refer to the attachment.
Regards!
Gerry

>>> +
>>>               /* Account for what is only usable for kernelcore */
>>>               if (start_pfn<  usable_startpfn) {
>>>                   unsigned long kernel_pages;
>>> @@ -4896,12 +4923,12 @@ restart:
>>>       if (usable_nodes&&  required_kernelcore>  usable_nodes)
>>>           goto restart;
>>>
>>> +out:
>>>       /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
>>>       for (nid = 0; nid<  MAX_NUMNODES; nid++)
>>>           zone_movable_pfn[nid] =
>>>               roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
>>>
>>> -out:
>>>       /* restore the node_state */
>>>       node_states[N_HIGH_MEMORY] = saved_node_state;
>>>   }
>>>
>>
>>
> 
> 
> .
> 


[-- Attachment #2: 0003-page_alloc-Introduce-zone_movable_limit-to-keep-mova.patch --]
[-- Type: text/x-patch, Size: 4379 bytes --]

>From 120759daa8410e1bf61d19210ddeb52ef32d002a Mon Sep 17 00:00:00 2001
From: Jiang Liu <jiang.liu@huawei.com>
Date: Wed, 5 Dec 2012 23:58:42 +0800
Subject: [PATCH 3/6] page_alloc: Introduce zone_movable_limit[] to keep
 movable limit for nodes

This patch introduces a new array zone_movable_limit[] to store the
ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
The function sanitize_zone_movable_limit() will find out to which
node the ranges in movable_map.map[] belongs, and calculates the
low boundary of ZONE_MOVABLE for each node.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>

page_alloc: Make movablecore_map has higher priority

If kernelcore or movablecore is specified at the same time
with movablecore_map, movablecore_map will have higher
priority to be satisfied.
This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from
zone_movable_limit[].

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 mm/page_alloc.c |   66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e35ee27..41c3b51 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4338,6 +4338,60 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
 	return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
 }
 
+/**
+ * Try to figure out zone_movable_pfn[] from movablecore_map.
+ */
+static int __init find_zone_movable_from_movablecore_map(void)
+{
+	int map_pos = 0, i, nid;
+	unsigned long start_pfn, end_pfn;
+
+	if (!movablecore_map.nr_map)
+		return 0;
+
+	/* Iterate all ranges from minimum to maximum */
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+		/*
+		 * If we have found lowest pfn of ZONE_MOVABLE of the node
+		 * specified by user, just go on to check next range.
+		 */
+		if (zone_movable_pfn[nid])
+			continue;
+
+#ifdef CONFIG_HIGHMEM
+		/* Skip lowmem if ZONE_MOVABLE is highmem */
+		if (zone_movable_is_highmem() &&
+		    start_pfn < arch_zone_lowest_possible_pfn[ZONE_HIGHMEM])
+			start_pfn = arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
+		if (start_pfn > end_pfn)
+			continue;
+#endif
+
+		while (map_pos < movablecore_map.nr_map) {
+			if (end_pfn < movablecore_map.map[map_pos].start)
+				break;
+
+			if (start_pfn > movablecore_map.map[map_pos].end) {
+				map_pos++;
+				continue;
+			}
+
+			/*
+			 * The start_pfn of ZONE_MOVABLE is either the minimum
+			 * pfn specified by movablecore_map, or 0, which means
+			 * the node has no ZONE_MOVABLE.
+			 */
+			start_pfn = max(start_pfn,
+					movablecore_map.map[map_pos].start);
+			zone_movable_pfn[nid] = roundup(zone_movable_pfn[nid],
+							MAX_ORDER_NR_PAGES);
+			break;
+		}
+	}
+
+	return 1;
+}
+
 #else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
 					unsigned long zone_type,
@@ -4356,6 +4410,11 @@ static inline unsigned long __meminit zone_absent_pages_in_node(int nid,
 	return zholes_size[zone_type];
 }
 
+static int __init find_zone_movable_from_movablecore_map(void)
+{
+	return 0;
+}
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
@@ -4739,6 +4798,12 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 	unsigned long totalpages = early_calculate_totalpages();
 	int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
 
+	find_usable_zone_for_movable();
+
+	/* movablecore_map takes precedence over movablecore/kernelcore */
+	if (find_zone_movable_from_movablecore_map())
+		return;
+
 	/*
 	 * If movablecore was specified, calculate what size of
 	 * kernelcore that corresponds so that memory usable for
@@ -4766,7 +4831,6 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 		goto out;
 
 	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
-	find_usable_zone_for_movable();
 	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
 restart:
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
@ 2012-12-06  2:26         ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-12-06  2:26 UTC (permalink / raw)
  To: Tang Chen
  Cc: Jiang Liu, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

[-- Attachment #1: Type: text/plain, Size: 4428 bytes --]

On 2012-12-6 9:26, Tang Chen wrote:
> On 12/05/2012 11:43 PM, Jiang Liu wrote:
>> If we make "movablecore_map" take precedence over "movablecore/kernelcore",
>> the logic could be simplified. I think it's not so attractive to support
>> both "movablecore_map" and "movablecore/kernelcore" at the same time.
> 
> Hi Liu,
> 
> Thanks for you advice. :)
> 
> Memory hotplug needs different support on different hardware. We are
> trying to figure out a way to satisfy as many users as we can.
> Since it is a little difficult, it may take sometime. :)
> 
> But I still think we need a boot option to support it. Just a metter of
> how to make it easier to use. :)
> 
> Thanks. :)
> 
>>
>> On 11/23/2012 06:44 PM, Tang Chen wrote:
>>> If kernelcore or movablecore is specified at the same time
>>> with movablecore_map, movablecore_map will have higher
>>> priority to be satisfied.
>>> This patch will make find_zone_movable_pfns_for_nodes()
>>> calculate zone_movable_pfn[] with the limit from
>>> zone_movable_limit[].
>>>
>>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>>> Reviewed-by: Wen Congyang<wency@cn.fujitsu.com>
>>> Reviewed-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>>> Tested-by: Lin Feng<linfeng@cn.fujitsu.com>
>>> ---
>>>   mm/page_alloc.c |   35 +++++++++++++++++++++++++++++++----
>>>   1 files changed, 31 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index f23d76a..05bafbb 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4800,12 +4800,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>>>           required_kernelcore = max(required_kernelcore, corepages);
>>>       }
>>>
>>> -    /* If kernelcore was not specified, there is no ZONE_MOVABLE */
>>> -    if (!required_kernelcore)
>>> +    /*
>>> +     * No matter kernelcore/movablecore was limited or not, movable_zone
>>> +     * should always be set to a usable zone index.
>>> +     */
>>> +    find_usable_zone_for_movable();
>>> +
>>> +    /*
>>> +     * If neither kernelcore/movablecore nor movablecore_map is specified,
>>> +     * there is no ZONE_MOVABLE. But if movablecore_map is specified, the
>>> +     * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
>>> +     */
>>> +    if (!required_kernelcore) {
>>> +        if (movablecore_map.nr_map)
>>> +            memcpy(zone_movable_pfn, zone_movable_limit,
>>> +                sizeof(zone_movable_pfn));
>>>           goto out;
>>> +    }
>>>
>>>       /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
>>> -    find_usable_zone_for_movable();
>>>       usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
>>>
>>>   restart:
>>> @@ -4833,10 +4846,24 @@ restart:
>>>           for_each_mem_pfn_range(i, nid,&start_pfn,&end_pfn, NULL) {
>>>               unsigned long size_pages;
>>>
>>> +            /*
>>> +             * Find more memory for kernelcore in
>>> +             * [zone_movable_pfn[nid], zone_movable_limit[nid]).
>>> +             */
>>>               start_pfn = max(start_pfn, zone_movable_pfn[nid]);
>>>               if (start_pfn>= end_pfn)
>>>                   continue;
>>>
>>> +            if (zone_movable_limit[nid]) {
>>> +                end_pfn = min(end_pfn, zone_movable_limit[nid]);
>>> +                /* No range left for kernelcore in this node */
>>> +                if (start_pfn>= end_pfn) {
>>> +                    zone_movable_pfn[nid] =
>>> +                            zone_movable_limit[nid];
>>> +                    break;
>>> +                }
>>> +            }
Hi Tang,
	I just to remove the above logic, so the implementation will be greatly
simplified. Please refer to the attachment.
Regards!
Gerry

>>> +
>>>               /* Account for what is only usable for kernelcore */
>>>               if (start_pfn<  usable_startpfn) {
>>>                   unsigned long kernel_pages;
>>> @@ -4896,12 +4923,12 @@ restart:
>>>       if (usable_nodes&&  required_kernelcore>  usable_nodes)
>>>           goto restart;
>>>
>>> +out:
>>>       /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
>>>       for (nid = 0; nid<  MAX_NUMNODES; nid++)
>>>           zone_movable_pfn[nid] =
>>>               roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
>>>
>>> -out:
>>>       /* restore the node_state */
>>>       node_states[N_HIGH_MEMORY] = saved_node_state;
>>>   }
>>>
>>
>>
> 
> 
> .
> 


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0003-page_alloc-Introduce-zone_movable_limit-to-keep-mova.patch --]
[-- Type: text/x-patch; name="0003-page_alloc-Introduce-zone_movable_limit-to-keep-mova.patch", Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
  2012-12-06  2:26         ` Jiang Liu
@ 2012-12-06  2:51           ` Jianguo Wu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jianguo Wu @ 2012-12-06  2:51 UTC (permalink / raw)
  To: Tang Chen
  Cc: Jiang Liu, Jiang Liu, hpa, akpm, rob, isimatu.yasuaki, laijs,
	wency, linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

Hi Tang,

There is a bug in Gerry's patch, please apply this patch to fix it.

---
 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 41c3b51..d981810 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4383,7 +4383,7 @@ static int __init find_zone_movable_from_movablecore_map(void)
 			 */
 			start_pfn = max(start_pfn,
 					movablecore_map.map[map_pos].start);
-			zone_movable_pfn[nid] = roundup(zone_movable_pfn[nid],
+			zone_movable_pfn[nid] = roundup(start_pfn,
 							MAX_ORDER_NR_PAGES);
 			break;
 		}
-- 
1.7.6.1

On 2012/12/6 10:26, Jiang Liu wrote:

> On 2012-12-6 9:26, Tang Chen wrote:
>> On 12/05/2012 11:43 PM, Jiang Liu wrote:
>>> If we make "movablecore_map" take precedence over "movablecore/kernelcore",
>>> the logic could be simplified. I think it's not so attractive to support
>>> both "movablecore_map" and "movablecore/kernelcore" at the same time.
>>
>> Hi Liu,
>>
>> Thanks for you advice. :)
>>
>> Memory hotplug needs different support on different hardware. We are
>> trying to figure out a way to satisfy as many users as we can.
>> Since it is a little difficult, it may take sometime. :)
>>
>> But I still think we need a boot option to support it. Just a metter of
>> how to make it easier to use. :)
>>
>> Thanks. :)
>>
>>>
>>> On 11/23/2012 06:44 PM, Tang Chen wrote:
>>>> If kernelcore or movablecore is specified at the same time
>>>> with movablecore_map, movablecore_map will have higher
>>>> priority to be satisfied.
>>>> This patch will make find_zone_movable_pfns_for_nodes()
>>>> calculate zone_movable_pfn[] with the limit from
>>>> zone_movable_limit[].
>>>>
>>>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>>>> Reviewed-by: Wen Congyang<wency@cn.fujitsu.com>
>>>> Reviewed-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>>>> Tested-by: Lin Feng<linfeng@cn.fujitsu.com>
>>>> ---
>>>>   mm/page_alloc.c |   35 +++++++++++++++++++++++++++++++----
>>>>   1 files changed, 31 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index f23d76a..05bafbb 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -4800,12 +4800,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>>>>           required_kernelcore = max(required_kernelcore, corepages);
>>>>       }
>>>>
>>>> -    /* If kernelcore was not specified, there is no ZONE_MOVABLE */
>>>> -    if (!required_kernelcore)
>>>> +    /*
>>>> +     * No matter kernelcore/movablecore was limited or not, movable_zone
>>>> +     * should always be set to a usable zone index.
>>>> +     */
>>>> +    find_usable_zone_for_movable();
>>>> +
>>>> +    /*
>>>> +     * If neither kernelcore/movablecore nor movablecore_map is specified,
>>>> +     * there is no ZONE_MOVABLE. But if movablecore_map is specified, the
>>>> +     * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
>>>> +     */
>>>> +    if (!required_kernelcore) {
>>>> +        if (movablecore_map.nr_map)
>>>> +            memcpy(zone_movable_pfn, zone_movable_limit,
>>>> +                sizeof(zone_movable_pfn));
>>>>           goto out;
>>>> +    }
>>>>
>>>>       /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
>>>> -    find_usable_zone_for_movable();
>>>>       usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
>>>>
>>>>   restart:
>>>> @@ -4833,10 +4846,24 @@ restart:
>>>>           for_each_mem_pfn_range(i, nid,&start_pfn,&end_pfn, NULL) {
>>>>               unsigned long size_pages;
>>>>
>>>> +            /*
>>>> +             * Find more memory for kernelcore in
>>>> +             * [zone_movable_pfn[nid], zone_movable_limit[nid]).
>>>> +             */
>>>>               start_pfn = max(start_pfn, zone_movable_pfn[nid]);
>>>>               if (start_pfn>= end_pfn)
>>>>                   continue;
>>>>
>>>> +            if (zone_movable_limit[nid]) {
>>>> +                end_pfn = min(end_pfn, zone_movable_limit[nid]);
>>>> +                /* No range left for kernelcore in this node */
>>>> +                if (start_pfn>= end_pfn) {
>>>> +                    zone_movable_pfn[nid] =
>>>> +                            zone_movable_limit[nid];
>>>> +                    break;
>>>> +                }
>>>> +            }
> Hi Tang,
> 	I just to remove the above logic, so the implementation will be greatly
> simplified. Please refer to the attachment.
> Regards!
> Gerry
> 
>>>> +
>>>>               /* Account for what is only usable for kernelcore */
>>>>               if (start_pfn<  usable_startpfn) {
>>>>                   unsigned long kernel_pages;
>>>> @@ -4896,12 +4923,12 @@ restart:
>>>>       if (usable_nodes&&  required_kernelcore>  usable_nodes)
>>>>           goto restart;
>>>>
>>>> +out:
>>>>       /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
>>>>       for (nid = 0; nid<  MAX_NUMNODES; nid++)
>>>>           zone_movable_pfn[nid] =
>>>>               roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
>>>>
>>>> -out:
>>>>       /* restore the node_state */
>>>>       node_states[N_HIGH_MEMORY] = saved_node_state;
>>>>   }
>>>>
>>>
>>>
>>
>>
>> .
>>
> 




^ permalink raw reply related	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
@ 2012-12-06  2:51           ` Jianguo Wu
  0 siblings, 0 replies; 170+ messages in thread
From: Jianguo Wu @ 2012-12-06  2:51 UTC (permalink / raw)
  To: Tang Chen
  Cc: Jiang Liu, Jiang Liu, hpa, akpm, rob, isimatu.yasuaki, laijs,
	wency, linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

Hi Tang,

There is a bug in Gerry's patch, please apply this patch to fix it.

---
 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 41c3b51..d981810 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4383,7 +4383,7 @@ static int __init find_zone_movable_from_movablecore_map(void)
 			 */
 			start_pfn = max(start_pfn,
 					movablecore_map.map[map_pos].start);
-			zone_movable_pfn[nid] = roundup(zone_movable_pfn[nid],
+			zone_movable_pfn[nid] = roundup(start_pfn,
 							MAX_ORDER_NR_PAGES);
 			break;
 		}
-- 
1.7.6.1

On 2012/12/6 10:26, Jiang Liu wrote:

> On 2012-12-6 9:26, Tang Chen wrote:
>> On 12/05/2012 11:43 PM, Jiang Liu wrote:
>>> If we make "movablecore_map" take precedence over "movablecore/kernelcore",
>>> the logic could be simplified. I think it's not so attractive to support
>>> both "movablecore_map" and "movablecore/kernelcore" at the same time.
>>
>> Hi Liu,
>>
>> Thanks for you advice. :)
>>
>> Memory hotplug needs different support on different hardware. We are
>> trying to figure out a way to satisfy as many users as we can.
>> Since it is a little difficult, it may take sometime. :)
>>
>> But I still think we need a boot option to support it. Just a metter of
>> how to make it easier to use. :)
>>
>> Thanks. :)
>>
>>>
>>> On 11/23/2012 06:44 PM, Tang Chen wrote:
>>>> If kernelcore or movablecore is specified at the same time
>>>> with movablecore_map, movablecore_map will have higher
>>>> priority to be satisfied.
>>>> This patch will make find_zone_movable_pfns_for_nodes()
>>>> calculate zone_movable_pfn[] with the limit from
>>>> zone_movable_limit[].
>>>>
>>>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>>>> Reviewed-by: Wen Congyang<wency@cn.fujitsu.com>
>>>> Reviewed-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>>>> Tested-by: Lin Feng<linfeng@cn.fujitsu.com>
>>>> ---
>>>>   mm/page_alloc.c |   35 +++++++++++++++++++++++++++++++----
>>>>   1 files changed, 31 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index f23d76a..05bafbb 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -4800,12 +4800,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>>>>           required_kernelcore = max(required_kernelcore, corepages);
>>>>       }
>>>>
>>>> -    /* If kernelcore was not specified, there is no ZONE_MOVABLE */
>>>> -    if (!required_kernelcore)
>>>> +    /*
>>>> +     * No matter kernelcore/movablecore was limited or not, movable_zone
>>>> +     * should always be set to a usable zone index.
>>>> +     */
>>>> +    find_usable_zone_for_movable();
>>>> +
>>>> +    /*
>>>> +     * If neither kernelcore/movablecore nor movablecore_map is specified,
>>>> +     * there is no ZONE_MOVABLE. But if movablecore_map is specified, the
>>>> +     * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
>>>> +     */
>>>> +    if (!required_kernelcore) {
>>>> +        if (movablecore_map.nr_map)
>>>> +            memcpy(zone_movable_pfn, zone_movable_limit,
>>>> +                sizeof(zone_movable_pfn));
>>>>           goto out;
>>>> +    }
>>>>
>>>>       /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
>>>> -    find_usable_zone_for_movable();
>>>>       usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
>>>>
>>>>   restart:
>>>> @@ -4833,10 +4846,24 @@ restart:
>>>>           for_each_mem_pfn_range(i, nid,&start_pfn,&end_pfn, NULL) {
>>>>               unsigned long size_pages;
>>>>
>>>> +            /*
>>>> +             * Find more memory for kernelcore in
>>>> +             * [zone_movable_pfn[nid], zone_movable_limit[nid]).
>>>> +             */
>>>>               start_pfn = max(start_pfn, zone_movable_pfn[nid]);
>>>>               if (start_pfn>= end_pfn)
>>>>                   continue;
>>>>
>>>> +            if (zone_movable_limit[nid]) {
>>>> +                end_pfn = min(end_pfn, zone_movable_limit[nid]);
>>>> +                /* No range left for kernelcore in this node */
>>>> +                if (start_pfn>= end_pfn) {
>>>> +                    zone_movable_pfn[nid] =
>>>> +                            zone_movable_limit[nid];
>>>> +                    break;
>>>> +                }
>>>> +            }
> Hi Tang,
> 	I just to remove the above logic, so the implementation will be greatly
> simplified. Please refer to the attachment.
> Regards!
> Gerry
> 
>>>> +
>>>>               /* Account for what is only usable for kernelcore */
>>>>               if (start_pfn<  usable_startpfn) {
>>>>                   unsigned long kernel_pages;
>>>> @@ -4896,12 +4923,12 @@ restart:
>>>>       if (usable_nodes&&  required_kernelcore>  usable_nodes)
>>>>           goto restart;
>>>>
>>>> +out:
>>>>       /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
>>>>       for (nid = 0; nid<  MAX_NUMNODES; nid++)
>>>>           zone_movable_pfn[nid] =
>>>>               roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
>>>>
>>>> -out:
>>>>       /* restore the node_state */
>>>>       node_states[N_HIGH_MEMORY] = saved_node_state;
>>>>   }
>>>>
>>>
>>>
>>
>>
>> .
>>
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
  2012-12-06  2:51           ` Jianguo Wu
@ 2012-12-06  2:57             ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-12-06  2:57 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: Jiang Liu, Jiang Liu, hpa, akpm, rob, isimatu.yasuaki, laijs,
	wency, linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

Hi Liu, Wu,

I got it, thank you very much. The idea is very helpful. :)
I'll apply your patches and do some tests later.

Thanks. :)


On 12/06/2012 10:51 AM, Jianguo Wu wrote:
> Hi Tang,
>
> There is a bug in Gerry's patch, please apply this patch to fix it.
>
> ---
>   mm/page_alloc.c |    2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 41c3b51..d981810 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4383,7 +4383,7 @@ static int __init find_zone_movable_from_movablecore_map(void)
>   			 */
>   			start_pfn = max(start_pfn,
>   					movablecore_map.map[map_pos].start);
> -			zone_movable_pfn[nid] = roundup(zone_movable_pfn[nid],
> +			zone_movable_pfn[nid] = roundup(start_pfn,
>   							MAX_ORDER_NR_PAGES);
>   			break;
>   		}


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
@ 2012-12-06  2:57             ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-12-06  2:57 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: Jiang Liu, Jiang Liu, hpa, akpm, rob, isimatu.yasuaki, laijs,
	wency, linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

Hi Liu, Wu,

I got it, thank you very much. The idea is very helpful. :)
I'll apply your patches and do some tests later.

Thanks. :)


On 12/06/2012 10:51 AM, Jianguo Wu wrote:
> Hi Tang,
>
> There is a bug in Gerry's patch, please apply this patch to fix it.
>
> ---
>   mm/page_alloc.c |    2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 41c3b51..d981810 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4383,7 +4383,7 @@ static int __init find_zone_movable_from_movablecore_map(void)
>   			 */
>   			start_pfn = max(start_pfn,
>   					movablecore_map.map[map_pos].start);
> -			zone_movable_pfn[nid] = roundup(zone_movable_pfn[nid],
> +			zone_movable_pfn[nid] = roundup(start_pfn,
>   							MAX_ORDER_NR_PAGES);
>   			break;
>   		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-27  5:31             ` H. Peter Anvin
@ 2012-12-06 17:28               ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-12-06 17:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Wen Congyang, Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki,
	laijs, linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	wujianguo, qiuxishi

Hi hpa and Tang,
	How do you think about the attached patches, which reserves memory
for hotplug from memblock/bootmem allocator at early booting stages?
	Logically we split the task into three parts:
1) Provide a mechanism to specify zone_movable[] by kernel parameter.
   Patch 1-4 from Tang achieves this goal by adding "movablecore_map" kernel
   parameter.
2) Reserve memory for hotplug by reusing information provided by "movablecore_map".
   Patch 5 from Tang achieve this goal. And the attached patches provides
   another way to achieve the same goal by calling memblock_reserve() and newly
   introduced memblock interfaces.
3) Automatically reserve memory for hotplug according to firmware provided
   information based on the attached patches.

Regards!
Gerry

On 11/27/2012 01:31 PM, H. Peter Anvin wrote:
> On 11/26/2012 07:15 PM, Wen Congyang wrote:
>>
>> Hi, hpa
>>
>> The problem is that:
>> node1 address rang: [18G, 34G), and the user specifies movable map is [8G, 24G).
>> We don't know node1's address range before numa init. So we can't prevent
>> allocating boot memory in the range [24G, 34G).
>>
>> The movable memory should be classified as a non-RAM type in memblock. What
>> do you want to say? We don't save type in memblock because we only
>> add E820_RAM and E820_RESERVED_KERN to memblock.
>>
> 
> We either need to keep the type or not add it to the memblocks.
> 
>     -hpa
> 


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-12-06 17:28               ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-12-06 17:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Wen Congyang, Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki,
	laijs, linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	wujianguo, qiuxishi

Hi hpa and Tang,
	How do you think about the attached patches, which reserves memory
for hotplug from memblock/bootmem allocator at early booting stages?
	Logically we split the task into three parts:
1) Provide a mechanism to specify zone_movable[] by kernel parameter.
   Patch 1-4 from Tang achieves this goal by adding "movablecore_map" kernel
   parameter.
2) Reserve memory for hotplug by reusing information provided by "movablecore_map".
   Patch 5 from Tang achieve this goal. And the attached patches provides
   another way to achieve the same goal by calling memblock_reserve() and newly
   introduced memblock interfaces.
3) Automatically reserve memory for hotplug according to firmware provided
   information based on the attached patches.

Regards!
Gerry

On 11/27/2012 01:31 PM, H. Peter Anvin wrote:
> On 11/26/2012 07:15 PM, Wen Congyang wrote:
>>
>> Hi, hpa
>>
>> The problem is that:
>> node1 address rang: [18G, 34G), and the user specifies movable map is [8G, 24G).
>> We don't know node1's address range before numa init. So we can't prevent
>> allocating boot memory in the range [24G, 34G).
>>
>> The movable memory should be classified as a non-RAM type in memblock. What
>> do you want to say? We don't save type in memblock because we only
>> add E820_RAM and E820_RESERVED_KERN to memblock.
>>
> 
> We either need to keep the type or not add it to the memblocks.
> 
>     -hpa
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-12-06 17:28               ` Jiang Liu
@ 2012-12-06 17:41                 ` H. Peter Anvin
  -1 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-12-06 17:41 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki,
	laijs, linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	wujianguo, qiuxishi

On 12/06/2012 09:28 AM, Jiang Liu wrote:
> Hi hpa and Tang,
> 	How do you think about the attached patches, which reserves memory
> for hotplug from memblock/bootmem allocator at early booting stages?

I don't see any attached patches?

	-hpa


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-12-06 17:41                 ` H. Peter Anvin
  0 siblings, 0 replies; 170+ messages in thread
From: H. Peter Anvin @ 2012-12-06 17:41 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki,
	laijs, linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	wujianguo, qiuxishi

On 12/06/2012 09:28 AM, Jiang Liu wrote:
> Hi hpa and Tang,
> 	How do you think about the attached patches, which reserves memory
> for hotplug from memblock/bootmem allocator at early booting stages?

I don't see any attached patches?

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-12-06 17:41                 ` H. Peter Anvin
@ 2012-12-07  0:18                   ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-12-07  0:18 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Wen Congyang, Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki,
	laijs, linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	wujianguo, qiuxishi

[-- Attachment #1: Type: text/plain, Size: 363 bytes --]

On 12/07/2012 01:41 AM, H. Peter Anvin wrote:
> On 12/06/2012 09:28 AM, Jiang Liu wrote:
>> Hi hpa and Tang,
>> 	How do you think about the attached patches, which reserves memory
>> for hotplug from memblock/bootmem allocator at early booting stages?
> 
> I don't see any attached patches?
> 
> 	-hpa
> 
Sorry, I was a little sleepy and missed the attachment.



[-- Attachment #2: 0001-memblock-introduce-interfaces-to-assoicate-tag-and-d.patch --]
[-- Type: text/x-patch, Size: 5499 bytes --]

>From 0ba5a0996d307d89f19ef79cf5fed1f8c4a7ed27 Mon Sep 17 00:00:00 2001
From: Jiang Liu <jiang.liu@huawei.com>
Date: Sun, 2 Dec 2012 20:54:32 +0800
Subject: [PATCH 1/3] memblock: introduce interfaces to assoicate tag and data
 with reserved regions

Currently some subsystems use private static arrays to store information
assoicated with memory blocks allocated/reserved from memblock subsystem.
For example, dma-contiguous.c uses cma_reserved[] to store information
assoicated with allocated memory blocks.

So introduce interfaces to associate tag(type) and caller specific data
with allocated/reserved memblock regions. Users of memblock subsystem
may be simplified by using these new interfaces.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 include/linux/memblock.h |   33 ++++++++++++++++++++++++++
 mm/Kconfig               |    3 +++
 mm/memblock.c            |   58 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index d452ee1..40dea53 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -22,6 +22,10 @@
 struct memblock_region {
 	phys_addr_t base;
 	phys_addr_t size;
+#ifdef CONFIG_HAVE_MEMBLOCK_TAG
+	void *data;
+	int tag;
+#endif
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	int nid;
 #endif
@@ -118,6 +122,35 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
 	     i != (u64)ULLONG_MAX;					\
 	     __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid))
 
+#ifdef CONFIG_HAVE_MEMBLOCK_TAG
+#define	MEMBLOCK_TAG_DEFAULT	0x0 /* default tag for bootmem allocatror */
+
+int memblock_mark_tag(phys_addr_t base, phys_addr_t size, int tag, void *data);
+void memblock_free_all_with_tag(int tag);
+
+/* Only merge regions with default tag */
+static inline bool memblock_tag_mergeable(struct memblock_region *prev,
+					  struct memblock_region *next)
+{
+	return prev->tag == MEMBLOCK_TAG_DEFAULT &&
+	       next->tag == MEMBLOCK_TAG_DEFAULT;
+}
+
+static inline void memblock_init_tag(struct memblock_region *reg)
+{
+	reg->tag = MEMBLOCK_TAG_DEFAULT;
+	reg->data = NULL;
+}
+#else /* CONFIG_HAVE_MEMBLOCK_TAG */
+static inline bool memblock_tag_mergeable(struct memblock_region *prev,
+					  struct memblock_region *next)
+{
+	return true;
+}
+
+static inline void memblock_init_tag(struct memblock_region *reg) {}
+#endif /* CONFIG_HAVE_MEMBLOCK_TAG */
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index a3f8ddd..5080390 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -131,6 +131,9 @@ config SPARSEMEM_VMEMMAP
 config HAVE_MEMBLOCK
 	boolean
 
+config HAVE_MEMBLOCK_TAG
+	boolean
+
 config HAVE_MEMBLOCK_NODE_MAP
 	boolean
 
diff --git a/mm/memblock.c b/mm/memblock.c
index 6259055..c2c644e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -307,7 +307,8 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
 
 		if (this->base + this->size != next->base ||
 		    memblock_get_region_node(this) !=
-		    memblock_get_region_node(next)) {
+		    memblock_get_region_node(next) ||
+		    !memblock_tag_mergeable(this, next)) {
 			BUG_ON(this->base + this->size > next->base);
 			i++;
 			continue;
@@ -339,6 +340,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
 	memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
 	rgn->base = base;
 	rgn->size = size;
+	memblock_init_tag(rgn);
 	memblock_set_region_node(rgn, nid);
 	type->cnt++;
 	type->total_size += size;
@@ -764,6 +766,60 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
 }
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
+#ifdef CONFIG_HAVE_MEMBLOCK_TAG
+/**
+ * memblock_mark_tag - mark @tag and @data with reserved regions
+ * @base: base of area to mark @tag and @data with
+ * @size: size of area to mark @tag and @data with
+ * @tag: tag (type) to assoicated with reserved regions
+ * @data: caller specific data to associated with reserved regions
+ *
+ * Associate @tag(type) and caller specific @data with reserved memblock
+ * regions in [@base,@base+@size).
+ * Regions which cross the area boundaries are split as necessary.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_mark_tag(phys_addr_t base, phys_addr_t size,
+				      int tag, void *data)
+{
+	struct memblock_type *type = &memblock.reserved;
+	int start_rgn, end_rgn;
+	int i, ret;
+
+	ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
+	if (ret)
+		return ret;
+
+	for (i = start_rgn; i < end_rgn; i++) {
+		type->regions[i].tag = tag;
+		type->regions[i].data = data;
+	}
+
+	memblock_merge_regions(type);
+
+	return 0;
+}
+
+/**
+ * memblock_free_all_with_tag - free all reserved regions with @tag
+ * @tag: tag to identify reserved memblock regions to be freed
+ *
+ * Free all reserved memblock regions with tag (type) of @tag
+ */
+void __init_memblock memblock_free_all_with_tag(int tag)
+{
+	int i;
+	struct memblock_type *type = &memblock.reserved;
+
+	/* scan backward because it may remove current region */
+	for (i = type->cnt - 1; i >= 0; i--)
+		if (type->regions[i].tag == tag)
+			memblock_remove_region(type, i);
+}
+#endif /* CONFIG_HAVE_MEMBLOCK_TAG */
+
 static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
 					phys_addr_t align, phys_addr_t max_addr,
 					int nid)
-- 
1.7.9.5


[-- Attachment #3: 0002-x86-memhotplug-reserve-memory-from-bootmem-allocator.patch --]
[-- Type: text/x-patch, Size: 6076 bytes --]

>From ba05910c7915e3f95a0cd0893b9abc6cd98ab22e Mon Sep 17 00:00:00 2001
From: Jiang Liu <jiang.liu@huawei.com>
Date: Sun, 2 Dec 2012 21:26:21 +0800
Subject: [PATCH 2/3] x86, memhotplug: reserve memory from bootmem allocator
 for memory hotplug

There's no mechanism to migrate pages allocated from bootmem allocator,
thus a memory device may become irremovable if bootmem  allocates any
pages from it.

This patch introduces a mechanism to
1) reserve memory from bootmem allocator for hotplug early 'enough'
   during boot.
2) free reserve memory into buddy system at late when memory hogplug
   infrastructure has been initialized.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 arch/x86/kernel/setup.c        |   11 ++++++++
 arch/x86/mm/init.c             |   56 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/init_32.c          |    2 ++
 arch/x86/mm/init_64.c          |    2 ++
 include/linux/memblock.h       |    1 +
 include/linux/memory_hotplug.h |    5 ++++
 mm/Kconfig                     |    1 +
 7 files changed, 78 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ca45696..93f6f10 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -940,6 +940,17 @@ void __init setup_arch(char **cmdline_p)
 		max_low_pfn = max_pfn;
 	}
 #endif
+
+	/*
+	 * Try to reserve memory from bootmem allocator for memory hotplug
+	 * before updating memblock.current_limit to cover all low memory.
+	 * Until now memblock.current_limit is still set to the initial value
+	 * of max_pfn_mapped, which is 512M on x86_64 and xxx on i386. And
+	 * memblock allocates available memory in reverse order, so we almost
+	 * have no chance to reserve memory below 512M for memory hotplug.
+	 */
+	reserve_memory_for_hotplug();
+
 	memblock.current_limit = get_max_mapped();
 	dma_contiguous_reserve(0);
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index d7aea41..36bb5c2 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -424,3 +424,59 @@ void __init zone_sizes_init(void)
 	free_area_init_nodes(max_zone_pfns);
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+static int __init reserve_bootmem_for_hotplug(phys_addr_t base,
+					      phys_addr_t size)
+{
+	if (memblock_is_region_reserved(base, size) ||
+	    memblock_reserve(base, size) < 0)
+		return -EBUSY;
+
+	BUG_ON(memblock_mark_tag(base, size, MEMBLOCK_TAG_HOTPLUG, NULL));
+
+	return 0;
+}
+
+/*
+ * Try to reserve low memory for hotplug according to user configured
+ * movablecore_map. Movable zone hasn't been determined yet, so can't rely
+ * on zone_movable_is_highmem() but to reserve all low memory configured by
+ * movablecore_map parameter.
+ * Assume entries in movablecore_map.map are sorted in increasing order.
+ */
+static int __init reserve_hotplug_memory_from_movable_map(void)
+{
+	int i;
+	phys_addr_t start, end;
+	struct movablecore_entry *ep;
+
+	if (movablecore_map.nr_map == 0)
+		return 0;
+
+	for (i = 0; i < movablecore_map.nr_map; i++) {
+		ep = &movablecore_map.map[i];
+		start = ep->start << PAGE_SHIFT;
+		end = (min(ep->end, max_low_pfn) + 1) << PAGE_SHIFT;
+		if (end <= start)
+			break;
+
+		if (reserve_bootmem_for_hotplug(start, end - start))
+			pr_warn("mm: failed to reserve lowmem [%#016llx-%#016llx] for hotplug.",
+				(unsigned long long)start,
+				(unsigned long long)end - 1);
+	}
+
+	return 1;
+}
+
+void __init reserve_memory_for_hotplug(void)
+{
+	if (reserve_hotplug_memory_from_movable_map())
+		return;
+}
+
+void __init free_memory_reserved_for_hotplug(void)
+{
+	memblock_free_all_with_tag(MEMBLOCK_TAG_HOTPLUG);
+}
+#endif
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 11a5800..815700a 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -745,6 +745,8 @@ void __init mem_init(void)
 	 */
 	set_highmem_pages_init();
 
+	free_memory_reserved_for_hotplug();
+
 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem();
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3baff25..1a92fd6 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -695,6 +695,8 @@ void __init mem_init(void)
 
 	reservedpages = 0;
 
+	free_memory_reserved_for_hotplug();
+
 	/* this will put all low memory onto the freelists */
 #ifdef CONFIG_NUMA
 	totalram_pages = numa_free_all_bootmem();
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 40dea53..5420ed9 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -124,6 +124,7 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
 
 #ifdef CONFIG_HAVE_MEMBLOCK_TAG
 #define	MEMBLOCK_TAG_DEFAULT	0x0 /* default tag for bootmem allocatror */
+#define	MEMBLOCK_TAG_HOTPLUG	0x1 /* reserved for memory hotplug */
 
 int memblock_mark_tag(phys_addr_t base, phys_addr_t size, int tag, void *data);
 void memblock_free_all_with_tag(int tag);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 95573ec..edf183d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -222,6 +222,8 @@ static inline void unlock_memory_hotplug(void) {}
 #ifdef CONFIG_MEMORY_HOTREMOVE
 
 extern int is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
+extern void reserve_memory_for_hotplug(void);
+extern void free_memory_reserved_for_hotplug(void);
 
 #else
 static inline int is_mem_section_removable(unsigned long pfn,
@@ -229,6 +231,9 @@ static inline int is_mem_section_removable(unsigned long pfn,
 {
 	return 0;
 }
+
+static inline void reserve_memory_for_hotplug(void) {}
+static inline void free_memory_reserved_for_hotplug(void) {}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 extern int mem_online_node(int nid);
diff --git a/mm/Kconfig b/mm/Kconfig
index 5080390..9d69e5d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -160,6 +160,7 @@ config MEMORY_HOTPLUG_SPARSE
 
 config MEMORY_HOTREMOVE
 	bool "Allow for memory hot remove"
+	select HAVE_MEMBLOCK_TAG
 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
 	depends on MIGRATION
 
-- 
1.7.9.5


[-- Attachment #4: 0003-CMA-use-new-memblock-interfaces-to-simplify-implemen.patch --]
[-- Type: text/x-patch, Size: 3922 bytes --]

>From d1ddc6e2196758923c71d649d52b9a14d678419b Mon Sep 17 00:00:00 2001
From: Jiang Liu <jiang.liu@huawei.com>
Date: Sun, 2 Dec 2012 21:00:52 +0800
Subject: [PATCH 3/3] CMA: use new memblock interfaces to simplify
 implementation

This patch simplifies dma-continuous.c by using new memblock interfaces.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 drivers/base/Kconfig          |    1 +
 drivers/base/dma-contiguous.c |   36 +++++++++++++-----------------------
 include/linux/memblock.h      |    1 +
 3 files changed, 15 insertions(+), 23 deletions(-)

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index b34b5cd..b0ac008 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -197,6 +197,7 @@ config CMA
 	depends on HAVE_DMA_CONTIGUOUS && HAVE_MEMBLOCK && EXPERIMENTAL
 	select MIGRATION
 	select MEMORY_ISOLATION
+	select HAVE_MEMBLOCK_TAG
 	help
 	  This enables the Contiguous Memory Allocator which allows drivers
 	  to allocate big physically-contiguous blocks of memory for use with
diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 612afcc..c092b76 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -190,27 +190,24 @@ no_mem:
 	return ERR_PTR(ret);
 }
 
-static struct cma_reserved {
-	phys_addr_t start;
-	unsigned long size;
-	struct device *dev;
-} cma_reserved[MAX_CMA_AREAS] __initdata;
 static unsigned cma_reserved_count __initdata;
 
 static int __init cma_init_reserved_areas(void)
 {
-	struct cma_reserved *r = cma_reserved;
-	unsigned i = cma_reserved_count;
+	struct memblock_region *reg;
+	struct cma *cma;
 
 	pr_debug("%s()\n", __func__);
 
-	for (; i; --i, ++r) {
-		struct cma *cma;
-		cma = cma_create_area(PFN_DOWN(r->start),
-				      r->size >> PAGE_SHIFT);
-		if (!IS_ERR(cma))
-			dev_set_cma_area(r->dev, cma);
-	}
+	for_each_memblock(memory, reg)
+		if (reg->tag == MEMBLOCK_TAG_CMA) {
+			cma = cma_create_area(PFN_DOWN(reg->base),
+					      reg->size >> PAGE_SHIFT);
+			if (!IS_ERR(cma))
+				dev_set_cma_area(reg->data, cma);
+		}
+	memblock_free_all_with_tag(MEMBLOCK_TAG_CMA);
+
 	return 0;
 }
 core_initcall(cma_init_reserved_areas);
@@ -230,7 +227,6 @@ core_initcall(cma_init_reserved_areas);
 int __init dma_declare_contiguous(struct device *dev, unsigned long size,
 				  phys_addr_t base, phys_addr_t limit)
 {
-	struct cma_reserved *r = &cma_reserved[cma_reserved_count];
 	unsigned long alignment;
 
 	pr_debug("%s(size %lx, base %08lx, limit %08lx)\n", __func__,
@@ -238,7 +234,7 @@ int __init dma_declare_contiguous(struct device *dev, unsigned long size,
 		 (unsigned long)limit);
 
 	/* Sanity checks */
-	if (cma_reserved_count == ARRAY_SIZE(cma_reserved)) {
+	if (cma_reserved_count == MAX_CMA_AREAS) {
 		pr_err("Not enough slots for CMA reserved regions!\n");
 		return -ENOSPC;
 	}
@@ -277,13 +273,7 @@ int __init dma_declare_contiguous(struct device *dev, unsigned long size,
 		}
 	}
 
-	/*
-	 * Each reserved area must be initialised later, when more kernel
-	 * subsystems (like slab allocator) are available.
-	 */
-	r->start = base;
-	r->size = size;
-	r->dev = dev;
+	BUG_ON(memblock_mark_tag(base, size, MEMBLOCK_TAG_CMA, dev));
 	cma_reserved_count++;
 	pr_info("CMA: reserved %ld MiB at %08lx\n", size / SZ_1M,
 		(unsigned long)base);
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 5420ed9..a662c07 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -125,6 +125,7 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
 #ifdef CONFIG_HAVE_MEMBLOCK_TAG
 #define	MEMBLOCK_TAG_DEFAULT	0x0 /* default tag for bootmem allocatror */
 #define	MEMBLOCK_TAG_HOTPLUG	0x1 /* reserved for memory hotplug */
+#define	MEMBLOCK_TAG_CMA	0x2 /* reserved for CMA */
 
 int memblock_mark_tag(phys_addr_t base, phys_addr_t size, int tag, void *data);
 void memblock_free_all_with_tag(int tag);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-12-07  0:18                   ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-12-07  0:18 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Wen Congyang, Tang Chen, wujianguo, akpm, rob, isimatu.yasuaki,
	laijs, linfeng, jiang.liu, yinghai, kosaki.motohiro, minchan.kim,
	mgorman, rientjes, rusty, linux-kernel, linux-mm, linux-doc,
	wujianguo, qiuxishi

[-- Attachment #1: Type: text/plain, Size: 363 bytes --]

On 12/07/2012 01:41 AM, H. Peter Anvin wrote:
> On 12/06/2012 09:28 AM, Jiang Liu wrote:
>> Hi hpa and Tang,
>> 	How do you think about the attached patches, which reserves memory
>> for hotplug from memblock/bootmem allocator at early booting stages?
> 
> I don't see any attached patches?
> 
> 	-hpa
> 
Sorry, I was a little sleepy and missed the attachment.



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-memblock-introduce-interfaces-to-assoicate-tag-and-d.patch --]
[-- Type: text/x-patch; name="0001-memblock-introduce-interfaces-to-assoicate-tag-and-d.patch", Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
  2012-12-06  2:26         ` Jiang Liu
@ 2012-12-09  8:10           ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-12-09  8:10 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Jiang Liu, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

Hi Liu, Wu,

On 12/06/2012 10:26 AM, Jiang Liu wrote:
> On 2012-12-6 9:26, Tang Chen wrote:
>> On 12/05/2012 11:43 PM, Jiang Liu wrote:
>>> If we make "movablecore_map" take precedence over "movablecore/kernelcore",
>>> the logic could be simplified. I think it's not so attractive to support
>>> both "movablecore_map" and "movablecore/kernelcore" at the same time.

Thanks for the advice of removing movablecore/kernelcore. But since we
didn't plan to do this in the beginning, and movablecore/kernelcore are
more user friendly, I think for now, I'll handle DMA and low memory 
address problems as you mentioned, and just keep movablecore/kernelcore
in the next version. :)

And about the SRAT, I think it is necessary to many users. I think we
should provide both interfaces. I may give a try in the next version.

Thanks. :)


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
@ 2012-12-09  8:10           ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-12-09  8:10 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Jiang Liu, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

Hi Liu, Wu,

On 12/06/2012 10:26 AM, Jiang Liu wrote:
> On 2012-12-6 9:26, Tang Chen wrote:
>> On 12/05/2012 11:43 PM, Jiang Liu wrote:
>>> If we make "movablecore_map" take precedence over "movablecore/kernelcore",
>>> the logic could be simplified. I think it's not so attractive to support
>>> both "movablecore_map" and "movablecore/kernelcore" at the same time.

Thanks for the advice of removing movablecore/kernelcore. But since we
didn't plan to do this in the beginning, and movablecore/kernelcore are
more user friendly, I think for now, I'll handle DMA and low memory 
address problems as you mentioned, and just keep movablecore/kernelcore
in the next version. :)

And about the SRAT, I think it is necessary to many users. I think we
should provide both interfaces. I may give a try in the next version.

Thanks. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
  2012-12-09  8:10           ` Tang Chen
@ 2012-12-10  2:15             ` Jiang Liu
  -1 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-12-10  2:15 UTC (permalink / raw)
  To: Tang Chen
  Cc: Jiang Liu, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 2012-12-9 16:10, Tang Chen wrote:
> Hi Liu, Wu,
> 
> On 12/06/2012 10:26 AM, Jiang Liu wrote:
>> On 2012-12-6 9:26, Tang Chen wrote:
>>> On 12/05/2012 11:43 PM, Jiang Liu wrote:
>>>> If we make "movablecore_map" take precedence over "movablecore/kernelcore",
>>>> the logic could be simplified. I think it's not so attractive to support
>>>> both "movablecore_map" and "movablecore/kernelcore" at the same time.
> 
> Thanks for the advice of removing movablecore/kernelcore. But since we
> didn't plan to do this in the beginning, and movablecore/kernelcore are
> more user friendly, I think for now, I'll handle DMA and low memory address problems as you mentioned, and just keep movablecore/kernelcore
> in the next version. :)
Hi Tang,
	I mean we could ignore kernelcore/movablecore if user specifies
both movablecore_map and kernelcore/movablecore in the kernel command
line. I'm not suggesting to get rid of kernelcore/movablecore:)
	Thanks!

> 
> And about the SRAT, I think it is necessary to many users. I think we
> should provide both interfaces. I may give a try in the next version.
> 
> Thanks. :)
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority
@ 2012-12-10  2:15             ` Jiang Liu
  0 siblings, 0 replies; 170+ messages in thread
From: Jiang Liu @ 2012-12-10  2:15 UTC (permalink / raw)
  To: Tang Chen
  Cc: Jiang Liu, hpa, akpm, rob, isimatu.yasuaki, laijs, wency,
	linfeng, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc

On 2012-12-9 16:10, Tang Chen wrote:
> Hi Liu, Wu,
> 
> On 12/06/2012 10:26 AM, Jiang Liu wrote:
>> On 2012-12-6 9:26, Tang Chen wrote:
>>> On 12/05/2012 11:43 PM, Jiang Liu wrote:
>>>> If we make "movablecore_map" take precedence over "movablecore/kernelcore",
>>>> the logic could be simplified. I think it's not so attractive to support
>>>> both "movablecore_map" and "movablecore/kernelcore" at the same time.
> 
> Thanks for the advice of removing movablecore/kernelcore. But since we
> didn't plan to do this in the beginning, and movablecore/kernelcore are
> more user friendly, I think for now, I'll handle DMA and low memory address problems as you mentioned, and just keep movablecore/kernelcore
> in the next version. :)
Hi Tang,
	I mean we could ignore kernelcore/movablecore if user specifies
both movablecore_map and kernelcore/movablecore in the kernel command
line. I'm not suggesting to get rid of kernelcore/movablecore:)
	Thanks!

> 
> And about the SRAT, I think it is necessary to many users. I think we
> should provide both interfaces. I may give a try in the next version.
> 
> Thanks. :)
> 
> 
> .
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
  2012-11-26 12:40     ` wujianguo
@ 2012-12-19  9:17       ` Tang Chen
  -1 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-12-19  9:17 UTC (permalink / raw)
  To: wujianguo
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi

Hi Wu,

Sorry for such a long delay.

On 11/26/2012 08:40 PM, wujianguo wrote:
> Hi Tang,
> 	I tested this patchset in x86_64, and I found that this patch didn't
> work as expected.
> 	For example, if node2's memory pfn range is [0x680000-0x980000),
> I boot kernel with movablecore_map=4G@0x680000000, all memory in node2 will be
> in ZONE_MOVABLE, but bootmem still can be allocated from [0x780000000-0x980000000),
> that means bootmem *is allocated* from ZONE_MOVABLE. This because movablecore_map
> only contains [0x680000000-0x780000000). I think we can fixup movablecore_map, how
> about this:
>
> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
> Signed-off-by: Jiang Liu<jiang.liu@huawei.com>
> ---
>   arch/x86/mm/srat.c |   15 +++++++++++++++
>   include/linux/mm.h |    3 +++
>   mm/page_alloc.c    |    2 +-
>   3 files changed, 19 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
> index 4ddf497..f1aac08 100644
> --- a/arch/x86/mm/srat.c
> +++ b/arch/x86/mm/srat.c
> @@ -147,6 +147,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
>   {
>   	u64 start, end;
>   	int node, pxm;
> +	int i;
> +	unsigned long start_pfn, end_pfn;
>
>   	if (srat_disabled())
>   		return -1;
> @@ -181,6 +183,19 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
>   	printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
>   	       node, pxm,
>   	       (unsigned long long) start, (unsigned long long) end - 1);
> +
> +	start_pfn = PFN_DOWN(start);
> +	end_pfn = PFN_UP(end);

I think the logic here has some problems.

Let's assume the range here is [3G, 5G), and
movablecore_map.map[] is like: [1G, 2G), [3G, 4G), [7G,8G).

> +	for (i = 0; i<  movablecore_map.nr_map; i++) {
> +		if (end_pfn<= movablecore_map.map[i].start)
> +			break;

When i = 0, 5G > 1G, no break.

> +
> +		if (movablecore_map.map[i].end<  end_pfn) {
> +			insert_movablecore_map(movablecore_map.map[i].end,
> +						end_pfn);

2G < 5G, so insert [2G, 5G). It's incorrect.
We should insert [4G, 5G).

I got your idea, and I also add SRAT support. So I made a new patch to
do this. Please have a look if you like. :)

Thanks. :)

> +		}
> +	}
> +
>   	return 0;
>   }
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5a65251..7a23403 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1356,6 +1356,9 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn);
>   #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
>   #endif
>
> +extern void insert_movablecore_map(unsigned long start_pfn,
> +					  unsigned long end_pfn);
> +
>   extern void set_dma_reserve(unsigned long new_dma_reserve);
>   extern void memmap_init_zone(unsigned long, int, unsigned long,
>   				unsigned long, enum memmap_context);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 544c829..e6b5090 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5089,7 +5089,7 @@ early_param("movablecore", cmdline_parse_movablecore);
>    * This function will also merge the overlapped ranges, and sort the array
>    * by start_pfn in monotonic increasing order.
>    */
> -static void __init insert_movablecore_map(unsigned long start_pfn,
> +void __init insert_movablecore_map(unsigned long start_pfn,
>   					  unsigned long end_pfn)
>   {
>   	int pos, overlap;
> -- 1.7.6.1
> .
>
> Thanks,
> Jianguo Wu
>
> On 2012-11-23 18:44, Tang Chen wrote:
>> This patch make sure bootmem will not allocate memory from areas that
>> may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
>>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> Signed-off-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>> Reviewed-by: Wen Congyang<wency@cn.fujitsu.com>
>> Tested-by: Lin Feng<linfeng@cn.fujitsu.com>
>> ---
>>   include/linux/memblock.h |    1 +
>>   mm/memblock.c            |   15 ++++++++++++++-
>>   2 files changed, 15 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> index d452ee1..6e25597 100644
>> --- a/include/linux/memblock.h
>> +++ b/include/linux/memblock.h
>> @@ -42,6 +42,7 @@ struct memblock {
>>
>>   extern struct memblock memblock;
>>   extern int memblock_debug;
>> +extern struct movablecore_map movablecore_map;
>>
>>   #define memblock_dbg(fmt, ...) \
>>   	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 6259055..33b3b4d 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>>   {
>>   	phys_addr_t this_start, this_end, cand;
>>   	u64 i;
>> +	int curr = movablecore_map.nr_map - 1;
>>
>>   	/* pump up @end */
>>   	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
>> @@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>>   		this_start = clamp(this_start, start, end);
>>   		this_end = clamp(this_end, start, end);
>>
>> -		if (this_end<  size)
>> +restart:
>> +		if (this_end<= this_start || this_end<  size)
>>   			continue;
>>
>> +		for (; curr>= 0; curr--) {
>> +			if (movablecore_map.map[curr].start<  this_end)
>> +				break;
>> +		}
>> +
>>   		cand = round_down(this_end - size, align);
>> +		if (curr>= 0&&  cand<  movablecore_map.map[curr].end) {
>> +			this_end = movablecore_map.map[curr].start;
>> +			goto restart;
>> +		}
>> +
>>   		if (cand>= this_start)
>>   			return cand;
>>   	}
>> +
>>   	return 0;
>>   }
>>
>>
>
>


^ permalink raw reply	[flat|nested] 170+ messages in thread

* Re: [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map
@ 2012-12-19  9:17       ` Tang Chen
  0 siblings, 0 replies; 170+ messages in thread
From: Tang Chen @ 2012-12-19  9:17 UTC (permalink / raw)
  To: wujianguo
  Cc: hpa, akpm, rob, isimatu.yasuaki, laijs, wency, linfeng,
	jiang.liu, yinghai, kosaki.motohiro, minchan.kim, mgorman,
	rientjes, rusty, linux-kernel, linux-mm, linux-doc, wujianguo,
	qiuxishi

Hi Wu,

Sorry for such a long delay.

On 11/26/2012 08:40 PM, wujianguo wrote:
> Hi Tang,
> 	I tested this patchset in x86_64, and I found that this patch didn't
> work as expected.
> 	For example, if node2's memory pfn range is [0x680000-0x980000),
> I boot kernel with movablecore_map=4G@0x680000000, all memory in node2 will be
> in ZONE_MOVABLE, but bootmem still can be allocated from [0x780000000-0x980000000),
> that means bootmem *is allocated* from ZONE_MOVABLE. This because movablecore_map
> only contains [0x680000000-0x780000000). I think we can fixup movablecore_map, how
> about this:
>
> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
> Signed-off-by: Jiang Liu<jiang.liu@huawei.com>
> ---
>   arch/x86/mm/srat.c |   15 +++++++++++++++
>   include/linux/mm.h |    3 +++
>   mm/page_alloc.c    |    2 +-
>   3 files changed, 19 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
> index 4ddf497..f1aac08 100644
> --- a/arch/x86/mm/srat.c
> +++ b/arch/x86/mm/srat.c
> @@ -147,6 +147,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
>   {
>   	u64 start, end;
>   	int node, pxm;
> +	int i;
> +	unsigned long start_pfn, end_pfn;
>
>   	if (srat_disabled())
>   		return -1;
> @@ -181,6 +183,19 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
>   	printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
>   	       node, pxm,
>   	       (unsigned long long) start, (unsigned long long) end - 1);
> +
> +	start_pfn = PFN_DOWN(start);
> +	end_pfn = PFN_UP(end);

I think the logic here has some problems.

Let's assume the range here is [3G, 5G), and
movablecore_map.map[] is like: [1G, 2G), [3G, 4G), [7G,8G).

> +	for (i = 0; i<  movablecore_map.nr_map; i++) {
> +		if (end_pfn<= movablecore_map.map[i].start)
> +			break;

When i = 0, 5G > 1G, no break.

> +
> +		if (movablecore_map.map[i].end<  end_pfn) {
> +			insert_movablecore_map(movablecore_map.map[i].end,
> +						end_pfn);

2G < 5G, so insert [2G, 5G). It's incorrect.
We should insert [4G, 5G).

I got your idea, and I also add SRAT support. So I made a new patch to
do this. Please have a look if you like. :)

Thanks. :)

> +		}
> +	}
> +
>   	return 0;
>   }
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5a65251..7a23403 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1356,6 +1356,9 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn);
>   #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
>   #endif
>
> +extern void insert_movablecore_map(unsigned long start_pfn,
> +					  unsigned long end_pfn);
> +
>   extern void set_dma_reserve(unsigned long new_dma_reserve);
>   extern void memmap_init_zone(unsigned long, int, unsigned long,
>   				unsigned long, enum memmap_context);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 544c829..e6b5090 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5089,7 +5089,7 @@ early_param("movablecore", cmdline_parse_movablecore);
>    * This function will also merge the overlapped ranges, and sort the array
>    * by start_pfn in monotonic increasing order.
>    */
> -static void __init insert_movablecore_map(unsigned long start_pfn,
> +void __init insert_movablecore_map(unsigned long start_pfn,
>   					  unsigned long end_pfn)
>   {
>   	int pos, overlap;
> -- 1.7.6.1
> .
>
> Thanks,
> Jianguo Wu
>
> On 2012-11-23 18:44, Tang Chen wrote:
>> This patch make sure bootmem will not allocate memory from areas that
>> may be ZONE_MOVABLE. The map info is from movablecore_map boot option.
>>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> Signed-off-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>> Reviewed-by: Wen Congyang<wency@cn.fujitsu.com>
>> Tested-by: Lin Feng<linfeng@cn.fujitsu.com>
>> ---
>>   include/linux/memblock.h |    1 +
>>   mm/memblock.c            |   15 ++++++++++++++-
>>   2 files changed, 15 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> index d452ee1..6e25597 100644
>> --- a/include/linux/memblock.h
>> +++ b/include/linux/memblock.h
>> @@ -42,6 +42,7 @@ struct memblock {
>>
>>   extern struct memblock memblock;
>>   extern int memblock_debug;
>> +extern struct movablecore_map movablecore_map;
>>
>>   #define memblock_dbg(fmt, ...) \
>>   	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 6259055..33b3b4d 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -101,6 +101,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>>   {
>>   	phys_addr_t this_start, this_end, cand;
>>   	u64 i;
>> +	int curr = movablecore_map.nr_map - 1;
>>
>>   	/* pump up @end */
>>   	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
>> @@ -114,13 +115,25 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>>   		this_start = clamp(this_start, start, end);
>>   		this_end = clamp(this_end, start, end);
>>
>> -		if (this_end<  size)
>> +restart:
>> +		if (this_end<= this_start || this_end<  size)
>>   			continue;
>>
>> +		for (; curr>= 0; curr--) {
>> +			if (movablecore_map.map[curr].start<  this_end)
>> +				break;
>> +		}
>> +
>>   		cand = round_down(this_end - size, align);
>> +		if (curr>= 0&&  cand<  movablecore_map.map[curr].end) {
>> +			this_end = movablecore_map.map[curr].start;
>> +			goto restart;
>> +		}
>> +
>>   		if (cand>= this_start)
>>   			return cand;
>>   	}
>> +
>>   	return 0;
>>   }
>>
>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 170+ messages in thread

end of thread, other threads:[~2012-12-19  9:18 UTC | newest]

Thread overview: 170+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-23 10:44 [PATCH v2 0/5] Add movablecore_map boot option Tang Chen
2012-11-23 10:44 ` Tang Chen
2012-11-23 10:44 ` [PATCH v2 1/5] x86: get pg_data_t's memory from other node Tang Chen
2012-11-23 10:44   ` Tang Chen
2012-11-24  1:19   ` Jiang Liu
2012-11-24  1:19     ` Jiang Liu
2012-11-26  1:19     ` Tang Chen
2012-11-26  1:19       ` Tang Chen
2012-12-02 15:11   ` Jiang Liu
2012-12-02 15:11     ` Jiang Liu
2012-11-23 10:44 ` [PATCH v2 2/5] page_alloc: add movable_memmap kernel parameter Tang Chen
2012-11-23 10:44   ` Tang Chen
2012-11-23 10:44 ` [PATCH v2 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes Tang Chen
2012-11-23 10:44   ` Tang Chen
2012-12-05 15:46   ` Jiang Liu
2012-12-05 15:46     ` Jiang Liu
2012-12-06  1:20     ` Tang Chen
2012-12-06  1:20       ` Tang Chen
2012-11-23 10:44 ` [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority Tang Chen
2012-11-23 10:44   ` Tang Chen
2012-12-05 15:43   ` Jiang Liu
2012-12-05 15:43     ` Jiang Liu
2012-12-06  1:26     ` Tang Chen
2012-12-06  1:26       ` Tang Chen
2012-12-06  2:26       ` Jiang Liu
2012-12-06  2:26         ` Jiang Liu
2012-12-06  2:51         ` Jianguo Wu
2012-12-06  2:51           ` Jianguo Wu
2012-12-06  2:57           ` Tang Chen
2012-12-06  2:57             ` Tang Chen
2012-12-09  8:10         ` Tang Chen
2012-12-09  8:10           ` Tang Chen
2012-12-10  2:15           ` Jiang Liu
2012-12-10  2:15             ` Jiang Liu
2012-11-23 10:44 ` [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map Tang Chen
2012-11-23 10:44   ` Tang Chen
2012-11-26 12:22   ` wujianguo
2012-11-26 12:22     ` wujianguo
2012-11-26 12:53     ` Tang Chen
2012-11-26 12:53       ` Tang Chen
2012-11-26 12:40   ` wujianguo
2012-11-26 12:40     ` wujianguo
2012-11-26 13:15     ` Tang Chen
2012-11-26 13:15       ` Tang Chen
2012-11-26 15:48       ` H. Peter Anvin
2012-11-26 15:48         ` H. Peter Anvin
2012-11-27  0:58         ` Jianguo Wu
2012-11-27  0:58           ` Jianguo Wu
2012-11-27  3:19           ` Wen Congyang
2012-11-27  3:19             ` Wen Congyang
2012-11-27  3:22             ` Jianguo Wu
2012-11-27  3:22               ` Jianguo Wu
2012-11-27  3:34               ` Wen Congyang
2012-11-27  3:34                 ` Wen Congyang
2012-11-27  1:12         ` Jiang Liu
2012-11-27  1:12           ` Jiang Liu
2012-11-27  1:20           ` H. Peter Anvin
2012-11-27  1:20             ` H. Peter Anvin
2012-11-27  3:15         ` Wen Congyang
2012-11-27  3:15           ` Wen Congyang
2012-11-27  5:31           ` H. Peter Anvin
2012-11-27  5:31             ` H. Peter Anvin
2012-12-06 17:28             ` Jiang Liu
2012-12-06 17:28               ` Jiang Liu
2012-12-06 17:41               ` H. Peter Anvin
2012-12-06 17:41                 ` H. Peter Anvin
2012-12-07  0:18                 ` Jiang Liu
2012-12-07  0:18                   ` Jiang Liu
2012-12-19  9:17     ` Tang Chen
2012-12-19  9:17       ` Tang Chen
2012-11-27  3:10 ` [PATCH v2 0/5] Add movablecore_map boot option wujianguo
2012-11-27  3:10   ` wujianguo
2012-11-27  5:43   ` Tang Chen
2012-11-27  5:43     ` Tang Chen
2012-11-27  6:20     ` H. Peter Anvin
2012-11-27  6:20       ` H. Peter Anvin
2012-11-27  6:47     ` Jianguo Wu
2012-11-27  6:47       ` Jianguo Wu
2012-11-28  3:47   ` Tang Chen
2012-11-28  3:47     ` Tang Chen
2012-11-28  4:01     ` Jiang Liu
2012-11-28  4:01       ` Jiang Liu
2012-11-28  5:21       ` Wen Congyang
2012-11-28  5:21         ` Wen Congyang
2012-11-28  5:17         ` Jiang Liu
2012-11-28  5:17           ` Jiang Liu
2012-11-28  4:53     ` Jianguo Wu
2012-11-28  4:53       ` Jianguo Wu
2012-11-27  8:00 ` Bob Liu
2012-11-27  8:00   ` Bob Liu
2012-11-27  8:29   ` Tang Chen
2012-11-27  8:29     ` Tang Chen
2012-11-27  8:49     ` H. Peter Anvin
2012-11-27  8:49       ` H. Peter Anvin
2012-11-27  9:47       ` Wen Congyang
2012-11-27  9:47         ` Wen Congyang
2012-11-27  9:53         ` H. Peter Anvin
2012-11-27  9:53           ` H. Peter Anvin
2012-11-27  9:59       ` Yasuaki Ishimatsu
2012-11-27  9:59         ` Yasuaki Ishimatsu
2012-11-27 12:09     ` Bob Liu
2012-11-27 12:09       ` Bob Liu
2012-11-27 12:49       ` Tang Chen
2012-11-27 12:49         ` Tang Chen
2012-11-28  3:24         ` Bob Liu
2012-11-28  3:24           ` Bob Liu
2012-11-28  4:08           ` Jiang Liu
2012-11-28  4:08             ` Jiang Liu
2012-11-28  6:16             ` Tang Chen
2012-11-28  6:16               ` Tang Chen
2012-11-28  7:03               ` Jiang Liu
2012-11-28  7:03                 ` Jiang Liu
2012-11-28  8:29             ` Wen Congyang
2012-11-28  8:29               ` Wen Congyang
2012-11-28  8:28               ` Jiang Liu
2012-11-28  8:28                 ` Jiang Liu
2012-11-28  8:38                 ` Wen Congyang
2012-11-28  8:38                   ` Wen Congyang
2012-11-29  0:43               ` Jaegeuk Hanse
2012-11-29  0:43                 ` Jaegeuk Hanse
2012-11-29  1:24                 ` Tang Chen
2012-11-29  1:24                   ` Tang Chen
2012-11-30  9:20             ` Lai Jiangshan
2012-11-30  9:20               ` Lai Jiangshan
2012-11-28  8:47 ` Jiang Liu
2012-11-28  8:47   ` Jiang Liu
2012-11-28 21:34   ` Luck, Tony
2012-11-28 21:34     ` Luck, Tony
2012-11-28 21:38     ` H. Peter Anvin
2012-11-28 21:38       ` H. Peter Anvin
2012-11-29 11:00       ` Mel Gorman
2012-11-29 11:00         ` Mel Gorman
2012-11-29 16:07         ` H. Peter Anvin
2012-11-29 16:07           ` H. Peter Anvin
2012-11-29 22:41           ` Luck, Tony
2012-11-29 22:41             ` Luck, Tony
2012-11-29 22:45             ` H. Peter Anvin
2012-11-29 22:45               ` H. Peter Anvin
2012-11-30  2:56         ` Jiang Liu
2012-11-30  2:56           ` Jiang Liu
2012-11-30  3:15           ` Yasuaki Ishimatsu
2012-11-30  3:15             ` Yasuaki Ishimatsu
2012-11-30 15:36             ` Jiang Liu
2012-11-30 15:36               ` Jiang Liu
2012-11-30  2:58         ` Luck, Tony
2012-11-30  2:58           ` Luck, Tony
2012-11-30  3:28           ` H. Peter Anvin
2012-11-30  3:28             ` H. Peter Anvin
2012-11-30 10:19           ` Glauber Costa
2012-11-30 10:19             ` Glauber Costa
2012-11-30 10:52           ` Mel Gorman
2012-11-30 10:52             ` Mel Gorman
2012-11-29 10:38     ` Yasuaki Ishimatsu
2012-11-29 10:38       ` Yasuaki Ishimatsu
2012-11-29 11:05       ` Mel Gorman
2012-11-29 11:05         ` Mel Gorman
2012-11-29 15:47       ` Jiang Liu
2012-11-29 15:47         ` Jiang Liu
2012-11-29 15:53       ` Jiang Liu
2012-11-29 15:53         ` Jiang Liu
2012-11-29  1:42   ` Jaegeuk Hanse
2012-11-29  1:42     ` Jaegeuk Hanse
2012-11-29  2:25     ` Jiang Liu
2012-11-29  2:25       ` Jiang Liu
2012-11-29  2:49       ` Wanpeng Li
2012-11-29  2:49       ` Wanpeng Li
2012-11-29  2:59         ` Jiang Liu
2012-11-29  2:59           ` Jiang Liu
2012-11-30 22:27       ` Toshi Kani
2012-11-30 22:27         ` Toshi Kani

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.