linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE
@ 2013-12-03  2:19 Zhang Yanfei
  2013-12-03  2:22 ` [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node Zhang Yanfei
                   ` (9 more replies)
  0 siblings, 10 replies; 24+ messages in thread
From: Zhang Yanfei @ 2013-12-03  2:19 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Rafael J . Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, linux-kernel, Linux MM, Chen Tang, Tang Chen,
	Zhang Yanfei

[Problem]

The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.


[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.

To do this, we need ACPI's help.


[How we do this]

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
   ZONE_MOVABLE.
   (This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
   hotpluggable memory for the kernel before the memory initialization
   finishes.
   (This is what we are going to do. See below.)


[About this patch-set]

In previous part's patches, we have made the kernel allocate memory near
kernel image before SRAT parsed to avoid allocating hotpluggable memory
for kernel. So this patch-set does the following things:

1. Improve memblock to support flags, which are used to indicate different 
   memory type.

2. Mark all hotpluggable memory in memblock.memory[].

3. Make the default memblock allocator skip hotpluggable memory.

4. Improve "movable_node" boot option to have higher priority of movablecore
   and kernelcore boot option.

Change log v1 -> v2:
1. Rebase this part on the v7 version of part1
2. Fix bug: If movable_node boot option not specified, memblock still
   checks hotpluggable memory when allocating memory. 

Tang Chen (7):
  memblock, numa: Introduce flag into memblock
  memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark
    hotpluggable regions
  memblock: Make memblock_set_node() support different memblock_type
  acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock
  acpi, numa, mem_hotplug: Mark all nodes the kernel resides
    un-hotpluggable
  memblock, mem_hotplug: Make memblock skip hotpluggable regions if
    needed
  x86, numa, acpi, memory-hotplug: Make movable_node have higher
    priority

Yasuaki Ishimatsu (1):
  x86: get pg_data_t's memory from other node

 arch/metag/mm/init.c      |    3 +-
 arch/metag/mm/numa.c      |    3 +-
 arch/microblaze/mm/init.c |    3 +-
 arch/powerpc/mm/mem.c     |    2 +-
 arch/powerpc/mm/numa.c    |    8 ++-
 arch/sh/kernel/setup.c    |    4 +-
 arch/sparc/mm/init_64.c   |    5 +-
 arch/x86/mm/init_32.c     |    2 +-
 arch/x86/mm/init_64.c     |    2 +-
 arch/x86/mm/numa.c        |   63 +++++++++++++++++++++--
 arch/x86/mm/srat.c        |    5 ++
 include/linux/memblock.h  |   39 ++++++++++++++-
 mm/memblock.c             |  123 ++++++++++++++++++++++++++++++++++++++-------
 mm/memory_hotplug.c       |    1 +
 mm/page_alloc.c           |   28 ++++++++++-
 15 files changed, 252 insertions(+), 39 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node
  2013-12-03  2:19 [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
@ 2013-12-03  2:22 ` Zhang Yanfei
  2014-01-16 17:11   ` Mel Gorman
  2013-12-03  2:24 ` [PATCH RESEND part2 v2 2/8] memblock, numa: Introduce flag into memblock Zhang Yanfei
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: Zhang Yanfei @ 2013-12-03  2:22 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Rafael J . Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, linux-kernel, Linux MM, Chen Tang, Tang Chen,
	Zhang Yanfei

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

If system can create movable node which all memory of the node is allocated
as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's
pg_data_t. So, invoke memblock_alloc_nid(...MAX_NUMNODES) again to retry when
the first allocation fails. Otherwise, the system could failed to boot.
(We don't use memblock_alloc_try_nid() to retry because in this function,
if the allocation fails, it will panic the system.)

The node_data could be on hotpluggable node. And so could pagetable and
vmemmap. But for now, doing so will break memory hot-remove path.

A node could have several memory devices. And the device who holds node
data should be hot-removed in the last place. But in NUMA level, we don't
know which memory_block (/sys/devices/system/node/nodeX/memoryXXX) belongs
to which memory device. We only have node. So we can only do node hotplug.

But in virtualization, developers are now developing memory hotplug in qemu,
which support a single memory device hotplug. So a whole node hotplug will
not satisfy virtualization users.

So at last, we concluded that we'd better do memory hotplug and local node
things (local node node data, pagetable, vmemmap, ...) in two steps.
Please refer to https://lkml.org/lkml/2013/6/19/73

For now, we put node_data of movable node to another node, and then improve
it in the future.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
---
 arch/x86/mm/numa.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 24aec58..e17db5d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -211,9 +211,14 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
 	 */
 	nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
 	if (!nd_pa) {
-		pr_err("Cannot find %zu bytes in node %d\n",
-		       nd_size, nid);
-		return;
+		pr_warn("Cannot find %zu bytes in node %d, so try other nodes",
+			nd_size, nid);
+		nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES,
+					   MAX_NUMNODES);
+		if (!nd_pa) {
+			pr_err("Cannot find %zu bytes in any node\n", nd_size);
+			return;
+		}
 	}
 	nd = __va(nd_pa);
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RESEND part2 v2 2/8] memblock, numa: Introduce flag into memblock
  2013-12-03  2:19 [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
  2013-12-03  2:22 ` [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node Zhang Yanfei
@ 2013-12-03  2:24 ` Zhang Yanfei
  2013-12-03  2:25 ` [PATCH RESEND part2 v2 3/8] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions Zhang Yanfei
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Zhang Yanfei @ 2013-12-03  2:24 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Rafael J . Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, linux-kernel, Linux MM, Chen Tang, Tang Chen,
	Zhang Yanfei

From: Tang Chen <tangchen@cn.fujitsu.com>

There is no flag in memblock to describe what type the memory is.
Sometimes, we may use memblock to reserve some memory for special usage.
And we want to know what kind of memory it is. So we need a way to
differentiate memory for different usage.

In hotplug environment, we want to reserve hotpluggable memory so the
kernel won't be able to use it. And when the system is up, we have to
free these hotpluggable memory to buddy. So we need to mark these memory
first.

In order to do so, we need to mark out these special memory in memblock.
In this patch, we introduce a new "flags" member into memblock_region:
   struct memblock_region {
           phys_addr_t base;
           phys_addr_t size;
           unsigned long flags;		/* This is new. */
   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
           int nid;
   #endif
   };

This patch does the following things:
1) Add "flags" member to memblock_region.
2) Modify the following APIs' prototype:
	memblock_add_region()
	memblock_insert_region()
3) Add memblock_reserve_region() to support reserve memory with flags, and keep
   memblock_reserve()'s prototype unmodified.
4) Modify other APIs to support flags, but keep their prototype unmodified.

The idea is from Wen Congyang <wency@cn.fujitsu.com> and Liu Jiang <jiang.liu@huawei.com>.

v1 -> v2:
As tj suggested, a zero flag MEMBLK_DEFAULT will make users confused. If
we want to specify any other flag, such MEMBLK_HOTPLUG, users don't know
to use MEMBLK_DEFAULT | MEMBLK_HOTPLUG or just MEMBLK_HOTPLUG. So remove
MEMBLK_DEFAULT (which is 0), and just use 0 by default to avoid confusions
to users.

Suggested-by: Wen Congyang <wency@cn.fujitsu.com>
Suggested-by: Liu Jiang <jiang.liu@huawei.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |    1 +
 mm/memblock.c            |   53 +++++++++++++++++++++++++++++++++-------------
 2 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 77c60e5..9a805ec 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -22,6 +22,7 @@
 struct memblock_region {
 	phys_addr_t base;
 	phys_addr_t size;
+	unsigned long flags;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	int nid;
 #endif
diff --git a/mm/memblock.c b/mm/memblock.c
index 53e477b..877973e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -255,6 +255,7 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
 		type->cnt = 1;
 		type->regions[0].base = 0;
 		type->regions[0].size = 0;
+		type->regions[0].flags = 0;
 		memblock_set_region_node(&type->regions[0], MAX_NUMNODES);
 	}
 }
@@ -405,7 +406,8 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
 
 		if (this->base + this->size != next->base ||
 		    memblock_get_region_node(this) !=
-		    memblock_get_region_node(next)) {
+		    memblock_get_region_node(next) ||
+		    this->flags != next->flags) {
 			BUG_ON(this->base + this->size > next->base);
 			i++;
 			continue;
@@ -425,13 +427,15 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
  * @base:	base address of the new region
  * @size:	size of the new region
  * @nid:	node id of the new region
+ * @flags:	flags of the new region
  *
  * Insert new memblock region [@base,@base+@size) into @type at @idx.
  * @type must already have extra room to accomodate the new region.
  */
 static void __init_memblock memblock_insert_region(struct memblock_type *type,
 						   int idx, phys_addr_t base,
-						   phys_addr_t size, int nid)
+						   phys_addr_t size,
+						   int nid, unsigned long flags)
 {
 	struct memblock_region *rgn = &type->regions[idx];
 
@@ -439,6 +443,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
 	memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
 	rgn->base = base;
 	rgn->size = size;
+	rgn->flags = flags;
 	memblock_set_region_node(rgn, nid);
 	type->cnt++;
 	type->total_size += size;
@@ -450,6 +455,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
  * @base: base address of the new region
  * @size: size of the new region
  * @nid: nid of the new region
+ * @flags: flags of the new region
  *
  * Add new memblock region [@base,@base+@size) into @type.  The new region
  * is allowed to overlap with existing ones - overlaps don't affect already
@@ -460,7 +466,8 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
  * 0 on success, -errno on failure.
  */
 static int __init_memblock memblock_add_region(struct memblock_type *type,
-				phys_addr_t base, phys_addr_t size, int nid)
+				phys_addr_t base, phys_addr_t size,
+				int nid, unsigned long flags)
 {
 	bool insert = false;
 	phys_addr_t obase = base;
@@ -475,6 +482,7 @@ static int __init_memblock memblock_add_region(struct memblock_type *type,
 		WARN_ON(type->cnt != 1 || type->total_size);
 		type->regions[0].base = base;
 		type->regions[0].size = size;
+		type->regions[0].flags = flags;
 		memblock_set_region_node(&type->regions[0], nid);
 		type->total_size = size;
 		return 0;
@@ -505,7 +513,8 @@ repeat:
 			nr_new++;
 			if (insert)
 				memblock_insert_region(type, i++, base,
-						       rbase - base, nid);
+						       rbase - base, nid,
+						       flags);
 		}
 		/* area below @rend is dealt with, forget about it */
 		base = min(rend, end);
@@ -515,7 +524,8 @@ repeat:
 	if (base < end) {
 		nr_new++;
 		if (insert)
-			memblock_insert_region(type, i, base, end - base, nid);
+			memblock_insert_region(type, i, base, end - base,
+					       nid, flags);
 	}
 
 	/*
@@ -537,12 +547,13 @@ repeat:
 int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
 				       int nid)
 {
-	return memblock_add_region(&memblock.memory, base, size, nid);
+	return memblock_add_region(&memblock.memory, base, size, nid, 0);
 }
 
 int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)
 {
-	return memblock_add_region(&memblock.memory, base, size, MAX_NUMNODES);
+	return memblock_add_region(&memblock.memory, base, size,
+				   MAX_NUMNODES, 0);
 }
 
 /**
@@ -597,7 +608,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
 			rgn->size -= base - rbase;
 			type->total_size -= base - rbase;
 			memblock_insert_region(type, i, rbase, base - rbase,
-					       memblock_get_region_node(rgn));
+					       memblock_get_region_node(rgn),
+					       rgn->flags);
 		} else if (rend > end) {
 			/*
 			 * @rgn intersects from above.  Split and redo the
@@ -607,7 +619,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
 			rgn->size -= end - rbase;
 			type->total_size -= end - rbase;
 			memblock_insert_region(type, i--, rbase, end - rbase,
-					       memblock_get_region_node(rgn));
+					       memblock_get_region_node(rgn),
+					       rgn->flags);
 		} else {
 			/* @rgn is fully contained, record it */
 			if (!*end_rgn)
@@ -649,16 +662,24 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
 	return __memblock_remove(&memblock.reserved, base, size);
 }
 
-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+static int __init_memblock memblock_reserve_region(phys_addr_t base,
+						   phys_addr_t size,
+						   int nid,
+						   unsigned long flags)
 {
 	struct memblock_type *_rgn = &memblock.reserved;
 
-	memblock_dbg("memblock_reserve: [%#016llx-%#016llx] %pF\n",
+	memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
 		     (unsigned long long)base,
 		     (unsigned long long)base + size,
-		     (void *)_RET_IP_);
+		     flags, (void *)_RET_IP_);
+
+	return memblock_add_region(_rgn, base, size, nid, flags);
+}
 
-	return memblock_add_region(_rgn, base, size, MAX_NUMNODES);
+int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_reserve_region(base, size, MAX_NUMNODES, 0);
 }
 
 /**
@@ -1101,6 +1122,7 @@ void __init_memblock memblock_set_current_limit(phys_addr_t limit)
 static void __init_memblock memblock_dump(struct memblock_type *type, char *name)
 {
 	unsigned long long base, size;
+	unsigned long flags;
 	int i;
 
 	pr_info(" %s.cnt  = 0x%lx\n", name, type->cnt);
@@ -1111,13 +1133,14 @@ static void __init_memblock memblock_dump(struct memblock_type *type, char *name
 
 		base = rgn->base;
 		size = rgn->size;
+		flags = rgn->flags;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 		if (memblock_get_region_node(rgn) != MAX_NUMNODES)
 			snprintf(nid_buf, sizeof(nid_buf), " on node %d",
 				 memblock_get_region_node(rgn));
 #endif
-		pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s\n",
-			name, i, base, base + size - 1, size, nid_buf);
+		pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s flags: %#lx\n",
+			name, i, base, base + size - 1, size, nid_buf, flags);
 	}
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RESEND part2 v2 3/8] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions
  2013-12-03  2:19 [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
  2013-12-03  2:22 ` [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node Zhang Yanfei
  2013-12-03  2:24 ` [PATCH RESEND part2 v2 2/8] memblock, numa: Introduce flag into memblock Zhang Yanfei
@ 2013-12-03  2:25 ` Zhang Yanfei
  2013-12-03  2:25 ` [PATCH RESEND part2 v2 4/8] memblock: Make memblock_set_node() support different memblock_type Zhang Yanfei
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Zhang Yanfei @ 2013-12-03  2:25 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Rafael J . Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, linux-kernel, Linux MM, Chen Tang, Tang Chen,
	Zhang Yanfei

From: Tang Chen <tangchen@cn.fujitsu.com>

In find_hotpluggable_memory, once we find out a memory region which is
hotpluggable, we want to mark them in memblock.memory. So that we could
control memblock allocator not to allocte hotpluggable memory for the kernel
later.

To achieve this goal, we introduce MEMBLOCK_HOTPLUG flag to indicate the
hotpluggable memory regions in memblock and a function memblock_mark_hotplug()
to mark hotpluggable memory if we find one.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |   17 +++++++++++++++
 mm/memblock.c            |   52 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 9a805ec..b788faa 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,6 +19,9 @@
 
 #define INIT_MEMBLOCK_REGIONS	128
 
+/* Definition of memblock flags. */
+#define MEMBLOCK_HOTPLUG	0x1	/* hotpluggable region */
+
 struct memblock_region {
 	phys_addr_t base;
 	phys_addr_t size;
@@ -60,6 +63,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
 int memblock_free(phys_addr_t base, phys_addr_t size);
 int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
+int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
+int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
@@ -122,6 +127,18 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
 	     i != (u64)ULLONG_MAX;					\
 	     __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid))
 
+static inline void memblock_set_region_flags(struct memblock_region *r,
+					     unsigned long flags)
+{
+	r->flags |= flags;
+}
+
+static inline void memblock_clear_region_flags(struct memblock_region *r,
+					       unsigned long flags)
+{
+	r->flags &= ~flags;
+}
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);
 
diff --git a/mm/memblock.c b/mm/memblock.c
index 877973e..5bea331 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -683,6 +683,58 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
 }
 
 /**
+ * memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * This function isolates region [@base, @base + @size), and mark it with flag
+ * MEMBLOCK_HOTPLUG.
+ *
+ * Return 0 on succees, -errno on failure.
+ */
+int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
+{
+	struct memblock_type *type = &memblock.memory;
+	int i, ret, start_rgn, end_rgn;
+
+	ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
+	if (ret)
+		return ret;
+
+	for (i = start_rgn; i < end_rgn; i++)
+		memblock_set_region_flags(&type->regions[i], MEMBLOCK_HOTPLUG);
+
+	memblock_merge_regions(type);
+	return 0;
+}
+
+/**
+ * memblock_clear_hotplug - Clear flag MEMBLOCK_HOTPLUG for a specified region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * This function isolates region [@base, @base + @size), and clear flag
+ * MEMBLOCK_HOTPLUG for the isolated regions.
+ *
+ * Return 0 on succees, -errno on failure.
+ */
+int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
+{
+	struct memblock_type *type = &memblock.memory;
+	int i, ret, start_rgn, end_rgn;
+
+	ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
+	if (ret)
+		return ret;
+
+	for (i = start_rgn; i < end_rgn; i++)
+		memblock_clear_region_flags(&type->regions[i], MEMBLOCK_HOTPLUG);
+
+	memblock_merge_regions(type);
+	return 0;
+}
+
+/**
  * __next_free_mem_range - next function for for_each_free_mem_range()
  * @idx: pointer to u64 loop variable
  * @nid: node selector, %MAX_NUMNODES for all nodes
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RESEND part2 v2 4/8] memblock: Make memblock_set_node() support different memblock_type
  2013-12-03  2:19 [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
                   ` (2 preceding siblings ...)
  2013-12-03  2:25 ` [PATCH RESEND part2 v2 3/8] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions Zhang Yanfei
@ 2013-12-03  2:25 ` Zhang Yanfei
  2013-12-03  2:27 ` [PATCH RESEND part2 v2 5/8] acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock Zhang Yanfei
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Zhang Yanfei @ 2013-12-03  2:25 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Rafael J . Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, linux-kernel, Linux MM, Chen Tang, Tang Chen,
	Zhang Yanfei

From: Tang Chen <tangchen@cn.fujitsu.com>

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/metag/mm/init.c      |    3 ++-
 arch/metag/mm/numa.c      |    3 ++-
 arch/microblaze/mm/init.c |    3 ++-
 arch/powerpc/mm/mem.c     |    2 +-
 arch/powerpc/mm/numa.c    |    8 +++++---
 arch/sh/kernel/setup.c    |    4 ++--
 arch/sparc/mm/init_64.c   |    5 +++--
 arch/x86/mm/init_32.c     |    2 +-
 arch/x86/mm/init_64.c     |    2 +-
 arch/x86/mm/numa.c        |    6 ++++--
 include/linux/memblock.h  |    3 ++-
 mm/memblock.c             |    6 +++---
 12 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/arch/metag/mm/init.c b/arch/metag/mm/init.c
index 1239195..d94a58f 100644
--- a/arch/metag/mm/init.c
+++ b/arch/metag/mm/init.c
@@ -205,7 +205,8 @@ static void __init do_init_bootmem(void)
 		start_pfn = memblock_region_memory_base_pfn(reg);
 		end_pfn = memblock_region_memory_end_pfn(reg);
 		memblock_set_node(PFN_PHYS(start_pfn),
-				  PFN_PHYS(end_pfn - start_pfn), 0);
+				  PFN_PHYS(end_pfn - start_pfn),
+				  &memblock.memory, 0);
 	}
 
 	/* All of system RAM sits in node 0 for the non-NUMA case */
diff --git a/arch/metag/mm/numa.c b/arch/metag/mm/numa.c
index 9ae578c..229407f 100644
--- a/arch/metag/mm/numa.c
+++ b/arch/metag/mm/numa.c
@@ -42,7 +42,8 @@ void __init setup_bootmem_node(int nid, unsigned long start, unsigned long end)
 	memblock_add(start, end - start);
 
 	memblock_set_node(PFN_PHYS(start_pfn),
-			  PFN_PHYS(end_pfn - start_pfn), nid);
+			  PFN_PHYS(end_pfn - start_pfn),
+			  &memblock.memory, nid);
 
 	/* Node-local pgdat */
 	pgdat_paddr = memblock_alloc_base(sizeof(struct pglist_data),
diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c
index 74c7bcc..89077d3 100644
--- a/arch/microblaze/mm/init.c
+++ b/arch/microblaze/mm/init.c
@@ -192,7 +192,8 @@ void __init setup_memory(void)
 		start_pfn = memblock_region_memory_base_pfn(reg);
 		end_pfn = memblock_region_memory_end_pfn(reg);
 		memblock_set_node(start_pfn << PAGE_SHIFT,
-					(end_pfn - start_pfn) << PAGE_SHIFT, 0);
+				  (end_pfn - start_pfn) << PAGE_SHIFT,
+				  &memblock.memory, 0);
 	}
 
 	/* free bootmem is whole main memory */
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 3fa93dc..231b785 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -209,7 +209,7 @@ void __init do_init_bootmem(void)
 	/* Place all memblock_regions in the same node and merge contiguous
 	 * memblock_regions
 	 */
-	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
+	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock_memory, 0);
 
 	/* Add all physical memory to the bootmem map, mark each area
 	 * present.
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c916127..f82f2ea 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -670,7 +670,8 @@ static void __init parse_drconf_memory(struct device_node *memory)
 			node_set_online(nid);
 			sz = numa_enforce_memory_limit(base, size);
 			if (sz)
-				memblock_set_node(base, sz, nid);
+				memblock_set_node(base, sz,
+						  &memblock.memory, nid);
 		} while (--ranges);
 	}
 }
@@ -760,7 +761,7 @@ new_range:
 				continue;
 		}
 
-		memblock_set_node(start, size, nid);
+		memblock_set_node(start, size, &memblock.memory, nid);
 
 		if (--ranges)
 			goto new_range;
@@ -797,7 +798,8 @@ static void __init setup_nonnuma(void)
 
 		fake_numa_create_new_node(end_pfn, &nid);
 		memblock_set_node(PFN_PHYS(start_pfn),
-				  PFN_PHYS(end_pfn - start_pfn), nid);
+				  PFN_PHYS(end_pfn - start_pfn),
+				  &memblock.memory, nid);
 		node_set_online(nid);
 	}
 }
diff --git a/arch/sh/kernel/setup.c b/arch/sh/kernel/setup.c
index 1cf90e9..de19cfa 100644
--- a/arch/sh/kernel/setup.c
+++ b/arch/sh/kernel/setup.c
@@ -230,8 +230,8 @@ void __init __add_active_range(unsigned int nid, unsigned long start_pfn,
 	pmb_bolt_mapping((unsigned long)__va(start), start, end - start,
 			 PAGE_KERNEL);
 
-	memblock_set_node(PFN_PHYS(start_pfn),
-			  PFN_PHYS(end_pfn - start_pfn), nid);
+	memblock_set_node(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn),
+			  &memblock.memory, nid);
 }
 
 void __init __weak plat_early_device_setup(void)
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index ed82eda..31beb53 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -1021,7 +1021,8 @@ static void __init add_node_ranges(void)
 				"start[%lx] end[%lx]\n",
 				nid, start, this_end);
 
-			memblock_set_node(start, this_end - start, nid);
+			memblock_set_node(start, this_end - start,
+					  &memblock.memory, nid);
 			start = this_end;
 		}
 	}
@@ -1325,7 +1326,7 @@ static void __init bootmem_init_nonnuma(void)
 	       (top_of_ram - total_ram) >> 20);
 
 	init_node_masks_nonnuma();
-	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
+	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock.memory, 0);
 	allocate_node_data(0);
 	node_set_online(0);
 }
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 4287f1f..d9685b6 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -665,7 +665,7 @@ void __init initmem_init(void)
 	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE - 1) + 1;
 #endif
 
-	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
+	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock.memory, 0);
 	sparse_memory_present_with_active_regions(0);
 
 #ifdef CONFIG_FLATMEM
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 104d56a..f35c66c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -643,7 +643,7 @@ kernel_physical_mapping_init(unsigned long start,
 #ifndef CONFIG_NUMA
 void __init initmem_init(void)
 {
-	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
+	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock.memory, 0);
 }
 #endif
 
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e17db5d..ab69e1d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -492,7 +492,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 
 	for (i = 0; i < mi->nr_blks; i++) {
 		struct numa_memblk *mb = &mi->blk[i];
-		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+		memblock_set_node(mb->start, mb->end - mb->start,
+				  &memblock.memory, mb->nid);
 	}
 
 	/*
@@ -566,7 +567,8 @@ static int __init numa_init(int (*init_func)(void))
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
-	WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
+	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
+				  MAX_NUMNODES));
 	numa_reset_distance();
 
 	ret = init_func();
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index b788faa..97480d3 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -140,7 +140,8 @@ static inline void memblock_clear_region_flags(struct memblock_region *r,
 }
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);
+int memblock_set_node(phys_addr_t base, phys_addr_t size,
+		      struct memblock_type *type, int nid);
 
 static inline void memblock_set_region_node(struct memblock_region *r, int nid)
 {
diff --git a/mm/memblock.c b/mm/memblock.c
index 5bea331..7de9c76 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -910,18 +910,18 @@ void __init_memblock __next_mem_pfn_range(int *idx, int nid,
  * memblock_set_node - set node ID on memblock regions
  * @base: base of area to set node ID for
  * @size: size of area to set node ID for
+ * @type: memblock type to set node ID for
  * @nid: node ID to set
  *
- * Set the nid of memblock memory regions in [@base,@base+@size) to @nid.
+ * Set the nid of memblock @type regions in [@base,@base+@size) to @nid.
  * Regions which cross the area boundaries are split as necessary.
  *
  * RETURNS:
  * 0 on success, -errno on failure.
  */
 int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
-				      int nid)
+				      struct memblock_type *type, int nid)
 {
-	struct memblock_type *type = &memblock.memory;
 	int start_rgn, end_rgn;
 	int i, ret;
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RESEND part2 v2 5/8] acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock
  2013-12-03  2:19 [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
                   ` (3 preceding siblings ...)
  2013-12-03  2:25 ` [PATCH RESEND part2 v2 4/8] memblock: Make memblock_set_node() support different memblock_type Zhang Yanfei
@ 2013-12-03  2:27 ` Zhang Yanfei
  2013-12-03  2:28 ` [PATCH RESEND part2 v2 6/8] acpi, numa, mem_hotplug: Mark all nodes the kernel resides un-hotpluggable Zhang Yanfei
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Zhang Yanfei @ 2013-12-03  2:27 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Rafael J . Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, linux-kernel, Linux MM, Chen Tang, Tang Chen,
	Zhang Yanfei

From: Tang Chen <tangchen@cn.fujitsu.com>

When parsing SRAT, we know that which memory area is hotpluggable.
So we invoke function memblock_mark_hotplug() introduced by previous
patch to mark hotpluggable memory in memblock.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |    2 ++
 arch/x86/mm/srat.c |    5 +++++
 2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index ab69e1d..408c02d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -569,6 +569,8 @@ static int __init numa_init(int (*init_func)(void))
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
 	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
 				  MAX_NUMNODES));
+	/* In case that parsing SRAT failed. */
+	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
 	numa_reset_distance();
 
 	ret = init_func();
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 266ca91..ca7c484 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -181,6 +181,11 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
 		(unsigned long long) start, (unsigned long long) end - 1,
 		hotpluggable ? " hotplug" : "");
 
+	/* Mark hotplug range in memblock. */
+	if (hotpluggable && memblock_mark_hotplug(start, ma->length))
+		pr_warn("SRAT: Failed to mark hotplug range [mem %#010Lx-%#010Lx] in memblock\n",
+			(unsigned long long) start, (unsigned long long) end - 1);
+
 	return 0;
 out_err_bad_srat:
 	bad_srat();
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RESEND part2 v2 6/8] acpi, numa, mem_hotplug: Mark all nodes the kernel resides un-hotpluggable
  2013-12-03  2:19 [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
                   ` (4 preceding siblings ...)
  2013-12-03  2:27 ` [PATCH RESEND part2 v2 5/8] acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock Zhang Yanfei
@ 2013-12-03  2:28 ` Zhang Yanfei
  2013-12-03 23:44   ` Andrew Morton
  2013-12-03  2:29 ` [PATCH RESEND part2 v2 7/8] memblock, mem_hotplug: Make memblock skip hotpluggable regions if needed Zhang Yanfei
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: Zhang Yanfei @ 2013-12-03  2:28 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Rafael J . Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, linux-kernel, Linux MM, Chen Tang, Tang Chen,
	Zhang Yanfei

From: Tang Chen <tangchen@cn.fujitsu.com>

At very early time, the kernel have to use some memory such as
loading the kernel image. We cannot prevent this anyway. So any
node the kernel resides in should be un-hotpluggable.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 408c02d..f26b16f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -494,6 +494,14 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 		struct numa_memblk *mb = &mi->blk[i];
 		memblock_set_node(mb->start, mb->end - mb->start,
 				  &memblock.memory, mb->nid);
+
+		/*
+		 * At this time, all memory regions reserved by memblock are
+		 * used by the kernel. Set the nid in memblock.reserved will
+		 * mark out all the nodes the kernel resides in.
+		 */
+		memblock_set_node(mb->start, mb->end - mb->start,
+				  &memblock.reserved, mb->nid);
 	}
 
 	/*
@@ -555,6 +563,30 @@ static void __init numa_init_array(void)
 	}
 }
 
+static void __init numa_clear_kernel_node_hotplug(void)
+{
+	int i, nid;
+	nodemask_t numa_kernel_nodes;
+	unsigned long start, end;
+	struct memblock_type *type = &memblock.reserved;
+
+	/* Mark all kernel nodes. */
+	for (i = 0; i < type->cnt; i++)
+		node_set(type->regions[i].nid, numa_kernel_nodes);
+
+	/* Clear MEMBLOCK_HOTPLUG flag for memory in kernel nodes. */
+	for (i = 0; i < numa_meminfo.nr_blks; i++) {
+		nid = numa_meminfo.blk[i].nid;
+		if (!node_isset(nid, numa_kernel_nodes))
+			continue;
+
+		start = numa_meminfo.blk[i].start;
+		end = numa_meminfo.blk[i].end;
+
+		memblock_clear_hotplug(start, end - start);
+	}
+}
+
 static int __init numa_init(int (*init_func)(void))
 {
 	int i;
@@ -569,6 +601,8 @@ static int __init numa_init(int (*init_func)(void))
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
 	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
 				  MAX_NUMNODES));
+	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
+				  MAX_NUMNODES));
 	/* In case that parsing SRAT failed. */
 	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
 	numa_reset_distance();
@@ -606,6 +640,16 @@ static int __init numa_init(int (*init_func)(void))
 			numa_clear_node(i);
 	}
 	numa_init_array();
+
+	/*
+	 * At very early time, the kernel have to use some memory such as
+	 * loading the kernel image. We cannot prevent this anyway. So any
+	 * node the kernel resides in should be un-hotpluggable.
+	 *
+	 * And when we come here, numa_init() won't fail.
+	 */
+	numa_clear_kernel_node_hotplug();
+
 	return 0;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RESEND part2 v2 7/8] memblock, mem_hotplug: Make memblock skip hotpluggable regions if needed
  2013-12-03  2:19 [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
                   ` (5 preceding siblings ...)
  2013-12-03  2:28 ` [PATCH RESEND part2 v2 6/8] acpi, numa, mem_hotplug: Mark all nodes the kernel resides un-hotpluggable Zhang Yanfei
@ 2013-12-03  2:29 ` Zhang Yanfei
  2013-12-03  2:30 ` [PATCH RESEND part2 v2 8/8] x86, numa, acpi, memory-hotplug: Make movable_node have higher priority Zhang Yanfei
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Zhang Yanfei @ 2013-12-03  2:29 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Rafael J . Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, linux-kernel, Linux MM, Chen Tang, Tang Chen,
	Zhang Yanfei

From: Tang Chen <tangchen@cn.fujitsu.com>

Linux kernel cannot migrate pages used by the kernel. As a result, hotpluggable
memory used by the kernel won't be able to be hot-removed. To solve this
problem, the basic idea is to prevent memblock from allocating hotpluggable
memory for the kernel at early time, and arrange all hotpluggable memory in
ACPI SRAT(System Resource Affinity Table) as ZONE_MOVABLE when initializing
zones.

In the previous patches, we have marked hotpluggable memory regions with
MEMBLOCK_HOTPLUG flag in memblock.memory.

In this patch, we make memblock skip these hotpluggable memory regions in
the default top-down allocation function if movable_node boot option is
specified.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |   18 ++++++++++++++++++
 mm/memblock.c            |   12 ++++++++++++
 mm/memory_hotplug.c      |    1 +
 3 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 97480d3..bfc1dba 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -47,6 +47,10 @@ struct memblock {
 
 extern struct memblock memblock;
 extern int memblock_debug;
+#ifdef CONFIG_MOVABLE_NODE
+/* If movable_node boot option specified */
+extern bool movable_node_enabled;
+#endif /* CONFIG_MOVABLE_NODE */
 
 #define memblock_dbg(fmt, ...) \
 	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
@@ -65,6 +69,20 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
 int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
+#ifdef CONFIG_MOVABLE_NODE
+static inline bool memblock_is_hotpluggable(struct memblock_region *m)
+{
+	return m->flags & MEMBLOCK_HOTPLUG;
+}
+
+static inline bool movable_node_is_enabled(void)
+{
+	return movable_node_enabled;
+}
+#else
+static inline bool memblock_is_hotpluggable(struct memblock_region *m){ return false; }
+static inline bool movable_node_is_enabled(void) { return false; }
+#endif
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index 7de9c76..7f69012 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -39,6 +39,9 @@ struct memblock memblock __initdata_memblock = {
 };
 
 int memblock_debug __initdata_memblock;
+#ifdef CONFIG_MOVABLE_NODE
+bool movable_node_enabled __initdata_memblock = false;
+#endif
 static int memblock_can_resize __initdata_memblock;
 static int memblock_memory_in_slab __initdata_memblock = 0;
 static int memblock_reserved_in_slab __initdata_memblock = 0;
@@ -819,6 +822,11 @@ void __init_memblock __next_free_mem_range(u64 *idx, int nid,
  * @out_nid: ptr to int for nid of the range, can be %NULL
  *
  * Reverse of __next_free_mem_range().
+ *
+ * Linux kernel cannot migrate pages used by itself. Memory hotplug users won't
+ * be able to hot-remove hotpluggable memory used by the kernel. So this
+ * function skip hotpluggable regions if needed when allocating memory for the
+ * kernel.
  */
 void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
 					   phys_addr_t *out_start,
@@ -843,6 +851,10 @@ void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
 		if (nid != MAX_NUMNODES && nid != memblock_get_region_node(m))
 			continue;
 
+		/* skip hotpluggable memory regions if needed */
+		if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
+			continue;
+
 		/* scan areas before each reservation for intersection */
 		for ( ; ri >= 0; ri--) {
 			struct memblock_region *r = &rsv->regions[ri];
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8c91d0a..729a2d8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1436,6 +1436,7 @@ static int __init cmdline_parse_movable_node(char *p)
 	 * the kernel away from hotpluggable memory.
 	 */
 	memblock_set_bottom_up(true);
+	movable_node_enabled = true;
 #else
 	pr_warn("movable_node option not supported\n");
 #endif
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RESEND part2 v2 8/8] x86, numa, acpi, memory-hotplug: Make movable_node have higher priority
  2013-12-03  2:19 [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
                   ` (6 preceding siblings ...)
  2013-12-03  2:29 ` [PATCH RESEND part2 v2 7/8] memblock, mem_hotplug: Make memblock skip hotpluggable regions if needed Zhang Yanfei
@ 2013-12-03  2:30 ` Zhang Yanfei
  2014-01-16 17:03   ` Mel Gorman
  2013-12-03  2:45 ` [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
  2013-12-03 23:48 ` Andrew Morton
  9 siblings, 1 reply; 24+ messages in thread
From: Zhang Yanfei @ 2013-12-03  2:30 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Rafael J . Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, linux-kernel, Linux MM, Chen Tang, Tang Chen,
	Zhang Yanfei

From: Tang Chen <tangchen@cn.fujitsu.com>

If users specify the original movablecore=nn@ss boot option, the kernel will
arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot option is similar
except it specifies ZONE_NORMAL ranges.

Now, if users specify "movable_node" in kernel commandline, the kernel will
arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do this, all
the other movablecore=nn@ss and kernelcore=nn@ss options should be ignored.

For those who don't want this, just specify nothing. The kernel will act as
before.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   28 ++++++++++++++++++++++++++--
 1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fa..768ea0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5021,9 +5021,33 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 	nodemask_t saved_node_state = node_states[N_MEMORY];
 	unsigned long totalpages = early_calculate_totalpages();
 	int usable_nodes = nodes_weight(node_states[N_MEMORY]);
+	struct memblock_type *type = &memblock.memory;
+
+	/* Need to find movable_zone earlier when movable_node is specified. */
+	find_usable_zone_for_movable();
+
+	/*
+	 * If movable_node is specified, ignore kernelcore and movablecore
+	 * options.
+	 */
+	if (movable_node_is_enabled()) {
+		for (i = 0; i < type->cnt; i++) {
+			if (!memblock_is_hotpluggable(&type->regions[i]))
+				continue;
+
+			nid = type->regions[i].nid;
+
+			usable_startpfn = PFN_DOWN(type->regions[i].base);
+			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+				min(usable_startpfn, zone_movable_pfn[nid]) :
+				usable_startpfn;
+		}
+
+		goto out2;
+	}
 
 	/*
-	 * If movablecore was specified, calculate what size of
+	 * If movablecore=nn[KMG] was specified, calculate what size of
 	 * kernelcore that corresponds so that memory usable for
 	 * any allocation type is evenly spread. If both kernelcore
 	 * and movablecore are specified, then the value of kernelcore
@@ -5049,7 +5073,6 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 		goto out;
 
 	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
-	find_usable_zone_for_movable();
 	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
 restart:
@@ -5140,6 +5163,7 @@ restart:
 	if (usable_nodes && required_kernelcore > usable_nodes)
 		goto restart;
 
+out2:
 	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)
 		zone_movable_pfn[nid] =
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE
  2013-12-03  2:19 [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
                   ` (7 preceding siblings ...)
  2013-12-03  2:30 ` [PATCH RESEND part2 v2 8/8] x86, numa, acpi, memory-hotplug: Make movable_node have higher priority Zhang Yanfei
@ 2013-12-03  2:45 ` Zhang Yanfei
  2013-12-03 23:48 ` Andrew Morton
  9 siblings, 0 replies; 24+ messages in thread
From: Zhang Yanfei @ 2013-12-03  2:45 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Rafael J . Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, linux-kernel, Linux MM, Chen Tang, Tang Chen,
	Zhang Yanfei

Hello Andrew
CC: tejun

Now since the 3.13-rc2 is out, It will be appreciated that you take these
patches into -mm tree so that they start appearing in next to catch any
regressions, issues etc. It will give us some time to fix any issues
arises from next.

This is only the remaining part of the memory-hotplug work and the first part
has been merged in 3.12 so we hope this part will catch v3.13 to make
the functionality work asap.

I tested these patches on top of 3.13-rc2 and it works well.

Thank you very much!
Zhang

On 12/03/2013 10:19 AM, Zhang Yanfei wrote:
> [Problem]
> 
> The current Linux cannot migrate pages used by the kerenl because
> of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
> When the pa is changed, we cannot simply update the pagetable and
> keep the va unmodified. So the kernel pages are not migratable.
> 
> There are also some other issues will cause the kernel pages not migratable.
> For example, the physical address may be cached somewhere and will be used.
> It is not to update all the caches.
> 
> When doing memory hotplug in Linux, we first migrate all the pages in one
> memory device somewhere else, and then remove the device. But if pages are
> used by the kernel, they are not migratable. As a result, memory used by
> the kernel cannot be hot-removed.
> 
> Modifying the kernel direct mapping mechanism is too difficult to do. And
> it may cause the kernel performance down and unstable. So we use the following
> way to do memory hotplug.
> 
> 
> [What we are doing]
> 
> In Linux, memory in one numa node is divided into several zones. One of the
> zones is ZONE_MOVABLE, which the kernel won't use.
> 
> In order to implement memory hotplug in Linux, we are going to arrange all
> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
> 
> To do this, we need ACPI's help.
> 
> 
> [How we do this]
> 
> In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
> affinities in SRAT record every memory range in the system, and also, flags
> specifying if the memory range is hotpluggable.
> (Please refer to ACPI spec 5.0 5.2.16)
> 
> With the help of SRAT, we have to do the following two things to achieve our
> goal:
> 
> 1. When doing memory hot-add, allow the users arranging hotpluggable as
>    ZONE_MOVABLE.
>    (This has been done by the MOVABLE_NODE functionality in Linux.)
> 
> 2. when the system is booting, prevent bootmem allocator from allocating
>    hotpluggable memory for the kernel before the memory initialization
>    finishes.
>    (This is what we are going to do. See below.)
> 
> 
> [About this patch-set]
> 
> In previous part's patches, we have made the kernel allocate memory near
> kernel image before SRAT parsed to avoid allocating hotpluggable memory
> for kernel. So this patch-set does the following things:
> 
> 1. Improve memblock to support flags, which are used to indicate different 
>    memory type.
> 
> 2. Mark all hotpluggable memory in memblock.memory[].
> 
> 3. Make the default memblock allocator skip hotpluggable memory.
> 
> 4. Improve "movable_node" boot option to have higher priority of movablecore
>    and kernelcore boot option.
> 
> Change log v1 -> v2:
> 1. Rebase this part on the v7 version of part1
> 2. Fix bug: If movable_node boot option not specified, memblock still
>    checks hotpluggable memory when allocating memory. 
> 
> Tang Chen (7):
>   memblock, numa: Introduce flag into memblock
>   memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark
>     hotpluggable regions
>   memblock: Make memblock_set_node() support different memblock_type
>   acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock
>   acpi, numa, mem_hotplug: Mark all nodes the kernel resides
>     un-hotpluggable
>   memblock, mem_hotplug: Make memblock skip hotpluggable regions if
>     needed
>   x86, numa, acpi, memory-hotplug: Make movable_node have higher
>     priority
> 
> Yasuaki Ishimatsu (1):
>   x86: get pg_data_t's memory from other node
> 
>  arch/metag/mm/init.c      |    3 +-
>  arch/metag/mm/numa.c      |    3 +-
>  arch/microblaze/mm/init.c |    3 +-
>  arch/powerpc/mm/mem.c     |    2 +-
>  arch/powerpc/mm/numa.c    |    8 ++-
>  arch/sh/kernel/setup.c    |    4 +-
>  arch/sparc/mm/init_64.c   |    5 +-
>  arch/x86/mm/init_32.c     |    2 +-
>  arch/x86/mm/init_64.c     |    2 +-
>  arch/x86/mm/numa.c        |   63 +++++++++++++++++++++--
>  arch/x86/mm/srat.c        |    5 ++
>  include/linux/memblock.h  |   39 ++++++++++++++-
>  mm/memblock.c             |  123 ++++++++++++++++++++++++++++++++++++++-------
>  mm/memory_hotplug.c       |    1 +
>  mm/page_alloc.c           |   28 ++++++++++-
>  15 files changed, 252 insertions(+), 39 deletions(-)
> 


-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RESEND part2 v2 6/8] acpi, numa, mem_hotplug: Mark all nodes the kernel resides un-hotpluggable
  2013-12-03  2:28 ` [PATCH RESEND part2 v2 6/8] acpi, numa, mem_hotplug: Mark all nodes the kernel resides un-hotpluggable Zhang Yanfei
@ 2013-12-03 23:44   ` Andrew Morton
  2013-12-04  2:09     ` [PATCH update " Zhang Yanfei
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2013-12-03 23:44 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Tejun Heo, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, linux-kernel, Linux MM,
	Chen Tang, Tang Chen, Zhang Yanfei

On Tue, 03 Dec 2013 10:28:13 +0800 Zhang Yanfei <zhangyanfei@cn.fujitsu.com> wrote:

> From: Tang Chen <tangchen@cn.fujitsu.com>
> 
> At very early time, the kernel have to use some memory such as
> loading the kernel image. We cannot prevent this anyway. So any
> node the kernel resides in should be un-hotpluggable.
> 
> @@ -555,6 +563,30 @@ static void __init numa_init_array(void)
>  	}
>  }
>  
> +static void __init numa_clear_kernel_node_hotplug(void)
> +{
> +	int i, nid;
> +	nodemask_t numa_kernel_nodes;
> +	unsigned long start, end;
> +	struct memblock_type *type = &memblock.reserved;
> +
> +	/* Mark all kernel nodes. */
> +	for (i = 0; i < type->cnt; i++)
> +		node_set(type->regions[i].nid, numa_kernel_nodes);
> +
> +	/* Clear MEMBLOCK_HOTPLUG flag for memory in kernel nodes. */
> +	for (i = 0; i < numa_meminfo.nr_blks; i++) {
> +		nid = numa_meminfo.blk[i].nid;
> +		if (!node_isset(nid, numa_kernel_nodes))
> +			continue;
> +
> +		start = numa_meminfo.blk[i].start;
> +		end = numa_meminfo.blk[i].end;
> +
> +		memblock_clear_hotplug(start, end - start);
> +	}
> +}

Shouldn't numa_kernel_nodes be initialized?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE
  2013-12-03  2:19 [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
                   ` (8 preceding siblings ...)
  2013-12-03  2:45 ` [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
@ 2013-12-03 23:48 ` Andrew Morton
  2013-12-04  0:02   ` Zhang Yanfei
  9 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2013-12-03 23:48 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Tejun Heo, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, linux-kernel, Linux MM,
	Chen Tang, Tang Chen, Zhang Yanfei

On Tue, 03 Dec 2013 10:19:44 +0800 Zhang Yanfei <zhangyanfei@cn.fujitsu.com> wrote:

> The current Linux cannot migrate pages used by the kerenl because
> of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
> When the pa is changed, we cannot simply update the pagetable and
> keep the va unmodified. So the kernel pages are not migratable.
> 
> There are also some other issues will cause the kernel pages not migratable.
> For example, the physical address may be cached somewhere and will be used.
> It is not to update all the caches.
> 
> When doing memory hotplug in Linux, we first migrate all the pages in one
> memory device somewhere else, and then remove the device. But if pages are
> used by the kernel, they are not migratable. As a result, memory used by
> the kernel cannot be hot-removed.
> 
> Modifying the kernel direct mapping mechanism is too difficult to do. And
> it may cause the kernel performance down and unstable. So we use the following
> way to do memory hotplug.
> 
> 
> [What we are doing]
> 
> In Linux, memory in one numa node is divided into several zones. One of the
> zones is ZONE_MOVABLE, which the kernel won't use.
> 
> In order to implement memory hotplug in Linux, we are going to arrange all
> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.

How does the user enable this?  I didn't spot a Kconfig variable which
enables it.  Is there a boot option?

Or is it always enabled?  If so, that seems incautious - if it breaks
in horrid ways we want people to be able to go back to the usual
behavior.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE
  2013-12-03 23:48 ` Andrew Morton
@ 2013-12-04  0:02   ` Zhang Yanfei
  2013-12-04  9:53     ` Ingo Molnar
  0 siblings, 1 reply; 24+ messages in thread
From: Zhang Yanfei @ 2013-12-04  0:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Zhang Yanfei, Tejun Heo, Rafael J . Wysocki, Len Brown,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Toshi Kani,
	Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis,
	lwoodman, Rik van Riel, jweiner, Prarit Bhargava, linux-kernel,
	Linux MM, Chen Tang, Tang Chen

Hello Andrew

On 12/04/2013 07:48 AM, Andrew Morton wrote:
> On Tue, 03 Dec 2013 10:19:44 +0800 Zhang Yanfei <zhangyanfei@cn.fujitsu.com> wrote:
> 
>> The current Linux cannot migrate pages used by the kerenl because
>> of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
>> When the pa is changed, we cannot simply update the pagetable and
>> keep the va unmodified. So the kernel pages are not migratable.
>>
>> There are also some other issues will cause the kernel pages not migratable.
>> For example, the physical address may be cached somewhere and will be used.
>> It is not to update all the caches.
>>
>> When doing memory hotplug in Linux, we first migrate all the pages in one
>> memory device somewhere else, and then remove the device. But if pages are
>> used by the kernel, they are not migratable. As a result, memory used by
>> the kernel cannot be hot-removed.
>>
>> Modifying the kernel direct mapping mechanism is too difficult to do. And
>> it may cause the kernel performance down and unstable. So we use the following
>> way to do memory hotplug.
>>
>>
>> [What we are doing]
>>
>> In Linux, memory in one numa node is divided into several zones. One of the
>> zones is ZONE_MOVABLE, which the kernel won't use.
>>
>> In order to implement memory hotplug in Linux, we are going to arrange all
>> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
> 
> How does the user enable this?  I didn't spot a Kconfig variable which
> enables it.  Is there a boot option?

Yeah, there is a Kconfig variable "MOVABLE_NODE" and a boot option "movable_node"

mm/Kconfig

config MOVABLE_NODE
        boolean "Enable to assign a node which has only movable memory"
        ......
        default n
        help
          Allow a node to have only movable memory.  Pages used by the kernel,
          such as direct mapping pages cannot be migrated.  So the corresponding
          memory device cannot be hotplugged.  This option allows the following
          two things:
          - When the system is booting, node full of hotpluggable memory can 
          be arranged to have only movable memory so that the whole node can 
          be hot-removed. (need movable_node boot option specified).
          - After the system is up, the option allows users to online all the 
          memory of a node as movable memory so that the whole node can be
          hot-removed.

          Users who don't use the memory hotplug feature are fine with this
          option on since they don't specify movable_node boot option or they
          don't online memory as movable.

          Say Y here if you want to hotplug a whole node.
          Say N here if you want kernel to use memory on all nodes evenly.

And the movable_node boot option in DOC:

Documentation/kernel-parameters.txt

        movable_node    [KNL,X86] Boot-time switch to *enable* the effects
                        of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.


> 
> Or is it always enabled?  If so, that seems incautious - if it breaks
> in horrid ways we want people to be able to go back to the usual
> behavior.
> 

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH update part2 v2 6/8] acpi, numa, mem_hotplug: Mark all nodes the kernel resides un-hotpluggable
  2013-12-03 23:44   ` Andrew Morton
@ 2013-12-04  2:09     ` Zhang Yanfei
  0 siblings, 0 replies; 24+ messages in thread
From: Zhang Yanfei @ 2013-12-04  2:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, linux-kernel, Linux MM,
	Chen Tang, Tang Chen, Zhang Yanfei

On 12/04/2013 07:44 AM, Andrew Morton wrote:
> On Tue, 03 Dec 2013 10:28:13 +0800 Zhang Yanfei <zhangyanfei@cn.fujitsu.com> wrote:
> 
>> From: Tang Chen <tangchen@cn.fujitsu.com>
>>
>> At very early time, the kernel have to use some memory such as
>> loading the kernel image. We cannot prevent this anyway. So any
>> node the kernel resides in should be un-hotpluggable.
>>
>> @@ -555,6 +563,30 @@ static void __init numa_init_array(void)
>>  	}
>>  }
>>  
>> +static void __init numa_clear_kernel_node_hotplug(void)
>> +{
>> +	int i, nid;
>> +	nodemask_t numa_kernel_nodes;
>> +	unsigned long start, end;
>> +	struct memblock_type *type = &memblock.reserved;
>> +
>> +	/* Mark all kernel nodes. */
>> +	for (i = 0; i < type->cnt; i++)
>> +		node_set(type->regions[i].nid, numa_kernel_nodes);
>> +
>> +	/* Clear MEMBLOCK_HOTPLUG flag for memory in kernel nodes. */
>> +	for (i = 0; i < numa_meminfo.nr_blks; i++) {
>> +		nid = numa_meminfo.blk[i].nid;
>> +		if (!node_isset(nid, numa_kernel_nodes))
>> +			continue;
>> +
>> +		start = numa_meminfo.blk[i].start;
>> +		end = numa_meminfo.blk[i].end;
>> +
>> +		memblock_clear_hotplug(start, end - start);
>> +	}
>> +}
> 
> Shouldn't numa_kernel_nodes be initialized?
> 

Ah, sorry for the mistake. Please use the updated patch below:

--------------------------------------------------
From: Tang Chen <tangchen@cn.fujitsu.com>
Date: Wed, 4 Dec 2013 09:37:26 +0800
Subject: [PATCH 6/8] acpi, numa, mem_hotplug: Mark all nodes the kernel resides un-hotpluggable

At very early time, the kernel have to use some memory such as
loading the kernel image. We cannot prevent this anyway. So any
node the kernel resides in should be un-hotpluggable.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 408c02d..43eb7d4 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -494,6 +494,14 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 		struct numa_memblk *mb = &mi->blk[i];
 		memblock_set_node(mb->start, mb->end - mb->start,
 				  &memblock.memory, mb->nid);
+
+		/*
+		 * At this time, all memory regions reserved by memblock are
+		 * used by the kernel. Set the nid in memblock.reserved will
+		 * mark out all the nodes the kernel resides in.
+		 */
+		memblock_set_node(mb->start, mb->end - mb->start,
+				  &memblock.reserved, mb->nid);
 	}
 
 	/*
@@ -555,6 +563,31 @@ static void __init numa_init_array(void)
 	}
 }
 
+static void __init numa_clear_kernel_node_hotplug(void)
+{
+	int i, nid;
+	nodemask_t numa_kernel_nodes;
+	unsigned long start, end;
+	struct memblock_type *type = &memblock.reserved;
+
+	nodes_clear(numa_kernel_nodes);
+	/* Mark all kernel nodes. */
+	for (i = 0; i < type->cnt; i++)
+		node_set(type->regions[i].nid, numa_kernel_nodes);
+
+	/* Clear MEMBLOCK_HOTPLUG flag for memory in kernel nodes. */
+	for (i = 0; i < numa_meminfo.nr_blks; i++) {
+		nid = numa_meminfo.blk[i].nid;
+		if (!node_isset(nid, numa_kernel_nodes))
+			continue;
+
+		start = numa_meminfo.blk[i].start;
+		end = numa_meminfo.blk[i].end;
+
+		memblock_clear_hotplug(start, end - start);
+	}
+}
+
 static int __init numa_init(int (*init_func)(void))
 {
 	int i;
@@ -569,6 +602,8 @@ static int __init numa_init(int (*init_func)(void))
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
 	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
 				  MAX_NUMNODES));
+	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
+				  MAX_NUMNODES));
 	/* In case that parsing SRAT failed. */
 	WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
 	numa_reset_distance();
@@ -606,6 +641,16 @@ static int __init numa_init(int (*init_func)(void))
 			numa_clear_node(i);
 	}
 	numa_init_array();
+
+	/*
+	 * At very early time, the kernel have to use some memory such as
+	 * loading the kernel image. We cannot prevent this anyway. So any
+	 * node the kernel resides in should be un-hotpluggable.
+	 *
+	 * And when we come here, numa_init() won't fail.
+	 */
+	numa_clear_kernel_node_hotplug();
+
 	return 0;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE
  2013-12-04  0:02   ` Zhang Yanfei
@ 2013-12-04  9:53     ` Ingo Molnar
  0 siblings, 0 replies; 24+ messages in thread
From: Ingo Molnar @ 2013-12-04  9:53 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Zhang Yanfei, Tejun Heo, Rafael J . Wysocki,
	Len Brown, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Toshi Kani, Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis,
	lwoodman, Rik van Riel, jweiner, Prarit Bhargava, linux-kernel,
	Linux MM, Chen Tang, Tang Chen


* Zhang Yanfei <zhangyanfei.yes@gmail.com> wrote:

> Hello Andrew
> 
> On 12/04/2013 07:48 AM, Andrew Morton wrote:
> > On Tue, 03 Dec 2013 10:19:44 +0800 Zhang Yanfei <zhangyanfei@cn.fujitsu.com> wrote:
> > 
> >> The current Linux cannot migrate pages used by the kerenl because
> >> of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
> >> When the pa is changed, we cannot simply update the pagetable and
> >> keep the va unmodified. So the kernel pages are not migratable.
> >>
> >> There are also some other issues will cause the kernel pages not migratable.
> >> For example, the physical address may be cached somewhere and will be used.
> >> It is not to update all the caches.
> >>
> >> When doing memory hotplug in Linux, we first migrate all the pages in one
> >> memory device somewhere else, and then remove the device. But if pages are
> >> used by the kernel, they are not migratable. As a result, memory used by
> >> the kernel cannot be hot-removed.
> >>
> >> Modifying the kernel direct mapping mechanism is too difficult to do. And
> >> it may cause the kernel performance down and unstable. So we use the following
> >> way to do memory hotplug.
> >>
> >>
> >> [What we are doing]
> >>
> >> In Linux, memory in one numa node is divided into several zones. One of the
> >> zones is ZONE_MOVABLE, which the kernel won't use.
> >>
> >> In order to implement memory hotplug in Linux, we are going to arrange all
> >> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
> > 
> > How does the user enable this?  I didn't spot a Kconfig variable which
> > enables it.  Is there a boot option?
> 
> Yeah, there is a Kconfig variable "MOVABLE_NODE" and a boot option "movable_node"
> 
> mm/Kconfig
> 
> config MOVABLE_NODE

Some bikeshedding: I suspect 'movable nodes' is the right idiom to use 
here, unless the feature is restricted to a single node only.

So the option should be 'CONFIG_MOVABLE_NODES=y' and 
'movable_nodes=...'.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RESEND part2 v2 8/8] x86, numa, acpi, memory-hotplug: Make movable_node have higher priority
  2013-12-03  2:30 ` [PATCH RESEND part2 v2 8/8] x86, numa, acpi, memory-hotplug: Make movable_node have higher priority Zhang Yanfei
@ 2014-01-16 17:03   ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2014-01-16 17:03 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Tejun Heo, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Minchan Kim,
	mina86, gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel,
	jweiner, Prarit Bhargava, linux-kernel, Linux MM, Chen Tang,
	Tang Chen, Zhang Yanfei

On Tue, Dec 03, 2013 at 10:30:23AM +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
> 
> If users specify the original movablecore=nn@ss boot option, the kernel will
> arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot option is similar
> except it specifies ZONE_NORMAL ranges.
> 
> Now, if users specify "movable_node" in kernel commandline, the kernel will
> arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do this, all
> the other movablecore=nn@ss and kernelcore=nn@ss options should be ignored.
> 
> For those who don't want this, just specify nothing. The kernel will act as
> before.
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
> ---
>  mm/page_alloc.c |   28 ++++++++++++++++++++++++++--
>  1 files changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dd886fa..768ea0e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5021,9 +5021,33 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>  	nodemask_t saved_node_state = node_states[N_MEMORY];
>  	unsigned long totalpages = early_calculate_totalpages();
>  	int usable_nodes = nodes_weight(node_states[N_MEMORY]);
> +	struct memblock_type *type = &memblock.memory;
> +
> +	/* Need to find movable_zone earlier when movable_node is specified. */
> +	find_usable_zone_for_movable();
> +
> +	/*
> +	 * If movable_node is specified, ignore kernelcore and movablecore
> +	 * options.
> +	 */
> +	if (movable_node_is_enabled()) {
> +		for (i = 0; i < type->cnt; i++) {
> +			if (!memblock_is_hotpluggable(&type->regions[i]))
> +				continue;
> +
> +			nid = type->regions[i].nid;
> +
> +			usable_startpfn = PFN_DOWN(type->regions[i].base);
> +			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
> +				min(usable_startpfn, zone_movable_pfn[nid]) :
> +				usable_startpfn;
> +		}
> +
> +		goto out2;

out2 is not the most descriptive variable that ever existed. out_align?

There is an assumption here that the hot-pluggable regions of memory
are always at the upper end of the physical address space for that NUMA
node. What prevents the hardware having something like

node0:	0-4G	Not removable
node0:	4-8G	Removable
node0:	8-12G	Not removable

?

By the looks of things, the current code would make ZONE_MOVABLE for the
while 4-12G range of memory even though the 8-12G region cannot be
hot-removed. That would compound any problems related to lowmem-like
pressure as the 8-12G region cannot be used for kernel allocations like
inodes.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node
  2013-12-03  2:22 ` [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node Zhang Yanfei
@ 2014-01-16 17:11   ` Mel Gorman
  2014-01-17  0:15     ` H. Peter Anvin
  2014-01-20  7:29     ` Tang Chen
  0 siblings, 2 replies; 24+ messages in thread
From: Mel Gorman @ 2014-01-16 17:11 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Tejun Heo, Len Brown, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Minchan Kim,
	mina86, gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel,
	jweiner, Prarit Bhargava, linux-kernel, Linux MM, Chen Tang,
	Tang Chen, Zhang Yanfei

On Tue, Dec 03, 2013 at 10:22:00AM +0800, Zhang Yanfei wrote:
> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> If system can create movable node which all memory of the node is allocated
> as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's
> pg_data_t. So, invoke memblock_alloc_nid(...MAX_NUMNODES) again to retry when
> the first allocation fails. Otherwise, the system could failed to boot.
> (We don't use memblock_alloc_try_nid() to retry because in this function,
> if the allocation fails, it will panic the system.)
> 

This implies that it is possible to ahve a configuration with a big ratio
difference between Normal:Movable memory. In such configurations there
would be a risk that the system will reclaim heavily or go OOM because
the kernrel cannot allocate memory due to a relatively small Normal
zone. What protects against that? Is the user ever warned if the ratio
between Normal:Movable very high? The movable_node boot parameter still
turns the feature on and off, there appears to be no way of controlling
the ratio of memory other than booting with the minimum amount of memory
and manually hot-adding the sections to set the appropriate ratio.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node
  2014-01-16 17:11   ` Mel Gorman
@ 2014-01-17  0:15     ` H. Peter Anvin
  2014-01-20  7:29     ` Tang Chen
  1 sibling, 0 replies; 24+ messages in thread
From: H. Peter Anvin @ 2014-01-17  0:15 UTC (permalink / raw)
  To: Mel Gorman, Zhang Yanfei
  Cc: Andrew Morton, Tejun Heo, Len Brown, Thomas Gleixner,
	Ingo Molnar, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Minchan Kim, mina86, gong.chen,
	Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, linux-kernel, Linux MM, Chen Tang, Tang Chen,
	Zhang Yanfei

On 01/16/2014 09:11 AM, Mel Gorman wrote:
> This implies that it is possible to ahve a configuration with a big ratio
> difference between Normal:Movable memory.

In fact, one would expect that would be the norm.

> In such configurations there
> would be a risk that the system will reclaim heavily or go OOM because
> the kernrel cannot allocate memory due to a relatively small Normal
> zone. What protects against that? Is the user ever warned if the ratio
> between Normal:Movable very high? The movable_node boot parameter still
> turns the feature on and off, there appears to be no way of controlling
> the ratio of memory other than booting with the minimum amount of memory
> and manually hot-adding the sections to set the appropriate ratio.

This is really the fundamental problem with this particular approach to
hotswap memory.

	-hpa



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node
  2014-01-16 17:11   ` Mel Gorman
  2014-01-17  0:15     ` H. Peter Anvin
@ 2014-01-20  7:29     ` Tang Chen
  2014-01-20 15:14       ` Mel Gorman
  1 sibling, 1 reply; 24+ messages in thread
From: Tang Chen @ 2014-01-20  7:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Zhang Yanfei, Andrew Morton, Tejun Heo, Len Brown,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Toshi Kani,
	Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, linux-kernel, Linux MM,
	Chen Tang, Zhang Yanfei

Hi Mel,

On 01/17/2014 01:11 AM, Mel Gorman wrote:
> On Tue, Dec 03, 2013 at 10:22:00AM +0800, Zhang Yanfei wrote:
>> From: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
>>
>> If system can create movable node which all memory of the node is allocated
>> as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's
>> pg_data_t. So, invoke memblock_alloc_nid(...MAX_NUMNODES) again to retry when
>> the first allocation fails. Otherwise, the system could failed to boot.
>> (We don't use memblock_alloc_try_nid() to retry because in this function,
>> if the allocation fails, it will panic the system.)
>>
>
> This implies that it is possible to ahve a configuration with a big ratio
> difference between Normal:Movable memory. In such configurations there
> would be a risk that the system will reclaim heavily or go OOM because
> the kernrel cannot allocate memory due to a relatively small Normal
> zone. What protects against that? Is the user ever warned if the ratio
> between Normal:Movable very high?

For now, there is no way protecting against this. But on a modern 
server, it won't be
that easy running out of memory when booting, I think.

The current implementation will set any node the kernel resides in as 
unhotpluggable,
which means normal zone here. And for nowadays server, especially memory 
hotplug server,
each node would have at least 16GB memory, which is enough for the 
kernel to boot.

We can add a patch to make it return to the original path if we run out 
of memory,
which means turn off the functionality and warn users in log.

How do you think ?

>  The movable_node boot parameter still
> turns the feature on and off, there appears to be no way of controlling
> the ratio of memory other than booting with the minimum amount of memory
> and manually hot-adding the sections to set the appropriate ratio.

For now, yes. We expect firmware and hardware to give the basic ratio 
(how much memory
is hotpluggable), and the user decides how to arrange the memory (decide 
the size of
normal zone and movable zone).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node
  2014-01-20  7:29     ` Tang Chen
@ 2014-01-20 15:14       ` Mel Gorman
  2014-02-06 10:12         ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2014-01-20 15:14 UTC (permalink / raw)
  To: Tang Chen
  Cc: Zhang Yanfei, Andrew Morton, Tejun Heo, Len Brown,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Toshi Kani,
	Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, linux-kernel, Linux MM,
	Chen Tang, Zhang Yanfei

On Mon, Jan 20, 2014 at 03:29:41PM +0800, Tang Chen wrote:
> Hi Mel,
> 
> On 01/17/2014 01:11 AM, Mel Gorman wrote:
> >On Tue, Dec 03, 2013 at 10:22:00AM +0800, Zhang Yanfei wrote:
> >>From: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
> >>
> >>If system can create movable node which all memory of the node is allocated
> >>as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's
> >>pg_data_t. So, invoke memblock_alloc_nid(...MAX_NUMNODES) again to retry when
> >>the first allocation fails. Otherwise, the system could failed to boot.
> >>(We don't use memblock_alloc_try_nid() to retry because in this function,
> >>if the allocation fails, it will panic the system.)
> >>
> >
> >This implies that it is possible to ahve a configuration with a big ratio
> >difference between Normal:Movable memory. In such configurations there
> >would be a risk that the system will reclaim heavily or go OOM because
> >the kernrel cannot allocate memory due to a relatively small Normal
> >zone. What protects against that? Is the user ever warned if the ratio
> >between Normal:Movable very high?
> 
> For now, there is no way protecting against this. But on a modern
> server, it won't be
> that easy running out of memory when booting, I think.
> 


Booting is a basic functional requirement and I'm more concerned about the
behaviour of the kernel when the machine is running.  If the kernel trashes
heavily or goes OOM when a workload starts then the fact the machine booted
is not much comfort.

> The current implementation will set any node the kernel resides in
> as unhotpluggable,
> which means normal zone here. And for nowadays server, especially
> memory hotplug server,
> each node would have at least 16GB memory, which is enough for the
> kernel to boot.
> 

Again, booting is fine but least say it's an 8-node machine then that
implies the Normal:Movable ratio will be 1:8. All page table pages, inode,
dentries etc will have to fit in that 1/8th of memory with all the associated
costs including remote access penalties.  In extreme cases it may not be
possible to use all of memory because the management structures cannot be
allocated. Users may want the option of adjusting what this ratio is so
they can unplug some memory while not completely sacrificing performance.

Minimally, the kernel should print a big fat warning if the ratio is equal
or more than 1:3 Normal:Movable. That ratio selection is arbitrary. I do not
recall ever seeing any major Normal:Highmem bugs on 4G 32-bit machines so it
is a conservative choice. The last Normal:Highmem bug I remember was related
to a 16G 32-bit machine (https://bugzilla.kernel.org/show_bug.cgi?id=42578)
a 1:15 ratio feels very optimistic for a very large machine.

> We can add a patch to make it return to the original path if we run
> out of memory,
> which means turn off the functionality and warn users in log.
> 
> How do you think ?
> 

I think that will allow the machine to boot but that there still will be a
large number of bugs filed with these machines due to high Normal:Movable
ratios. The shape of the bug reports will be similar to the Normal:Highmem
ratio bugs that existed years ago.

> > The movable_node boot parameter still
> >turns the feature on and off, there appears to be no way of controlling
> >the ratio of memory other than booting with the minimum amount of memory
> >and manually hot-adding the sections to set the appropriate ratio.
> 
> For now, yes. We expect firmware and hardware to give the basic
> ratio (how much memory
> is hotpluggable), and the user decides how to arrange the memory
> (decide the size of
> normal zone and movable zone).
> 

There seems to be big gaps in the configuration options here. The user
can either ask it to be automatically assigned and have no control of
the ratio or manually hot-add the memory which is a relatively heavy
administrative burden.

I think they should be warned if the ratio is high and have an option of
specifying a ratio manually even if that means that additional nodes
will not be hot-removable.

This is all still a kludge around the fact that node memory hot-remove
did not try and cope with full migration by breaking some of the 1:1
virt:phys mapping assumptions when hot-remove was enabled.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node
  2014-01-20 15:14       ` Mel Gorman
@ 2014-02-06 10:12         ` Mel Gorman
  2014-02-10  5:44           ` Tang Chen
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2014-02-06 10:12 UTC (permalink / raw)
  To: Tang Chen
  Cc: Zhang Yanfei, Andrew Morton, Tejun Heo, Len Brown,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Toshi Kani,
	Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, linux-kernel, Linux MM,
	Chen Tang, Zhang Yanfei

Any comment on this or are the issues just going to be waved away?

On Mon, Jan 20, 2014 at 03:14:09PM +0000, Mel Gorman wrote:
> On Mon, Jan 20, 2014 at 03:29:41PM +0800, Tang Chen wrote:
> > Hi Mel,
> > 
> > On 01/17/2014 01:11 AM, Mel Gorman wrote:
> > >On Tue, Dec 03, 2013 at 10:22:00AM +0800, Zhang Yanfei wrote:
> > >>From: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
> > >>
> > >>If system can create movable node which all memory of the node is allocated
> > >>as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's
> > >>pg_data_t. So, invoke memblock_alloc_nid(...MAX_NUMNODES) again to retry when
> > >>the first allocation fails. Otherwise, the system could failed to boot.
> > >>(We don't use memblock_alloc_try_nid() to retry because in this function,
> > >>if the allocation fails, it will panic the system.)
> > >>
> > >
> > >This implies that it is possible to ahve a configuration with a big ratio
> > >difference between Normal:Movable memory. In such configurations there
> > >would be a risk that the system will reclaim heavily or go OOM because
> > >the kernrel cannot allocate memory due to a relatively small Normal
> > >zone. What protects against that? Is the user ever warned if the ratio
> > >between Normal:Movable very high?
> > 
> > For now, there is no way protecting against this. But on a modern
> > server, it won't be
> > that easy running out of memory when booting, I think.
> > 
> 
> 
> Booting is a basic functional requirement and I'm more concerned about the
> behaviour of the kernel when the machine is running.  If the kernel trashes
> heavily or goes OOM when a workload starts then the fact the machine booted
> is not much comfort.
> 
> > The current implementation will set any node the kernel resides in
> > as unhotpluggable,
> > which means normal zone here. And for nowadays server, especially
> > memory hotplug server,
> > each node would have at least 16GB memory, which is enough for the
> > kernel to boot.
> > 
> 
> Again, booting is fine but least say it's an 8-node machine then that
> implies the Normal:Movable ratio will be 1:8. All page table pages, inode,
> dentries etc will have to fit in that 1/8th of memory with all the associated
> costs including remote access penalties.  In extreme cases it may not be
> possible to use all of memory because the management structures cannot be
> allocated. Users may want the option of adjusting what this ratio is so
> they can unplug some memory while not completely sacrificing performance.
> 
> Minimally, the kernel should print a big fat warning if the ratio is equal
> or more than 1:3 Normal:Movable. That ratio selection is arbitrary. I do not
> recall ever seeing any major Normal:Highmem bugs on 4G 32-bit machines so it
> is a conservative choice. The last Normal:Highmem bug I remember was related
> to a 16G 32-bit machine (https://bugzilla.kernel.org/show_bug.cgi?id=42578)
> a 1:15 ratio feels very optimistic for a very large machine.
> 
> > We can add a patch to make it return to the original path if we run
> > out of memory,
> > which means turn off the functionality and warn users in log.
> > 
> > How do you think ?
> > 
> 
> I think that will allow the machine to boot but that there still will be a
> large number of bugs filed with these machines due to high Normal:Movable
> ratios. The shape of the bug reports will be similar to the Normal:Highmem
> ratio bugs that existed years ago.
> 
> > > The movable_node boot parameter still
> > >turns the feature on and off, there appears to be no way of controlling
> > >the ratio of memory other than booting with the minimum amount of memory
> > >and manually hot-adding the sections to set the appropriate ratio.
> > 
> > For now, yes. We expect firmware and hardware to give the basic
> > ratio (how much memory
> > is hotpluggable), and the user decides how to arrange the memory
> > (decide the size of
> > normal zone and movable zone).
> > 
> 
> There seems to be big gaps in the configuration options here. The user
> can either ask it to be automatically assigned and have no control of
> the ratio or manually hot-add the memory which is a relatively heavy
> administrative burden.
> 
> I think they should be warned if the ratio is high and have an option of
> specifying a ratio manually even if that means that additional nodes
> will not be hot-removable.
> 
> This is all still a kludge around the fact that node memory hot-remove
> did not try and cope with full migration by breaking some of the 1:1
> virt:phys mapping assumptions when hot-remove was enabled.
> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node
  2014-02-06 10:12         ` Mel Gorman
@ 2014-02-10  5:44           ` Tang Chen
  2014-02-11 11:08             ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Tang Chen @ 2014-02-10  5:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Zhang Yanfei, Andrew Morton, Tejun Heo, Len Brown,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Toshi Kani,
	Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, linux-kernel, Linux MM,
	Chen Tang, Zhang Yanfei, Tang Chen

Hi Mel,

On 02/06/2014 06:12 PM, Mel Gorman wrote:
> Any comment on this or are the issues just going to be waved away?

Sorry for the delay.

>
......
>> Again, booting is fine but least say it's an 8-node machine then that
>> implies the Normal:Movable ratio will be 1:8. All page table pages, inode,
>> dentries etc will have to fit in that 1/8th of memory with all the associated
>> costs including remote access penalties.  In extreme cases it may not be
>> possible to use all of memory because the management structures cannot be
>> allocated. Users may want the option of adjusting what this ratio is so
>> they can unplug some memory while not completely sacrificing performance.
>>
>> Minimally, the kernel should print a big fat warning if the ratio is equal
>> or more than 1:3 Normal:Movable. That ratio selection is arbitrary. I do not
>> recall ever seeing any major Normal:Highmem bugs on 4G 32-bit machines so it
>> is a conservative choice. The last Normal:Highmem bug I remember was related
>> to a 16G 32-bit machine (https://bugzilla.kernel.org/show_bug.cgi?id=42578)
>> a 1:15 ratio feels very optimistic for a very large machine.
......
>>>
>>> For now, yes. We expect firmware and hardware to give the basic
>>> ratio (how much memory
>>> is hotpluggable), and the user decides how to arrange the memory
>>> (decide the size of
>>> normal zone and movable zone).
>>>
>>
>> There seems to be big gaps in the configuration options here. The user
>> can either ask it to be automatically assigned and have no control of
>> the ratio or manually hot-add the memory which is a relatively heavy
>> administrative burden.

Yes.

1. Automatically assigning is done by movable_node boot option, which is 
the
    main work of this patch-set. It depends on SRAT (firmware).

2. Manually assigning has been done since 2012, by the following patch-set.

    https://lkml.org/lkml/2012/8/6//113

    This patch-set allowed users to online memory as normal or movable. 
But it
    is not that easy to use. So, I also think an user space tool is needed.
    And I'm planing to do this recently.

>>
>> I think they should be warned if the ratio is high and have an option of
>> specifying a ratio manually even if that means that additional nodes
>> will not be hot-removable.

I think this is easy to do, provide an option for users to specify a
Normal:Movable ratio. This is not phys addr, and it is easy to use.

>>
>> This is all still a kludge around the fact that node memory hot-remove
>> did not try and cope with full migration by breaking some of the 1:1
>> virt:phys mapping assumptions when hot-remove was enabled.

I also said before, the implementation now can only be a temporary
solution for memory hotplug since it would take us a lot of time to
deal with 1:1 mapping thing.

But about "breaking some of the 1:1 mapping", would you please give me
any hint of it ?  I want to do it too, but I cannot see where to start.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node
  2014-02-10  5:44           ` Tang Chen
@ 2014-02-11 11:08             ` Mel Gorman
  2014-02-12  7:11               ` Tang Chen
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2014-02-11 11:08 UTC (permalink / raw)
  To: Tang Chen
  Cc: Zhang Yanfei, Andrew Morton, Tejun Heo, Len Brown,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Toshi Kani,
	Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, linux-kernel, Linux MM,
	Chen Tang, Zhang Yanfei

On Mon, Feb 10, 2014 at 01:44:37PM +0800, Tang Chen wrote:
> Hi Mel,
> 
> On 02/06/2014 06:12 PM, Mel Gorman wrote:
> >Any comment on this or are the issues just going to be waved away?
> 
> Sorry for the delay.
> 
> >
> ......
> >>Again, booting is fine but least say it's an 8-node machine then that
> >>implies the Normal:Movable ratio will be 1:8. All page table pages, inode,
> >>dentries etc will have to fit in that 1/8th of memory with all the associated
> >>costs including remote access penalties.  In extreme cases it may not be
> >>possible to use all of memory because the management structures cannot be
> >>allocated. Users may want the option of adjusting what this ratio is so
> >>they can unplug some memory while not completely sacrificing performance.
> >>
> >>Minimally, the kernel should print a big fat warning if the ratio is equal
> >>or more than 1:3 Normal:Movable. That ratio selection is arbitrary. I do not
> >>recall ever seeing any major Normal:Highmem bugs on 4G 32-bit machines so it
> >>is a conservative choice. The last Normal:Highmem bug I remember was related
> >>to a 16G 32-bit machine (https://bugzilla.kernel.org/show_bug.cgi?id=42578)
> >>a 1:15 ratio feels very optimistic for a very large machine.
> ......
> >>>
> >>>For now, yes. We expect firmware and hardware to give the basic
> >>>ratio (how much memory
> >>>is hotpluggable), and the user decides how to arrange the memory
> >>>(decide the size of
> >>>normal zone and movable zone).
> >>>
> >>
> >>There seems to be big gaps in the configuration options here. The user
> >>can either ask it to be automatically assigned and have no control of
> >>the ratio or manually hot-add the memory which is a relatively heavy
> >>administrative burden.
> 
> Yes.
> 
> 1. Automatically assigning is done by movable_node boot option,
> which is the
>    main work of this patch-set. It depends on SRAT (firmware).
> 

I know but I'm concerned that this means that the firmware can request a
setup with an insane Normal:Movable ratio.

> 2. Manually assigning has been done since 2012, by the following patch-set.
> 
>    https://lkml.org/lkml/2012/8/6//113
> 
>    This patch-set allowed users to online memory as normal or
> movable. But it
>    is not that easy to use. So, I also think an user space tool is needed.
>    And I'm planing to do this recently.
> 

Ok.

> >>
> >>I think they should be warned if the ratio is high and have an option of
> >>specifying a ratio manually even if that means that additional nodes
> >>will not be hot-removable.
> 
> I think this is easy to do, provide an option for users to specify a
> Normal:Movable ratio. This is not phys addr, and it is easy to use.
> 

Yes. It would even be some help if the parameter forced some NUMA nodes
to be Normal instead of Movable regardless of what SRAT says. There
still would be an administrative burden in discovering what nodes are
now pluggable but they must have been dealing with this already.

> >>
> >>This is all still a kludge around the fact that node memory hot-remove
> >>did not try and cope with full migration by breaking some of the 1:1
> >>virt:phys mapping assumptions when hot-remove was enabled.
> 
> I also said before, the implementation now can only be a temporary
> solution for memory hotplug since it would take us a lot of time to
> deal with 1:1 mapping thing.
> 
> But about "breaking some of the 1:1 mapping", would you please give me
> any hint of it ?  I want to do it too, but I cannot see where to start.
> 

Some hints on how it might be tackled were given back in November 2012
https://lkml.org/lkml/2012/11/29/190 but I never researched it in
detail.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node
  2014-02-11 11:08             ` Mel Gorman
@ 2014-02-12  7:11               ` Tang Chen
  0 siblings, 0 replies; 24+ messages in thread
From: Tang Chen @ 2014-02-12  7:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Zhang Yanfei, Andrew Morton, Tejun Heo, Len Brown,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Toshi Kani,
	Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, linux-kernel, Linux MM,
	Chen Tang, Zhang Yanfei

Hi Mel,

On 02/11/2014 07:08 PM, Mel Gorman wrote:
......
>>>> I think they should be warned if the ratio is high and have an option of
>>>> specifying a ratio manually even if that means that additional nodes
>>>> will not be hot-removable.
>>
>> I think this is easy to do, provide an option for users to specify a
>> Normal:Movable ratio. This is not phys addr, and it is easy to use.
>>
>
> Yes. It would even be some help if the parameter forced some NUMA nodes
> to be Normal instead of Movable regardless of what SRAT says. There
> still would be an administrative burden in discovering what nodes are
> now pluggable but they must have been dealing with this already.
>

OK, I will start this work, and send patches soon.

>>>>
>>>> This is all still a kludge around the fact that node memory hot-remove
>>>> did not try and cope with full migration by breaking some of the 1:1
>>>> virt:phys mapping assumptions when hot-remove was enabled.
>>
>> I also said before, the implementation now can only be a temporary
>> solution for memory hotplug since it would take us a lot of time to
>> deal with 1:1 mapping thing.
>>
>> But about "breaking some of the 1:1 mapping", would you please give me
>> any hint of it ?  I want to do it too, but I cannot see where to start.
>>
>
> Some hints on how it might be tackled were given back in November 2012
> https://lkml.org/lkml/2012/11/29/190 but I never researched it in
> detail.
>

Thank you very much. I will read it one more time, and start trying to
migrate some of the kernel pages first.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-02-12  7:08 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-03  2:19 [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
2013-12-03  2:22 ` [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node Zhang Yanfei
2014-01-16 17:11   ` Mel Gorman
2014-01-17  0:15     ` H. Peter Anvin
2014-01-20  7:29     ` Tang Chen
2014-01-20 15:14       ` Mel Gorman
2014-02-06 10:12         ` Mel Gorman
2014-02-10  5:44           ` Tang Chen
2014-02-11 11:08             ` Mel Gorman
2014-02-12  7:11               ` Tang Chen
2013-12-03  2:24 ` [PATCH RESEND part2 v2 2/8] memblock, numa: Introduce flag into memblock Zhang Yanfei
2013-12-03  2:25 ` [PATCH RESEND part2 v2 3/8] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions Zhang Yanfei
2013-12-03  2:25 ` [PATCH RESEND part2 v2 4/8] memblock: Make memblock_set_node() support different memblock_type Zhang Yanfei
2013-12-03  2:27 ` [PATCH RESEND part2 v2 5/8] acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock Zhang Yanfei
2013-12-03  2:28 ` [PATCH RESEND part2 v2 6/8] acpi, numa, mem_hotplug: Mark all nodes the kernel resides un-hotpluggable Zhang Yanfei
2013-12-03 23:44   ` Andrew Morton
2013-12-04  2:09     ` [PATCH update " Zhang Yanfei
2013-12-03  2:29 ` [PATCH RESEND part2 v2 7/8] memblock, mem_hotplug: Make memblock skip hotpluggable regions if needed Zhang Yanfei
2013-12-03  2:30 ` [PATCH RESEND part2 v2 8/8] x86, numa, acpi, memory-hotplug: Make movable_node have higher priority Zhang Yanfei
2014-01-16 17:03   ` Mel Gorman
2013-12-03  2:45 ` [PATCH RESEND part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
2013-12-03 23:48 ` Andrew Morton
2013-12-04  0:02   ` Zhang Yanfei
2013-12-04  9:53     ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).