linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-08 10:16 Tang Chen
  2013-08-08 10:16 ` [PATCH part5 1/7] x86: get pg_data_t's memory from other node Tang Chen
                   ` (8 more replies)
  0 siblings, 9 replies; 48+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

[Problem]

The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.


[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.

To do this, we need ACPI's help.


[How we do this]

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
   ZONE_MOVABLE.
   (This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
   hotpluggable memory for the kernel before the memory initialization
   finishes.
   (This is what we are going to do. See below.)


[About this patch-set]

In previous parts' patches, we have obtained SRAT earlier enough, right after
memblock is ready. So this patch-set does the following things:

1. Improve memblock to support flags, which are used to indicate different 
   memory type.

2. Mark all hotpluggable memory in memblock.memory[].

3. Make the default memblock allocator skip hotpluggable memory.

4. Introduce "movablenode" boot option to allow users to enable/disable this
   functionality.


Tang Chen (6):
  x86, numa, mem_hotplug: Skip all the regions the kernel resides in.
  memblock, numa: Introduce flag into memblock.
  memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark
    hotpluggable regions.
  memblock, mem_hotplug: Make memblock skip hotpluggable regions by
    default.
  mem-hotplug: Introduce movablenode boot option to {en|dis}able using
    SRAT.
  x86, numa, acpi, memory-hotplug: Make movablenode have higher
    priority.

Yasuaki Ishimatsu (1):
  x86: get pg_data_t's memory from other node

 Documentation/kernel-parameters.txt |   15 ++++++
 arch/x86/kernel/setup.c             |   10 +++-
 arch/x86/mm/numa.c                  |    5 +-
 include/linux/memblock.h            |   13 +++++
 include/linux/memory_hotplug.h      |    3 +
 mm/memblock.c                       |   92 +++++++++++++++++++++++++++++------
 mm/memory_hotplug.c                 |   56 +++++++++++++++++++++-
 mm/page_alloc.c                     |   31 +++++++++++-
 8 files changed, 201 insertions(+), 24 deletions(-)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH part5 1/7] x86: get pg_data_t's memory from other node
  2013-08-08 10:16 [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tang Chen
@ 2013-08-08 10:16 ` Tang Chen
  2013-08-12 14:39   ` Tejun Heo
  2013-08-08 10:16 ` [PATCH part5 2/7] x86, numa, mem_hotplug: Skip all the regions the kernel resides in Tang Chen
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

If system can create movable node which all memory of the node is allocated
as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's
pg_data_t. So, use memblock_alloc_try_nid() instead of memblock_alloc_nid()
to retry when the first allocation fails. Otherwise, the system could failed
to boot.

The node_data could be on hotpluggable node. And so could pagetable and
vmemmap. But for now, doing so will break memory hot-remove path.

A node could have several memory devices. And the device who holds node
data should be hot-removed in the last place. But in NUMA level, we don't
know which memory_block (/sys/devices/system/node/nodeX/memoryXXX) belongs
to which memory device. We only have node. So we can only do node hotplug.

But in virtualization, developers are now developing memory hotplug in qemu,
which support a single memory device hotplug. So a whole node hotplug will
not satisfy virtualization users.

So at last, we concluded that we'd better do memory hotplug and local node
things (local node node data, pagetable, vmemmap, ...) in two steps.
Please refer to https://lkml.org/lkml/2013/6/19/73

For now, we put node_data of movable node to another node, and then improve
it in the future.

In the later patches, a boot option will be introduced to enable/disable this
functionality. If users disable it, the node_data will still be put on the
local node.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
---
 arch/x86/mm/numa.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..d532b6d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -209,10 +209,9 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
 	 * Allocate node data.  Try node-local memory and then any node.
 	 * Never allocate in DMA zone.
 	 */
-	nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
+	nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
 	if (!nd_pa) {
-		pr_err("Cannot find %zu bytes in node %d\n",
-		       nd_size, nid);
+		pr_err("Cannot find %zu bytes in any node\n", nd_size);
 		return;
 	}
 	nd = __va(nd_pa);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH part5 2/7] x86, numa, mem_hotplug: Skip all the regions the kernel resides in.
  2013-08-08 10:16 [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tang Chen
  2013-08-08 10:16 ` [PATCH part5 1/7] x86: get pg_data_t's memory from other node Tang Chen
@ 2013-08-08 10:16 ` Tang Chen
  2013-08-08 10:16 ` [PATCH part5 3/7] memblock, numa: Introduce flag into memblock Tang Chen
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

At early time, memblock will reserve some memory for the kernel,
such as the kernel code and data segments, initrd file, and so on,
which means the kernel resides in these memory regions.

Even if these memory regions are hotpluggable, we should not
mark them as hotpluggable. Otherwise the kernel won't have enough
memory to boot.

This patch finds out which memory regions the kernel resides in,
and skip them when finding all hotpluggable memory regions.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 mm/memory_hotplug.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ef9ccf8..e63f947 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -31,6 +31,7 @@
 #include <linux/firmware-map.h>
 #include <linux/stop_machine.h>
 #include <linux/acpi.h>
+#include <linux/memblock.h>
 
 #include <asm/tlbflush.h>
 
@@ -93,6 +94,37 @@ static void release_memory_resource(struct resource *res)
 
 #ifdef CONFIG_ACPI_NUMA
 /**
+ * kernel_resides_in_range - Check if kernel resides in a memory region.
+ * @base: The base address of the memory region.
+ * @length: The length of the memory region.
+ *
+ * This function is used at early time. It iterates memblock.reserved and check
+ * if the kernel has used any memory in [@base, @base + @length).
+ *
+ * Return true if the kernel resides in the memory region, false otherwise.
+ */
+static bool __init kernel_resides_in_region(phys_addr_t base, u64 length)
+{
+	int i;
+	phys_addr_t start, end;
+	struct memblock_region *region;
+	struct memblock_type *reserved = &memblock.reserved;
+
+	for (i = 0; i < reserved->cnt; i++) {
+		region = &reserved->regions[i];
+
+		start = region->base;
+		end = region->base + region->size;
+		if (end <= base || start >= base + length)
+			continue;
+
+		return true;
+	}
+
+	return false;
+}
+
+/**
  * find_hotpluggable_memory - Find out hotpluggable memory from ACPI SRAT.
  *
  * This function did the following:
@@ -129,6 +161,16 @@ void __init find_hotpluggable_memory(void)
 
 	while (ACPI_SUCCESS(acpi_hotplug_mem_affinity(srat_vaddr, &base,
 						      &size, &offset))) {
+		/*
+		 * At early time, memblock will reserve some memory for the
+		 * kernel, such as the kernel code and data segments, initrd
+		 * file, and so on, which means the kernel resides in these
+		 * memory regions. These regions should not be hotpluggable.
+		 * So do not mark them as hotpluggable.
+		 */
+		if (kernel_resides_in_region(base, size))
+			continue;
+
 		/* Will mark hotpluggable memory regions here */
 	}
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH part5 3/7] memblock, numa: Introduce flag into memblock.
  2013-08-08 10:16 [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tang Chen
  2013-08-08 10:16 ` [PATCH part5 1/7] x86: get pg_data_t's memory from other node Tang Chen
  2013-08-08 10:16 ` [PATCH part5 2/7] x86, numa, mem_hotplug: Skip all the regions the kernel resides in Tang Chen
@ 2013-08-08 10:16 ` Tang Chen
  2013-08-08 10:16 ` [PATCH part5 4/7] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions Tang Chen
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

There is no flag in memblock to describe what type the memory is.
Sometimes, we may use memblock to reserve some memory for special usage.
And we want to know what kind of memory it is. So we need a way to
differentiate memory for different usage.

In hotplug environment, we want to reserve hotpluggable memory so the
kernel won't be able to use it. And when the system is up, we have to
free these hotpluggable memory to buddy. So we need to mark these memory
first.

In order to do so, we need to mark out these special memory in memblock.
In this patch, we introduce a new "flags" member into memblock_region:
   struct memblock_region {
           phys_addr_t base;
           phys_addr_t size;
           unsigned long flags;		/* This is new. */
   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
           int nid;
   #endif
   };

This patch does the following things:
1) Add "flags" member to memblock_region.
2) Modify the following APIs' prototype:
	memblock_add_region()
	memblock_insert_region()
3) Add memblock_reserve_region() to support reserve memory with flags, and keep
   memblock_reserve()'s prototype unmodified.
4) Modify other APIs to support flags, but keep their prototype unmodified.

The idea is from Wen Congyang <wency@cn.fujitsu.com> and Liu Jiang <jiang.liu@huawei.com>.

v1 -> v2:
As tj suggested, a zero flag MEMBLK_DEFAULT will make users confused. If
we want to specify any other flag, such MEMBLK_HOTPLUG, users don't know
to use MEMBLK_DEFAULT | MEMBLK_HOTPLUG or just MEMBLK_HOTPLUG. So remove
MEMBLK_DEFAULT (which is 0), and just use 0 by default to avoid confusions
to users.

Suggested-by: Wen Congyang <wency@cn.fujitsu.com>
Suggested-by: Liu Jiang <jiang.liu@huawei.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |    1 +
 mm/memblock.c            |   53 +++++++++++++++++++++++++++++++++-------------
 2 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f388203..e89e0cd 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -22,6 +22,7 @@
 struct memblock_region {
 	phys_addr_t base;
 	phys_addr_t size;
+	unsigned long flags;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	int nid;
 #endif
diff --git a/mm/memblock.c b/mm/memblock.c
index a847bfe..0841a6e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -157,6 +157,7 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
 		type->cnt = 1;
 		type->regions[0].base = 0;
 		type->regions[0].size = 0;
+		type->regions[0].flags = 0;
 		memblock_set_region_node(&type->regions[0], MAX_NUMNODES);
 	}
 }
@@ -307,7 +308,8 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
 
 		if (this->base + this->size != next->base ||
 		    memblock_get_region_node(this) !=
-		    memblock_get_region_node(next)) {
+		    memblock_get_region_node(next) ||
+		    this->flags != next->flags) {
 			BUG_ON(this->base + this->size > next->base);
 			i++;
 			continue;
@@ -327,13 +329,15 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
  * @base:	base address of the new region
  * @size:	size of the new region
  * @nid:	node id of the new region
+ * @flags:	flags of the new region
  *
  * Insert new memblock region [@base,@base+@size) into @type at @idx.
  * @type must already have extra room to accomodate the new region.
  */
 static void __init_memblock memblock_insert_region(struct memblock_type *type,
 						   int idx, phys_addr_t base,
-						   phys_addr_t size, int nid)
+						   phys_addr_t size,
+						   int nid, unsigned long flags)
 {
 	struct memblock_region *rgn = &type->regions[idx];
 
@@ -341,6 +345,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
 	memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
 	rgn->base = base;
 	rgn->size = size;
+	rgn->flags = flags;
 	memblock_set_region_node(rgn, nid);
 	type->cnt++;
 	type->total_size += size;
@@ -352,6 +357,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
  * @base: base address of the new region
  * @size: size of the new region
  * @nid: nid of the new region
+ * @flags: flags of the new region
  *
  * Add new memblock region [@base,@base+@size) into @type.  The new region
  * is allowed to overlap with existing ones - overlaps don't affect already
@@ -362,7 +368,8 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
  * 0 on success, -errno on failure.
  */
 static int __init_memblock memblock_add_region(struct memblock_type *type,
-				phys_addr_t base, phys_addr_t size, int nid)
+				phys_addr_t base, phys_addr_t size,
+				int nid, unsigned long flags)
 {
 	bool insert = false;
 	phys_addr_t obase = base;
@@ -377,6 +384,7 @@ static int __init_memblock memblock_add_region(struct memblock_type *type,
 		WARN_ON(type->cnt != 1 || type->total_size);
 		type->regions[0].base = base;
 		type->regions[0].size = size;
+		type->regions[0].flags = flags;
 		memblock_set_region_node(&type->regions[0], nid);
 		type->total_size = size;
 		return 0;
@@ -407,7 +415,8 @@ repeat:
 			nr_new++;
 			if (insert)
 				memblock_insert_region(type, i++, base,
-						       rbase - base, nid);
+						       rbase - base, nid,
+						       flags);
 		}
 		/* area below @rend is dealt with, forget about it */
 		base = min(rend, end);
@@ -417,7 +426,8 @@ repeat:
 	if (base < end) {
 		nr_new++;
 		if (insert)
-			memblock_insert_region(type, i, base, end - base, nid);
+			memblock_insert_region(type, i, base, end - base,
+					       nid, flags);
 	}
 
 	/*
@@ -439,12 +449,13 @@ repeat:
 int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
 				       int nid)
 {
-	return memblock_add_region(&memblock.memory, base, size, nid);
+	return memblock_add_region(&memblock.memory, base, size, nid, 0);
 }
 
 int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)
 {
-	return memblock_add_region(&memblock.memory, base, size, MAX_NUMNODES);
+	return memblock_add_region(&memblock.memory, base, size,
+				   MAX_NUMNODES, 0);
 }
 
 /**
@@ -499,7 +510,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
 			rgn->size -= base - rbase;
 			type->total_size -= base - rbase;
 			memblock_insert_region(type, i, rbase, base - rbase,
-					       memblock_get_region_node(rgn));
+					       memblock_get_region_node(rgn),
+					       rgn->flags);
 		} else if (rend > end) {
 			/*
 			 * @rgn intersects from above.  Split and redo the
@@ -509,7 +521,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
 			rgn->size -= end - rbase;
 			type->total_size -= end - rbase;
 			memblock_insert_region(type, i--, rbase, end - rbase,
-					       memblock_get_region_node(rgn));
+					       memblock_get_region_node(rgn),
+					       rgn->flags);
 		} else {
 			/* @rgn is fully contained, record it */
 			if (!*end_rgn)
@@ -551,16 +564,24 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
 	return __memblock_remove(&memblock.reserved, base, size);
 }
 
-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+static int __init_memblock memblock_reserve_region(phys_addr_t base,
+						   phys_addr_t size,
+						   int nid,
+						   unsigned long flags)
 {
 	struct memblock_type *_rgn = &memblock.reserved;
 
-	memblock_dbg("memblock_reserve: [%#016llx-%#016llx] %pF\n",
+	memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
 		     (unsigned long long)base,
 		     (unsigned long long)base + size,
-		     (void *)_RET_IP_);
+		     flags, (void *)_RET_IP_);
+
+	return memblock_add_region(_rgn, base, size, nid, flags);
+}
 
-	return memblock_add_region(_rgn, base, size, MAX_NUMNODES);
+int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_reserve_region(base, size, MAX_NUMNODES, 0);
 }
 
 /**
@@ -985,6 +1006,7 @@ void __init_memblock memblock_set_current_limit(phys_addr_t limit)
 static void __init_memblock memblock_dump(struct memblock_type *type, char *name)
 {
 	unsigned long long base, size;
+	unsigned long flags;
 	int i;
 
 	pr_info(" %s.cnt  = 0x%lx\n", name, type->cnt);
@@ -995,13 +1017,14 @@ static void __init_memblock memblock_dump(struct memblock_type *type, char *name
 
 		base = rgn->base;
 		size = rgn->size;
+		flags = rgn->flags;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 		if (memblock_get_region_node(rgn) != MAX_NUMNODES)
 			snprintf(nid_buf, sizeof(nid_buf), " on node %d",
 				 memblock_get_region_node(rgn));
 #endif
-		pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s\n",
-			name, i, base, base + size - 1, size, nid_buf);
+		pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s flags: %#lx\n",
+			name, i, base, base + size - 1, size, nid_buf, flags);
 	}
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH part5 4/7] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions.
  2013-08-08 10:16 [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tang Chen
                   ` (2 preceding siblings ...)
  2013-08-08 10:16 ` [PATCH part5 3/7] memblock, numa: Introduce flag into memblock Tang Chen
@ 2013-08-08 10:16 ` Tang Chen
  2013-08-08 10:16 ` [PATCH part5 5/7] memblock, mem_hotplug: Make memblock skip hotpluggable regions by default Tang Chen
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

In find_hotpluggable_memory, once we find out a memory region which is
hotpluggable, we want to mark them in memblock.memory. So that we could
control memblock allocator not to allocte hotpluggable memory for the kernel
later.

To achieve this goal, we introduce MEMBLOCK_HOTPLUG flag to indicate the
hotpluggable memory regions in memblock and a function memblock_mark_hotplug()
to mark hotpluggable memory if we find one.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |   11 +++++++++++
 mm/memblock.c            |   26 ++++++++++++++++++++++++++
 mm/memory_hotplug.c      |    3 ++-
 3 files changed, 39 insertions(+), 1 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index e89e0cd..c0bd31c 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,6 +19,9 @@
 
 #define INIT_MEMBLOCK_REGIONS	128
 
+/* Definition of memblock flags. */
+#define MEMBLOCK_HOTPLUG	0x1	/* hotpluggable region */
+
 struct memblock_region {
 	phys_addr_t base;
 	phys_addr_t size;
@@ -60,6 +63,8 @@ int memblock_free(phys_addr_t base, phys_addr_t size);
 int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 
+int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 			  unsigned long *out_end_pfn, int *out_nid);
@@ -119,6 +124,12 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
 	     i != (u64)ULLONG_MAX;					\
 	     __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid))
 
+static inline void memblock_set_region_flags(struct memblock_region *r,
+					     unsigned long flags)
+{
+	r->flags = flags;
+}
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);
 
diff --git a/mm/memblock.c b/mm/memblock.c
index 0841a6e..ecd8568 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -585,6 +585,32 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
 }
 
 /**
+ * memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * This function isolates region [@base, @base + @size), and mark it with flag
+ * MEMBLOCK_HOTPLUG.
+ *
+ * Return 0 on succees, -errno on failure.
+ */
+int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
+{
+	struct memblock_type *type = &memblock.memory;
+	int i, ret, start_rgn, end_rgn;
+
+	ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
+	if (ret)
+		return ret;
+
+	for (i = start_rgn; i < end_rgn; i++)
+		memblock_set_region_flags(&type->regions[i], MEMBLOCK_HOTPLUG);
+
+	memblock_merge_regions(type);
+	return 0;
+}
+
+/**
  * __next_free_mem_range - next function for for_each_free_mem_range()
  * @idx: pointer to u64 loop variable
  * @nid: node selector, %MAX_NUMNODES for all nodes
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e63f947..e4db758 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -171,7 +171,8 @@ void __init find_hotpluggable_memory(void)
 		if (kernel_resides_in_region(base, size))
 			continue;
 
-		/* Will mark hotpluggable memory regions here */
+		/* Mark hotpluggable memory regions in memblock.memory */
+		memblock_mark_hotplug(base, size);
 	}
 
 	early_iounmap(srat_vaddr, length);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH part5 5/7] memblock, mem_hotplug: Make memblock skip hotpluggable regions by default.
  2013-08-08 10:16 [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tang Chen
                   ` (3 preceding siblings ...)
  2013-08-08 10:16 ` [PATCH part5 4/7] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions Tang Chen
@ 2013-08-08 10:16 ` Tang Chen
  2013-08-14 21:54   ` Naoya Horiguchi
  2013-08-08 10:16 ` [PATCH part5 6/7] mem-hotplug: Introduce movablenode boot option to {en|dis}able using SRAT Tang Chen
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

Linux kernel cannot migrate pages used by the kernel. As a result, hotpluggable
memory used by the kernel won't be able to be hot-removed. To solve this
problem, the basic idea is to prevent memblock from allocating hotpluggable
memory for the kernel at early time, and arrange all hotpluggable memory in
ACPI SRAT(System Resource Affinity Table) as ZONE_MOVABLE when initializing
zones.

In the previous patches, we have marked hotpluggable memory regions with
MEMBLOCK_HOTPLUG flag in memblock.memory.

In this patch, we make memblock skip these hotpluggable memory regions in
the default allocate function.

memblock_find_in_range_node()
  |-->for_each_free_mem_range_reverse()
        |-->__next_free_mem_range_rev()

The above is the only place where __next_free_mem_range_rev() is used. So
skip hotpluggable memory regions when iterating memblock.memory to find
free memory.

In the later patches, a boot option named "movablenode" will be introduced
to enable/disable using SRAT to arrange ZONE_MOVABLE.

NOTE: This check will always be done. It is OK because if users didn't specify
      movablenode option, the hotpluggable memory won't be marked. So this
      check won't skip any memory, which means the kernel will act as before.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 mm/memblock.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index ecd8568..3ea4301 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -695,6 +695,10 @@ void __init_memblock __next_free_mem_range(u64 *idx, int nid,
  * @out_nid: ptr to int for nid of the range, can be %NULL
  *
  * Reverse of __next_free_mem_range().
+ *
+ * Linux kernel cannot migrate pages used by itself. Memory hotplug users won't
+ * be able to hot-remove hotpluggable memory used by the kernel. So this
+ * function skip hotpluggable regions when allocating memory for the kernel.
  */
 void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
 					   phys_addr_t *out_start,
@@ -719,6 +723,10 @@ void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
 		if (nid != MAX_NUMNODES && nid != memblock_get_region_node(m))
 			continue;
 
+		/* skip hotpluggable memory regions */
+		if (m->flags & MEMBLOCK_HOTPLUG)
+			continue;
+
 		/* scan areas before each reservation for intersection */
 		for ( ; ri >= 0; ri--) {
 			struct memblock_region *r = &rsv->regions[ri];
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH part5 6/7] mem-hotplug: Introduce movablenode boot option to {en|dis}able using SRAT.
  2013-08-08 10:16 [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tang Chen
                   ` (4 preceding siblings ...)
  2013-08-08 10:16 ` [PATCH part5 5/7] memblock, mem_hotplug: Make memblock skip hotpluggable regions by default Tang Chen
@ 2013-08-08 10:16 ` Tang Chen
  2013-08-08 10:16 ` [PATCH part5 7/7] x86, numa, acpi, memory-hotplug: Make movablenode have higher priority Tang Chen
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

The Hot-Pluggable fired in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movablenode boot option to allow users to
choose to reserve hotpluggable memory and set it as ZONE_MOVABLE or not.

Users can specify "movablenode" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |   15 +++++++++++++++
 arch/x86/kernel/setup.c             |   10 ++++++++--
 include/linux/memory_hotplug.h      |    3 +++
 mm/memory_hotplug.c                 |   11 +++++++++++
 4 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 15356ac..7349d1f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1718,6 +1718,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
+	movablenode		[KNL,X86] This parameter enables/disables the
+			kernel to arrange hotpluggable memory ranges recorded
+			in ACPI SRAT(System Resource Affinity Table) as
+			ZONE_MOVABLE. And these memory can be hot-removed when
+			the system is up.
+			By specifying this option, all the hotpluggable memory
+			will be in ZONE_MOVABLE, which the kernel cannot use.
+			This will cause NUMA performance down. For users who
+			care about NUMA performance, just don't use it.
+			If all the memory ranges in the system are hotpluggable,
+			then the ones used by the kernel at early time, such as
+			kernel code and data segments, initrd file and so on,
+			won't be set as ZONE_MOVABLE, and won't be hotpluggable.
+			Otherwise the kernel won't have enough memory to boot.
+
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 36d7fe8..abdfed7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1061,14 +1061,20 @@ void __init setup_arch(char **cmdline_p)
 	 */
 	early_acpi_boot_table_init();
 
-#ifdef CONFIG_ACPI_NUMA
+#if defined(CONFIG_ACPI_NUMA) && defined(CONFIG_MOVABLE_NODE)
 	/*
 	 * Linux kernel cannot migrate kernel pages, as a result, memory used
 	 * by the kernel cannot be hot-removed. Find and mark hotpluggable
 	 * memory in memblock to prevent memblock from allocating hotpluggable
 	 * memory for the kernel.
+	 *
+	 * If all the memory in a node is hotpluggable, then the kernel won't
+	 * be able to use memory on that node. This will cause NUMA performance
+	 * down. So by default, we don't reserve any hotpluggable memory. Users
+	 * may use "movablenode" boot option to enable this functionality.
 	 */
-	find_hotpluggable_memory();
+	if (movablenode_enable_srat)
+		find_hotpluggable_memory();
 #endif
 
 	/*
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 463efa9..43eb373 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -33,6 +33,9 @@ enum {
 	ONLINE_MOVABLE,
 };
 
+/* Enable/disable SRAT in movablenode boot option */
+extern bool movablenode_enable_srat;
+
 /*
  * pgdat resizing functions
  */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e4db758..65d7156 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -93,6 +93,17 @@ static void release_memory_resource(struct resource *res)
 }
 
 #ifdef CONFIG_ACPI_NUMA
+#ifdef CONFIG_MOVABLE_NODE
+bool __initdata movablenode_enable_srat;
+
+static int __init cmdline_parse_movablenode(char *p)
+{
+	movablenode_enable_srat = true;
+	return 0;
+}
+early_param("movablenode", cmdline_parse_movablenode);
+#endif	/* CONFIG_MOVABLE_NODE */
+
 /**
  * kernel_resides_in_range - Check if kernel resides in a memory region.
  * @base: The base address of the memory region.
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH part5 7/7] x86, numa, acpi, memory-hotplug: Make movablenode have higher priority.
  2013-08-08 10:16 [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tang Chen
                   ` (5 preceding siblings ...)
  2013-08-08 10:16 ` [PATCH part5 6/7] mem-hotplug: Introduce movablenode boot option to {en|dis}able using SRAT Tang Chen
@ 2013-08-08 10:16 ` Tang Chen
  2013-08-09 16:32 ` [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tejun Heo
  2013-08-12 14:50 ` Tejun Heo
  8 siblings, 0 replies; 48+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

Arrange hotpluggable memory as ZONE_MOVABLE will cause NUMA performance down
because the kernel cannot use movable memory. For users who don't use memory
hotplug and who don't want to lose their NUMA performance, they need a way to
disable this functionality. So we improved movablecore boot option.

If users specify the original movablecore=nn@ss boot option, the kernel will
arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot option is similar
except it specifies ZONE_NORMAL ranges.

Now, if users specify "movablenode" in kernel commandline, the kernel will
arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do this, all
the other movablecore=nn@ss and kernelcore=nn@ss options should be ignored.

For those who don't want this, just specify nothing. The kernel will act as
before.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |    1 +
 mm/memblock.c            |    5 +++++
 mm/page_alloc.c          |   31 ++++++++++++++++++++++++++++---
 3 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index c0bd31c..e78e32f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -64,6 +64,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 
 int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
+bool memblock_is_hotpluggable(struct memblock_region *region);
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index 3ea4301..c8eb5d2 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -610,6 +610,11 @@ int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
 	return 0;
 }
 
+bool __init_memblock memblock_is_hotpluggable(struct memblock_region *region)
+{
+	return region->flags & MEMBLOCK_HOTPLUG;
+}
+
 /**
  * __next_free_mem_range - next function for for_each_free_mem_range()
  * @idx: pointer to u64 loop variable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..86d4381 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4948,9 +4948,35 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 	nodemask_t saved_node_state = node_states[N_MEMORY];
 	unsigned long totalpages = early_calculate_totalpages();
 	int usable_nodes = nodes_weight(node_states[N_MEMORY]);
+	struct memblock_type *type = &memblock.memory;
 
+	/* Need to find movable_zone earlier when movablenode is specified. */
+	find_usable_zone_for_movable();
+
+#ifdef CONFIG_MOVABLE_NODE
 	/*
-	 * If movablecore was specified, calculate what size of
+	 * If movablenode is specified, ignore kernelcore and movablecore
+	 * options.
+	 */
+	if (movablenode_enable_srat) {
+		for (i = 0; i < type->cnt; i++) {
+			if (!memblock_is_hotpluggable(&type->regions[i]))
+				continue;
+
+			nid = type->regions[i].nid;
+
+			usable_startpfn = PFN_DOWN(type->regions[i].base);
+			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+				min(usable_startpfn, zone_movable_pfn[nid]) :
+				usable_startpfn;
+		}
+
+		goto out;
+	}
+#endif
+
+	/*
+	 * If movablecore=nn[KMG] was specified, calculate what size of
 	 * kernelcore that corresponds so that memory usable for
 	 * any allocation type is evenly spread. If both kernelcore
 	 * and movablecore are specified, then the value of kernelcore
@@ -4976,7 +5002,6 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 		goto out;
 
 	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
-	find_usable_zone_for_movable();
 	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
 restart:
@@ -5067,12 +5092,12 @@ restart:
 	if (usable_nodes && required_kernelcore > usable_nodes)
 		goto restart;
 
+out:
 	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)
 		zone_movable_pfn[nid] =
 			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
 
-out:
 	/* restore the node_state */
 	node_states[N_MEMORY] = saved_node_state;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-08 10:16 [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tang Chen
                   ` (6 preceding siblings ...)
  2013-08-08 10:16 ` [PATCH part5 7/7] x86, numa, acpi, memory-hotplug: Make movablenode have higher priority Tang Chen
@ 2013-08-09 16:32 ` Tejun Heo
  2013-08-12  8:54   ` Tang Chen
  2013-08-12 14:50 ` Tejun Heo
  8 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2013-08-09 16:32 UTC (permalink / raw)
  To: Tang Chen
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, trenn,
	yinghai, jiang.liu, wency, laijs, isimatu.yasuaki, izumi.taku,
	mgorman, minchan, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, riel, jweiner, prarit, zhangyanfei, yanghy, x86,
	linux-doc, linux-kernel, linux-mm, linux-acpi

Hello,

On Thu, Aug 08, 2013 at 06:16:12PM +0800, Tang Chen wrote:
> In previous parts' patches, we have obtained SRAT earlier enough, right after
> memblock is ready. So this patch-set does the following things:

Can you please set up a git branch with all patches?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-09 16:32 ` [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tejun Heo
@ 2013-08-12  8:54   ` Tang Chen
  0 siblings, 0 replies; 48+ messages in thread
From: Tang Chen @ 2013-08-12  8:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, imtangchen

On 08/10/2013 12:32 AM, Tejun Heo wrote:
> Hello,
>
> On Thu, Aug 08, 2013 at 06:16:12PM +0800, Tang Chen wrote:
>> In previous parts' patches, we have obtained SRAT earlier enough, right after
>> memblock is ready. So this patch-set does the following things:
> Can you please set up a git branch with all patches?
>
> Thanks.

Please refer to :

https://github.com/imtangchen/linux movablenode-boot-option

It contains all 5 parts patches.

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 1/7] x86: get pg_data_t's memory from other node
  2013-08-08 10:16 ` [PATCH part5 1/7] x86: get pg_data_t's memory from other node Tang Chen
@ 2013-08-12 14:39   ` Tejun Heo
  2013-08-12 15:12     ` Tang Chen
  0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2013-08-12 14:39 UTC (permalink / raw)
  To: Tang Chen
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, trenn,
	yinghai, jiang.liu, wency, laijs, isimatu.yasuaki, izumi.taku,
	mgorman, minchan, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, riel, jweiner, prarit, zhangyanfei, yanghy, x86,
	linux-doc, linux-kernel, linux-mm, linux-acpi

Hello,

The subject is a bit misleading.  Maybe it should say "allow getting
..." rather than "get ..."?

On Thu, Aug 08, 2013 at 06:16:13PM +0800, Tang Chen wrote:
....
> A node could have several memory devices. And the device who holds node
> data should be hot-removed in the last place. But in NUMA level, we don't
> know which memory_block (/sys/devices/system/node/nodeX/memoryXXX) belongs
> to which memory device. We only have node. So we can only do node hotplug.
> 
> But in virtualization, developers are now developing memory hotplug in qemu,
> which support a single memory device hotplug. So a whole node hotplug will
> not satisfy virtualization users.
> 
> So at last, we concluded that we'd better do memory hotplug and local node
> things (local node node data, pagetable, vmemmap, ...) in two steps.
> Please refer to https://lkml.org/lkml/2013/6/19/73

I suppose the above three paragraphs are trying to say

* A hotpluggable NUMA node may be composed of multiple memory devices
  which individually are hot-pluggable.

* pg_data_t and page tables the serving a NUMA node may be located in
  the same node they're serving; however, if the node is composed of
  multiple hotpluggable memory devices, the device containing them
  should be the last one to be removed.

* For physical memory hotplug, whole NUMA node hotunplugging is fine;
  however, in virtualizied environments, finer grained hotunplugging
  is desirable; unfortunately, there currently is no way to which
  specific memory device pg_data_t and page tables are allocated
  inside making it impossible to order unpluggings of memory devices
  of a NUMA node.  To avoid the ordering problem while allowing
  removal of subset fo a NUMA node, it has been decided that pg_data_t
  and page tables should be allocated on a different non-hotpluggable
  NUMA node.

Am I following it correctly?  If so, can you please update the
description?  It's quite confusing.  Also, the decision seems rather
poorly made.  It should be trivial to allocate memory for pg_data_t
and page tables in one end of the NUMA node and just record the
boundary to distinguish between the area which can be removed any time
and the other which can only be removed as a unit as the last step.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-08 10:16 [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tang Chen
                   ` (7 preceding siblings ...)
  2013-08-09 16:32 ` [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tejun Heo
@ 2013-08-12 14:50 ` Tejun Heo
  2013-08-12 15:14   ` H. Peter Anvin
  2013-08-12 15:41   ` Tang Chen
  8 siblings, 2 replies; 48+ messages in thread
From: Tejun Heo @ 2013-08-12 14:50 UTC (permalink / raw)
  To: Tang Chen
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, trenn,
	yinghai, jiang.liu, wency, laijs, isimatu.yasuaki, izumi.taku,
	mgorman, minchan, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, riel, jweiner, prarit, zhangyanfei, yanghy, x86,
	linux-doc, linux-kernel, linux-mm, linux-acpi

Hello,

On Thu, Aug 08, 2013 at 06:16:12PM +0800, Tang Chen wrote:
> [How we do this]
> 
> In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
> affinities in SRAT record every memory range in the system, and also, flags
> specifying if the memory range is hotpluggable.
> (Please refer to ACPI spec 5.0 5.2.16)
> 
> With the help of SRAT, we have to do the following two things to achieve our
> goal:
> 
> 1. When doing memory hot-add, allow the users arranging hotpluggable as
>    ZONE_MOVABLE.
>    (This has been done by the MOVABLE_NODE functionality in Linux.)
> 
> 2. when the system is booting, prevent bootmem allocator from allocating
>    hotpluggable memory for the kernel before the memory initialization
>    finishes.
>    (This is what we are going to do. See below.)

I think it's in a much better shape than before but there still are a
couple things bothering me.

* Why can't it be opportunistic?  It's silly, for example, to fail
  boot because ACPI tells the kernel that all memory is hotpluggable
  especially as there'd be plenty of memory sitting around doing
  nothing and failing to boot is one of the most grave failure mode.
  The HOTPLUG flag can be advisory, right?  Try to allocate
  !hotpluggable memory first, but if that fails, ignore it and
  allocate from anywhere, much like the try_nid allocations.

* Similar to the point hpa raised.  If this can be made opportunistic,
  do we need the strict reordering to discover things earlier?
  Shouldn't it be possible to configure memblock to allocate close to
  the kernel image until hotplug and numa information is available?
  For most sane cases, the memory allocated will be contained in
  non-hotpluggable node anyway and in case they aren't hotplug
  wouldn't work but the system will boot and function perfectly fine.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 1/7] x86: get pg_data_t's memory from other node
  2013-08-12 14:39   ` Tejun Heo
@ 2013-08-12 15:12     ` Tang Chen
  0 siblings, 0 replies; 48+ messages in thread
From: Tang Chen @ 2013-08-12 15:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, imtangchen

On 08/12/2013 10:39 PM, Tejun Heo wrote:
> Hello,
>
> The subject is a bit misleading.  Maybe it should say "allow getting
> ..." rather than "get ..."?

Ok, followed.

>
> On Thu, Aug 08, 2013 at 06:16:13PM +0800, Tang Chen wrote:
......
>
> I suppose the above three paragraphs are trying to say
>
> * A hotpluggable NUMA node may be composed of multiple memory devices
>    which individually are hot-pluggable.
>
> * pg_data_t and page tables the serving a NUMA node may be located in
>    the same node they're serving; however, if the node is composed of
>    multiple hotpluggable memory devices, the device containing them
>    should be the last one to be removed.
>
> * For physical memory hotplug, whole NUMA node hotunplugging is fine;
>    however, in virtualizied environments, finer grained hotunplugging
>    is desirable; unfortunately, there currently is no way to which
>    specific memory device pg_data_t and page tables are allocated
>    inside making it impossible to order unpluggings of memory devices
>    of a NUMA node.  To avoid the ordering problem while allowing
>    removal of subset fo a NUMA node, it has been decided that pg_data_t
>    and page tables should be allocated on a different non-hotpluggable
>    NUMA node.
>
> Am I following it correctly?  If so, can you please update the
> description?  It's quite confusing.

Yes, you are right. I'll update the description.

> Also, the decision seems rather
> poorly made.  It should be trivial to allocate memory for pg_data_t
> and page tables in one end of the NUMA node and just record the
> boundary to distinguish between the area which can be removed any time
> and the other which can only be removed as a unit as the last step.

We have tried, but the hot-remove path is difficult to fix.

Please refer to:
https://lkml.org/lkml/2013/6/13/249

Actually, the above patch-set can achieve movable node, what you said.
But we have the following problems:

1. The device holding pagetable cannot be removed before other devices.
    In virtualization environment, it could be prlblematic.
    (https://lkml.org/lkml/2013/6/18/527)

2. It will break the semanteme of memory_block online/offline. If part
    of the memory_block is pagetable, and it is offlined, what status
    it should have ? My patches set it to offline, but the kernel
    is still using the memory.


I'm not saying it is not fixable. But we finally came to that we
may do the movable node in the current way and then improve it,
including local pgdat and pagetable. We need more discussion on that.
But it should not block the memory hotplug developping.

I suggest to do movable node in the current way first, and improve
it after this is done.

Thanks.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 14:50 ` Tejun Heo
@ 2013-08-12 15:14   ` H. Peter Anvin
  2013-08-12 15:23     ` Tejun Heo
  2013-08-12 15:41   ` Tang Chen
  1 sibling, 1 reply; 48+ messages in thread
From: H. Peter Anvin @ 2013-08-12 15:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, akpm,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

On 08/12/2013 07:50 AM, Tejun Heo wrote:
> 
> * Why can't it be opportunistic?  It's silly, for example, to fail
>   boot because ACPI tells the kernel that all memory is hotpluggable
>   especially as there'd be plenty of memory sitting around doing
>   nothing and failing to boot is one of the most grave failure mode.
>   The HOTPLUG flag can be advisory, right?  Try to allocate
>   !hotpluggable memory first, but if that fails, ignore it and
>   allocate from anywhere, much like the try_nid allocations.
> 
> * Similar to the point hpa raised.  If this can be made opportunistic,
>   do we need the strict reordering to discover things earlier?
>   Shouldn't it be possible to configure memblock to allocate close to
>   the kernel image until hotplug and numa information is available?
>   For most sane cases, the memory allocated will be contained in
>   non-hotpluggable node anyway and in case they aren't hotplug
>   wouldn't work but the system will boot and function perfectly fine.
> 

It gets really messy if it is advisory.  Suddenly you have the user
thinking they can hotswap a memory bank and they just can't.

Overall, I'm getting convinced that this whole approach is just doomed
to failure -- it will not provide the user what they expect and what
they need, which is to be able to hotswap any particular chunk of
memory.  This means that there has to be a remapping layer, either using
the TLBs (perhaps leveraging the Xen machine page number) or using
things like QPI memory routing.

	-hpa



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 15:14   ` H. Peter Anvin
@ 2013-08-12 15:23     ` Tejun Heo
  2013-08-12 16:29       ` Tang Chen
  0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2013-08-12 15:23 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, akpm,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hello,

On Mon, Aug 12, 2013 at 08:14:04AM -0700, H. Peter Anvin wrote:
> It gets really messy if it is advisory.  Suddenly you have the user
> thinking they can hotswap a memory bank and they just can't.

I'm very skeptical that not doing the strict re-ordering would
increase the chance of reaching memory allocation where hot unplug
would be impossible by much.  Given that, it'd be much better to be
able to boot w/o hotunplug capability than to fail boot.  The kernel
can whine loudly when hotunplug conditions aren't met but I think that
really is as far as that should go.

> Overall, I'm getting convinced that this whole approach is just doomed
> to failure -- it will not provide the user what they expect and what
> they need, which is to be able to hotswap any particular chunk of
> memory.  This means that there has to be a remapping layer, either using
> the TLBs (perhaps leveraging the Xen machine page number) or using
> things like QPI memory routing.

For hot unplug to work in completely generic manner, yeah, there
probably needs to be an extra layer of indirection.  Have no idea what
the correct way to achieve that would be tho.  I'm also not sure how
practicial memory hot unplug is for physical machines and improving
ballooning could be a better approach for vms.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 14:50 ` Tejun Heo
  2013-08-12 15:14   ` H. Peter Anvin
@ 2013-08-12 15:41   ` Tang Chen
  2013-08-12 15:46     ` Tejun Heo
  1 sibling, 1 reply; 48+ messages in thread
From: Tang Chen @ 2013-08-12 15:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, imtangchen

On 08/12/2013 10:50 PM, Tejun Heo wrote:
> Hello,
......
>
> I think it's in a much better shape than before but there still are a
> couple things bothering me.
>
> * Why can't it be opportunistic?  It's silly, for example, to fail
>    boot because ACPI tells the kernel that all memory is hotpluggable
>    especially as there'd be plenty of memory sitting around doing
>    nothing and failing to boot is one of the most grave failure mode.
>    The HOTPLUG flag can be advisory, right?  Try to allocate
>    !hotpluggable memory first, but if that fails, ignore it and
>    allocate from anywhere, much like the try_nid allocations.
>

Then there is no way to tell the users which memory is hotpluggable.

phys addr is not user friendly. For users, node or memory device is the
best. The firmware should arrange the hotpluggable ranges well.

In my opinion, maybe some application layer tools may use SRAT to show
the users which memory is hotpluggable. I just think both of the kernel
and the application layer should obey the same rule.

> * Similar to the point hpa raised.  If this can be made opportunistic,
>    do we need the strict reordering to discover things earlier?
>    Shouldn't it be possible to configure memblock to allocate close to
>    the kernel image until hotplug and numa information is available?
>    For most sane cases, the memory allocated will be contained in
>    non-hotpluggable node anyway and in case they aren't hotplug
>    wouldn't work but the system will boot and function perfectly fine.

So far as I know, the kernel image and related data can be loaded
anywhere, above 4GB. I just can't make any assumption.

Thanks.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 15:41   ` Tang Chen
@ 2013-08-12 15:46     ` Tejun Heo
  2013-08-12 16:19       ` Tang Chen
  0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2013-08-12 15:46 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello,

On Mon, Aug 12, 2013 at 11:41:25PM +0800, Tang Chen wrote:
> Then there is no way to tell the users which memory is hotpluggable.
> 
> phys addr is not user friendly. For users, node or memory device is the
> best. The firmware should arrange the hotpluggable ranges well.

I don't follow.  Why can't the kernel export that information to
userland after boot is complete via printk / sysfs / proc / whatever?
The admin can "request" hotplug by boot param and the kernel would try
to honor that and return the result on boot completion.  I don't
understand why that wouldn't work.

> In my opinion, maybe some application layer tools may use SRAT to show
> the users which memory is hotpluggable. I just think both of the kernel
> and the application layer should obey the same rule.

Sure, just let the kernel tell the user which memory node ended up
hotpluggable after booting.

> >* Similar to the point hpa raised.  If this can be made opportunistic,
> >   do we need the strict reordering to discover things earlier?
> >   Shouldn't it be possible to configure memblock to allocate close to
> >   the kernel image until hotplug and numa information is available?
> >   For most sane cases, the memory allocated will be contained in
> >   non-hotpluggable node anyway and in case they aren't hotplug
> >   wouldn't work but the system will boot and function perfectly fine.
> 
> So far as I know, the kernel image and related data can be loaded
> anywhere, above 4GB. I just can't make any assumption.

I don't follow why that would be problematic.  Wouldn't finding out
which node the kernel image is located in and preferring to allocate
from that node before hotplug info is available be enough?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 15:46     ` Tejun Heo
@ 2013-08-12 16:19       ` Tang Chen
  2013-08-12 16:22         ` Tejun Heo
  0 siblings, 1 reply; 48+ messages in thread
From: Tang Chen @ 2013-08-12 16:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, imtangchen

On 08/12/2013 11:46 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, Aug 12, 2013 at 11:41:25PM +0800, Tang Chen wrote:
>> Then there is no way to tell the users which memory is hotpluggable.
>>
>> phys addr is not user friendly. For users, node or memory device is the
>> best. The firmware should arrange the hotpluggable ranges well.
>
> I don't follow.  Why can't the kernel export that information to
> userland after boot is complete via printk / sysfs / proc / whatever?
> The admin can "request" hotplug by boot param and the kernel would try
> to honor that and return the result on boot completion.  I don't
> understand why that wouldn't work.

Sorry, I was in such a hurry that I didn't make myself clear...

The kernel can export info to users. The point is what kind of info.
Exporting phys addr is meaningless, of course. Now in /sys, we only
have memory_block and node. memory_block is only 128M on x86, and
hotplug a memory_block means nothing. So actually we only have node.

So users want to hotplug a node is reasonable, I think. In the
beginning, we set the hotplug unit to a node. That is also why we
did the movable node.

In summary, node hotplug is much meaningful and usable for users.
So it is the best that we can arrange a whole node to be movable
node, not opportunistic.

>
>> In my opinion, maybe some application layer tools may use SRAT to show
>> the users which memory is hotpluggable. I just think both of the kernel
>> and the application layer should obey the same rule.
>
> Sure, just let the kernel tell the user which memory node ended up
> hotpluggable after booting.
>
>>> * Similar to the point hpa raised.  If this can be made opportunistic,
>>>    do we need the strict reordering to discover things earlier?
>>>    Shouldn't it be possible to configure memblock to allocate close to
>>>    the kernel image until hotplug and numa information is available?
>>>    For most sane cases, the memory allocated will be contained in
>>>    non-hotpluggable node anyway and in case they aren't hotplug
>>>    wouldn't work but the system will boot and function perfectly fine.
>>
>> So far as I know, the kernel image and related data can be loaded
>> anywhere, above 4GB. I just can't make any assumption.
>
> I don't follow why that would be problematic.  Wouldn't finding out
> which node the kernel image is located in and preferring to allocate
> from that node before hotplug info is available be enough?

I'm just thinking of a more extreme case. For example, if a machine
has only one node hotpluggable, and the kernel resides in that node.
Then the system has no hotpluggable node.

If we can prevent the kernel from using hotpluggable memory, in such
a machine, users can still do memory hotplug.

I wanted to do it as generic as possible. But yes, finding out the
nodes the kernel resides in and make it unhotpluggable can work.

Thanks.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 16:19       ` Tang Chen
@ 2013-08-12 16:22         ` Tejun Heo
  2013-08-12 17:01           ` Tang Chen
  0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2013-08-12 16:22 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello, Tang.

On Tue, Aug 13, 2013 at 12:19:02AM +0800, Tang Chen wrote:
> The kernel can export info to users. The point is what kind of info.
> Exporting phys addr is meaningless, of course. Now in /sys, we only
> have memory_block and node. memory_block is only 128M on x86, and
> hotplug a memory_block means nothing. So actually we only have node.
> 
> So users want to hotplug a node is reasonable, I think. In the
> beginning, we set the hotplug unit to a node. That is also why we
> did the movable node.
> 
> In summary, node hotplug is much meaningful and usable for users.
> So it is the best that we can arrange a whole node to be movable
> node, not opportunistic.

Still not following.  Yeah, sure, you can tell the userland that node
X is hotpluggable or not hotpluggable after boot is complete.  Why is
that relevant?

> I'm just thinking of a more extreme case. For example, if a machine
> has only one node hotpluggable, and the kernel resides in that node.
> Then the system has no hotpluggable node.

Yeah, sure, then there's no way that node can be hotpluggable and the
right thing to do is booting up the machine and informing the userland
that memory is not hotpluggable.

> If we can prevent the kernel from using hotpluggable memory, in such
> a machine, users can still do memory hotplug.
> 
> I wanted to do it as generic as possible. But yes, finding out the
> nodes the kernel resides in and make it unhotpluggable can work.

Short of being able to remap memory under the kernel, I don't think
this can be very generic and as a compromise trying to keep as many
hotpluggable nodes as possible doesn't sound too bad.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 15:23     ` Tejun Heo
@ 2013-08-12 16:29       ` Tang Chen
  2013-08-12 16:46         ` Tejun Heo
  0 siblings, 1 reply; 48+ messages in thread
From: Tang Chen @ 2013-08-12 16:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: H. Peter Anvin, Tang Chen, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

On 08/12/2013 11:23 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, Aug 12, 2013 at 08:14:04AM -0700, H. Peter Anvin wrote:
>> It gets really messy if it is advisory.  Suddenly you have the user
>> thinking they can hotswap a memory bank and they just can't.
>
> I'm very skeptical that not doing the strict re-ordering would
> increase the chance of reaching memory allocation where hot unplug
> would be impossible by much.  Given that, it'd be much better to be
> able to boot w/o hotunplug capability than to fail boot.  The kernel
> can whine loudly when hotunplug conditions aren't met but I think that
> really is as far as that should go.

As you said, we can ensure at least one node to be unhotplug. Then the
kernel will boot anyway. Just like CPU0. But we have the chance to lose
one movable node.

The best way is firmware and software corporate together. SRAT provides
several movable node and enough non-movable memory for the kernel to
boot. The hotplug users only use movable node.

>
>> Overall, I'm getting convinced that this whole approach is just doomed
>> to failure -- it will not provide the user what they expect and what
>> they need, which is to be able to hotswap any particular chunk of
>> memory.  This means that there has to be a remapping layer, either using
>> the TLBs (perhaps leveraging the Xen machine page number) or using
>> things like QPI memory routing.
>
> For hot unplug to work in completely generic manner, yeah, there
> probably needs to be an extra layer of indirection.

I agree too.

> Have no idea what
> the correct way to achieve that would be tho.  I'm also not sure how
> practicial memory hot unplug is for physical machines and improving
> ballooning could be a better approach for vms.

But, different users have different ways to use memory hotplug.

Hotswaping any particular chunk of memory is the goal we will reach
finally. But it is on specific hardware. In most current machines, we
can use movable node to manage resource in node unit.

And also, without this movablenode boot option, the MOVABLE_NODE
functionality, which is already in the kernel, will not be able to
work. All nodes has kernel memory means no movable node.

So, how about this: Just like MOVABLE_NODE functionality, introduce
a new config option. When we have better solutions for memory hotplug,
we shutoff or remove the config and related code.

For now, at least make movable node work.

Thanks.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 16:29       ` Tang Chen
@ 2013-08-12 16:46         ` Tejun Heo
  2013-08-12 18:23           ` Tang Chen
  2013-08-13  6:14           ` Tang Chen
  0 siblings, 2 replies; 48+ messages in thread
From: Tejun Heo @ 2013-08-12 16:46 UTC (permalink / raw)
  To: Tang Chen
  Cc: H. Peter Anvin, Tang Chen, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hello, Tang.

On Tue, Aug 13, 2013 at 12:29:51AM +0800, Tang Chen wrote:
> As you said, we can ensure at least one node to be unhotplug. Then the
> kernel will boot anyway. Just like CPU0. But we have the chance to lose
> one movable node.
> 
> The best way is firmware and software corporate together. SRAT provides
> several movable node and enough non-movable memory for the kernel to
> boot. The hotplug users only use movable node.

I'm really lost on this conversation and have no idea what you're
arguing.  My point was simple - let the kernel do its best during boot
and report the result to userland on what nodes are hotpluggable or
not.  Can you please elaborate what your point is from the ground up?
Unfortunately, I currently have no idea what you're saying.

> But, different users have different ways to use memory hotplug.
> 
> Hotswaping any particular chunk of memory is the goal we will reach
> finally. But it is on specific hardware. In most current machines, we
> can use movable node to manage resource in node unit.
> 
> And also, without this movablenode boot option, the MOVABLE_NODE
> functionality, which is already in the kernel, will not be able to
> work. All nodes has kernel memory means no movable node.
> 
> So, how about this: Just like MOVABLE_NODE functionality, introduce
> a new config option. When we have better solutions for memory hotplug,
> we shutoff or remove the config and related code.
> 
> For now, at least make movable node work.

We are talking completely past each other.  I'll just try to clarify
what I was saying.  Can you please do the same?  Let's re-sync on the
discussion.

* Adding an option to tell the kernel to try to stay away from
  hotpluggable nodes is fine.  I have no problem with that at all.

* The patchsets upto this point have been somehow trying to reorder
  operations shomehow such that *no* memory allocation happens before
  memblock is populated with hotplug information.

* However, we already *know* that the memory the kernel image is
  occupying won't be removeable.  It's highly likely that the amount
  of memory allocation before NUMA / hotplug information is fully
  populated is pretty small.  Also, it's highly likely that small
  amount of memory right after the kernel image is contained in the
  same NUMA node, so if we allocate memory close to the kernel image,
  it's likely that we don't contaminate hotpluggable node.  We're
  talking about few megs at most right after the kernel image.  I
  can't see how that would make any noticeable difference.

* Once hotplug information is available, allocation can happen as
  usual and the kernel can report the nodes which are actually
  hotpluggable - marked as hotpluggable by the firmware && didn't get
  contaminated during early alloc && didn't get overflow allocations
  afterwards.  Note that we need such mechanism no matter what as the
  kernel image can be loaded into hotpluggable nodes and reporting
  that to userland is the only thing the kernel can do for cases like
  that short of denying memory unplug on such nodes.

The whole thing would be a lot simpler and generic.  It doesn't even
have to care about which mechanism is being used to acquire all those
information.  What am I missing here?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 16:22         ` Tejun Heo
@ 2013-08-12 17:01           ` Tang Chen
  2013-08-12 17:23             ` H. Peter Anvin
  2013-08-12 18:07             ` Tejun Heo
  0 siblings, 2 replies; 48+ messages in thread
From: Tang Chen @ 2013-08-12 17:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hi tj,

On 08/13/2013 12:22 AM, Tejun Heo wrote:
> Hello, Tang.
>
> On Tue, Aug 13, 2013 at 12:19:02AM +0800, Tang Chen wrote:
>> The kernel can export info to users. The point is what kind of info.
>> Exporting phys addr is meaningless, of course. Now in /sys, we only
>> have memory_block and node. memory_block is only 128M on x86, and
>> hotplug a memory_block means nothing. So actually we only have node.
>>
>> So users want to hotplug a node is reasonable, I think. In the
>> beginning, we set the hotplug unit to a node. That is also why we
>> did the movable node.
>>
>> In summary, node hotplug is much meaningful and usable for users.
>> So it is the best that we can arrange a whole node to be movable
>> node, not opportunistic.
>
> Still not following.  Yeah, sure, you can tell the userland that node
> X is hotpluggable or not hotpluggable after boot is complete.  Why is
> that relevant?

Sorry for the misunderstanding.

I was trying to answer your question: "Why can't the kenrel allocate
hotpluggable memory opportunistic ?".

If the kernel has any opportunity to allocate hotpluggable memory in
SRAT, then the kernel should tell users which memory is hotpluggable.

But in what way ?  I think node is the best for now. But a node could
have a lot of memory. If the kernel uses only a little memory, we will
lose the whole movable node, which I don't want to do.

So, I don't want to allow the kenrel allocating hotpluggable memory
opportunistic.


>
>> I'm just thinking of a more extreme case. For example, if a machine
>> has only one node hotpluggable, and the kernel resides in that node.
>> Then the system has no hotpluggable node.
>
> Yeah, sure, then there's no way that node can be hotpluggable and the
> right thing to do is booting up the machine and informing the userland
> that memory is not hotpluggable.
>
>> If we can prevent the kernel from using hotpluggable memory, in such
>> a machine, users can still do memory hotplug.
>>
>> I wanted to do it as generic as possible. But yes, finding out the
>> nodes the kernel resides in and make it unhotpluggable can work.
>
> Short of being able to remap memory under the kernel, I don't think
> this can be very generic and as a compromise trying to keep as many
> hotpluggable nodes as possible doesn't sound too bad.

I think making one of the node hotpluggable is better. But OK, it is
no big deal. There won't be such machine in reality, I think. :)

Thanks. :)






^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 17:01           ` Tang Chen
@ 2013-08-12 17:23             ` H. Peter Anvin
  2013-08-14 18:22               ` KOSAKI Motohiro
  2013-08-12 18:07             ` Tejun Heo
  1 sibling, 1 reply; 48+ messages in thread
From: H. Peter Anvin @ 2013-08-12 17:23 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tejun Heo, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On 08/12/2013 10:01 AM, Tang Chen wrote:
>>
>>> I'm just thinking of a more extreme case. For example, if a machine
>>> has only one node hotpluggable, and the kernel resides in that node.
>>> Then the system has no hotpluggable node.
>>
>> Yeah, sure, then there's no way that node can be hotpluggable and the
>> right thing to do is booting up the machine and informing the userland
>> that memory is not hotpluggable.
>>
>>> If we can prevent the kernel from using hotpluggable memory, in such
>>> a machine, users can still do memory hotplug.
>>>
>>> I wanted to do it as generic as possible. But yes, finding out the
>>> nodes the kernel resides in and make it unhotpluggable can work.
>>
>> Short of being able to remap memory under the kernel, I don't think
>> this can be very generic and as a compromise trying to keep as many
>> hotpluggable nodes as possible doesn't sound too bad.
> 
> I think making one of the node hotpluggable is better. But OK, it is
> no big deal. There won't be such machine in reality, I think. :)
> 

The user may very well have configured a system with mirrored memory for
the kernel node as that will be non-hotpluggable, but not for the
others.  One can wonder how much that actually buys in real life, but
still...

	-hpa



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 17:01           ` Tang Chen
  2013-08-12 17:23             ` H. Peter Anvin
@ 2013-08-12 18:07             ` Tejun Heo
  2013-08-14 18:15               ` KOSAKI Motohiro
  1 sibling, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2013-08-12 18:07 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hey,

On Tue, Aug 13, 2013 at 01:01:09AM +0800, Tang Chen wrote:
> Sorry for the misunderstanding.
> 
> I was trying to answer your question: "Why can't the kenrel allocate
> hotpluggable memory opportunistic ?".

I've used the wrong word, I was meaning best-effort, which is the only
thing we can do anyway given that we have no control over where the
kernel image is linked in relation to NUMA nodes.

> If the kernel has any opportunity to allocate hotpluggable memory in
> SRAT, then the kernel should tell users which memory is hotpluggable.
> 
> But in what way ?  I think node is the best for now. But a node could
> have a lot of memory. If the kernel uses only a little memory, we will
> lose the whole movable node, which I don't want to do.
> 
> So, I don't want to allow the kenrel allocating hotpluggable memory
> opportunistic.

What I was saying was that the kernel should try !hotpluggable memory
first then fall back to hotpluggable memory instead of failing boot as
nothing really is worse than failing to boot.

> >Short of being able to remap memory under the kernel, I don't think
> >this can be very generic and as a compromise trying to keep as many
> >hotpluggable nodes as possible doesn't sound too bad.
> 
> I think making one of the node hotpluggable is better. But OK, it is
> no big deal. There won't be such machine in reality, I think. :)

Hmmm... but allocating close to kernel image will keep the number of
nodes which are made un-removeable via permanent allocation to
minimum.  In most configurations that I can recall, I don't think we'd
lose anything really and the code will be much simpler and generic.
It seems like a good trade-off to me given that we need to report
which nodes are hot unpluggable no matter what.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 16:46         ` Tejun Heo
@ 2013-08-12 18:23           ` Tang Chen
  2013-08-12 20:20             ` Tejun Heo
  2013-08-13  6:14           ` Tang Chen
  1 sibling, 1 reply; 48+ messages in thread
From: Tang Chen @ 2013-08-12 18:23 UTC (permalink / raw)
  To: Tejun Heo, H. Peter Anvin
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, akpm,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

On 08/13/2013 12:46 AM, Tejun Heo wrote:
> Hello, Tang.
......
>
>> But, different users have different ways to use memory hotplug.
>>
>> Hotswaping any particular chunk of memory is the goal we will reach
>> finally. But it is on specific hardware. In most current machines, we
>> can use movable node to manage resource in node unit.
>>
>> And also, without this movablenode boot option, the MOVABLE_NODE
>> functionality, which is already in the kernel, will not be able to
>> work. All nodes has kernel memory means no movable node.
>>
>> So, how about this: Just like MOVABLE_NODE functionality, introduce
>> a new config option. When we have better solutions for memory hotplug,
>> we shutoff or remove the config and related code.
>>
>> For now, at least make movable node work.

Hi tj,
cc hpa,

I explained above because hpa said he thought the whole approach is
wrong. I think node hotplug is meaningful for users. And without this
patch-set, MOVABLE_NODE means nothing. This is all above.

Since you replied his email in previous emails, I just replied to
answer both of you. Sorry for the misunderstanding. :)

>
> We are talking completely past each other.  I'll just try to clarify
> what I was saying.  Can you please do the same?  Let's re-sync on the
> discussion.
>
> * Adding an option to tell the kernel to try to stay away from
>    hotpluggable nodes is fine.  I have no problem with that at all.

Agreed.

>
> * The patchsets upto this point have been somehow trying to reorder
>    operations shomehow such that *no* memory allocation happens before
>    memblock is populated with hotplug information.

Yes, this is exactly what I want to do.

>
> * However, we already *know* that the memory the kernel image is
>    occupying won't be removeable.  It's highly likely that the amount
>    of memory allocation before NUMA / hotplug information is fully
>    populated is pretty small.  Also, it's highly likely that small
>    amount of memory right after the kernel image is contained in the
>    same NUMA node, so if we allocate memory close to the kernel image,
>    it's likely that we don't contaminate hotpluggable node.  We're
>    talking about few megs at most right after the kernel image.  I
>    can't see how that would make any noticeable difference.

This point, I don't quite agree. What you said is highly likely, but
not definitely. Users may find they lost hotpluggable memory.

The node the kernel resides in won't be removable. This is agreed.
But I still want SRAT earlier for the following reasons:

1. For a production provided to users, the firmware specified how
    many nodes are hotpluggable. When the system is up, if users
    found they lost movable nodes, I think it could be messy.

2. Reorder SRAT parsing earlier is not that difficult to do. The
    only procedures reordered are acpi tables initialization and
    acpi_initrd_override. The acpi part patches are being reviewed.
    And it is better solution. If possible, I think we should do it.

In summary, I don't want early memory allocation with hotpluggable
memory to be opportunistic.

>
> * Once hotplug information is available, allocation can happen as
>    usual and the kernel can report the nodes which are actually
>    hotpluggable - marked as hotpluggable by the firmware&&  didn't get
>    contaminated during early alloc&&  didn't get overflow allocations
>    afterwards.  Note that we need such mechanism no matter what as the
>    kernel image can be loaded into hotpluggable nodes and reporting
>    that to userland is the only thing the kernel can do for cases like
>    that short of denying memory unplug on such nodes.

Agreed.

>
> The whole thing would be a lot simpler and generic.  It doesn't even
> have to care about which mechanism is being used to acquire all those
> information.  What am I missing here?

Sorry for the misunderstanding.

Thanks. :)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 18:23           ` Tang Chen
@ 2013-08-12 20:20             ` Tejun Heo
  0 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2013-08-12 20:20 UTC (permalink / raw)
  To: Tang Chen
  Cc: H. Peter Anvin, Tang Chen, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hello,

On Tue, Aug 13, 2013 at 02:23:13AM +0800, Tang Chen wrote:
> >* However, we already *know* that the memory the kernel image is
> >   occupying won't be removeable.  It's highly likely that the amount
> >   of memory allocation before NUMA / hotplug information is fully
> >   populated is pretty small.  Also, it's highly likely that small
> >   amount of memory right after the kernel image is contained in the
> >   same NUMA node, so if we allocate memory close to the kernel image,
> >   it's likely that we don't contaminate hotpluggable node.  We're
> >   talking about few megs at most right after the kernel image.  I
> >   can't see how that would make any noticeable difference.
> 
> This point, I don't quite agree. What you said is highly likely, but
> not definitely. Users may find they lost hotpluggable memory.

I'm having difficult time buying that.  NUMA node granularity is
usually pretty large - it's in the range of gigabytes.  By comparison,
the area occupied by the kernel image is *tiny* and it's just highly
unlikely that allocating a bit more memory afterwards would lead to
any meaningful difference in hotunplug support.  The amount of memory
we're talking about is likely to be less than a meg, right?

> The node the kernel resides in won't be removable. This is agreed.
> But I still want SRAT earlier for the following reasons:
> 
> 1. For a production provided to users, the firmware specified how
>    many nodes are hotpluggable. When the system is up, if users
>    found they lost movable nodes, I think it could be messy.

How is that different from the memory occupied by kernel image?
Simply allocating early memory near kernel image is extremely unlikely
to change the situation.  Again, we're talking about tiny allocation
here.  It should be no different from having *slightly* larger kernel
image.  How is that material in any way?

> 2. Reorder SRAT parsing earlier is not that difficult to do. The
>    only procedures reordered are acpi tables initialization and
>    acpi_initrd_override. The acpi part patches are being reviewed.
>    And it is better solution. If possible, I think we should do it.

I don't think it's a better solution.  It's fragile and fiddly and
without much, if any, additional benefit.  Why should we do that when
we can almost trivially solve the problem almost in memblock proper in
a way which is completely firmware-agnostic?

But, what's the extra benefit of doing that?  Why would reserving less
than a megabyte after the kernel be so problematic to require this
invasive solution?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 16:46         ` Tejun Heo
  2013-08-12 18:23           ` Tang Chen
@ 2013-08-13  6:14           ` Tang Chen
  2013-08-13  9:56             ` Tang Chen
  1 sibling, 1 reply; 48+ messages in thread
From: Tang Chen @ 2013-08-13  6:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, H. Peter Anvin, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

On 08/13/2013 12:46 AM, Tejun Heo wrote:
......
>
> * Adding an option to tell the kernel to try to stay away from
>    hotpluggable nodes is fine.  I have no problem with that at all.
>
> * The patchsets upto this point have been somehow trying to reorder
>    operations shomehow such that *no* memory allocation happens before
>    memblock is populated with hotplug information.
>
> * However, we already *know* that the memory the kernel image is
>    occupying won't be removeable.  It's highly likely that the amount
>    of memory allocation before NUMA / hotplug information is fully
>    populated is pretty small.  Also, it's highly likely that small
>    amount of memory right after the kernel image is contained in the
>    same NUMA node, so if we allocate memory close to the kernel image,
>    it's likely that we don't contaminate hotpluggable node.  We're
>    talking about few megs at most right after the kernel image.  I
>    can't see how that would make any noticeable difference.
>
> * Once hotplug information is available, allocation can happen as
>    usual and the kernel can report the nodes which are actually
>    hotpluggable - marked as hotpluggable by the firmware&&  didn't get
>    contaminated during early alloc&&  didn't get overflow allocations
>    afterwards.  Note that we need such mechanism no matter what as the
>    kernel image can be loaded into hotpluggable nodes and reporting
>    that to userland is the only thing the kernel can do for cases like
>    that short of denying memory unplug on such nodes.
>

Hi tj, hpa, luck, yinghai,

So if all of you agree on the idea above from tj, I think
we can do it in this way. Will update the patches to allocate
memory near kernel image before SRAT is parsed.

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-13  6:14           ` Tang Chen
@ 2013-08-13  9:56             ` Tang Chen
  2013-08-13 14:38               ` Tejun Heo
  0 siblings, 1 reply; 48+ messages in thread
From: Tang Chen @ 2013-08-13  9:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, H. Peter Anvin, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hi tj,

When doing the "near kernel memory allocation", I have something
about memblock that I need you to comfirm.

1. First of all, memblock is platform independent. Different platforms
    have different ways to store kernel image address. So I don't think
    we can obtain the kernel image address on memblock side, right ?

    If so, then we need to pass kernel image address to memblock. But...

2. There are several places calling memblock_find_in_range_node() to
    allocate memory before SRAT parsed.

    early_reserve_e820_mpc_new()
    reserve_real_mode()
    init_mem_mapping()
    setup_log_buf()
    relocate_initrd()
    acpi_initrd_override()
    reserve_crashkernel()

    Maybe more, I didn't find out.

    And in the future, maybe someone will add code to allocate memory
    before SRAT parsed. So I don't think we should pass kernel image
    addr to them one by one. It will modify a lot of things.

So I think we need a generic way to tell memblock to allocate memory
from the kernel image end address to higher memory.


My idea is:

1. Introduce a memblock.current_limit_low to limit the lowest address
    that memblock can use.

2. Make memblock be able to allocate memory from low to high.

3. Get kernel image address on x86, and set memblock.current_limit_low
    to it before SRAT is parsed. Then we achieve the goal.

4. Reset it to 0, and make memblock allocate memory form high to low.


How do you think of this, or do you have any better idea ?


Thanks for your patient and help. :)


On 08/13/2013 02:14 PM, Tang Chen wrote:
> On 08/13/2013 12:46 AM, Tejun Heo wrote:
> ......
>>
>> * Adding an option to tell the kernel to try to stay away from
>> hotpluggable nodes is fine. I have no problem with that at all.
>>
>> * The patchsets upto this point have been somehow trying to reorder
>> operations shomehow such that *no* memory allocation happens before
>> memblock is populated with hotplug information.
>>
>> * However, we already *know* that the memory the kernel image is
>> occupying won't be removeable. It's highly likely that the amount
>> of memory allocation before NUMA / hotplug information is fully
>> populated is pretty small. Also, it's highly likely that small
>> amount of memory right after the kernel image is contained in the
>> same NUMA node, so if we allocate memory close to the kernel image,
>> it's likely that we don't contaminate hotpluggable node. We're
>> talking about few megs at most right after the kernel image. I
>> can't see how that would make any noticeable difference.
>>
>> * Once hotplug information is available, allocation can happen as
>> usual and the kernel can report the nodes which are actually
>> hotpluggable - marked as hotpluggable by the firmware&& didn't get
>> contaminated during early alloc&& didn't get overflow allocations
>> afterwards. Note that we need such mechanism no matter what as the
>> kernel image can be loaded into hotpluggable nodes and reporting
>> that to userland is the only thing the kernel can do for cases like
>> that short of denying memory unplug on such nodes.
>>
>
> Hi tj, hpa, luck, yinghai,
>
> So if all of you agree on the idea above from tj, I think
> we can do it in this way. Will update the patches to allocate
> memory near kernel image before SRAT is parsed.
>
> Thanks.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-13  9:56             ` Tang Chen
@ 2013-08-13 14:38               ` Tejun Heo
  0 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2013-08-13 14:38 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tang Chen, H. Peter Anvin, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hello, Tang.

On Tue, Aug 13, 2013 at 05:56:46PM +0800, Tang Chen wrote:
> 1. Introduce a memblock.current_limit_low to limit the lowest address
>    that memblock can use.
> 
> 2. Make memblock be able to allocate memory from low to high.
> 
> 3. Get kernel image address on x86, and set memblock.current_limit_low
>    to it before SRAT is parsed. Then we achieve the goal.
> 
> 4. Reset it to 0, and make memblock allocate memory form high to low.
> 
> How do you think of this, or do you have any better idea ?

Yes, something like that.  Maybe have something like
memblock_set_alloc_range(low, high, low_to_high) in memblock?  Once
NUMA info is available arch code can call memblock_set_alloc_range(0,
0, false) to reset it to the default behavior.

> Thanks for your patient and help. :)

Heh, sorry about all the roundabouts.  Your persistence is much
appreciated. :)

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 18:07             ` Tejun Heo
@ 2013-08-14 18:15               ` KOSAKI Motohiro
  2013-08-14 18:23                 ` Tejun Heo
  0 siblings, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 18:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, kosaki.motohiro

(8/12/13 2:07 PM), Tejun Heo wrote:
> Hey,
>
> On Tue, Aug 13, 2013 at 01:01:09AM +0800, Tang Chen wrote:
>> Sorry for the misunderstanding.
>>
>> I was trying to answer your question: "Why can't the kenrel allocate
>> hotpluggable memory opportunistic ?".
>
> I've used the wrong word, I was meaning best-effort, which is the only
> thing we can do anyway given that we have no control over where the
> kernel image is linked in relation to NUMA nodes.
>
>> If the kernel has any opportunity to allocate hotpluggable memory in
>> SRAT, then the kernel should tell users which memory is hotpluggable.
>>
>> But in what way ?  I think node is the best for now. But a node could
>> have a lot of memory. If the kernel uses only a little memory, we will
>> lose the whole movable node, which I don't want to do.
>>
>> So, I don't want to allow the kenrel allocating hotpluggable memory
>> opportunistic.
>
> What I was saying was that the kernel should try !hotpluggable memory
> first then fall back to hotpluggable memory instead of failing boot as
> nothing really is worse than failing to boot.

I don't follow this. We need to think why memory hotplug is necessary.
Because system reboot is unacceptable on several critical services. Then,
if someone set wrong boot option, systems SHOULD fail to boot. At that time,
admin have a chance to fix their mistake. In the other hand, after running
production service, they have no chance to fix the mistake. In general, default
boot option should have a fallback and non-default option should not have a
fallback. That's a fundamental rule.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 17:23             ` H. Peter Anvin
@ 2013-08-14 18:22               ` KOSAKI Motohiro
  0 siblings, 0 replies; 48+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 18:22 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tang Chen, Tejun Heo, Tang Chen, robert.moore, lv.zheng, rjw,
	lenb, tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, kosaki.motohiro

(8/12/13 1:23 PM), H. Peter Anvin wrote:
> On 08/12/2013 10:01 AM, Tang Chen wrote:
>>>
>>>> I'm just thinking of a more extreme case. For example, if a machine
>>>> has only one node hotpluggable, and the kernel resides in that node.
>>>> Then the system has no hotpluggable node.
>>>
>>> Yeah, sure, then there's no way that node can be hotpluggable and the
>>> right thing to do is booting up the machine and informing the userland
>>> that memory is not hotpluggable.
>>>
>>>> If we can prevent the kernel from using hotpluggable memory, in such
>>>> a machine, users can still do memory hotplug.
>>>>
>>>> I wanted to do it as generic as possible. But yes, finding out the
>>>> nodes the kernel resides in and make it unhotpluggable can work.
>>>
>>> Short of being able to remap memory under the kernel, I don't think
>>> this can be very generic and as a compromise trying to keep as many
>>> hotpluggable nodes as possible doesn't sound too bad.
>>
>> I think making one of the node hotpluggable is better. But OK, it is
>> no big deal. There won't be such machine in reality, I think. :)
>>
>
> The user may very well have configured a system with mirrored memory for
> the kernel node as that will be non-hotpluggable, but not for the
> others.  One can wonder how much that actually buys in real life, but
> still...

Note. Such system is much cheaper than full memory mirroring system. That's
one of reason why server vendors are interesting in hot plugging.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 18:15               ` KOSAKI Motohiro
@ 2013-08-14 18:23                 ` Tejun Heo
  2013-08-14 19:40                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2013-08-14 18:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello,

On Wed, Aug 14, 2013 at 02:15:44PM -0400, KOSAKI Motohiro wrote:
> I don't follow this. We need to think why memory hotplug is necessary.
> Because system reboot is unacceptable on several critical services. Then,
> if someone set wrong boot option, systems SHOULD fail to boot. At that time,
> admin have a chance to fix their mistake. In the other hand, after running
> production service, they have no chance to fix the mistake. In general, default
> boot option should have a fallback and non-default option should not have a
> fallback. That's a fundamental rule.

The fundamental rule is that the system has to boot.  Your argument is
pointless as the kernel has no control over where its own image is
placed w.r.t. hotpluggable nodes.  So, are we gonna fail boot if
kernel image intersects hotpluggable node and the option is specified
even if memory hotplug can be used on other nodes?  That doesn't make
any sense.

Failing to boot is *way* worse reporting mechanism than almost
everything else.  If the sysadmin is willing to risk machines failing
to come up, she would definitely be willing to check whether which
memory areas are actually hotpluggable too, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 18:23                 ` Tejun Heo
@ 2013-08-14 19:40                   ` KOSAKI Motohiro
  2013-08-14 19:55                     ` Tejun Heo
  0 siblings, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 19:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 2:23 PM), Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 14, 2013 at 02:15:44PM -0400, KOSAKI Motohiro wrote:
>> I don't follow this. We need to think why memory hotplug is necessary.
>> Because system reboot is unacceptable on several critical services. Then,
>> if someone set wrong boot option, systems SHOULD fail to boot. At that time,
>> admin have a chance to fix their mistake. In the other hand, after running
>> production service, they have no chance to fix the mistake. In general, default
>> boot option should have a fallback and non-default option should not have a
>> fallback. That's a fundamental rule.
>
> The fundamental rule is that the system has to boot.

I don't agree it. Please look at other kernel options. A lot of these don't
follow you. These behave as direction, not advise.

I mean the fallback should be implemented at turning on default the feature.


>  Your argument is
> pointless as the kernel has no control over where its own image is
> placed w.r.t. hotpluggable nodes.  So, are we gonna fail boot if
> kernel image intersects hotpluggable node and the option is specified
> even if memory hotplug can be used on other nodes?  That doesn't make
> any sense.

I don't read whole discussion and I don't quite understand why no kernel
place controlling is relevant. Every unpluggable node is suitable for
kernel. If you mean current kernel placement logic don't care plugging,
that's a bug.

If we aim to hot remove, we have to have either kernel relocation or
hotplug awre kernel placement at boot time.

> Failing to boot is *way* worse reporting mechanism than almost
> everything else.  If the sysadmin is willing to risk machines failing
> to come up, she would definitely be willing to check whether which
> memory areas are actually hotpluggable too, right?

No. see above. Your opinion is not pragmatic useful.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 19:40                   ` KOSAKI Motohiro
@ 2013-08-14 19:55                     ` Tejun Heo
  2013-08-14 20:29                       ` KOSAKI Motohiro
  0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2013-08-14 19:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello,

On Wed, Aug 14, 2013 at 03:40:31PM -0400, KOSAKI Motohiro wrote:
> I don't agree it. Please look at other kernel options. A lot of these don't
> follow you. These behave as direction, not advise.
> 
> I mean the fallback should be implemented at turning on default the feature.

Yeah, some options are "please try this" and others "do this or fail".
There's no frigging fundamental rule there.

> I don't read whole discussion and I don't quite understand why no kernel
> place controlling is relevant. Every unpluggable node is suitable for
> kernel. If you mean current kernel placement logic don't care plugging,
> that's a bug.
> 
> If we aim to hot remove, we have to have either kernel relocation or
> hotplug awre kernel placement at boot time.

What if all nodes are hot pluggable?  Are we moving the kernel
dynamically then?

> >Failing to boot is *way* worse reporting mechanism than almost
> >everything else.  If the sysadmin is willing to risk machines failing
> >to come up, she would definitely be willing to check whether which
> >memory areas are actually hotpluggable too, right?
> 
> No. see above. Your opinion is not pragmatic useful.

No, what you're saying doesn't make any sense.  There are multiple
ways to report when something doesn't work.  Failing to boot is *one*
of them and not a very good one.  Here, for practical reasons, the end
result may differ depending on the specifics of the configuration, so
more detailed reporting is necessary anyway, so why do you insist on
failing the boot?  In what world is it a good thing for the machine to
fail boot after bios or kernel update?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 19:55                     ` Tejun Heo
@ 2013-08-14 20:29                       ` KOSAKI Motohiro
  2013-08-14 20:30                         ` H. Peter Anvin
  2013-08-14 20:35                         ` Tejun Heo
  0 siblings, 2 replies; 48+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 20:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 3:55 PM), Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 14, 2013 at 03:40:31PM -0400, KOSAKI Motohiro wrote:
>> I don't agree it. Please look at other kernel options. A lot of these don't
>> follow you. These behave as direction, not advise.
>>
>> I mean the fallback should be implemented at turning on default the feature.
>
> Yeah, some options are "please try this" and others "do this or fail".
> There's no frigging fundamental rule there.

In this case, we have zero worth for fallback, right?


>> I don't read whole discussion and I don't quite understand why no kernel
>> place controlling is relevant. Every unpluggable node is suitable for
>> kernel. If you mean current kernel placement logic don't care plugging,
>> that's a bug.
>>
>> If we aim to hot remove, we have to have either kernel relocation or
>> hotplug awre kernel placement at boot time.
>
> What if all nodes are hot pluggable?  Are we moving the kernel
> dynamically then?

Intel folks already told, we have no such system in practice.


>>> Failing to boot is *way* worse reporting mechanism than almost
>>> everything else.  If the sysadmin is willing to risk machines failing
>>> to come up, she would definitely be willing to check whether which
>>> memory areas are actually hotpluggable too, right?
>>
>> No. see above. Your opinion is not pragmatic useful.
>
> No, what you're saying doesn't make any sense.  There are multiple
> ways to report when something doesn't work.  Failing to boot is *one*
> of them and not a very good one.  Here, for practical reasons, the end
> result may differ depending on the specifics of the configuration, so
> more detailed reporting is necessary anyway, so why do you insist on
> failing the boot?  In what world is it a good thing for the machine to
> fail boot after bios or kernel update?

Because boot failure have no chance to overlook and better way for practice.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 20:29                       ` KOSAKI Motohiro
@ 2013-08-14 20:30                         ` H. Peter Anvin
  2013-08-14 20:35                         ` Tejun Heo
  1 sibling, 0 replies; 48+ messages in thread
From: H. Peter Anvin @ 2013-08-14 20:30 UTC (permalink / raw)
  To: KOSAKI Motohiro, Tejun Heo
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

There are systems which can.  They have the ability to remap in hardware.

KOSAKI Motohiro <kosaki.motohiro@gmail.com> wrote:
>(8/14/13 3:55 PM), Tejun Heo wrote:
>> Hello,
>>
>> On Wed, Aug 14, 2013 at 03:40:31PM -0400, KOSAKI Motohiro wrote:
>>> I don't agree it. Please look at other kernel options. A lot of
>these don't
>>> follow you. These behave as direction, not advise.
>>>
>>> I mean the fallback should be implemented at turning on default the
>feature.
>>
>> Yeah, some options are "please try this" and others "do this or
>fail".
>> There's no frigging fundamental rule there.
>
>In this case, we have zero worth for fallback, right?
>
>
>>> I don't read whole discussion and I don't quite understand why no
>kernel
>>> place controlling is relevant. Every unpluggable node is suitable
>for
>>> kernel. If you mean current kernel placement logic don't care
>plugging,
>>> that's a bug.
>>>
>>> If we aim to hot remove, we have to have either kernel relocation or
>>> hotplug awre kernel placement at boot time.
>>
>> What if all nodes are hot pluggable?  Are we moving the kernel
>> dynamically then?
>
>Intel folks already told, we have no such system in practice.
>
>
>>>> Failing to boot is *way* worse reporting mechanism than almost
>>>> everything else.  If the sysadmin is willing to risk machines
>failing
>>>> to come up, she would definitely be willing to check whether which
>>>> memory areas are actually hotpluggable too, right?
>>>
>>> No. see above. Your opinion is not pragmatic useful.
>>
>> No, what you're saying doesn't make any sense.  There are multiple
>> ways to report when something doesn't work.  Failing to boot is *one*
>> of them and not a very good one.  Here, for practical reasons, the
>end
>> result may differ depending on the specifics of the configuration, so
>> more detailed reporting is necessary anyway, so why do you insist on
>> failing the boot?  In what world is it a good thing for the machine
>to
>> fail boot after bios or kernel update?
>
>Because boot failure have no chance to overlook and better way for
>practice.

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 20:29                       ` KOSAKI Motohiro
  2013-08-14 20:30                         ` H. Peter Anvin
@ 2013-08-14 20:35                         ` Tejun Heo
  2013-08-14 21:17                           ` KOSAKI Motohiro
  1 sibling, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2013-08-14 20:35 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 04:29:05PM -0400, KOSAKI Motohiro wrote:
> Because boot failure have no chance to overlook and better way for practice.

That's an extremely poor excuse.  We favor WARNs over BUGs for good
reasons.  If a sysadmin cares about hotplug and can't deal with the
system successfully booting, it's *trivial* to make the system behave
in a way which has no chance of being overlooked.  What's next?
Panicking if somebody echoes invalid value to an important knob file?
We sure don't want that to be overlooked either, right?

This discussion is so dumb.  Please stop.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 20:35                         ` Tejun Heo
@ 2013-08-14 21:17                           ` KOSAKI Motohiro
  2013-08-14 21:36                             ` Tejun Heo
  0 siblings, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 21:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 4:35 PM), Tejun Heo wrote:
> On Wed, Aug 14, 2013 at 04:29:05PM -0400, KOSAKI Motohiro wrote:
>> Because boot failure have no chance to overlook and better way for practice.
>
> That's an extremely poor excuse.  We favor WARNs over BUGs for good
> reasons.  If a sysadmin cares about hotplug and can't deal with the
> system successfully booting, it's *trivial* to make the system behave
> in a way which has no chance of being overlooked.  What's next?
> Panicking if somebody echoes invalid value to an important knob file?
> We sure don't want that to be overlooked either, right?
>
> This discussion is so dumb.  Please stop.

You haven't explain practical benefit of your opinion. As far as users have
no benefit, I'm never agree. Sorry.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 21:17                           ` KOSAKI Motohiro
@ 2013-08-14 21:36                             ` Tejun Heo
  2013-08-15  1:08                               ` KOSAKI Motohiro
  0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2013-08-14 21:36 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 05:17:23PM -0400, KOSAKI Motohiro wrote:
> You haven't explain practical benefit of your opinion. As far as users have
> no benefit, I'm never agree. Sorry.

Umm... how about being more robust and actually useable to begin with?
What's the benefit of panicking?  Are you seriously saying that the
admin / boot script can use the kernel boot param to tell the kernel
to enable hotplug but can't check what nodes are hot unpluggable
afterwards?  The admin *needs* to check which nodes are hotpluggable
no matter how this part is handled.  How else is it gonna know which
nodes are hotpluggable?  Magic?

There's no such rule as kernel param should make the kernel panic if
it's not happy, so please take that out of your brain.  It of course
should be clear what the result of the kernel parameter is and
panicking is the crudest way to do that which is good enough or even
desriable in *some* cases.  It is not the required behavior by any
stretch of imgination, especially when the result of the parameter may
change due to changing circumstances.  That's an outright idiotic
thing to do.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 5/7] memblock, mem_hotplug: Make memblock skip hotpluggable regions by default.
  2013-08-08 10:16 ` [PATCH part5 5/7] memblock, mem_hotplug: Make memblock skip hotpluggable regions by default Tang Chen
@ 2013-08-14 21:54   ` Naoya Horiguchi
  2013-08-15  5:15     ` Tang Chen
  0 siblings, 1 reply; 48+ messages in thread
From: Naoya Horiguchi @ 2013-08-14 21:54 UTC (permalink / raw)
  To: Tang Chen
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Thu, Aug 08, 2013 at 06:16:17PM +0800, Tang Chen wrote:
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
...
> @@ -719,6 +723,10 @@ void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
>  		if (nid != MAX_NUMNODES && nid != memblock_get_region_node(m))
>  			continue;
>  
> +		/* skip hotpluggable memory regions */
> +		if (m->flags & MEMBLOCK_HOTPLUG)
> +			continue;
> +
>  		/* scan areas before each reservation for intersection */
>  		for ( ; ri >= 0; ri--) {
>  			struct memblock_region *r = &rsv->regions[ri];
> -- 

Why don't you add this also in __next_free_mem_range()?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 21:36                             ` Tejun Heo
@ 2013-08-15  1:08                               ` KOSAKI Motohiro
  2013-08-15  1:21                                 ` Tejun Heo
  0 siblings, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2013-08-15  1:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 5:36 PM), Tejun Heo wrote:
> On Wed, Aug 14, 2013 at 05:17:23PM -0400, KOSAKI Motohiro wrote:
>> You haven't explain practical benefit of your opinion. As far as users have
>> no benefit, I'm never agree. Sorry.
> 
> Umm... how about being more robust and actually useable to begin with?
> What's the benefit of panicking?  Are you seriously saying that the
> admin / boot script can use the kernel boot param to tell the kernel
> to enable hotplug but can't check what nodes are hot unpluggable
> afterwards?  The admin *needs* to check which nodes are hotpluggable
> no matter how this part is handled.  How else is it gonna know which
> nodes are hotpluggable?  Magic?
> 
> There's no such rule as kernel param should make the kernel panic if
> it's not happy, so please take that out of your brain.  It of course
> should be clear what the result of the kernel parameter is and
> panicking is the crudest way to do that which is good enough or even
> desriable in *some* cases.  It is not the required behavior by any
> stretch of imgination, especially when the result of the parameter may
> change due to changing circumstances.  That's an outright idiotic
> thing to do.

Sigh, I'd like to point a link of past discussion. But I can't find it now.
Let's summarize past discussion as far as possible.

Firstly, technically you can't implement correct fallback. You used a term
"when can't allocate memory", but it's not so simple. Think following scenario,
memory is enough for kernel image, but kernel will load memory hogging drivers.
The system will crash after boot within 1 min. Then, MM subsystem don't believe
a fallback. Bogus and misguided fallback give a user false relief and they don't
notice their mistake quickly. The answer is, there is the fundamental rule.
We always said, "measure your system carefully, and setting option carefully too".
I have no seen any reason to make exception in this case.

Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
have a chance to get a hotplug related bug report. For protecting my life, I don't
want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
an admin makes mistake, they should shoot their foot, not me!

Thirdly, I haven't insist to aim verbose and kind messages as last breath. It much
likely help users. 

Last, we are now discussing hotplug feature. Then, we can assume hotpluggable machine.
They have a hotplug interface in farmware by definition. So, you need to aim a magic.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15  1:08                               ` KOSAKI Motohiro
@ 2013-08-15  1:21                                 ` Tejun Heo
  2013-08-15  1:33                                   ` Tejun Heo
  2013-08-15  1:38                                   ` KOSAKI Motohiro
  0 siblings, 2 replies; 48+ messages in thread
From: Tejun Heo @ 2013-08-15  1:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello, KOSAKI.

On Wed, Aug 14, 2013 at 09:08:22PM -0400, KOSAKI Motohiro wrote:
...
> a fallback. Bogus and misguided fallback give a user false relief and they don't
> notice their mistake quickly. The answer is, there is the fundamental rule.
> We always said, "measure your system carefully, and setting option carefully too".
> I have no seen any reason to make exception in this case.

Ugh... that is one stupid rule.  Sure, there are cases when those
aren't avoidable but sticking to that when there are better ways to do
it is stupid.  Why would you make it finicky when you don't have to?
That makes no sense.

> Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
> have a chance to get a hotplug related bug report. For protecting my life, I don't
> want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
> an admin makes mistake, they should shoot their foot, not me!

Dude, it's not cool to cause users' machine to fail boot because you
want bug report.  You don't do that.  There are other ways to achieve
that.  When the kernel can't make all hotpluggable nodes hotpluggable
(I mean, it's not necessarily node aligned to begin with), generate
warning and a debug dump with appropriate log levels.

If you think causing users' machine fail boot indetermistically is
acceptable, you really shouldn't be maintaining anything.  What is
this?  Are you nuts?

> Thirdly, I haven't insist to aim verbose and kind messages as last breath. It much
> likely help users. 

I have no idea what you're trying to say.

> Last, we are now discussing hotplug feature. Then, we can assume hotpluggable machine.
> They have a hotplug interface in farmware by definition. So, you need to aim a magic.

This is by no way magic.  It's a band-aid feature which aims to
achieve some portion of functionality with minimal impact on the rest
of code / runtime overhead.  If you wanna nack the whole thing, be my
guest.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15  1:21                                 ` Tejun Heo
@ 2013-08-15  1:33                                   ` Tejun Heo
  2013-08-15  1:44                                     ` KOSAKI Motohiro
  2013-08-15  1:38                                   ` KOSAKI Motohiro
  1 sibling, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2013-08-15  1:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 09:21:33PM -0400, Tejun Heo wrote:
> > Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
> > have a chance to get a hotplug related bug report. For protecting my life, I don't
> > want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
> > an admin makes mistake, they should shoot their foot, not me!
> 
> Dude, it's not cool to cause users' machine to fail boot because you
> want bug report.  You don't do that.  There are other ways to achieve
> that.  When the kernel can't make all hotpluggable nodes hotpluggable
> (I mean, it's not necessarily node aligned to begin with), generate
> warning and a debug dump with appropriate log levels.
> 
> If you think causing users' machine fail boot indetermistically is
> acceptable, you really shouldn't be maintaining anything.  What is
> this?  Are you nuts?

This is doubly idiotic because this is all early boot.  Most users
don't even have a way to access the debug info if the machine crashes
that early.  Developement convenience is something that we consider
too but, seriously, users come first.  This is not your personal
playground.  Don't frigging crash if you have any other option.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15  1:21                                 ` Tejun Heo
  2013-08-15  1:33                                   ` Tejun Heo
@ 2013-08-15  1:38                                   ` KOSAKI Motohiro
  2013-08-15  1:51                                     ` Tejun Heo
  1 sibling, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2013-08-15  1:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 9:21 PM), Tejun Heo wrote:
> Hello, KOSAKI.
>
> On Wed, Aug 14, 2013 at 09:08:22PM -0400, KOSAKI Motohiro wrote:
> ...
>> a fallback. Bogus and misguided fallback give a user false relief and they don't
>> notice their mistake quickly. The answer is, there is the fundamental rule.
>> We always said, "measure your system carefully, and setting option carefully too".
>> I have no seen any reason to make exception in this case.
>
> Ugh... that is one stupid rule.  Sure, there are cases when those
> aren't avoidable but sticking to that when there are better ways to do
> it is stupid.  Why would you make it finicky when you don't have to?
> That makes no sense.

As you think makes no sense, I also think your position makes no sense. So, please
stop emotional word. That doesn't help discussion progress.


>> Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
>> have a chance to get a hotplug related bug report. For protecting my life, I don't
>> want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
>> an admin makes mistake, they should shoot their foot, not me!
>
> Dude, it's not cool to cause users' machine to fail boot because you
> want bug report.  You don't do that.  There are other ways to achieve
> that.  When the kernel can't make all hotpluggable nodes hotpluggable
> (I mean, it's not necessarily node aligned to begin with), generate
> warning and a debug dump with appropriate log levels.

If the user was you, I agree. But I know the users don't react so.

> If you think causing users' machine fail boot indetermistically is
> acceptable, you really shouldn't be maintaining anything.  What is
> this?  Are you nuts?

Again, there is no perfect solution if an admin is true stupid. We can just
suggest "you are wrong, not kernel", but no more further. I'm sure just kernel
logging doesn't help because they don't read it and they say no body read such
plenty and for developer messages. I may accept any strong notification, but,
still, I don't think it's worth. Only sane way is, an admin realize their mistake
and fix themselves.


>> Thirdly, I haven't insist to aim verbose and kind messages as last breath. It much
>> likely help users.
>
> I have no idea what you're trying to say.

I meant, "which is verbose" makes no sense. I don't take it.


>> Last, we are now discussing hotplug feature. Then, we can assume hotpluggable machine.
>> They have a hotplug interface in farmware by definition. So, you need to aim a magic.
>
> This is by no way magic.  It's a band-aid feature which aims to
> achieve some portion of functionality with minimal impact on the rest
> of code / runtime overhead.  If you wanna nack the whole thing, be my
> guest.

Huh? no fallback mean no additional code. I can't imagine no code makes runtime overhead.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15  1:33                                   ` Tejun Heo
@ 2013-08-15  1:44                                     ` KOSAKI Motohiro
  2013-08-15  2:22                                       ` Tejun Heo
  0 siblings, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2013-08-15  1:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 9:33 PM), Tejun Heo wrote:
> On Wed, Aug 14, 2013 at 09:21:33PM -0400, Tejun Heo wrote:
>>> Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
>>> have a chance to get a hotplug related bug report. For protecting my life, I don't
>>> want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
>>> an admin makes mistake, they should shoot their foot, not me!
>>
>> Dude, it's not cool to cause users' machine to fail boot because you
>> want bug report.  You don't do that.  There are other ways to achieve
>> that.  When the kernel can't make all hotpluggable nodes hotpluggable
>> (I mean, it's not necessarily node aligned to begin with), generate
>> warning and a debug dump with appropriate log levels.
>>
>> If you think causing users' machine fail boot indetermistically is
>> acceptable, you really shouldn't be maintaining anything.  What is
>> this?  Are you nuts?
>
> This is doubly idiotic because this is all early boot.  Most users
> don't even have a way to access the debug info if the machine crashes
> that early.  Developement convenience is something that we consider
> too but, seriously, users come first.  This is not your personal
> playground.  Don't frigging crash if you have any other option.

Again, the best depend on the purpose and the goal. If someone specify
to enable hotplugging, They are sure they need it. Now, any fallback
achieve their goal. Their goal is not booting. If they don't have enough
machine to achieve their goal, we have only one way, tell them that.
If we had an alternative way, I might say an another answer.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15  1:38                                   ` KOSAKI Motohiro
@ 2013-08-15  1:51                                     ` Tejun Heo
  0 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2013-08-15  1:51 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 09:38:12PM -0400, KOSAKI Motohiro wrote:
> As you think makes no sense, I also think your position makes no sense. So, please
> stop emotional word. That doesn't help discussion progress.

Would you then please stop making nonsense assertions like "the
fundamental rule here is to crash"?  You could have started the whole
thread with "I'm not sure about the failure mode, it can be better to
hard fail because ..." and we could have debated on the details.
Instead I now have to break the nonsense assertion.  Of course the
tension is way higher.

> If the user was you, I agree. But I know the users don't react so.

Yeah, users react super well to machines failing boot without any way
to know what's going on.  How is a good idea?

> Again, there is no perfect solution if an admin is true stupid. We can just
> suggest "you are wrong, not kernel", but no more further. I'm sure just kernel
> logging doesn't help because they don't read it and they say no body read such

There are things like automated reporting.  The system is trying to
use hotplug, right?  It would have associated tools to do that, won't
it?  If you want to support it, build sensible tools and conventions
around it and given how specialized / highend the whole thing is, it
shouldn't be hard either.

> plenty and for developer messages. I may accept any strong notification, but,
> still, I don't think it's worth. Only sane way is, an admin realize their mistake
> and fix themselves.

Yes, we'll show them who's the boss.  No, this is not how things are
done in kernel.  We don't crash to give admins a lesson.  Do you even
realize that this isn't completely deterministic?  The machine might
boot fine one time and fail the next time.  What lesson would that
teach the admin?  Stay away from linux?

> Huh? no fallback mean no additional code. I can't imagine no code makes runtime overhead.

What fallback are you talking about?  You need to report hotpluggable
node somehow anyway.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15  1:44                                     ` KOSAKI Motohiro
@ 2013-08-15  2:22                                       ` Tejun Heo
  0 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2013-08-15  2:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 09:44:19PM -0400, KOSAKI Motohiro wrote:
> >This is doubly idiotic because this is all early boot.  Most users
> >don't even have a way to access the debug info if the machine crashes
> >that early.  Developement convenience is something that we consider
> >too but, seriously, users come first.  This is not your personal
> >playground.  Don't frigging crash if you have any other option.
> 
> Again, the best depend on the purpose and the goal. If someone specify
> to enable hotplugging, They are sure they need it. Now, any fallback
> achieve their goal. Their goal is not booting. If they don't have enough
> machine to achieve their goal, we have only one way, tell them that.

Yes, you go and tell them with the blank screen.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH part5 5/7] memblock, mem_hotplug: Make memblock skip hotpluggable regions by default.
  2013-08-14 21:54   ` Naoya Horiguchi
@ 2013-08-15  5:15     ` Tang Chen
  0 siblings, 0 replies; 48+ messages in thread
From: Tang Chen @ 2013-08-15  5:15 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On 08/15/2013 05:54 AM, Naoya Horiguchi wrote:
> On Thu, Aug 08, 2013 at 06:16:17PM +0800, Tang Chen wrote:
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
> ...
>> @@ -719,6 +723,10 @@ void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
>>   		if (nid != MAX_NUMNODES&&  nid != memblock_get_region_node(m))
>>   			continue;
>>
>> +		/* skip hotpluggable memory regions */
>> +		if (m->flags&  MEMBLOCK_HOTPLUG)
>> +			continue;
>> +
>>   		/* scan areas before each reservation for intersection */
>>   		for ( ; ri>= 0; ri--) {
>>   			struct memblock_region *r =&rsv->regions[ri];
>> -- 
> 
> Why don't you add this also in __next_free_mem_range()?

Hi Naoya,

__next_free_mem_range_rev() is for for_each_free_mem_range_reverse(),
which is
only called in memblock_find_in_range_node().

But I think __next_free_mem_range() is for for_each_free_mem_range,
which is
called by many others. These callers could has nothing to do with memory
hotplug.
So I didn't add.

Maybe adding the check here is not good. I'm trying to find somewhere to
check MEMBLOCK_HOTPLUG.

Thanks. :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2013-08-15  5:17 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-08 10:16 [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tang Chen
2013-08-08 10:16 ` [PATCH part5 1/7] x86: get pg_data_t's memory from other node Tang Chen
2013-08-12 14:39   ` Tejun Heo
2013-08-12 15:12     ` Tang Chen
2013-08-08 10:16 ` [PATCH part5 2/7] x86, numa, mem_hotplug: Skip all the regions the kernel resides in Tang Chen
2013-08-08 10:16 ` [PATCH part5 3/7] memblock, numa: Introduce flag into memblock Tang Chen
2013-08-08 10:16 ` [PATCH part5 4/7] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions Tang Chen
2013-08-08 10:16 ` [PATCH part5 5/7] memblock, mem_hotplug: Make memblock skip hotpluggable regions by default Tang Chen
2013-08-14 21:54   ` Naoya Horiguchi
2013-08-15  5:15     ` Tang Chen
2013-08-08 10:16 ` [PATCH part5 6/7] mem-hotplug: Introduce movablenode boot option to {en|dis}able using SRAT Tang Chen
2013-08-08 10:16 ` [PATCH part5 7/7] x86, numa, acpi, memory-hotplug: Make movablenode have higher priority Tang Chen
2013-08-09 16:32 ` [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tejun Heo
2013-08-12  8:54   ` Tang Chen
2013-08-12 14:50 ` Tejun Heo
2013-08-12 15:14   ` H. Peter Anvin
2013-08-12 15:23     ` Tejun Heo
2013-08-12 16:29       ` Tang Chen
2013-08-12 16:46         ` Tejun Heo
2013-08-12 18:23           ` Tang Chen
2013-08-12 20:20             ` Tejun Heo
2013-08-13  6:14           ` Tang Chen
2013-08-13  9:56             ` Tang Chen
2013-08-13 14:38               ` Tejun Heo
2013-08-12 15:41   ` Tang Chen
2013-08-12 15:46     ` Tejun Heo
2013-08-12 16:19       ` Tang Chen
2013-08-12 16:22         ` Tejun Heo
2013-08-12 17:01           ` Tang Chen
2013-08-12 17:23             ` H. Peter Anvin
2013-08-14 18:22               ` KOSAKI Motohiro
2013-08-12 18:07             ` Tejun Heo
2013-08-14 18:15               ` KOSAKI Motohiro
2013-08-14 18:23                 ` Tejun Heo
2013-08-14 19:40                   ` KOSAKI Motohiro
2013-08-14 19:55                     ` Tejun Heo
2013-08-14 20:29                       ` KOSAKI Motohiro
2013-08-14 20:30                         ` H. Peter Anvin
2013-08-14 20:35                         ` Tejun Heo
2013-08-14 21:17                           ` KOSAKI Motohiro
2013-08-14 21:36                             ` Tejun Heo
2013-08-15  1:08                               ` KOSAKI Motohiro
2013-08-15  1:21                                 ` Tejun Heo
2013-08-15  1:33                                   ` Tejun Heo
2013-08-15  1:44                                     ` KOSAKI Motohiro
2013-08-15  2:22                                       ` Tejun Heo
2013-08-15  1:38                                   ` KOSAKI Motohiro
2013-08-15  1:51                                     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).