All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-08 10:16 ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

[Problem]

The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.


[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.

To do this, we need ACPI's help.


[How we do this]

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
   ZONE_MOVABLE.
   (This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
   hotpluggable memory for the kernel before the memory initialization
   finishes.
   (This is what we are going to do. See below.)


[About this patch-set]

In previous parts' patches, we have obtained SRAT earlier enough, right after
memblock is ready. So this patch-set does the following things:

1. Improve memblock to support flags, which are used to indicate different 
   memory type.

2. Mark all hotpluggable memory in memblock.memory[].

3. Make the default memblock allocator skip hotpluggable memory.

4. Introduce "movablenode" boot option to allow users to enable/disable this
   functionality.


Tang Chen (6):
  x86, numa, mem_hotplug: Skip all the regions the kernel resides in.
  memblock, numa: Introduce flag into memblock.
  memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark
    hotpluggable regions.
  memblock, mem_hotplug: Make memblock skip hotpluggable regions by
    default.
  mem-hotplug: Introduce movablenode boot option to {en|dis}able using
    SRAT.
  x86, numa, acpi, memory-hotplug: Make movablenode have higher
    priority.

Yasuaki Ishimatsu (1):
  x86: get pg_data_t's memory from other node

 Documentation/kernel-parameters.txt |   15 ++++++
 arch/x86/kernel/setup.c             |   10 +++-
 arch/x86/mm/numa.c                  |    5 +-
 include/linux/memblock.h            |   13 +++++
 include/linux/memory_hotplug.h      |    3 +
 mm/memblock.c                       |   92 +++++++++++++++++++++++++++++------
 mm/memory_hotplug.c                 |   56 +++++++++++++++++++++-
 mm/page_alloc.c                     |   31 +++++++++++-
 8 files changed, 201 insertions(+), 24 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-08 10:16 ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

[Problem]

The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.


[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.

To do this, we need ACPI's help.


[How we do this]

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
   ZONE_MOVABLE.
   (This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
   hotpluggable memory for the kernel before the memory initialization
   finishes.
   (This is what we are going to do. See below.)


[About this patch-set]

In previous parts' patches, we have obtained SRAT earlier enough, right after
memblock is ready. So this patch-set does the following things:

1. Improve memblock to support flags, which are used to indicate different 
   memory type.

2. Mark all hotpluggable memory in memblock.memory[].

3. Make the default memblock allocator skip hotpluggable memory.

4. Introduce "movablenode" boot option to allow users to enable/disable this
   functionality.


Tang Chen (6):
  x86, numa, mem_hotplug: Skip all the regions the kernel resides in.
  memblock, numa: Introduce flag into memblock.
  memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark
    hotpluggable regions.
  memblock, mem_hotplug: Make memblock skip hotpluggable regions by
    default.
  mem-hotplug: Introduce movablenode boot option to {en|dis}able using
    SRAT.
  x86, numa, acpi, memory-hotplug: Make movablenode have higher
    priority.

Yasuaki Ishimatsu (1):
  x86: get pg_data_t's memory from other node

 Documentation/kernel-parameters.txt |   15 ++++++
 arch/x86/kernel/setup.c             |   10 +++-
 arch/x86/mm/numa.c                  |    5 +-
 include/linux/memblock.h            |   13 +++++
 include/linux/memory_hotplug.h      |    3 +
 mm/memblock.c                       |   92 +++++++++++++++++++++++++++++------
 mm/memory_hotplug.c                 |   56 +++++++++++++++++++++-
 mm/page_alloc.c                     |   31 +++++++++++-
 8 files changed, 201 insertions(+), 24 deletions(-)


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [PATCH part5 1/7] x86: get pg_data_t's memory from other node
  2013-08-08 10:16 ` Tang Chen
@ 2013-08-08 10:16   ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

If system can create movable node which all memory of the node is allocated
as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's
pg_data_t. So, use memblock_alloc_try_nid() instead of memblock_alloc_nid()
to retry when the first allocation fails. Otherwise, the system could failed
to boot.

The node_data could be on hotpluggable node. And so could pagetable and
vmemmap. But for now, doing so will break memory hot-remove path.

A node could have several memory devices. And the device who holds node
data should be hot-removed in the last place. But in NUMA level, we don't
know which memory_block (/sys/devices/system/node/nodeX/memoryXXX) belongs
to which memory device. We only have node. So we can only do node hotplug.

But in virtualization, developers are now developing memory hotplug in qemu,
which support a single memory device hotplug. So a whole node hotplug will
not satisfy virtualization users.

So at last, we concluded that we'd better do memory hotplug and local node
things (local node node data, pagetable, vmemmap, ...) in two steps.
Please refer to https://lkml.org/lkml/2013/6/19/73

For now, we put node_data of movable node to another node, and then improve
it in the future.

In the later patches, a boot option will be introduced to enable/disable this
functionality. If users disable it, the node_data will still be put on the
local node.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
---
 arch/x86/mm/numa.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..d532b6d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -209,10 +209,9 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
 	 * Allocate node data.  Try node-local memory and then any node.
 	 * Never allocate in DMA zone.
 	 */
-	nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
+	nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
 	if (!nd_pa) {
-		pr_err("Cannot find %zu bytes in node %d\n",
-		       nd_size, nid);
+		pr_err("Cannot find %zu bytes in any node\n", nd_size);
 		return;
 	}
 	nd = __va(nd_pa);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH part5 1/7] x86: get pg_data_t's memory from other node
@ 2013-08-08 10:16   ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

If system can create movable node which all memory of the node is allocated
as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's
pg_data_t. So, use memblock_alloc_try_nid() instead of memblock_alloc_nid()
to retry when the first allocation fails. Otherwise, the system could failed
to boot.

The node_data could be on hotpluggable node. And so could pagetable and
vmemmap. But for now, doing so will break memory hot-remove path.

A node could have several memory devices. And the device who holds node
data should be hot-removed in the last place. But in NUMA level, we don't
know which memory_block (/sys/devices/system/node/nodeX/memoryXXX) belongs
to which memory device. We only have node. So we can only do node hotplug.

But in virtualization, developers are now developing memory hotplug in qemu,
which support a single memory device hotplug. So a whole node hotplug will
not satisfy virtualization users.

So at last, we concluded that we'd better do memory hotplug and local node
things (local node node data, pagetable, vmemmap, ...) in two steps.
Please refer to https://lkml.org/lkml/2013/6/19/73

For now, we put node_data of movable node to another node, and then improve
it in the future.

In the later patches, a boot option will be introduced to enable/disable this
functionality. If users disable it, the node_data will still be put on the
local node.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
---
 arch/x86/mm/numa.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..d532b6d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -209,10 +209,9 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
 	 * Allocate node data.  Try node-local memory and then any node.
 	 * Never allocate in DMA zone.
 	 */
-	nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
+	nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
 	if (!nd_pa) {
-		pr_err("Cannot find %zu bytes in node %d\n",
-		       nd_size, nid);
+		pr_err("Cannot find %zu bytes in any node\n", nd_size);
 		return;
 	}
 	nd = __va(nd_pa);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH part5 2/7] x86, numa, mem_hotplug: Skip all the regions the kernel resides in.
  2013-08-08 10:16 ` Tang Chen
@ 2013-08-08 10:16   ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

At early time, memblock will reserve some memory for the kernel,
such as the kernel code and data segments, initrd file, and so on,
which means the kernel resides in these memory regions.

Even if these memory regions are hotpluggable, we should not
mark them as hotpluggable. Otherwise the kernel won't have enough
memory to boot.

This patch finds out which memory regions the kernel resides in,
and skip them when finding all hotpluggable memory regions.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 mm/memory_hotplug.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ef9ccf8..e63f947 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -31,6 +31,7 @@
 #include <linux/firmware-map.h>
 #include <linux/stop_machine.h>
 #include <linux/acpi.h>
+#include <linux/memblock.h>
 
 #include <asm/tlbflush.h>
 
@@ -93,6 +94,37 @@ static void release_memory_resource(struct resource *res)
 
 #ifdef CONFIG_ACPI_NUMA
 /**
+ * kernel_resides_in_range - Check if kernel resides in a memory region.
+ * @base: The base address of the memory region.
+ * @length: The length of the memory region.
+ *
+ * This function is used at early time. It iterates memblock.reserved and check
+ * if the kernel has used any memory in [@base, @base + @length).
+ *
+ * Return true if the kernel resides in the memory region, false otherwise.
+ */
+static bool __init kernel_resides_in_region(phys_addr_t base, u64 length)
+{
+	int i;
+	phys_addr_t start, end;
+	struct memblock_region *region;
+	struct memblock_type *reserved = &memblock.reserved;
+
+	for (i = 0; i < reserved->cnt; i++) {
+		region = &reserved->regions[i];
+
+		start = region->base;
+		end = region->base + region->size;
+		if (end <= base || start >= base + length)
+			continue;
+
+		return true;
+	}
+
+	return false;
+}
+
+/**
  * find_hotpluggable_memory - Find out hotpluggable memory from ACPI SRAT.
  *
  * This function did the following:
@@ -129,6 +161,16 @@ void __init find_hotpluggable_memory(void)
 
 	while (ACPI_SUCCESS(acpi_hotplug_mem_affinity(srat_vaddr, &base,
 						      &size, &offset))) {
+		/*
+		 * At early time, memblock will reserve some memory for the
+		 * kernel, such as the kernel code and data segments, initrd
+		 * file, and so on, which means the kernel resides in these
+		 * memory regions. These regions should not be hotpluggable.
+		 * So do not mark them as hotpluggable.
+		 */
+		if (kernel_resides_in_region(base, size))
+			continue;
+
 		/* Will mark hotpluggable memory regions here */
 	}
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH part5 2/7] x86, numa, mem_hotplug: Skip all the regions the kernel resides in.
@ 2013-08-08 10:16   ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

At early time, memblock will reserve some memory for the kernel,
such as the kernel code and data segments, initrd file, and so on,
which means the kernel resides in these memory regions.

Even if these memory regions are hotpluggable, we should not
mark them as hotpluggable. Otherwise the kernel won't have enough
memory to boot.

This patch finds out which memory regions the kernel resides in,
and skip them when finding all hotpluggable memory regions.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 mm/memory_hotplug.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ef9ccf8..e63f947 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -31,6 +31,7 @@
 #include <linux/firmware-map.h>
 #include <linux/stop_machine.h>
 #include <linux/acpi.h>
+#include <linux/memblock.h>
 
 #include <asm/tlbflush.h>
 
@@ -93,6 +94,37 @@ static void release_memory_resource(struct resource *res)
 
 #ifdef CONFIG_ACPI_NUMA
 /**
+ * kernel_resides_in_range - Check if kernel resides in a memory region.
+ * @base: The base address of the memory region.
+ * @length: The length of the memory region.
+ *
+ * This function is used at early time. It iterates memblock.reserved and check
+ * if the kernel has used any memory in [@base, @base + @length).
+ *
+ * Return true if the kernel resides in the memory region, false otherwise.
+ */
+static bool __init kernel_resides_in_region(phys_addr_t base, u64 length)
+{
+	int i;
+	phys_addr_t start, end;
+	struct memblock_region *region;
+	struct memblock_type *reserved = &memblock.reserved;
+
+	for (i = 0; i < reserved->cnt; i++) {
+		region = &reserved->regions[i];
+
+		start = region->base;
+		end = region->base + region->size;
+		if (end <= base || start >= base + length)
+			continue;
+
+		return true;
+	}
+
+	return false;
+}
+
+/**
  * find_hotpluggable_memory - Find out hotpluggable memory from ACPI SRAT.
  *
  * This function did the following:
@@ -129,6 +161,16 @@ void __init find_hotpluggable_memory(void)
 
 	while (ACPI_SUCCESS(acpi_hotplug_mem_affinity(srat_vaddr, &base,
 						      &size, &offset))) {
+		/*
+		 * At early time, memblock will reserve some memory for the
+		 * kernel, such as the kernel code and data segments, initrd
+		 * file, and so on, which means the kernel resides in these
+		 * memory regions. These regions should not be hotpluggable.
+		 * So do not mark them as hotpluggable.
+		 */
+		if (kernel_resides_in_region(base, size))
+			continue;
+
 		/* Will mark hotpluggable memory regions here */
 	}
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH part5 3/7] memblock, numa: Introduce flag into memblock.
  2013-08-08 10:16 ` Tang Chen
@ 2013-08-08 10:16   ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

There is no flag in memblock to describe what type the memory is.
Sometimes, we may use memblock to reserve some memory for special usage.
And we want to know what kind of memory it is. So we need a way to
differentiate memory for different usage.

In hotplug environment, we want to reserve hotpluggable memory so the
kernel won't be able to use it. And when the system is up, we have to
free these hotpluggable memory to buddy. So we need to mark these memory
first.

In order to do so, we need to mark out these special memory in memblock.
In this patch, we introduce a new "flags" member into memblock_region:
   struct memblock_region {
           phys_addr_t base;
           phys_addr_t size;
           unsigned long flags;		/* This is new. */
   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
           int nid;
   #endif
   };

This patch does the following things:
1) Add "flags" member to memblock_region.
2) Modify the following APIs' prototype:
	memblock_add_region()
	memblock_insert_region()
3) Add memblock_reserve_region() to support reserve memory with flags, and keep
   memblock_reserve()'s prototype unmodified.
4) Modify other APIs to support flags, but keep their prototype unmodified.

The idea is from Wen Congyang <wency@cn.fujitsu.com> and Liu Jiang <jiang.liu@huawei.com>.

v1 -> v2:
As tj suggested, a zero flag MEMBLK_DEFAULT will make users confused. If
we want to specify any other flag, such MEMBLK_HOTPLUG, users don't know
to use MEMBLK_DEFAULT | MEMBLK_HOTPLUG or just MEMBLK_HOTPLUG. So remove
MEMBLK_DEFAULT (which is 0), and just use 0 by default to avoid confusions
to users.

Suggested-by: Wen Congyang <wency@cn.fujitsu.com>
Suggested-by: Liu Jiang <jiang.liu@huawei.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |    1 +
 mm/memblock.c            |   53 +++++++++++++++++++++++++++++++++-------------
 2 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f388203..e89e0cd 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -22,6 +22,7 @@
 struct memblock_region {
 	phys_addr_t base;
 	phys_addr_t size;
+	unsigned long flags;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	int nid;
 #endif
diff --git a/mm/memblock.c b/mm/memblock.c
index a847bfe..0841a6e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -157,6 +157,7 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
 		type->cnt = 1;
 		type->regions[0].base = 0;
 		type->regions[0].size = 0;
+		type->regions[0].flags = 0;
 		memblock_set_region_node(&type->regions[0], MAX_NUMNODES);
 	}
 }
@@ -307,7 +308,8 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
 
 		if (this->base + this->size != next->base ||
 		    memblock_get_region_node(this) !=
-		    memblock_get_region_node(next)) {
+		    memblock_get_region_node(next) ||
+		    this->flags != next->flags) {
 			BUG_ON(this->base + this->size > next->base);
 			i++;
 			continue;
@@ -327,13 +329,15 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
  * @base:	base address of the new region
  * @size:	size of the new region
  * @nid:	node id of the new region
+ * @flags:	flags of the new region
  *
  * Insert new memblock region [@base,@base+@size) into @type at @idx.
  * @type must already have extra room to accomodate the new region.
  */
 static void __init_memblock memblock_insert_region(struct memblock_type *type,
 						   int idx, phys_addr_t base,
-						   phys_addr_t size, int nid)
+						   phys_addr_t size,
+						   int nid, unsigned long flags)
 {
 	struct memblock_region *rgn = &type->regions[idx];
 
@@ -341,6 +345,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
 	memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
 	rgn->base = base;
 	rgn->size = size;
+	rgn->flags = flags;
 	memblock_set_region_node(rgn, nid);
 	type->cnt++;
 	type->total_size += size;
@@ -352,6 +357,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
  * @base: base address of the new region
  * @size: size of the new region
  * @nid: nid of the new region
+ * @flags: flags of the new region
  *
  * Add new memblock region [@base,@base+@size) into @type.  The new region
  * is allowed to overlap with existing ones - overlaps don't affect already
@@ -362,7 +368,8 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
  * 0 on success, -errno on failure.
  */
 static int __init_memblock memblock_add_region(struct memblock_type *type,
-				phys_addr_t base, phys_addr_t size, int nid)
+				phys_addr_t base, phys_addr_t size,
+				int nid, unsigned long flags)
 {
 	bool insert = false;
 	phys_addr_t obase = base;
@@ -377,6 +384,7 @@ static int __init_memblock memblock_add_region(struct memblock_type *type,
 		WARN_ON(type->cnt != 1 || type->total_size);
 		type->regions[0].base = base;
 		type->regions[0].size = size;
+		type->regions[0].flags = flags;
 		memblock_set_region_node(&type->regions[0], nid);
 		type->total_size = size;
 		return 0;
@@ -407,7 +415,8 @@ repeat:
 			nr_new++;
 			if (insert)
 				memblock_insert_region(type, i++, base,
-						       rbase - base, nid);
+						       rbase - base, nid,
+						       flags);
 		}
 		/* area below @rend is dealt with, forget about it */
 		base = min(rend, end);
@@ -417,7 +426,8 @@ repeat:
 	if (base < end) {
 		nr_new++;
 		if (insert)
-			memblock_insert_region(type, i, base, end - base, nid);
+			memblock_insert_region(type, i, base, end - base,
+					       nid, flags);
 	}
 
 	/*
@@ -439,12 +449,13 @@ repeat:
 int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
 				       int nid)
 {
-	return memblock_add_region(&memblock.memory, base, size, nid);
+	return memblock_add_region(&memblock.memory, base, size, nid, 0);
 }
 
 int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)
 {
-	return memblock_add_region(&memblock.memory, base, size, MAX_NUMNODES);
+	return memblock_add_region(&memblock.memory, base, size,
+				   MAX_NUMNODES, 0);
 }
 
 /**
@@ -499,7 +510,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
 			rgn->size -= base - rbase;
 			type->total_size -= base - rbase;
 			memblock_insert_region(type, i, rbase, base - rbase,
-					       memblock_get_region_node(rgn));
+					       memblock_get_region_node(rgn),
+					       rgn->flags);
 		} else if (rend > end) {
 			/*
 			 * @rgn intersects from above.  Split and redo the
@@ -509,7 +521,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
 			rgn->size -= end - rbase;
 			type->total_size -= end - rbase;
 			memblock_insert_region(type, i--, rbase, end - rbase,
-					       memblock_get_region_node(rgn));
+					       memblock_get_region_node(rgn),
+					       rgn->flags);
 		} else {
 			/* @rgn is fully contained, record it */
 			if (!*end_rgn)
@@ -551,16 +564,24 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
 	return __memblock_remove(&memblock.reserved, base, size);
 }
 
-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+static int __init_memblock memblock_reserve_region(phys_addr_t base,
+						   phys_addr_t size,
+						   int nid,
+						   unsigned long flags)
 {
 	struct memblock_type *_rgn = &memblock.reserved;
 
-	memblock_dbg("memblock_reserve: [%#016llx-%#016llx] %pF\n",
+	memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
 		     (unsigned long long)base,
 		     (unsigned long long)base + size,
-		     (void *)_RET_IP_);
+		     flags, (void *)_RET_IP_);
+
+	return memblock_add_region(_rgn, base, size, nid, flags);
+}
 
-	return memblock_add_region(_rgn, base, size, MAX_NUMNODES);
+int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_reserve_region(base, size, MAX_NUMNODES, 0);
 }
 
 /**
@@ -985,6 +1006,7 @@ void __init_memblock memblock_set_current_limit(phys_addr_t limit)
 static void __init_memblock memblock_dump(struct memblock_type *type, char *name)
 {
 	unsigned long long base, size;
+	unsigned long flags;
 	int i;
 
 	pr_info(" %s.cnt  = 0x%lx\n", name, type->cnt);
@@ -995,13 +1017,14 @@ static void __init_memblock memblock_dump(struct memblock_type *type, char *name
 
 		base = rgn->base;
 		size = rgn->size;
+		flags = rgn->flags;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 		if (memblock_get_region_node(rgn) != MAX_NUMNODES)
 			snprintf(nid_buf, sizeof(nid_buf), " on node %d",
 				 memblock_get_region_node(rgn));
 #endif
-		pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s\n",
-			name, i, base, base + size - 1, size, nid_buf);
+		pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s flags: %#lx\n",
+			name, i, base, base + size - 1, size, nid_buf, flags);
 	}
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH part5 3/7] memblock, numa: Introduce flag into memblock.
@ 2013-08-08 10:16   ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

There is no flag in memblock to describe what type the memory is.
Sometimes, we may use memblock to reserve some memory for special usage.
And we want to know what kind of memory it is. So we need a way to
differentiate memory for different usage.

In hotplug environment, we want to reserve hotpluggable memory so the
kernel won't be able to use it. And when the system is up, we have to
free these hotpluggable memory to buddy. So we need to mark these memory
first.

In order to do so, we need to mark out these special memory in memblock.
In this patch, we introduce a new "flags" member into memblock_region:
   struct memblock_region {
           phys_addr_t base;
           phys_addr_t size;
           unsigned long flags;		/* This is new. */
   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
           int nid;
   #endif
   };

This patch does the following things:
1) Add "flags" member to memblock_region.
2) Modify the following APIs' prototype:
	memblock_add_region()
	memblock_insert_region()
3) Add memblock_reserve_region() to support reserve memory with flags, and keep
   memblock_reserve()'s prototype unmodified.
4) Modify other APIs to support flags, but keep their prototype unmodified.

The idea is from Wen Congyang <wency@cn.fujitsu.com> and Liu Jiang <jiang.liu@huawei.com>.

v1 -> v2:
As tj suggested, a zero flag MEMBLK_DEFAULT will make users confused. If
we want to specify any other flag, such MEMBLK_HOTPLUG, users don't know
to use MEMBLK_DEFAULT | MEMBLK_HOTPLUG or just MEMBLK_HOTPLUG. So remove
MEMBLK_DEFAULT (which is 0), and just use 0 by default to avoid confusions
to users.

Suggested-by: Wen Congyang <wency@cn.fujitsu.com>
Suggested-by: Liu Jiang <jiang.liu@huawei.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |    1 +
 mm/memblock.c            |   53 +++++++++++++++++++++++++++++++++-------------
 2 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f388203..e89e0cd 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -22,6 +22,7 @@
 struct memblock_region {
 	phys_addr_t base;
 	phys_addr_t size;
+	unsigned long flags;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	int nid;
 #endif
diff --git a/mm/memblock.c b/mm/memblock.c
index a847bfe..0841a6e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -157,6 +157,7 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
 		type->cnt = 1;
 		type->regions[0].base = 0;
 		type->regions[0].size = 0;
+		type->regions[0].flags = 0;
 		memblock_set_region_node(&type->regions[0], MAX_NUMNODES);
 	}
 }
@@ -307,7 +308,8 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
 
 		if (this->base + this->size != next->base ||
 		    memblock_get_region_node(this) !=
-		    memblock_get_region_node(next)) {
+		    memblock_get_region_node(next) ||
+		    this->flags != next->flags) {
 			BUG_ON(this->base + this->size > next->base);
 			i++;
 			continue;
@@ -327,13 +329,15 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
  * @base:	base address of the new region
  * @size:	size of the new region
  * @nid:	node id of the new region
+ * @flags:	flags of the new region
  *
  * Insert new memblock region [@base,@base+@size) into @type at @idx.
  * @type must already have extra room to accomodate the new region.
  */
 static void __init_memblock memblock_insert_region(struct memblock_type *type,
 						   int idx, phys_addr_t base,
-						   phys_addr_t size, int nid)
+						   phys_addr_t size,
+						   int nid, unsigned long flags)
 {
 	struct memblock_region *rgn = &type->regions[idx];
 
@@ -341,6 +345,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
 	memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
 	rgn->base = base;
 	rgn->size = size;
+	rgn->flags = flags;
 	memblock_set_region_node(rgn, nid);
 	type->cnt++;
 	type->total_size += size;
@@ -352,6 +357,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
  * @base: base address of the new region
  * @size: size of the new region
  * @nid: nid of the new region
+ * @flags: flags of the new region
  *
  * Add new memblock region [@base,@base+@size) into @type.  The new region
  * is allowed to overlap with existing ones - overlaps don't affect already
@@ -362,7 +368,8 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
  * 0 on success, -errno on failure.
  */
 static int __init_memblock memblock_add_region(struct memblock_type *type,
-				phys_addr_t base, phys_addr_t size, int nid)
+				phys_addr_t base, phys_addr_t size,
+				int nid, unsigned long flags)
 {
 	bool insert = false;
 	phys_addr_t obase = base;
@@ -377,6 +384,7 @@ static int __init_memblock memblock_add_region(struct memblock_type *type,
 		WARN_ON(type->cnt != 1 || type->total_size);
 		type->regions[0].base = base;
 		type->regions[0].size = size;
+		type->regions[0].flags = flags;
 		memblock_set_region_node(&type->regions[0], nid);
 		type->total_size = size;
 		return 0;
@@ -407,7 +415,8 @@ repeat:
 			nr_new++;
 			if (insert)
 				memblock_insert_region(type, i++, base,
-						       rbase - base, nid);
+						       rbase - base, nid,
+						       flags);
 		}
 		/* area below @rend is dealt with, forget about it */
 		base = min(rend, end);
@@ -417,7 +426,8 @@ repeat:
 	if (base < end) {
 		nr_new++;
 		if (insert)
-			memblock_insert_region(type, i, base, end - base, nid);
+			memblock_insert_region(type, i, base, end - base,
+					       nid, flags);
 	}
 
 	/*
@@ -439,12 +449,13 @@ repeat:
 int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
 				       int nid)
 {
-	return memblock_add_region(&memblock.memory, base, size, nid);
+	return memblock_add_region(&memblock.memory, base, size, nid, 0);
 }
 
 int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)
 {
-	return memblock_add_region(&memblock.memory, base, size, MAX_NUMNODES);
+	return memblock_add_region(&memblock.memory, base, size,
+				   MAX_NUMNODES, 0);
 }
 
 /**
@@ -499,7 +510,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
 			rgn->size -= base - rbase;
 			type->total_size -= base - rbase;
 			memblock_insert_region(type, i, rbase, base - rbase,
-					       memblock_get_region_node(rgn));
+					       memblock_get_region_node(rgn),
+					       rgn->flags);
 		} else if (rend > end) {
 			/*
 			 * @rgn intersects from above.  Split and redo the
@@ -509,7 +521,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
 			rgn->size -= end - rbase;
 			type->total_size -= end - rbase;
 			memblock_insert_region(type, i--, rbase, end - rbase,
-					       memblock_get_region_node(rgn));
+					       memblock_get_region_node(rgn),
+					       rgn->flags);
 		} else {
 			/* @rgn is fully contained, record it */
 			if (!*end_rgn)
@@ -551,16 +564,24 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
 	return __memblock_remove(&memblock.reserved, base, size);
 }
 
-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+static int __init_memblock memblock_reserve_region(phys_addr_t base,
+						   phys_addr_t size,
+						   int nid,
+						   unsigned long flags)
 {
 	struct memblock_type *_rgn = &memblock.reserved;
 
-	memblock_dbg("memblock_reserve: [%#016llx-%#016llx] %pF\n",
+	memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
 		     (unsigned long long)base,
 		     (unsigned long long)base + size,
-		     (void *)_RET_IP_);
+		     flags, (void *)_RET_IP_);
+
+	return memblock_add_region(_rgn, base, size, nid, flags);
+}
 
-	return memblock_add_region(_rgn, base, size, MAX_NUMNODES);
+int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_reserve_region(base, size, MAX_NUMNODES, 0);
 }
 
 /**
@@ -985,6 +1006,7 @@ void __init_memblock memblock_set_current_limit(phys_addr_t limit)
 static void __init_memblock memblock_dump(struct memblock_type *type, char *name)
 {
 	unsigned long long base, size;
+	unsigned long flags;
 	int i;
 
 	pr_info(" %s.cnt  = 0x%lx\n", name, type->cnt);
@@ -995,13 +1017,14 @@ static void __init_memblock memblock_dump(struct memblock_type *type, char *name
 
 		base = rgn->base;
 		size = rgn->size;
+		flags = rgn->flags;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 		if (memblock_get_region_node(rgn) != MAX_NUMNODES)
 			snprintf(nid_buf, sizeof(nid_buf), " on node %d",
 				 memblock_get_region_node(rgn));
 #endif
-		pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s\n",
-			name, i, base, base + size - 1, size, nid_buf);
+		pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s flags: %#lx\n",
+			name, i, base, base + size - 1, size, nid_buf, flags);
 	}
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH part5 4/7] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions.
  2013-08-08 10:16 ` Tang Chen
@ 2013-08-08 10:16   ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

In find_hotpluggable_memory, once we find out a memory region which is
hotpluggable, we want to mark them in memblock.memory. So that we could
control memblock allocator not to allocte hotpluggable memory for the kernel
later.

To achieve this goal, we introduce MEMBLOCK_HOTPLUG flag to indicate the
hotpluggable memory regions in memblock and a function memblock_mark_hotplug()
to mark hotpluggable memory if we find one.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |   11 +++++++++++
 mm/memblock.c            |   26 ++++++++++++++++++++++++++
 mm/memory_hotplug.c      |    3 ++-
 3 files changed, 39 insertions(+), 1 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index e89e0cd..c0bd31c 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,6 +19,9 @@
 
 #define INIT_MEMBLOCK_REGIONS	128
 
+/* Definition of memblock flags. */
+#define MEMBLOCK_HOTPLUG	0x1	/* hotpluggable region */
+
 struct memblock_region {
 	phys_addr_t base;
 	phys_addr_t size;
@@ -60,6 +63,8 @@ int memblock_free(phys_addr_t base, phys_addr_t size);
 int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 
+int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 			  unsigned long *out_end_pfn, int *out_nid);
@@ -119,6 +124,12 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
 	     i != (u64)ULLONG_MAX;					\
 	     __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid))
 
+static inline void memblock_set_region_flags(struct memblock_region *r,
+					     unsigned long flags)
+{
+	r->flags = flags;
+}
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);
 
diff --git a/mm/memblock.c b/mm/memblock.c
index 0841a6e..ecd8568 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -585,6 +585,32 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
 }
 
 /**
+ * memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * This function isolates region [@base, @base + @size), and mark it with flag
+ * MEMBLOCK_HOTPLUG.
+ *
+ * Return 0 on succees, -errno on failure.
+ */
+int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
+{
+	struct memblock_type *type = &memblock.memory;
+	int i, ret, start_rgn, end_rgn;
+
+	ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
+	if (ret)
+		return ret;
+
+	for (i = start_rgn; i < end_rgn; i++)
+		memblock_set_region_flags(&type->regions[i], MEMBLOCK_HOTPLUG);
+
+	memblock_merge_regions(type);
+	return 0;
+}
+
+/**
  * __next_free_mem_range - next function for for_each_free_mem_range()
  * @idx: pointer to u64 loop variable
  * @nid: node selector, %MAX_NUMNODES for all nodes
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e63f947..e4db758 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -171,7 +171,8 @@ void __init find_hotpluggable_memory(void)
 		if (kernel_resides_in_region(base, size))
 			continue;
 
-		/* Will mark hotpluggable memory regions here */
+		/* Mark hotpluggable memory regions in memblock.memory */
+		memblock_mark_hotplug(base, size);
 	}
 
 	early_iounmap(srat_vaddr, length);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH part5 4/7] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions.
@ 2013-08-08 10:16   ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

In find_hotpluggable_memory, once we find out a memory region which is
hotpluggable, we want to mark them in memblock.memory. So that we could
control memblock allocator not to allocte hotpluggable memory for the kernel
later.

To achieve this goal, we introduce MEMBLOCK_HOTPLUG flag to indicate the
hotpluggable memory regions in memblock and a function memblock_mark_hotplug()
to mark hotpluggable memory if we find one.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |   11 +++++++++++
 mm/memblock.c            |   26 ++++++++++++++++++++++++++
 mm/memory_hotplug.c      |    3 ++-
 3 files changed, 39 insertions(+), 1 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index e89e0cd..c0bd31c 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,6 +19,9 @@
 
 #define INIT_MEMBLOCK_REGIONS	128
 
+/* Definition of memblock flags. */
+#define MEMBLOCK_HOTPLUG	0x1	/* hotpluggable region */
+
 struct memblock_region {
 	phys_addr_t base;
 	phys_addr_t size;
@@ -60,6 +63,8 @@ int memblock_free(phys_addr_t base, phys_addr_t size);
 int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 
+int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 			  unsigned long *out_end_pfn, int *out_nid);
@@ -119,6 +124,12 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
 	     i != (u64)ULLONG_MAX;					\
 	     __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid))
 
+static inline void memblock_set_region_flags(struct memblock_region *r,
+					     unsigned long flags)
+{
+	r->flags = flags;
+}
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);
 
diff --git a/mm/memblock.c b/mm/memblock.c
index 0841a6e..ecd8568 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -585,6 +585,32 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
 }
 
 /**
+ * memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * This function isolates region [@base, @base + @size), and mark it with flag
+ * MEMBLOCK_HOTPLUG.
+ *
+ * Return 0 on succees, -errno on failure.
+ */
+int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
+{
+	struct memblock_type *type = &memblock.memory;
+	int i, ret, start_rgn, end_rgn;
+
+	ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
+	if (ret)
+		return ret;
+
+	for (i = start_rgn; i < end_rgn; i++)
+		memblock_set_region_flags(&type->regions[i], MEMBLOCK_HOTPLUG);
+
+	memblock_merge_regions(type);
+	return 0;
+}
+
+/**
  * __next_free_mem_range - next function for for_each_free_mem_range()
  * @idx: pointer to u64 loop variable
  * @nid: node selector, %MAX_NUMNODES for all nodes
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e63f947..e4db758 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -171,7 +171,8 @@ void __init find_hotpluggable_memory(void)
 		if (kernel_resides_in_region(base, size))
 			continue;
 
-		/* Will mark hotpluggable memory regions here */
+		/* Mark hotpluggable memory regions in memblock.memory */
+		memblock_mark_hotplug(base, size);
 	}
 
 	early_iounmap(srat_vaddr, length);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH part5 5/7] memblock, mem_hotplug: Make memblock skip hotpluggable regions by default.
  2013-08-08 10:16 ` Tang Chen
@ 2013-08-08 10:16   ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

Linux kernel cannot migrate pages used by the kernel. As a result, hotpluggable
memory used by the kernel won't be able to be hot-removed. To solve this
problem, the basic idea is to prevent memblock from allocating hotpluggable
memory for the kernel at early time, and arrange all hotpluggable memory in
ACPI SRAT(System Resource Affinity Table) as ZONE_MOVABLE when initializing
zones.

In the previous patches, we have marked hotpluggable memory regions with
MEMBLOCK_HOTPLUG flag in memblock.memory.

In this patch, we make memblock skip these hotpluggable memory regions in
the default allocate function.

memblock_find_in_range_node()
  |-->for_each_free_mem_range_reverse()
        |-->__next_free_mem_range_rev()

The above is the only place where __next_free_mem_range_rev() is used. So
skip hotpluggable memory regions when iterating memblock.memory to find
free memory.

In the later patches, a boot option named "movablenode" will be introduced
to enable/disable using SRAT to arrange ZONE_MOVABLE.

NOTE: This check will always be done. It is OK because if users didn't specify
      movablenode option, the hotpluggable memory won't be marked. So this
      check won't skip any memory, which means the kernel will act as before.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 mm/memblock.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index ecd8568..3ea4301 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -695,6 +695,10 @@ void __init_memblock __next_free_mem_range(u64 *idx, int nid,
  * @out_nid: ptr to int for nid of the range, can be %NULL
  *
  * Reverse of __next_free_mem_range().
+ *
+ * Linux kernel cannot migrate pages used by itself. Memory hotplug users won't
+ * be able to hot-remove hotpluggable memory used by the kernel. So this
+ * function skip hotpluggable regions when allocating memory for the kernel.
  */
 void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
 					   phys_addr_t *out_start,
@@ -719,6 +723,10 @@ void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
 		if (nid != MAX_NUMNODES && nid != memblock_get_region_node(m))
 			continue;
 
+		/* skip hotpluggable memory regions */
+		if (m->flags & MEMBLOCK_HOTPLUG)
+			continue;
+
 		/* scan areas before each reservation for intersection */
 		for ( ; ri >= 0; ri--) {
 			struct memblock_region *r = &rsv->regions[ri];
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH part5 5/7] memblock, mem_hotplug: Make memblock skip hotpluggable regions by default.
@ 2013-08-08 10:16   ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

Linux kernel cannot migrate pages used by the kernel. As a result, hotpluggable
memory used by the kernel won't be able to be hot-removed. To solve this
problem, the basic idea is to prevent memblock from allocating hotpluggable
memory for the kernel at early time, and arrange all hotpluggable memory in
ACPI SRAT(System Resource Affinity Table) as ZONE_MOVABLE when initializing
zones.

In the previous patches, we have marked hotpluggable memory regions with
MEMBLOCK_HOTPLUG flag in memblock.memory.

In this patch, we make memblock skip these hotpluggable memory regions in
the default allocate function.

memblock_find_in_range_node()
  |-->for_each_free_mem_range_reverse()
        |-->__next_free_mem_range_rev()

The above is the only place where __next_free_mem_range_rev() is used. So
skip hotpluggable memory regions when iterating memblock.memory to find
free memory.

In the later patches, a boot option named "movablenode" will be introduced
to enable/disable using SRAT to arrange ZONE_MOVABLE.

NOTE: This check will always be done. It is OK because if users didn't specify
      movablenode option, the hotpluggable memory won't be marked. So this
      check won't skip any memory, which means the kernel will act as before.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 mm/memblock.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index ecd8568..3ea4301 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -695,6 +695,10 @@ void __init_memblock __next_free_mem_range(u64 *idx, int nid,
  * @out_nid: ptr to int for nid of the range, can be %NULL
  *
  * Reverse of __next_free_mem_range().
+ *
+ * Linux kernel cannot migrate pages used by itself. Memory hotplug users won't
+ * be able to hot-remove hotpluggable memory used by the kernel. So this
+ * function skip hotpluggable regions when allocating memory for the kernel.
  */
 void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
 					   phys_addr_t *out_start,
@@ -719,6 +723,10 @@ void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
 		if (nid != MAX_NUMNODES && nid != memblock_get_region_node(m))
 			continue;
 
+		/* skip hotpluggable memory regions */
+		if (m->flags & MEMBLOCK_HOTPLUG)
+			continue;
+
 		/* scan areas before each reservation for intersection */
 		for ( ; ri >= 0; ri--) {
 			struct memblock_region *r = &rsv->regions[ri];
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH part5 6/7] mem-hotplug: Introduce movablenode boot option to {en|dis}able using SRAT.
  2013-08-08 10:16 ` Tang Chen
@ 2013-08-08 10:16   ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

The Hot-Pluggable fired in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movablenode boot option to allow users to
choose to reserve hotpluggable memory and set it as ZONE_MOVABLE or not.

Users can specify "movablenode" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |   15 +++++++++++++++
 arch/x86/kernel/setup.c             |   10 ++++++++--
 include/linux/memory_hotplug.h      |    3 +++
 mm/memory_hotplug.c                 |   11 +++++++++++
 4 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 15356ac..7349d1f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1718,6 +1718,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
+	movablenode		[KNL,X86] This parameter enables/disables the
+			kernel to arrange hotpluggable memory ranges recorded
+			in ACPI SRAT(System Resource Affinity Table) as
+			ZONE_MOVABLE. And these memory can be hot-removed when
+			the system is up.
+			By specifying this option, all the hotpluggable memory
+			will be in ZONE_MOVABLE, which the kernel cannot use.
+			This will cause NUMA performance down. For users who
+			care about NUMA performance, just don't use it.
+			If all the memory ranges in the system are hotpluggable,
+			then the ones used by the kernel at early time, such as
+			kernel code and data segments, initrd file and so on,
+			won't be set as ZONE_MOVABLE, and won't be hotpluggable.
+			Otherwise the kernel won't have enough memory to boot.
+
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 36d7fe8..abdfed7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1061,14 +1061,20 @@ void __init setup_arch(char **cmdline_p)
 	 */
 	early_acpi_boot_table_init();
 
-#ifdef CONFIG_ACPI_NUMA
+#if defined(CONFIG_ACPI_NUMA) && defined(CONFIG_MOVABLE_NODE)
 	/*
 	 * Linux kernel cannot migrate kernel pages, as a result, memory used
 	 * by the kernel cannot be hot-removed. Find and mark hotpluggable
 	 * memory in memblock to prevent memblock from allocating hotpluggable
 	 * memory for the kernel.
+	 *
+	 * If all the memory in a node is hotpluggable, then the kernel won't
+	 * be able to use memory on that node. This will cause NUMA performance
+	 * down. So by default, we don't reserve any hotpluggable memory. Users
+	 * may use "movablenode" boot option to enable this functionality.
 	 */
-	find_hotpluggable_memory();
+	if (movablenode_enable_srat)
+		find_hotpluggable_memory();
 #endif
 
 	/*
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 463efa9..43eb373 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -33,6 +33,9 @@ enum {
 	ONLINE_MOVABLE,
 };
 
+/* Enable/disable SRAT in movablenode boot option */
+extern bool movablenode_enable_srat;
+
 /*
  * pgdat resizing functions
  */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e4db758..65d7156 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -93,6 +93,17 @@ static void release_memory_resource(struct resource *res)
 }
 
 #ifdef CONFIG_ACPI_NUMA
+#ifdef CONFIG_MOVABLE_NODE
+bool __initdata movablenode_enable_srat;
+
+static int __init cmdline_parse_movablenode(char *p)
+{
+	movablenode_enable_srat = true;
+	return 0;
+}
+early_param("movablenode", cmdline_parse_movablenode);
+#endif	/* CONFIG_MOVABLE_NODE */
+
 /**
  * kernel_resides_in_range - Check if kernel resides in a memory region.
  * @base: The base address of the memory region.
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH part5 6/7] mem-hotplug: Introduce movablenode boot option to {en|dis}able using SRAT.
@ 2013-08-08 10:16   ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

The Hot-Pluggable fired in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movablenode boot option to allow users to
choose to reserve hotpluggable memory and set it as ZONE_MOVABLE or not.

Users can specify "movablenode" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |   15 +++++++++++++++
 arch/x86/kernel/setup.c             |   10 ++++++++--
 include/linux/memory_hotplug.h      |    3 +++
 mm/memory_hotplug.c                 |   11 +++++++++++
 4 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 15356ac..7349d1f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1718,6 +1718,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
+	movablenode		[KNL,X86] This parameter enables/disables the
+			kernel to arrange hotpluggable memory ranges recorded
+			in ACPI SRAT(System Resource Affinity Table) as
+			ZONE_MOVABLE. And these memory can be hot-removed when
+			the system is up.
+			By specifying this option, all the hotpluggable memory
+			will be in ZONE_MOVABLE, which the kernel cannot use.
+			This will cause NUMA performance down. For users who
+			care about NUMA performance, just don't use it.
+			If all the memory ranges in the system are hotpluggable,
+			then the ones used by the kernel at early time, such as
+			kernel code and data segments, initrd file and so on,
+			won't be set as ZONE_MOVABLE, and won't be hotpluggable.
+			Otherwise the kernel won't have enough memory to boot.
+
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 36d7fe8..abdfed7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1061,14 +1061,20 @@ void __init setup_arch(char **cmdline_p)
 	 */
 	early_acpi_boot_table_init();
 
-#ifdef CONFIG_ACPI_NUMA
+#if defined(CONFIG_ACPI_NUMA) && defined(CONFIG_MOVABLE_NODE)
 	/*
 	 * Linux kernel cannot migrate kernel pages, as a result, memory used
 	 * by the kernel cannot be hot-removed. Find and mark hotpluggable
 	 * memory in memblock to prevent memblock from allocating hotpluggable
 	 * memory for the kernel.
+	 *
+	 * If all the memory in a node is hotpluggable, then the kernel won't
+	 * be able to use memory on that node. This will cause NUMA performance
+	 * down. So by default, we don't reserve any hotpluggable memory. Users
+	 * may use "movablenode" boot option to enable this functionality.
 	 */
-	find_hotpluggable_memory();
+	if (movablenode_enable_srat)
+		find_hotpluggable_memory();
 #endif
 
 	/*
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 463efa9..43eb373 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -33,6 +33,9 @@ enum {
 	ONLINE_MOVABLE,
 };
 
+/* Enable/disable SRAT in movablenode boot option */
+extern bool movablenode_enable_srat;
+
 /*
  * pgdat resizing functions
  */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e4db758..65d7156 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -93,6 +93,17 @@ static void release_memory_resource(struct resource *res)
 }
 
 #ifdef CONFIG_ACPI_NUMA
+#ifdef CONFIG_MOVABLE_NODE
+bool __initdata movablenode_enable_srat;
+
+static int __init cmdline_parse_movablenode(char *p)
+{
+	movablenode_enable_srat = true;
+	return 0;
+}
+early_param("movablenode", cmdline_parse_movablenode);
+#endif	/* CONFIG_MOVABLE_NODE */
+
 /**
  * kernel_resides_in_range - Check if kernel resides in a memory region.
  * @base: The base address of the memory region.
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH part5 7/7] x86, numa, acpi, memory-hotplug: Make movablenode have higher priority.
  2013-08-08 10:16 ` Tang Chen
@ 2013-08-08 10:16   ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

Arrange hotpluggable memory as ZONE_MOVABLE will cause NUMA performance down
because the kernel cannot use movable memory. For users who don't use memory
hotplug and who don't want to lose their NUMA performance, they need a way to
disable this functionality. So we improved movablecore boot option.

If users specify the original movablecore=nn@ss boot option, the kernel will
arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot option is similar
except it specifies ZONE_NORMAL ranges.

Now, if users specify "movablenode" in kernel commandline, the kernel will
arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do this, all
the other movablecore=nn@ss and kernelcore=nn@ss options should be ignored.

For those who don't want this, just specify nothing. The kernel will act as
before.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |    1 +
 mm/memblock.c            |    5 +++++
 mm/page_alloc.c          |   31 ++++++++++++++++++++++++++++---
 3 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index c0bd31c..e78e32f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -64,6 +64,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 
 int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
+bool memblock_is_hotpluggable(struct memblock_region *region);
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index 3ea4301..c8eb5d2 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -610,6 +610,11 @@ int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
 	return 0;
 }
 
+bool __init_memblock memblock_is_hotpluggable(struct memblock_region *region)
+{
+	return region->flags & MEMBLOCK_HOTPLUG;
+}
+
 /**
  * __next_free_mem_range - next function for for_each_free_mem_range()
  * @idx: pointer to u64 loop variable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..86d4381 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4948,9 +4948,35 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 	nodemask_t saved_node_state = node_states[N_MEMORY];
 	unsigned long totalpages = early_calculate_totalpages();
 	int usable_nodes = nodes_weight(node_states[N_MEMORY]);
+	struct memblock_type *type = &memblock.memory;
 
+	/* Need to find movable_zone earlier when movablenode is specified. */
+	find_usable_zone_for_movable();
+
+#ifdef CONFIG_MOVABLE_NODE
 	/*
-	 * If movablecore was specified, calculate what size of
+	 * If movablenode is specified, ignore kernelcore and movablecore
+	 * options.
+	 */
+	if (movablenode_enable_srat) {
+		for (i = 0; i < type->cnt; i++) {
+			if (!memblock_is_hotpluggable(&type->regions[i]))
+				continue;
+
+			nid = type->regions[i].nid;
+
+			usable_startpfn = PFN_DOWN(type->regions[i].base);
+			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+				min(usable_startpfn, zone_movable_pfn[nid]) :
+				usable_startpfn;
+		}
+
+		goto out;
+	}
+#endif
+
+	/*
+	 * If movablecore=nn[KMG] was specified, calculate what size of
 	 * kernelcore that corresponds so that memory usable for
 	 * any allocation type is evenly spread. If both kernelcore
 	 * and movablecore are specified, then the value of kernelcore
@@ -4976,7 +5002,6 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 		goto out;
 
 	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
-	find_usable_zone_for_movable();
 	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
 restart:
@@ -5067,12 +5092,12 @@ restart:
 	if (usable_nodes && required_kernelcore > usable_nodes)
 		goto restart;
 
+out:
 	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)
 		zone_movable_pfn[nid] =
 			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
 
-out:
 	/* restore the node_state */
 	node_states[N_MEMORY] = saved_node_state;
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 165+ messages in thread

* [PATCH part5 7/7] x86, numa, acpi, memory-hotplug: Make movablenode have higher priority.
@ 2013-08-08 10:16   ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-08 10:16 UTC (permalink / raw)
  To: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

Arrange hotpluggable memory as ZONE_MOVABLE will cause NUMA performance down
because the kernel cannot use movable memory. For users who don't use memory
hotplug and who don't want to lose their NUMA performance, they need a way to
disable this functionality. So we improved movablecore boot option.

If users specify the original movablecore=nn@ss boot option, the kernel will
arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot option is similar
except it specifies ZONE_NORMAL ranges.

Now, if users specify "movablenode" in kernel commandline, the kernel will
arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do this, all
the other movablecore=nn@ss and kernelcore=nn@ss options should be ignored.

For those who don't want this, just specify nothing. The kernel will act as
before.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |    1 +
 mm/memblock.c            |    5 +++++
 mm/page_alloc.c          |   31 ++++++++++++++++++++++++++++---
 3 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index c0bd31c..e78e32f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -64,6 +64,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 
 int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
+bool memblock_is_hotpluggable(struct memblock_region *region);
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index 3ea4301..c8eb5d2 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -610,6 +610,11 @@ int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
 	return 0;
 }
 
+bool __init_memblock memblock_is_hotpluggable(struct memblock_region *region)
+{
+	return region->flags & MEMBLOCK_HOTPLUG;
+}
+
 /**
  * __next_free_mem_range - next function for for_each_free_mem_range()
  * @idx: pointer to u64 loop variable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..86d4381 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4948,9 +4948,35 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 	nodemask_t saved_node_state = node_states[N_MEMORY];
 	unsigned long totalpages = early_calculate_totalpages();
 	int usable_nodes = nodes_weight(node_states[N_MEMORY]);
+	struct memblock_type *type = &memblock.memory;
 
+	/* Need to find movable_zone earlier when movablenode is specified. */
+	find_usable_zone_for_movable();
+
+#ifdef CONFIG_MOVABLE_NODE
 	/*
-	 * If movablecore was specified, calculate what size of
+	 * If movablenode is specified, ignore kernelcore and movablecore
+	 * options.
+	 */
+	if (movablenode_enable_srat) {
+		for (i = 0; i < type->cnt; i++) {
+			if (!memblock_is_hotpluggable(&type->regions[i]))
+				continue;
+
+			nid = type->regions[i].nid;
+
+			usable_startpfn = PFN_DOWN(type->regions[i].base);
+			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+				min(usable_startpfn, zone_movable_pfn[nid]) :
+				usable_startpfn;
+		}
+
+		goto out;
+	}
+#endif
+
+	/*
+	 * If movablecore=nn[KMG] was specified, calculate what size of
 	 * kernelcore that corresponds so that memory usable for
 	 * any allocation type is evenly spread. If both kernelcore
 	 * and movablecore are specified, then the value of kernelcore
@@ -4976,7 +5002,6 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 		goto out;
 
 	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
-	find_usable_zone_for_movable();
 	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
 restart:
@@ -5067,12 +5092,12 @@ restart:
 	if (usable_nodes && required_kernelcore > usable_nodes)
 		goto restart;
 
+out:
 	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)
 		zone_movable_pfn[nid] =
 			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
 
-out:
 	/* restore the node_state */
 	node_states[N_MEMORY] = saved_node_state;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-08 10:16 ` Tang Chen
@ 2013-08-09 16:32   ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-09 16:32 UTC (permalink / raw)
  To: Tang Chen
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, trenn,
	yinghai, jiang.liu, wency, laijs, isimatu.yasuaki, izumi.taku,
	mgorman, minchan, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, riel, jweiner, prarit, zhangyanfei, yanghy, x86,
	linux-doc, linux-kernel, linux-mm, linux-acpi

Hello,

On Thu, Aug 08, 2013 at 06:16:12PM +0800, Tang Chen wrote:
> In previous parts' patches, we have obtained SRAT earlier enough, right after
> memblock is ready. So this patch-set does the following things:

Can you please set up a git branch with all patches?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-09 16:32   ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-09 16:32 UTC (permalink / raw)
  To: Tang Chen
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, trenn,
	yinghai, jiang.liu, wency, laijs, isimatu.yasuaki, izumi.taku,
	mgorman, minchan, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, riel, jweiner, prarit, zhangyanfei, yanghy, x86,
	linux-doc, linux-kernel, linux-mm, linux-acpi

Hello,

On Thu, Aug 08, 2013 at 06:16:12PM +0800, Tang Chen wrote:
> In previous parts' patches, we have obtained SRAT earlier enough, right after
> memblock is ready. So this patch-set does the following things:

Can you please set up a git branch with all patches?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-09 16:32   ` Tejun Heo
  (?)
@ 2013-08-12  6:33   ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12  6:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

[-- Attachment #1: Type: text/plain, Size: 469 bytes --]

On 08/10/2013 12:32 AM, Tejun Heo wrote:
> Hello,
>
> On Thu, Aug 08, 2013 at 06:16:12PM +0800, Tang Chen wrote:
>> In previous parts' patches, we have obtained SRAT earlier enough, right after
>> memblock is ready. So this patch-set does the following things:
> Can you please set up a git branch with all patches?
Hi tj,

Please refer to the following tree:
https://github.com/imtangchen/linux movablenode-boot-option

It contains all 5 parts patches.

Thanks.

>
>


[-- Attachment #2: Type: text/html, Size: 1206 bytes --]

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-09 16:32   ` Tejun Heo
@ 2013-08-12  8:54     ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12  8:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, imtangchen

On 08/10/2013 12:32 AM, Tejun Heo wrote:
> Hello,
>
> On Thu, Aug 08, 2013 at 06:16:12PM +0800, Tang Chen wrote:
>> In previous parts' patches, we have obtained SRAT earlier enough, right after
>> memblock is ready. So this patch-set does the following things:
> Can you please set up a git branch with all patches?
>
> Thanks.

Please refer to :

https://github.com/imtangchen/linux movablenode-boot-option

It contains all 5 parts patches.

Thanks.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12  8:54     ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12  8:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, imtangchen

On 08/10/2013 12:32 AM, Tejun Heo wrote:
> Hello,
>
> On Thu, Aug 08, 2013 at 06:16:12PM +0800, Tang Chen wrote:
>> In previous parts' patches, we have obtained SRAT earlier enough, right after
>> memblock is ready. So this patch-set does the following things:
> Can you please set up a git branch with all patches?
>
> Thanks.

Please refer to :

https://github.com/imtangchen/linux movablenode-boot-option

It contains all 5 parts patches.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 1/7] x86: get pg_data_t's memory from other node
  2013-08-08 10:16   ` Tang Chen
@ 2013-08-12 14:39     ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 14:39 UTC (permalink / raw)
  To: Tang Chen
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, trenn,
	yinghai, jiang.liu, wency, laijs, isimatu.yasuaki, izumi.taku,
	mgorman, minchan, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, riel, jweiner, prarit, zhangyanfei, yanghy, x86,
	linux-doc, linux-kernel, linux-mm, linux-acpi

Hello,

The subject is a bit misleading.  Maybe it should say "allow getting
..." rather than "get ..."?

On Thu, Aug 08, 2013 at 06:16:13PM +0800, Tang Chen wrote:
....
> A node could have several memory devices. And the device who holds node
> data should be hot-removed in the last place. But in NUMA level, we don't
> know which memory_block (/sys/devices/system/node/nodeX/memoryXXX) belongs
> to which memory device. We only have node. So we can only do node hotplug.
> 
> But in virtualization, developers are now developing memory hotplug in qemu,
> which support a single memory device hotplug. So a whole node hotplug will
> not satisfy virtualization users.
> 
> So at last, we concluded that we'd better do memory hotplug and local node
> things (local node node data, pagetable, vmemmap, ...) in two steps.
> Please refer to https://lkml.org/lkml/2013/6/19/73

I suppose the above three paragraphs are trying to say

* A hotpluggable NUMA node may be composed of multiple memory devices
  which individually are hot-pluggable.

* pg_data_t and page tables the serving a NUMA node may be located in
  the same node they're serving; however, if the node is composed of
  multiple hotpluggable memory devices, the device containing them
  should be the last one to be removed.

* For physical memory hotplug, whole NUMA node hotunplugging is fine;
  however, in virtualizied environments, finer grained hotunplugging
  is desirable; unfortunately, there currently is no way to which
  specific memory device pg_data_t and page tables are allocated
  inside making it impossible to order unpluggings of memory devices
  of a NUMA node.  To avoid the ordering problem while allowing
  removal of subset fo a NUMA node, it has been decided that pg_data_t
  and page tables should be allocated on a different non-hotpluggable
  NUMA node.

Am I following it correctly?  If so, can you please update the
description?  It's quite confusing.  Also, the decision seems rather
poorly made.  It should be trivial to allocate memory for pg_data_t
and page tables in one end of the NUMA node and just record the
boundary to distinguish between the area which can be removed any time
and the other which can only be removed as a unit as the last step.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 1/7] x86: get pg_data_t's memory from other node
@ 2013-08-12 14:39     ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 14:39 UTC (permalink / raw)
  To: Tang Chen
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, trenn,
	yinghai, jiang.liu, wency, laijs, isimatu.yasuaki, izumi.taku,
	mgorman, minchan, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, riel, jweiner, prarit, zhangyanfei, yanghy, x86,
	linux-doc, linux-kernel, linux-mm, linux-acpi

Hello,

The subject is a bit misleading.  Maybe it should say "allow getting
..." rather than "get ..."?

On Thu, Aug 08, 2013 at 06:16:13PM +0800, Tang Chen wrote:
....
> A node could have several memory devices. And the device who holds node
> data should be hot-removed in the last place. But in NUMA level, we don't
> know which memory_block (/sys/devices/system/node/nodeX/memoryXXX) belongs
> to which memory device. We only have node. So we can only do node hotplug.
> 
> But in virtualization, developers are now developing memory hotplug in qemu,
> which support a single memory device hotplug. So a whole node hotplug will
> not satisfy virtualization users.
> 
> So at last, we concluded that we'd better do memory hotplug and local node
> things (local node node data, pagetable, vmemmap, ...) in two steps.
> Please refer to https://lkml.org/lkml/2013/6/19/73

I suppose the above three paragraphs are trying to say

* A hotpluggable NUMA node may be composed of multiple memory devices
  which individually are hot-pluggable.

* pg_data_t and page tables the serving a NUMA node may be located in
  the same node they're serving; however, if the node is composed of
  multiple hotpluggable memory devices, the device containing them
  should be the last one to be removed.

* For physical memory hotplug, whole NUMA node hotunplugging is fine;
  however, in virtualizied environments, finer grained hotunplugging
  is desirable; unfortunately, there currently is no way to which
  specific memory device pg_data_t and page tables are allocated
  inside making it impossible to order unpluggings of memory devices
  of a NUMA node.  To avoid the ordering problem while allowing
  removal of subset fo a NUMA node, it has been decided that pg_data_t
  and page tables should be allocated on a different non-hotpluggable
  NUMA node.

Am I following it correctly?  If so, can you please update the
description?  It's quite confusing.  Also, the decision seems rather
poorly made.  It should be trivial to allocate memory for pg_data_t
and page tables in one end of the NUMA node and just record the
boundary to distinguish between the area which can be removed any time
and the other which can only be removed as a unit as the last step.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-08 10:16 ` Tang Chen
@ 2013-08-12 14:50   ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 14:50 UTC (permalink / raw)
  To: Tang Chen
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, trenn,
	yinghai, jiang.liu, wency, laijs, isimatu.yasuaki, izumi.taku,
	mgorman, minchan, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, riel, jweiner, prarit, zhangyanfei, yanghy, x86,
	linux-doc, linux-kernel, linux-mm, linux-acpi

Hello,

On Thu, Aug 08, 2013 at 06:16:12PM +0800, Tang Chen wrote:
> [How we do this]
> 
> In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
> affinities in SRAT record every memory range in the system, and also, flags
> specifying if the memory range is hotpluggable.
> (Please refer to ACPI spec 5.0 5.2.16)
> 
> With the help of SRAT, we have to do the following two things to achieve our
> goal:
> 
> 1. When doing memory hot-add, allow the users arranging hotpluggable as
>    ZONE_MOVABLE.
>    (This has been done by the MOVABLE_NODE functionality in Linux.)
> 
> 2. when the system is booting, prevent bootmem allocator from allocating
>    hotpluggable memory for the kernel before the memory initialization
>    finishes.
>    (This is what we are going to do. See below.)

I think it's in a much better shape than before but there still are a
couple things bothering me.

* Why can't it be opportunistic?  It's silly, for example, to fail
  boot because ACPI tells the kernel that all memory is hotpluggable
  especially as there'd be plenty of memory sitting around doing
  nothing and failing to boot is one of the most grave failure mode.
  The HOTPLUG flag can be advisory, right?  Try to allocate
  !hotpluggable memory first, but if that fails, ignore it and
  allocate from anywhere, much like the try_nid allocations.

* Similar to the point hpa raised.  If this can be made opportunistic,
  do we need the strict reordering to discover things earlier?
  Shouldn't it be possible to configure memblock to allocate close to
  the kernel image until hotplug and numa information is available?
  For most sane cases, the memory allocated will be contained in
  non-hotpluggable node anyway and in case they aren't hotplug
  wouldn't work but the system will boot and function perfectly fine.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 14:50   ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 14:50 UTC (permalink / raw)
  To: Tang Chen
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, trenn,
	yinghai, jiang.liu, wency, laijs, isimatu.yasuaki, izumi.taku,
	mgorman, minchan, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, riel, jweiner, prarit, zhangyanfei, yanghy, x86,
	linux-doc, linux-kernel, linux-mm, linux-acpi

Hello,

On Thu, Aug 08, 2013 at 06:16:12PM +0800, Tang Chen wrote:
> [How we do this]
> 
> In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
> affinities in SRAT record every memory range in the system, and also, flags
> specifying if the memory range is hotpluggable.
> (Please refer to ACPI spec 5.0 5.2.16)
> 
> With the help of SRAT, we have to do the following two things to achieve our
> goal:
> 
> 1. When doing memory hot-add, allow the users arranging hotpluggable as
>    ZONE_MOVABLE.
>    (This has been done by the MOVABLE_NODE functionality in Linux.)
> 
> 2. when the system is booting, prevent bootmem allocator from allocating
>    hotpluggable memory for the kernel before the memory initialization
>    finishes.
>    (This is what we are going to do. See below.)

I think it's in a much better shape than before but there still are a
couple things bothering me.

* Why can't it be opportunistic?  It's silly, for example, to fail
  boot because ACPI tells the kernel that all memory is hotpluggable
  especially as there'd be plenty of memory sitting around doing
  nothing and failing to boot is one of the most grave failure mode.
  The HOTPLUG flag can be advisory, right?  Try to allocate
  !hotpluggable memory first, but if that fails, ignore it and
  allocate from anywhere, much like the try_nid allocations.

* Similar to the point hpa raised.  If this can be made opportunistic,
  do we need the strict reordering to discover things earlier?
  Shouldn't it be possible to configure memblock to allocate close to
  the kernel image until hotplug and numa information is available?
  For most sane cases, the memory allocated will be contained in
  non-hotpluggable node anyway and in case they aren't hotplug
  wouldn't work but the system will boot and function perfectly fine.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 1/7] x86: get pg_data_t's memory from other node
  2013-08-12 14:39     ` Tejun Heo
@ 2013-08-12 15:12       ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12 15:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, imtangchen

On 08/12/2013 10:39 PM, Tejun Heo wrote:
> Hello,
>
> The subject is a bit misleading.  Maybe it should say "allow getting
> ..." rather than "get ..."?

Ok, followed.

>
> On Thu, Aug 08, 2013 at 06:16:13PM +0800, Tang Chen wrote:
......
>
> I suppose the above three paragraphs are trying to say
>
> * A hotpluggable NUMA node may be composed of multiple memory devices
>    which individually are hot-pluggable.
>
> * pg_data_t and page tables the serving a NUMA node may be located in
>    the same node they're serving; however, if the node is composed of
>    multiple hotpluggable memory devices, the device containing them
>    should be the last one to be removed.
>
> * For physical memory hotplug, whole NUMA node hotunplugging is fine;
>    however, in virtualizied environments, finer grained hotunplugging
>    is desirable; unfortunately, there currently is no way to which
>    specific memory device pg_data_t and page tables are allocated
>    inside making it impossible to order unpluggings of memory devices
>    of a NUMA node.  To avoid the ordering problem while allowing
>    removal of subset fo a NUMA node, it has been decided that pg_data_t
>    and page tables should be allocated on a different non-hotpluggable
>    NUMA node.
>
> Am I following it correctly?  If so, can you please update the
> description?  It's quite confusing.

Yes, you are right. I'll update the description.

> Also, the decision seems rather
> poorly made.  It should be trivial to allocate memory for pg_data_t
> and page tables in one end of the NUMA node and just record the
> boundary to distinguish between the area which can be removed any time
> and the other which can only be removed as a unit as the last step.

We have tried, but the hot-remove path is difficult to fix.

Please refer to:
https://lkml.org/lkml/2013/6/13/249

Actually, the above patch-set can achieve movable node, what you said.
But we have the following problems:

1. The device holding pagetable cannot be removed before other devices.
    In virtualization environment, it could be prlblematic.
    (https://lkml.org/lkml/2013/6/18/527)

2. It will break the semanteme of memory_block online/offline. If part
    of the memory_block is pagetable, and it is offlined, what status
    it should have ? My patches set it to offline, but the kernel
    is still using the memory.


I'm not saying it is not fixable. But we finally came to that we
may do the movable node in the current way and then improve it,
including local pgdat and pagetable. We need more discussion on that.
But it should not block the memory hotplug developping.

I suggest to do movable node in the current way first, and improve
it after this is done.

Thanks.



^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 1/7] x86: get pg_data_t's memory from other node
@ 2013-08-12 15:12       ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12 15:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, imtangchen

On 08/12/2013 10:39 PM, Tejun Heo wrote:
> Hello,
>
> The subject is a bit misleading.  Maybe it should say "allow getting
> ..." rather than "get ..."?

Ok, followed.

>
> On Thu, Aug 08, 2013 at 06:16:13PM +0800, Tang Chen wrote:
......
>
> I suppose the above three paragraphs are trying to say
>
> * A hotpluggable NUMA node may be composed of multiple memory devices
>    which individually are hot-pluggable.
>
> * pg_data_t and page tables the serving a NUMA node may be located in
>    the same node they're serving; however, if the node is composed of
>    multiple hotpluggable memory devices, the device containing them
>    should be the last one to be removed.
>
> * For physical memory hotplug, whole NUMA node hotunplugging is fine;
>    however, in virtualizied environments, finer grained hotunplugging
>    is desirable; unfortunately, there currently is no way to which
>    specific memory device pg_data_t and page tables are allocated
>    inside making it impossible to order unpluggings of memory devices
>    of a NUMA node.  To avoid the ordering problem while allowing
>    removal of subset fo a NUMA node, it has been decided that pg_data_t
>    and page tables should be allocated on a different non-hotpluggable
>    NUMA node.
>
> Am I following it correctly?  If so, can you please update the
> description?  It's quite confusing.

Yes, you are right. I'll update the description.

> Also, the decision seems rather
> poorly made.  It should be trivial to allocate memory for pg_data_t
> and page tables in one end of the NUMA node and just record the
> boundary to distinguish between the area which can be removed any time
> and the other which can only be removed as a unit as the last step.

We have tried, but the hot-remove path is difficult to fix.

Please refer to:
https://lkml.org/lkml/2013/6/13/249

Actually, the above patch-set can achieve movable node, what you said.
But we have the following problems:

1. The device holding pagetable cannot be removed before other devices.
    In virtualization environment, it could be prlblematic.
    (https://lkml.org/lkml/2013/6/18/527)

2. It will break the semanteme of memory_block online/offline. If part
    of the memory_block is pagetable, and it is offlined, what status
    it should have ? My patches set it to offline, but the kernel
    is still using the memory.


I'm not saying it is not fixable. But we finally came to that we
may do the movable node in the current way and then improve it,
including local pgdat and pagetable. We need more discussion on that.
But it should not block the memory hotplug developping.

I suggest to do movable node in the current way first, and improve
it after this is done.

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 14:50   ` Tejun Heo
@ 2013-08-12 15:14     ` H. Peter Anvin
  -1 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-12 15:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, akpm,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

On 08/12/2013 07:50 AM, Tejun Heo wrote:
> 
> * Why can't it be opportunistic?  It's silly, for example, to fail
>   boot because ACPI tells the kernel that all memory is hotpluggable
>   especially as there'd be plenty of memory sitting around doing
>   nothing and failing to boot is one of the most grave failure mode.
>   The HOTPLUG flag can be advisory, right?  Try to allocate
>   !hotpluggable memory first, but if that fails, ignore it and
>   allocate from anywhere, much like the try_nid allocations.
> 
> * Similar to the point hpa raised.  If this can be made opportunistic,
>   do we need the strict reordering to discover things earlier?
>   Shouldn't it be possible to configure memblock to allocate close to
>   the kernel image until hotplug and numa information is available?
>   For most sane cases, the memory allocated will be contained in
>   non-hotpluggable node anyway and in case they aren't hotplug
>   wouldn't work but the system will boot and function perfectly fine.
> 

It gets really messy if it is advisory.  Suddenly you have the user
thinking they can hotswap a memory bank and they just can't.

Overall, I'm getting convinced that this whole approach is just doomed
to failure -- it will not provide the user what they expect and what
they need, which is to be able to hotswap any particular chunk of
memory.  This means that there has to be a remapping layer, either using
the TLBs (perhaps leveraging the Xen machine page number) or using
things like QPI memory routing.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 15:14     ` H. Peter Anvin
  0 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-12 15:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, akpm,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

On 08/12/2013 07:50 AM, Tejun Heo wrote:
> 
> * Why can't it be opportunistic?  It's silly, for example, to fail
>   boot because ACPI tells the kernel that all memory is hotpluggable
>   especially as there'd be plenty of memory sitting around doing
>   nothing and failing to boot is one of the most grave failure mode.
>   The HOTPLUG flag can be advisory, right?  Try to allocate
>   !hotpluggable memory first, but if that fails, ignore it and
>   allocate from anywhere, much like the try_nid allocations.
> 
> * Similar to the point hpa raised.  If this can be made opportunistic,
>   do we need the strict reordering to discover things earlier?
>   Shouldn't it be possible to configure memblock to allocate close to
>   the kernel image until hotplug and numa information is available?
>   For most sane cases, the memory allocated will be contained in
>   non-hotpluggable node anyway and in case they aren't hotplug
>   wouldn't work but the system will boot and function perfectly fine.
> 

It gets really messy if it is advisory.  Suddenly you have the user
thinking they can hotswap a memory bank and they just can't.

Overall, I'm getting convinced that this whole approach is just doomed
to failure -- it will not provide the user what they expect and what
they need, which is to be able to hotswap any particular chunk of
memory.  This means that there has to be a remapping layer, either using
the TLBs (perhaps leveraging the Xen machine page number) or using
things like QPI memory routing.

	-hpa



^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 15:14     ` H. Peter Anvin
@ 2013-08-12 15:23       ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 15:23 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, akpm,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hello,

On Mon, Aug 12, 2013 at 08:14:04AM -0700, H. Peter Anvin wrote:
> It gets really messy if it is advisory.  Suddenly you have the user
> thinking they can hotswap a memory bank and they just can't.

I'm very skeptical that not doing the strict re-ordering would
increase the chance of reaching memory allocation where hot unplug
would be impossible by much.  Given that, it'd be much better to be
able to boot w/o hotunplug capability than to fail boot.  The kernel
can whine loudly when hotunplug conditions aren't met but I think that
really is as far as that should go.

> Overall, I'm getting convinced that this whole approach is just doomed
> to failure -- it will not provide the user what they expect and what
> they need, which is to be able to hotswap any particular chunk of
> memory.  This means that there has to be a remapping layer, either using
> the TLBs (perhaps leveraging the Xen machine page number) or using
> things like QPI memory routing.

For hot unplug to work in completely generic manner, yeah, there
probably needs to be an extra layer of indirection.  Have no idea what
the correct way to achieve that would be tho.  I'm also not sure how
practicial memory hot unplug is for physical machines and improving
ballooning could be a better approach for vms.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 15:23       ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 15:23 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, akpm,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hello,

On Mon, Aug 12, 2013 at 08:14:04AM -0700, H. Peter Anvin wrote:
> It gets really messy if it is advisory.  Suddenly you have the user
> thinking they can hotswap a memory bank and they just can't.

I'm very skeptical that not doing the strict re-ordering would
increase the chance of reaching memory allocation where hot unplug
would be impossible by much.  Given that, it'd be much better to be
able to boot w/o hotunplug capability than to fail boot.  The kernel
can whine loudly when hotunplug conditions aren't met but I think that
really is as far as that should go.

> Overall, I'm getting convinced that this whole approach is just doomed
> to failure -- it will not provide the user what they expect and what
> they need, which is to be able to hotswap any particular chunk of
> memory.  This means that there has to be a remapping layer, either using
> the TLBs (perhaps leveraging the Xen machine page number) or using
> things like QPI memory routing.

For hot unplug to work in completely generic manner, yeah, there
probably needs to be an extra layer of indirection.  Have no idea what
the correct way to achieve that would be tho.  I'm also not sure how
practicial memory hot unplug is for physical machines and improving
ballooning could be a better approach for vms.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 14:50   ` Tejun Heo
@ 2013-08-12 15:41     ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12 15:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, imtangchen

On 08/12/2013 10:50 PM, Tejun Heo wrote:
> Hello,
......
>
> I think it's in a much better shape than before but there still are a
> couple things bothering me.
>
> * Why can't it be opportunistic?  It's silly, for example, to fail
>    boot because ACPI tells the kernel that all memory is hotpluggable
>    especially as there'd be plenty of memory sitting around doing
>    nothing and failing to boot is one of the most grave failure mode.
>    The HOTPLUG flag can be advisory, right?  Try to allocate
>    !hotpluggable memory first, but if that fails, ignore it and
>    allocate from anywhere, much like the try_nid allocations.
>

Then there is no way to tell the users which memory is hotpluggable.

phys addr is not user friendly. For users, node or memory device is the
best. The firmware should arrange the hotpluggable ranges well.

In my opinion, maybe some application layer tools may use SRAT to show
the users which memory is hotpluggable. I just think both of the kernel
and the application layer should obey the same rule.

> * Similar to the point hpa raised.  If this can be made opportunistic,
>    do we need the strict reordering to discover things earlier?
>    Shouldn't it be possible to configure memblock to allocate close to
>    the kernel image until hotplug and numa information is available?
>    For most sane cases, the memory allocated will be contained in
>    non-hotpluggable node anyway and in case they aren't hotplug
>    wouldn't work but the system will boot and function perfectly fine.

So far as I know, the kernel image and related data can be loaded
anywhere, above 4GB. I just can't make any assumption.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 15:41     ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12 15:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, imtangchen

On 08/12/2013 10:50 PM, Tejun Heo wrote:
> Hello,
......
>
> I think it's in a much better shape than before but there still are a
> couple things bothering me.
>
> * Why can't it be opportunistic?  It's silly, for example, to fail
>    boot because ACPI tells the kernel that all memory is hotpluggable
>    especially as there'd be plenty of memory sitting around doing
>    nothing and failing to boot is one of the most grave failure mode.
>    The HOTPLUG flag can be advisory, right?  Try to allocate
>    !hotpluggable memory first, but if that fails, ignore it and
>    allocate from anywhere, much like the try_nid allocations.
>

Then there is no way to tell the users which memory is hotpluggable.

phys addr is not user friendly. For users, node or memory device is the
best. The firmware should arrange the hotpluggable ranges well.

In my opinion, maybe some application layer tools may use SRAT to show
the users which memory is hotpluggable. I just think both of the kernel
and the application layer should obey the same rule.

> * Similar to the point hpa raised.  If this can be made opportunistic,
>    do we need the strict reordering to discover things earlier?
>    Shouldn't it be possible to configure memblock to allocate close to
>    the kernel image until hotplug and numa information is available?
>    For most sane cases, the memory allocated will be contained in
>    non-hotpluggable node anyway and in case they aren't hotplug
>    wouldn't work but the system will boot and function perfectly fine.

So far as I know, the kernel image and related data can be loaded
anywhere, above 4GB. I just can't make any assumption.

Thanks.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 15:41     ` Tang Chen
@ 2013-08-12 15:46       ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 15:46 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello,

On Mon, Aug 12, 2013 at 11:41:25PM +0800, Tang Chen wrote:
> Then there is no way to tell the users which memory is hotpluggable.
> 
> phys addr is not user friendly. For users, node or memory device is the
> best. The firmware should arrange the hotpluggable ranges well.

I don't follow.  Why can't the kernel export that information to
userland after boot is complete via printk / sysfs / proc / whatever?
The admin can "request" hotplug by boot param and the kernel would try
to honor that and return the result on boot completion.  I don't
understand why that wouldn't work.

> In my opinion, maybe some application layer tools may use SRAT to show
> the users which memory is hotpluggable. I just think both of the kernel
> and the application layer should obey the same rule.

Sure, just let the kernel tell the user which memory node ended up
hotpluggable after booting.

> >* Similar to the point hpa raised.  If this can be made opportunistic,
> >   do we need the strict reordering to discover things earlier?
> >   Shouldn't it be possible to configure memblock to allocate close to
> >   the kernel image until hotplug and numa information is available?
> >   For most sane cases, the memory allocated will be contained in
> >   non-hotpluggable node anyway and in case they aren't hotplug
> >   wouldn't work but the system will boot and function perfectly fine.
> 
> So far as I know, the kernel image and related data can be loaded
> anywhere, above 4GB. I just can't make any assumption.

I don't follow why that would be problematic.  Wouldn't finding out
which node the kernel image is located in and preferring to allocate
from that node before hotplug info is available be enough?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 15:46       ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 15:46 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello,

On Mon, Aug 12, 2013 at 11:41:25PM +0800, Tang Chen wrote:
> Then there is no way to tell the users which memory is hotpluggable.
> 
> phys addr is not user friendly. For users, node or memory device is the
> best. The firmware should arrange the hotpluggable ranges well.

I don't follow.  Why can't the kernel export that information to
userland after boot is complete via printk / sysfs / proc / whatever?
The admin can "request" hotplug by boot param and the kernel would try
to honor that and return the result on boot completion.  I don't
understand why that wouldn't work.

> In my opinion, maybe some application layer tools may use SRAT to show
> the users which memory is hotpluggable. I just think both of the kernel
> and the application layer should obey the same rule.

Sure, just let the kernel tell the user which memory node ended up
hotpluggable after booting.

> >* Similar to the point hpa raised.  If this can be made opportunistic,
> >   do we need the strict reordering to discover things earlier?
> >   Shouldn't it be possible to configure memblock to allocate close to
> >   the kernel image until hotplug and numa information is available?
> >   For most sane cases, the memory allocated will be contained in
> >   non-hotpluggable node anyway and in case they aren't hotplug
> >   wouldn't work but the system will boot and function perfectly fine.
> 
> So far as I know, the kernel image and related data can be loaded
> anywhere, above 4GB. I just can't make any assumption.

I don't follow why that would be problematic.  Wouldn't finding out
which node the kernel image is located in and preferring to allocate
from that node before hotplug info is available be enough?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 15:46       ` Tejun Heo
@ 2013-08-12 16:19         ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12 16:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, imtangchen

On 08/12/2013 11:46 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, Aug 12, 2013 at 11:41:25PM +0800, Tang Chen wrote:
>> Then there is no way to tell the users which memory is hotpluggable.
>>
>> phys addr is not user friendly. For users, node or memory device is the
>> best. The firmware should arrange the hotpluggable ranges well.
>
> I don't follow.  Why can't the kernel export that information to
> userland after boot is complete via printk / sysfs / proc / whatever?
> The admin can "request" hotplug by boot param and the kernel would try
> to honor that and return the result on boot completion.  I don't
> understand why that wouldn't work.

Sorry, I was in such a hurry that I didn't make myself clear...

The kernel can export info to users. The point is what kind of info.
Exporting phys addr is meaningless, of course. Now in /sys, we only
have memory_block and node. memory_block is only 128M on x86, and
hotplug a memory_block means nothing. So actually we only have node.

So users want to hotplug a node is reasonable, I think. In the
beginning, we set the hotplug unit to a node. That is also why we
did the movable node.

In summary, node hotplug is much meaningful and usable for users.
So it is the best that we can arrange a whole node to be movable
node, not opportunistic.

>
>> In my opinion, maybe some application layer tools may use SRAT to show
>> the users which memory is hotpluggable. I just think both of the kernel
>> and the application layer should obey the same rule.
>
> Sure, just let the kernel tell the user which memory node ended up
> hotpluggable after booting.
>
>>> * Similar to the point hpa raised.  If this can be made opportunistic,
>>>    do we need the strict reordering to discover things earlier?
>>>    Shouldn't it be possible to configure memblock to allocate close to
>>>    the kernel image until hotplug and numa information is available?
>>>    For most sane cases, the memory allocated will be contained in
>>>    non-hotpluggable node anyway and in case they aren't hotplug
>>>    wouldn't work but the system will boot and function perfectly fine.
>>
>> So far as I know, the kernel image and related data can be loaded
>> anywhere, above 4GB. I just can't make any assumption.
>
> I don't follow why that would be problematic.  Wouldn't finding out
> which node the kernel image is located in and preferring to allocate
> from that node before hotplug info is available be enough?

I'm just thinking of a more extreme case. For example, if a machine
has only one node hotpluggable, and the kernel resides in that node.
Then the system has no hotpluggable node.

If we can prevent the kernel from using hotpluggable memory, in such
a machine, users can still do memory hotplug.

I wanted to do it as generic as possible. But yes, finding out the
nodes the kernel resides in and make it unhotpluggable can work.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 16:19         ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12 16:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, imtangchen

On 08/12/2013 11:46 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, Aug 12, 2013 at 11:41:25PM +0800, Tang Chen wrote:
>> Then there is no way to tell the users which memory is hotpluggable.
>>
>> phys addr is not user friendly. For users, node or memory device is the
>> best. The firmware should arrange the hotpluggable ranges well.
>
> I don't follow.  Why can't the kernel export that information to
> userland after boot is complete via printk / sysfs / proc / whatever?
> The admin can "request" hotplug by boot param and the kernel would try
> to honor that and return the result on boot completion.  I don't
> understand why that wouldn't work.

Sorry, I was in such a hurry that I didn't make myself clear...

The kernel can export info to users. The point is what kind of info.
Exporting phys addr is meaningless, of course. Now in /sys, we only
have memory_block and node. memory_block is only 128M on x86, and
hotplug a memory_block means nothing. So actually we only have node.

So users want to hotplug a node is reasonable, I think. In the
beginning, we set the hotplug unit to a node. That is also why we
did the movable node.

In summary, node hotplug is much meaningful and usable for users.
So it is the best that we can arrange a whole node to be movable
node, not opportunistic.

>
>> In my opinion, maybe some application layer tools may use SRAT to show
>> the users which memory is hotpluggable. I just think both of the kernel
>> and the application layer should obey the same rule.
>
> Sure, just let the kernel tell the user which memory node ended up
> hotpluggable after booting.
>
>>> * Similar to the point hpa raised.  If this can be made opportunistic,
>>>    do we need the strict reordering to discover things earlier?
>>>    Shouldn't it be possible to configure memblock to allocate close to
>>>    the kernel image until hotplug and numa information is available?
>>>    For most sane cases, the memory allocated will be contained in
>>>    non-hotpluggable node anyway and in case they aren't hotplug
>>>    wouldn't work but the system will boot and function perfectly fine.
>>
>> So far as I know, the kernel image and related data can be loaded
>> anywhere, above 4GB. I just can't make any assumption.
>
> I don't follow why that would be problematic.  Wouldn't finding out
> which node the kernel image is located in and preferring to allocate
> from that node before hotplug info is available be enough?

I'm just thinking of a more extreme case. For example, if a machine
has only one node hotpluggable, and the kernel resides in that node.
Then the system has no hotpluggable node.

If we can prevent the kernel from using hotpluggable memory, in such
a machine, users can still do memory hotplug.

I wanted to do it as generic as possible. But yes, finding out the
nodes the kernel resides in and make it unhotpluggable can work.

Thanks.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 16:19         ` Tang Chen
@ 2013-08-12 16:22           ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 16:22 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello, Tang.

On Tue, Aug 13, 2013 at 12:19:02AM +0800, Tang Chen wrote:
> The kernel can export info to users. The point is what kind of info.
> Exporting phys addr is meaningless, of course. Now in /sys, we only
> have memory_block and node. memory_block is only 128M on x86, and
> hotplug a memory_block means nothing. So actually we only have node.
> 
> So users want to hotplug a node is reasonable, I think. In the
> beginning, we set the hotplug unit to a node. That is also why we
> did the movable node.
> 
> In summary, node hotplug is much meaningful and usable for users.
> So it is the best that we can arrange a whole node to be movable
> node, not opportunistic.

Still not following.  Yeah, sure, you can tell the userland that node
X is hotpluggable or not hotpluggable after boot is complete.  Why is
that relevant?

> I'm just thinking of a more extreme case. For example, if a machine
> has only one node hotpluggable, and the kernel resides in that node.
> Then the system has no hotpluggable node.

Yeah, sure, then there's no way that node can be hotpluggable and the
right thing to do is booting up the machine and informing the userland
that memory is not hotpluggable.

> If we can prevent the kernel from using hotpluggable memory, in such
> a machine, users can still do memory hotplug.
> 
> I wanted to do it as generic as possible. But yes, finding out the
> nodes the kernel resides in and make it unhotpluggable can work.

Short of being able to remap memory under the kernel, I don't think
this can be very generic and as a compromise trying to keep as many
hotpluggable nodes as possible doesn't sound too bad.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 16:22           ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 16:22 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello, Tang.

On Tue, Aug 13, 2013 at 12:19:02AM +0800, Tang Chen wrote:
> The kernel can export info to users. The point is what kind of info.
> Exporting phys addr is meaningless, of course. Now in /sys, we only
> have memory_block and node. memory_block is only 128M on x86, and
> hotplug a memory_block means nothing. So actually we only have node.
> 
> So users want to hotplug a node is reasonable, I think. In the
> beginning, we set the hotplug unit to a node. That is also why we
> did the movable node.
> 
> In summary, node hotplug is much meaningful and usable for users.
> So it is the best that we can arrange a whole node to be movable
> node, not opportunistic.

Still not following.  Yeah, sure, you can tell the userland that node
X is hotpluggable or not hotpluggable after boot is complete.  Why is
that relevant?

> I'm just thinking of a more extreme case. For example, if a machine
> has only one node hotpluggable, and the kernel resides in that node.
> Then the system has no hotpluggable node.

Yeah, sure, then there's no way that node can be hotpluggable and the
right thing to do is booting up the machine and informing the userland
that memory is not hotpluggable.

> If we can prevent the kernel from using hotpluggable memory, in such
> a machine, users can still do memory hotplug.
> 
> I wanted to do it as generic as possible. But yes, finding out the
> nodes the kernel resides in and make it unhotpluggable can work.

Short of being able to remap memory under the kernel, I don't think
this can be very generic and as a compromise trying to keep as many
hotpluggable nodes as possible doesn't sound too bad.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 15:23       ` Tejun Heo
@ 2013-08-12 16:29         ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12 16:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: H. Peter Anvin, Tang Chen, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

On 08/12/2013 11:23 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, Aug 12, 2013 at 08:14:04AM -0700, H. Peter Anvin wrote:
>> It gets really messy if it is advisory.  Suddenly you have the user
>> thinking they can hotswap a memory bank and they just can't.
>
> I'm very skeptical that not doing the strict re-ordering would
> increase the chance of reaching memory allocation where hot unplug
> would be impossible by much.  Given that, it'd be much better to be
> able to boot w/o hotunplug capability than to fail boot.  The kernel
> can whine loudly when hotunplug conditions aren't met but I think that
> really is as far as that should go.

As you said, we can ensure at least one node to be unhotplug. Then the
kernel will boot anyway. Just like CPU0. But we have the chance to lose
one movable node.

The best way is firmware and software corporate together. SRAT provides
several movable node and enough non-movable memory for the kernel to
boot. The hotplug users only use movable node.

>
>> Overall, I'm getting convinced that this whole approach is just doomed
>> to failure -- it will not provide the user what they expect and what
>> they need, which is to be able to hotswap any particular chunk of
>> memory.  This means that there has to be a remapping layer, either using
>> the TLBs (perhaps leveraging the Xen machine page number) or using
>> things like QPI memory routing.
>
> For hot unplug to work in completely generic manner, yeah, there
> probably needs to be an extra layer of indirection.

I agree too.

> Have no idea what
> the correct way to achieve that would be tho.  I'm also not sure how
> practicial memory hot unplug is for physical machines and improving
> ballooning could be a better approach for vms.

But, different users have different ways to use memory hotplug.

Hotswaping any particular chunk of memory is the goal we will reach
finally. But it is on specific hardware. In most current machines, we
can use movable node to manage resource in node unit.

And also, without this movablenode boot option, the MOVABLE_NODE
functionality, which is already in the kernel, will not be able to
work. All nodes has kernel memory means no movable node.

So, how about this: Just like MOVABLE_NODE functionality, introduce
a new config option. When we have better solutions for memory hotplug,
we shutoff or remove the config and related code.

For now, at least make movable node work.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 16:29         ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12 16:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: H. Peter Anvin, Tang Chen, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

On 08/12/2013 11:23 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, Aug 12, 2013 at 08:14:04AM -0700, H. Peter Anvin wrote:
>> It gets really messy if it is advisory.  Suddenly you have the user
>> thinking they can hotswap a memory bank and they just can't.
>
> I'm very skeptical that not doing the strict re-ordering would
> increase the chance of reaching memory allocation where hot unplug
> would be impossible by much.  Given that, it'd be much better to be
> able to boot w/o hotunplug capability than to fail boot.  The kernel
> can whine loudly when hotunplug conditions aren't met but I think that
> really is as far as that should go.

As you said, we can ensure at least one node to be unhotplug. Then the
kernel will boot anyway. Just like CPU0. But we have the chance to lose
one movable node.

The best way is firmware and software corporate together. SRAT provides
several movable node and enough non-movable memory for the kernel to
boot. The hotplug users only use movable node.

>
>> Overall, I'm getting convinced that this whole approach is just doomed
>> to failure -- it will not provide the user what they expect and what
>> they need, which is to be able to hotswap any particular chunk of
>> memory.  This means that there has to be a remapping layer, either using
>> the TLBs (perhaps leveraging the Xen machine page number) or using
>> things like QPI memory routing.
>
> For hot unplug to work in completely generic manner, yeah, there
> probably needs to be an extra layer of indirection.

I agree too.

> Have no idea what
> the correct way to achieve that would be tho.  I'm also not sure how
> practicial memory hot unplug is for physical machines and improving
> ballooning could be a better approach for vms.

But, different users have different ways to use memory hotplug.

Hotswaping any particular chunk of memory is the goal we will reach
finally. But it is on specific hardware. In most current machines, we
can use movable node to manage resource in node unit.

And also, without this movablenode boot option, the MOVABLE_NODE
functionality, which is already in the kernel, will not be able to
work. All nodes has kernel memory means no movable node.

So, how about this: Just like MOVABLE_NODE functionality, introduce
a new config option. When we have better solutions for memory hotplug,
we shutoff or remove the config and related code.

For now, at least make movable node work.

Thanks.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 16:29         ` Tang Chen
@ 2013-08-12 16:46           ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 16:46 UTC (permalink / raw)
  To: Tang Chen
  Cc: H. Peter Anvin, Tang Chen, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hello, Tang.

On Tue, Aug 13, 2013 at 12:29:51AM +0800, Tang Chen wrote:
> As you said, we can ensure at least one node to be unhotplug. Then the
> kernel will boot anyway. Just like CPU0. But we have the chance to lose
> one movable node.
> 
> The best way is firmware and software corporate together. SRAT provides
> several movable node and enough non-movable memory for the kernel to
> boot. The hotplug users only use movable node.

I'm really lost on this conversation and have no idea what you're
arguing.  My point was simple - let the kernel do its best during boot
and report the result to userland on what nodes are hotpluggable or
not.  Can you please elaborate what your point is from the ground up?
Unfortunately, I currently have no idea what you're saying.

> But, different users have different ways to use memory hotplug.
> 
> Hotswaping any particular chunk of memory is the goal we will reach
> finally. But it is on specific hardware. In most current machines, we
> can use movable node to manage resource in node unit.
> 
> And also, without this movablenode boot option, the MOVABLE_NODE
> functionality, which is already in the kernel, will not be able to
> work. All nodes has kernel memory means no movable node.
> 
> So, how about this: Just like MOVABLE_NODE functionality, introduce
> a new config option. When we have better solutions for memory hotplug,
> we shutoff or remove the config and related code.
> 
> For now, at least make movable node work.

We are talking completely past each other.  I'll just try to clarify
what I was saying.  Can you please do the same?  Let's re-sync on the
discussion.

* Adding an option to tell the kernel to try to stay away from
  hotpluggable nodes is fine.  I have no problem with that at all.

* The patchsets upto this point have been somehow trying to reorder
  operations shomehow such that *no* memory allocation happens before
  memblock is populated with hotplug information.

* However, we already *know* that the memory the kernel image is
  occupying won't be removeable.  It's highly likely that the amount
  of memory allocation before NUMA / hotplug information is fully
  populated is pretty small.  Also, it's highly likely that small
  amount of memory right after the kernel image is contained in the
  same NUMA node, so if we allocate memory close to the kernel image,
  it's likely that we don't contaminate hotpluggable node.  We're
  talking about few megs at most right after the kernel image.  I
  can't see how that would make any noticeable difference.

* Once hotplug information is available, allocation can happen as
  usual and the kernel can report the nodes which are actually
  hotpluggable - marked as hotpluggable by the firmware && didn't get
  contaminated during early alloc && didn't get overflow allocations
  afterwards.  Note that we need such mechanism no matter what as the
  kernel image can be loaded into hotpluggable nodes and reporting
  that to userland is the only thing the kernel can do for cases like
  that short of denying memory unplug on such nodes.

The whole thing would be a lot simpler and generic.  It doesn't even
have to care about which mechanism is being used to acquire all those
information.  What am I missing here?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 16:46           ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 16:46 UTC (permalink / raw)
  To: Tang Chen
  Cc: H. Peter Anvin, Tang Chen, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hello, Tang.

On Tue, Aug 13, 2013 at 12:29:51AM +0800, Tang Chen wrote:
> As you said, we can ensure at least one node to be unhotplug. Then the
> kernel will boot anyway. Just like CPU0. But we have the chance to lose
> one movable node.
> 
> The best way is firmware and software corporate together. SRAT provides
> several movable node and enough non-movable memory for the kernel to
> boot. The hotplug users only use movable node.

I'm really lost on this conversation and have no idea what you're
arguing.  My point was simple - let the kernel do its best during boot
and report the result to userland on what nodes are hotpluggable or
not.  Can you please elaborate what your point is from the ground up?
Unfortunately, I currently have no idea what you're saying.

> But, different users have different ways to use memory hotplug.
> 
> Hotswaping any particular chunk of memory is the goal we will reach
> finally. But it is on specific hardware. In most current machines, we
> can use movable node to manage resource in node unit.
> 
> And also, without this movablenode boot option, the MOVABLE_NODE
> functionality, which is already in the kernel, will not be able to
> work. All nodes has kernel memory means no movable node.
> 
> So, how about this: Just like MOVABLE_NODE functionality, introduce
> a new config option. When we have better solutions for memory hotplug,
> we shutoff or remove the config and related code.
> 
> For now, at least make movable node work.

We are talking completely past each other.  I'll just try to clarify
what I was saying.  Can you please do the same?  Let's re-sync on the
discussion.

* Adding an option to tell the kernel to try to stay away from
  hotpluggable nodes is fine.  I have no problem with that at all.

* The patchsets upto this point have been somehow trying to reorder
  operations shomehow such that *no* memory allocation happens before
  memblock is populated with hotplug information.

* However, we already *know* that the memory the kernel image is
  occupying won't be removeable.  It's highly likely that the amount
  of memory allocation before NUMA / hotplug information is fully
  populated is pretty small.  Also, it's highly likely that small
  amount of memory right after the kernel image is contained in the
  same NUMA node, so if we allocate memory close to the kernel image,
  it's likely that we don't contaminate hotpluggable node.  We're
  talking about few megs at most right after the kernel image.  I
  can't see how that would make any noticeable difference.

* Once hotplug information is available, allocation can happen as
  usual and the kernel can report the nodes which are actually
  hotpluggable - marked as hotpluggable by the firmware && didn't get
  contaminated during early alloc && didn't get overflow allocations
  afterwards.  Note that we need such mechanism no matter what as the
  kernel image can be loaded into hotpluggable nodes and reporting
  that to userland is the only thing the kernel can do for cases like
  that short of denying memory unplug on such nodes.

The whole thing would be a lot simpler and generic.  It doesn't even
have to care about which mechanism is being used to acquire all those
information.  What am I missing here?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 16:22           ` Tejun Heo
@ 2013-08-12 17:01             ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12 17:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hi tj,

On 08/13/2013 12:22 AM, Tejun Heo wrote:
> Hello, Tang.
>
> On Tue, Aug 13, 2013 at 12:19:02AM +0800, Tang Chen wrote:
>> The kernel can export info to users. The point is what kind of info.
>> Exporting phys addr is meaningless, of course. Now in /sys, we only
>> have memory_block and node. memory_block is only 128M on x86, and
>> hotplug a memory_block means nothing. So actually we only have node.
>>
>> So users want to hotplug a node is reasonable, I think. In the
>> beginning, we set the hotplug unit to a node. That is also why we
>> did the movable node.
>>
>> In summary, node hotplug is much meaningful and usable for users.
>> So it is the best that we can arrange a whole node to be movable
>> node, not opportunistic.
>
> Still not following.  Yeah, sure, you can tell the userland that node
> X is hotpluggable or not hotpluggable after boot is complete.  Why is
> that relevant?

Sorry for the misunderstanding.

I was trying to answer your question: "Why can't the kenrel allocate
hotpluggable memory opportunistic ?".

If the kernel has any opportunity to allocate hotpluggable memory in
SRAT, then the kernel should tell users which memory is hotpluggable.

But in what way ?  I think node is the best for now. But a node could
have a lot of memory. If the kernel uses only a little memory, we will
lose the whole movable node, which I don't want to do.

So, I don't want to allow the kenrel allocating hotpluggable memory
opportunistic.


>
>> I'm just thinking of a more extreme case. For example, if a machine
>> has only one node hotpluggable, and the kernel resides in that node.
>> Then the system has no hotpluggable node.
>
> Yeah, sure, then there's no way that node can be hotpluggable and the
> right thing to do is booting up the machine and informing the userland
> that memory is not hotpluggable.
>
>> If we can prevent the kernel from using hotpluggable memory, in such
>> a machine, users can still do memory hotplug.
>>
>> I wanted to do it as generic as possible. But yes, finding out the
>> nodes the kernel resides in and make it unhotpluggable can work.
>
> Short of being able to remap memory under the kernel, I don't think
> this can be very generic and as a compromise trying to keep as many
> hotpluggable nodes as possible doesn't sound too bad.

I think making one of the node hotpluggable is better. But OK, it is
no big deal. There won't be such machine in reality, I think. :)

Thanks. :)





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 17:01             ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12 17:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hi tj,

On 08/13/2013 12:22 AM, Tejun Heo wrote:
> Hello, Tang.
>
> On Tue, Aug 13, 2013 at 12:19:02AM +0800, Tang Chen wrote:
>> The kernel can export info to users. The point is what kind of info.
>> Exporting phys addr is meaningless, of course. Now in /sys, we only
>> have memory_block and node. memory_block is only 128M on x86, and
>> hotplug a memory_block means nothing. So actually we only have node.
>>
>> So users want to hotplug a node is reasonable, I think. In the
>> beginning, we set the hotplug unit to a node. That is also why we
>> did the movable node.
>>
>> In summary, node hotplug is much meaningful and usable for users.
>> So it is the best that we can arrange a whole node to be movable
>> node, not opportunistic.
>
> Still not following.  Yeah, sure, you can tell the userland that node
> X is hotpluggable or not hotpluggable after boot is complete.  Why is
> that relevant?

Sorry for the misunderstanding.

I was trying to answer your question: "Why can't the kenrel allocate
hotpluggable memory opportunistic ?".

If the kernel has any opportunity to allocate hotpluggable memory in
SRAT, then the kernel should tell users which memory is hotpluggable.

But in what way ?  I think node is the best for now. But a node could
have a lot of memory. If the kernel uses only a little memory, we will
lose the whole movable node, which I don't want to do.

So, I don't want to allow the kenrel allocating hotpluggable memory
opportunistic.


>
>> I'm just thinking of a more extreme case. For example, if a machine
>> has only one node hotpluggable, and the kernel resides in that node.
>> Then the system has no hotpluggable node.
>
> Yeah, sure, then there's no way that node can be hotpluggable and the
> right thing to do is booting up the machine and informing the userland
> that memory is not hotpluggable.
>
>> If we can prevent the kernel from using hotpluggable memory, in such
>> a machine, users can still do memory hotplug.
>>
>> I wanted to do it as generic as possible. But yes, finding out the
>> nodes the kernel resides in and make it unhotpluggable can work.
>
> Short of being able to remap memory under the kernel, I don't think
> this can be very generic and as a compromise trying to keep as many
> hotpluggable nodes as possible doesn't sound too bad.

I think making one of the node hotpluggable is better. But OK, it is
no big deal. There won't be such machine in reality, I think. :)

Thanks. :)






^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 17:01             ` Tang Chen
@ 2013-08-12 17:23               ` H. Peter Anvin
  -1 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-12 17:23 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tejun Heo, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On 08/12/2013 10:01 AM, Tang Chen wrote:
>>
>>> I'm just thinking of a more extreme case. For example, if a machine
>>> has only one node hotpluggable, and the kernel resides in that node.
>>> Then the system has no hotpluggable node.
>>
>> Yeah, sure, then there's no way that node can be hotpluggable and the
>> right thing to do is booting up the machine and informing the userland
>> that memory is not hotpluggable.
>>
>>> If we can prevent the kernel from using hotpluggable memory, in such
>>> a machine, users can still do memory hotplug.
>>>
>>> I wanted to do it as generic as possible. But yes, finding out the
>>> nodes the kernel resides in and make it unhotpluggable can work.
>>
>> Short of being able to remap memory under the kernel, I don't think
>> this can be very generic and as a compromise trying to keep as many
>> hotpluggable nodes as possible doesn't sound too bad.
> 
> I think making one of the node hotpluggable is better. But OK, it is
> no big deal. There won't be such machine in reality, I think. :)
> 

The user may very well have configured a system with mirrored memory for
the kernel node as that will be non-hotpluggable, but not for the
others.  One can wonder how much that actually buys in real life, but
still...

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 17:23               ` H. Peter Anvin
  0 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-12 17:23 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tejun Heo, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On 08/12/2013 10:01 AM, Tang Chen wrote:
>>
>>> I'm just thinking of a more extreme case. For example, if a machine
>>> has only one node hotpluggable, and the kernel resides in that node.
>>> Then the system has no hotpluggable node.
>>
>> Yeah, sure, then there's no way that node can be hotpluggable and the
>> right thing to do is booting up the machine and informing the userland
>> that memory is not hotpluggable.
>>
>>> If we can prevent the kernel from using hotpluggable memory, in such
>>> a machine, users can still do memory hotplug.
>>>
>>> I wanted to do it as generic as possible. But yes, finding out the
>>> nodes the kernel resides in and make it unhotpluggable can work.
>>
>> Short of being able to remap memory under the kernel, I don't think
>> this can be very generic and as a compromise trying to keep as many
>> hotpluggable nodes as possible doesn't sound too bad.
> 
> I think making one of the node hotpluggable is better. But OK, it is
> no big deal. There won't be such machine in reality, I think. :)
> 

The user may very well have configured a system with mirrored memory for
the kernel node as that will be non-hotpluggable, but not for the
others.  One can wonder how much that actually buys in real life, but
still...

	-hpa



^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 17:01             ` Tang Chen
@ 2013-08-12 18:07               ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 18:07 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hey,

On Tue, Aug 13, 2013 at 01:01:09AM +0800, Tang Chen wrote:
> Sorry for the misunderstanding.
> 
> I was trying to answer your question: "Why can't the kenrel allocate
> hotpluggable memory opportunistic ?".

I've used the wrong word, I was meaning best-effort, which is the only
thing we can do anyway given that we have no control over where the
kernel image is linked in relation to NUMA nodes.

> If the kernel has any opportunity to allocate hotpluggable memory in
> SRAT, then the kernel should tell users which memory is hotpluggable.
> 
> But in what way ?  I think node is the best for now. But a node could
> have a lot of memory. If the kernel uses only a little memory, we will
> lose the whole movable node, which I don't want to do.
> 
> So, I don't want to allow the kenrel allocating hotpluggable memory
> opportunistic.

What I was saying was that the kernel should try !hotpluggable memory
first then fall back to hotpluggable memory instead of failing boot as
nothing really is worse than failing to boot.

> >Short of being able to remap memory under the kernel, I don't think
> >this can be very generic and as a compromise trying to keep as many
> >hotpluggable nodes as possible doesn't sound too bad.
> 
> I think making one of the node hotpluggable is better. But OK, it is
> no big deal. There won't be such machine in reality, I think. :)

Hmmm... but allocating close to kernel image will keep the number of
nodes which are made un-removeable via permanent allocation to
minimum.  In most configurations that I can recall, I don't think we'd
lose anything really and the code will be much simpler and generic.
It seems like a good trade-off to me given that we need to report
which nodes are hot unpluggable no matter what.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 18:07               ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 18:07 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa,
	akpm, trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hey,

On Tue, Aug 13, 2013 at 01:01:09AM +0800, Tang Chen wrote:
> Sorry for the misunderstanding.
> 
> I was trying to answer your question: "Why can't the kenrel allocate
> hotpluggable memory opportunistic ?".

I've used the wrong word, I was meaning best-effort, which is the only
thing we can do anyway given that we have no control over where the
kernel image is linked in relation to NUMA nodes.

> If the kernel has any opportunity to allocate hotpluggable memory in
> SRAT, then the kernel should tell users which memory is hotpluggable.
> 
> But in what way ?  I think node is the best for now. But a node could
> have a lot of memory. If the kernel uses only a little memory, we will
> lose the whole movable node, which I don't want to do.
> 
> So, I don't want to allow the kenrel allocating hotpluggable memory
> opportunistic.

What I was saying was that the kernel should try !hotpluggable memory
first then fall back to hotpluggable memory instead of failing boot as
nothing really is worse than failing to boot.

> >Short of being able to remap memory under the kernel, I don't think
> >this can be very generic and as a compromise trying to keep as many
> >hotpluggable nodes as possible doesn't sound too bad.
> 
> I think making one of the node hotpluggable is better. But OK, it is
> no big deal. There won't be such machine in reality, I think. :)

Hmmm... but allocating close to kernel image will keep the number of
nodes which are made un-removeable via permanent allocation to
minimum.  In most configurations that I can recall, I don't think we'd
lose anything really and the code will be much simpler and generic.
It seems like a good trade-off to me given that we need to report
which nodes are hot unpluggable no matter what.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 16:46           ` Tejun Heo
@ 2013-08-12 18:23             ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12 18:23 UTC (permalink / raw)
  To: Tejun Heo, H. Peter Anvin
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, akpm,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

On 08/13/2013 12:46 AM, Tejun Heo wrote:
> Hello, Tang.
......
>
>> But, different users have different ways to use memory hotplug.
>>
>> Hotswaping any particular chunk of memory is the goal we will reach
>> finally. But it is on specific hardware. In most current machines, we
>> can use movable node to manage resource in node unit.
>>
>> And also, without this movablenode boot option, the MOVABLE_NODE
>> functionality, which is already in the kernel, will not be able to
>> work. All nodes has kernel memory means no movable node.
>>
>> So, how about this: Just like MOVABLE_NODE functionality, introduce
>> a new config option. When we have better solutions for memory hotplug,
>> we shutoff or remove the config and related code.
>>
>> For now, at least make movable node work.

Hi tj,
cc hpa,

I explained above because hpa said he thought the whole approach is
wrong. I think node hotplug is meaningful for users. And without this
patch-set, MOVABLE_NODE means nothing. This is all above.

Since you replied his email in previous emails, I just replied to
answer both of you. Sorry for the misunderstanding. :)

>
> We are talking completely past each other.  I'll just try to clarify
> what I was saying.  Can you please do the same?  Let's re-sync on the
> discussion.
>
> * Adding an option to tell the kernel to try to stay away from
>    hotpluggable nodes is fine.  I have no problem with that at all.

Agreed.

>
> * The patchsets upto this point have been somehow trying to reorder
>    operations shomehow such that *no* memory allocation happens before
>    memblock is populated with hotplug information.

Yes, this is exactly what I want to do.

>
> * However, we already *know* that the memory the kernel image is
>    occupying won't be removeable.  It's highly likely that the amount
>    of memory allocation before NUMA / hotplug information is fully
>    populated is pretty small.  Also, it's highly likely that small
>    amount of memory right after the kernel image is contained in the
>    same NUMA node, so if we allocate memory close to the kernel image,
>    it's likely that we don't contaminate hotpluggable node.  We're
>    talking about few megs at most right after the kernel image.  I
>    can't see how that would make any noticeable difference.

This point, I don't quite agree. What you said is highly likely, but
not definitely. Users may find they lost hotpluggable memory.

The node the kernel resides in won't be removable. This is agreed.
But I still want SRAT earlier for the following reasons:

1. For a production provided to users, the firmware specified how
    many nodes are hotpluggable. When the system is up, if users
    found they lost movable nodes, I think it could be messy.

2. Reorder SRAT parsing earlier is not that difficult to do. The
    only procedures reordered are acpi tables initialization and
    acpi_initrd_override. The acpi part patches are being reviewed.
    And it is better solution. If possible, I think we should do it.

In summary, I don't want early memory allocation with hotpluggable
memory to be opportunistic.

>
> * Once hotplug information is available, allocation can happen as
>    usual and the kernel can report the nodes which are actually
>    hotpluggable - marked as hotpluggable by the firmware&&  didn't get
>    contaminated during early alloc&&  didn't get overflow allocations
>    afterwards.  Note that we need such mechanism no matter what as the
>    kernel image can be loaded into hotpluggable nodes and reporting
>    that to userland is the only thing the kernel can do for cases like
>    that short of denying memory unplug on such nodes.

Agreed.

>
> The whole thing would be a lot simpler and generic.  It doesn't even
> have to care about which mechanism is being used to acquire all those
> information.  What am I missing here?

Sorry for the misunderstanding.

Thanks. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 18:23             ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-12 18:23 UTC (permalink / raw)
  To: Tejun Heo, H. Peter Anvin
  Cc: Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx, mingo, akpm,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

On 08/13/2013 12:46 AM, Tejun Heo wrote:
> Hello, Tang.
......
>
>> But, different users have different ways to use memory hotplug.
>>
>> Hotswaping any particular chunk of memory is the goal we will reach
>> finally. But it is on specific hardware. In most current machines, we
>> can use movable node to manage resource in node unit.
>>
>> And also, without this movablenode boot option, the MOVABLE_NODE
>> functionality, which is already in the kernel, will not be able to
>> work. All nodes has kernel memory means no movable node.
>>
>> So, how about this: Just like MOVABLE_NODE functionality, introduce
>> a new config option. When we have better solutions for memory hotplug,
>> we shutoff or remove the config and related code.
>>
>> For now, at least make movable node work.

Hi tj,
cc hpa,

I explained above because hpa said he thought the whole approach is
wrong. I think node hotplug is meaningful for users. And without this
patch-set, MOVABLE_NODE means nothing. This is all above.

Since you replied his email in previous emails, I just replied to
answer both of you. Sorry for the misunderstanding. :)

>
> We are talking completely past each other.  I'll just try to clarify
> what I was saying.  Can you please do the same?  Let's re-sync on the
> discussion.
>
> * Adding an option to tell the kernel to try to stay away from
>    hotpluggable nodes is fine.  I have no problem with that at all.

Agreed.

>
> * The patchsets upto this point have been somehow trying to reorder
>    operations shomehow such that *no* memory allocation happens before
>    memblock is populated with hotplug information.

Yes, this is exactly what I want to do.

>
> * However, we already *know* that the memory the kernel image is
>    occupying won't be removeable.  It's highly likely that the amount
>    of memory allocation before NUMA / hotplug information is fully
>    populated is pretty small.  Also, it's highly likely that small
>    amount of memory right after the kernel image is contained in the
>    same NUMA node, so if we allocate memory close to the kernel image,
>    it's likely that we don't contaminate hotpluggable node.  We're
>    talking about few megs at most right after the kernel image.  I
>    can't see how that would make any noticeable difference.

This point, I don't quite agree. What you said is highly likely, but
not definitely. Users may find they lost hotpluggable memory.

The node the kernel resides in won't be removable. This is agreed.
But I still want SRAT earlier for the following reasons:

1. For a production provided to users, the firmware specified how
    many nodes are hotpluggable. When the system is up, if users
    found they lost movable nodes, I think it could be messy.

2. Reorder SRAT parsing earlier is not that difficult to do. The
    only procedures reordered are acpi tables initialization and
    acpi_initrd_override. The acpi part patches are being reviewed.
    And it is better solution. If possible, I think we should do it.

In summary, I don't want early memory allocation with hotpluggable
memory to be opportunistic.

>
> * Once hotplug information is available, allocation can happen as
>    usual and the kernel can report the nodes which are actually
>    hotpluggable - marked as hotpluggable by the firmware&&  didn't get
>    contaminated during early alloc&&  didn't get overflow allocations
>    afterwards.  Note that we need such mechanism no matter what as the
>    kernel image can be loaded into hotpluggable nodes and reporting
>    that to userland is the only thing the kernel can do for cases like
>    that short of denying memory unplug on such nodes.

Agreed.

>
> The whole thing would be a lot simpler and generic.  It doesn't even
> have to care about which mechanism is being used to acquire all those
> information.  What am I missing here?

Sorry for the misunderstanding.

Thanks. :)


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 18:23             ` Tang Chen
@ 2013-08-12 20:20               ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 20:20 UTC (permalink / raw)
  To: Tang Chen
  Cc: H. Peter Anvin, Tang Chen, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hello,

On Tue, Aug 13, 2013 at 02:23:13AM +0800, Tang Chen wrote:
> >* However, we already *know* that the memory the kernel image is
> >   occupying won't be removeable.  It's highly likely that the amount
> >   of memory allocation before NUMA / hotplug information is fully
> >   populated is pretty small.  Also, it's highly likely that small
> >   amount of memory right after the kernel image is contained in the
> >   same NUMA node, so if we allocate memory close to the kernel image,
> >   it's likely that we don't contaminate hotpluggable node.  We're
> >   talking about few megs at most right after the kernel image.  I
> >   can't see how that would make any noticeable difference.
> 
> This point, I don't quite agree. What you said is highly likely, but
> not definitely. Users may find they lost hotpluggable memory.

I'm having difficult time buying that.  NUMA node granularity is
usually pretty large - it's in the range of gigabytes.  By comparison,
the area occupied by the kernel image is *tiny* and it's just highly
unlikely that allocating a bit more memory afterwards would lead to
any meaningful difference in hotunplug support.  The amount of memory
we're talking about is likely to be less than a meg, right?

> The node the kernel resides in won't be removable. This is agreed.
> But I still want SRAT earlier for the following reasons:
> 
> 1. For a production provided to users, the firmware specified how
>    many nodes are hotpluggable. When the system is up, if users
>    found they lost movable nodes, I think it could be messy.

How is that different from the memory occupied by kernel image?
Simply allocating early memory near kernel image is extremely unlikely
to change the situation.  Again, we're talking about tiny allocation
here.  It should be no different from having *slightly* larger kernel
image.  How is that material in any way?

> 2. Reorder SRAT parsing earlier is not that difficult to do. The
>    only procedures reordered are acpi tables initialization and
>    acpi_initrd_override. The acpi part patches are being reviewed.
>    And it is better solution. If possible, I think we should do it.

I don't think it's a better solution.  It's fragile and fiddly and
without much, if any, additional benefit.  Why should we do that when
we can almost trivially solve the problem almost in memblock proper in
a way which is completely firmware-agnostic?

But, what's the extra benefit of doing that?  Why would reserving less
than a megabyte after the kernel be so problematic to require this
invasive solution?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 20:20               ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 20:20 UTC (permalink / raw)
  To: Tang Chen
  Cc: H. Peter Anvin, Tang Chen, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hello,

On Tue, Aug 13, 2013 at 02:23:13AM +0800, Tang Chen wrote:
> >* However, we already *know* that the memory the kernel image is
> >   occupying won't be removeable.  It's highly likely that the amount
> >   of memory allocation before NUMA / hotplug information is fully
> >   populated is pretty small.  Also, it's highly likely that small
> >   amount of memory right after the kernel image is contained in the
> >   same NUMA node, so if we allocate memory close to the kernel image,
> >   it's likely that we don't contaminate hotpluggable node.  We're
> >   talking about few megs at most right after the kernel image.  I
> >   can't see how that would make any noticeable difference.
> 
> This point, I don't quite agree. What you said is highly likely, but
> not definitely. Users may find they lost hotpluggable memory.

I'm having difficult time buying that.  NUMA node granularity is
usually pretty large - it's in the range of gigabytes.  By comparison,
the area occupied by the kernel image is *tiny* and it's just highly
unlikely that allocating a bit more memory afterwards would lead to
any meaningful difference in hotunplug support.  The amount of memory
we're talking about is likely to be less than a meg, right?

> The node the kernel resides in won't be removable. This is agreed.
> But I still want SRAT earlier for the following reasons:
> 
> 1. For a production provided to users, the firmware specified how
>    many nodes are hotpluggable. When the system is up, if users
>    found they lost movable nodes, I think it could be messy.

How is that different from the memory occupied by kernel image?
Simply allocating early memory near kernel image is extremely unlikely
to change the situation.  Again, we're talking about tiny allocation
here.  It should be no different from having *slightly* larger kernel
image.  How is that material in any way?

> 2. Reorder SRAT parsing earlier is not that difficult to do. The
>    only procedures reordered are acpi tables initialization and
>    acpi_initrd_override. The acpi part patches are being reviewed.
>    And it is better solution. If possible, I think we should do it.

I don't think it's a better solution.  It's fragile and fiddly and
without much, if any, additional benefit.  Why should we do that when
we can almost trivially solve the problem almost in memblock proper in
a way which is completely firmware-agnostic?

But, what's the extra benefit of doing that?  Why would reserving less
than a megabyte after the kernel be so problematic to require this
invasive solution?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* RE: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 20:20               ` Tejun Heo
@ 2013-08-12 20:49                 ` Luck, Tony
  -1 siblings, 0 replies; 165+ messages in thread
From: Luck, Tony @ 2013-08-12 20:49 UTC (permalink / raw)
  To: Tejun Heo, Tang Chen
  Cc: H. Peter Anvin, Tang Chen, Moore, Robert, Zheng, Lv, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskoviti

>> This point, I don't quite agree. What you said is highly likely, but
>> not definitely. Users may find they lost hotpluggable memory.
>
> I'm having difficult time buying that.  NUMA node granularity is
> usually pretty large - it's in the range of gigabytes.  By comparison,
> the area occupied by the kernel image is *tiny* and it's just highly
> unlikely that allocating a bit more memory afterwards would lead to
> any meaningful difference in hotunplug support.  The amount of memory
> we're talking about is likely to be less than a meg, right?

Pretty safe to assume double-digit gigabytes for a removable chunk
(8G DIMMs are fast becoming standard, and there are typically 4 channels
to populate with at least one DIMM each). 16G and 32G DIMMs are pricey,
but moving in too.  So I don't think we need to assume that early allocations
are limited to some tiny amount measured in single digit megabytes. We'd
be safe even with some small number of gigabytes.

> I don't think it's a better solution.  It's fragile and fiddly and
> without much, if any, additional benefit.  Why should we do that when
> we can almost trivially solve the problem almost in memblock proper in
> a way which is completely firmware-agnostic?

So we do need to make sure that early memory allocations do happen from
the free areas adjacent to the kernel - and document that as a requirement
so we don't have people coming along later with a "allocate from top of memory
downwards" or other strategy that would break this assumption.  If we do that,
then I think I stand with Tejun that there is little benefit to parsing the SRAT
earlier.

The only fly I see in the ointment here is the crazy fragmentation of physical
memory below 4G on X86 systems.  Typically it will all be on the same node.
But I don't know if there is any specification that requires it be that way. If some
"helpful" OEM decided to make some "lowmem" (below 4G) be available on
every node, they might in theory do something truly awesomely strange.  But
even here - the granularity of such mappings tends to be large enough that
the "allocate near where the kernel was loaded" should still work to make those
allocations be on the same node for the "few megabytes" level of allocations.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* RE: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 20:49                 ` Luck, Tony
  0 siblings, 0 replies; 165+ messages in thread
From: Luck, Tony @ 2013-08-12 20:49 UTC (permalink / raw)
  To: Tejun Heo, Tang Chen
  Cc: H. Peter Anvin, Tang Chen, Moore, Robert, Zheng, Lv, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

>> This point, I don't quite agree. What you said is highly likely, but
>> not definitely. Users may find they lost hotpluggable memory.
>
> I'm having difficult time buying that.  NUMA node granularity is
> usually pretty large - it's in the range of gigabytes.  By comparison,
> the area occupied by the kernel image is *tiny* and it's just highly
> unlikely that allocating a bit more memory afterwards would lead to
> any meaningful difference in hotunplug support.  The amount of memory
> we're talking about is likely to be less than a meg, right?

Pretty safe to assume double-digit gigabytes for a removable chunk
(8G DIMMs are fast becoming standard, and there are typically 4 channels
to populate with at least one DIMM each). 16G and 32G DIMMs are pricey,
but moving in too.  So I don't think we need to assume that early allocations
are limited to some tiny amount measured in single digit megabytes. We'd
be safe even with some small number of gigabytes.

> I don't think it's a better solution.  It's fragile and fiddly and
> without much, if any, additional benefit.  Why should we do that when
> we can almost trivially solve the problem almost in memblock proper in
> a way which is completely firmware-agnostic?

So we do need to make sure that early memory allocations do happen from
the free areas adjacent to the kernel - and document that as a requirement
so we don't have people coming along later with a "allocate from top of memory
downwards" or other strategy that would break this assumption.  If we do that,
then I think I stand with Tejun that there is little benefit to parsing the SRAT
earlier.

The only fly I see in the ointment here is the crazy fragmentation of physical
memory below 4G on X86 systems.  Typically it will all be on the same node.
But I don't know if there is any specification that requires it be that way. If some
"helpful" OEM decided to make some "lowmem" (below 4G) be available on
every node, they might in theory do something truly awesomely strange.  But
even here - the granularity of such mappings tends to be large enough that
the "allocate near where the kernel was loaded" should still work to make those
allocations be on the same node for the "few megabytes" level of allocations.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 20:49                 ` Luck, Tony
@ 2013-08-12 20:54                   ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 20:54 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Tang Chen, H. Peter Anvin, Tang Chen, Moore, Robert, Zheng, Lv,
	rjw, lenb, tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen@linux.intel.com

Hello, Tony.

On Mon, Aug 12, 2013 at 08:49:42PM +0000, Luck, Tony wrote:
> The only fly I see in the ointment here is the crazy fragmentation of physical
> memory below 4G on X86 systems.  Typically it will all be on the same node.
> But I don't know if there is any specification that requires it be that way. If some
> "helpful" OEM decided to make some "lowmem" (below 4G) be available on
> every node, they might in theory do something truly awesomely strange.  But
> even here - the granularity of such mappings tends to be large enough that
> the "allocate near where the kernel was loaded" should still work to make those
> allocations be on the same node for the "few megabytes" level of allocations.

Yeah, "near kernel" allocations are needed only till SRAT information
is parsed and fed into memblock.  From then on, it'll be the usual
node-affine top-down allocations, so the memory amount of interest
here is inherently tiny; otherwise, we're doing something silly in our
boot sequence.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 20:54                   ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 20:54 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Tang Chen, H. Peter Anvin, Tang Chen, Moore, Robert, Zheng, Lv,
	rjw, lenb, tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello, Tony.

On Mon, Aug 12, 2013 at 08:49:42PM +0000, Luck, Tony wrote:
> The only fly I see in the ointment here is the crazy fragmentation of physical
> memory below 4G on X86 systems.  Typically it will all be on the same node.
> But I don't know if there is any specification that requires it be that way. If some
> "helpful" OEM decided to make some "lowmem" (below 4G) be available on
> every node, they might in theory do something truly awesomely strange.  But
> even here - the granularity of such mappings tends to be large enough that
> the "allocate near where the kernel was loaded" should still work to make those
> allocations be on the same node for the "few megabytes" level of allocations.

Yeah, "near kernel" allocations are needed only till SRAT information
is parsed and fed into memblock.  From then on, it'll be the usual
node-affine top-down allocations, so the memory amount of interest
here is inherently tiny; otherwise, we're doing something silly in our
boot sequence.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 20:54                   ` Tejun Heo
@ 2013-08-12 20:57                     ` H. Peter Anvin
  -1 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-12 20:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Luck, Tony, Tang Chen, Tang Chen, Moore, Robert, Zheng, Lv, rjw,
	lenb, tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan

On 08/12/2013 01:54 PM, Tejun Heo wrote:
> Hello, Tony.
> 
> On Mon, Aug 12, 2013 at 08:49:42PM +0000, Luck, Tony wrote:
>> The only fly I see in the ointment here is the crazy fragmentation of physical
>> memory below 4G on X86 systems.  Typically it will all be on the same node.
>> But I don't know if there is any specification that requires it be that way. If some
>> "helpful" OEM decided to make some "lowmem" (below 4G) be available on
>> every node, they might in theory do something truly awesomely strange.  But
>> even here - the granularity of such mappings tends to be large enough that
>> the "allocate near where the kernel was loaded" should still work to make those
>> allocations be on the same node for the "few megabytes" level of allocations.
> 
> Yeah, "near kernel" allocations are needed only till SRAT information
> is parsed and fed into memblock.  From then on, it'll be the usual
> node-affine top-down allocations, so the memory amount of interest
> here is inherently tiny; otherwise, we're doing something silly in our
> boot sequence.
> 

Again, how much memory are we talking about here?

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 20:57                     ` H. Peter Anvin
  0 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-12 20:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Luck, Tony, Tang Chen, Tang Chen, Moore, Robert, Zheng, Lv, rjw,
	lenb, tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On 08/12/2013 01:54 PM, Tejun Heo wrote:
> Hello, Tony.
> 
> On Mon, Aug 12, 2013 at 08:49:42PM +0000, Luck, Tony wrote:
>> The only fly I see in the ointment here is the crazy fragmentation of physical
>> memory below 4G on X86 systems.  Typically it will all be on the same node.
>> But I don't know if there is any specification that requires it be that way. If some
>> "helpful" OEM decided to make some "lowmem" (below 4G) be available on
>> every node, they might in theory do something truly awesomely strange.  But
>> even here - the granularity of such mappings tends to be large enough that
>> the "allocate near where the kernel was loaded" should still work to make those
>> allocations be on the same node for the "few megabytes" level of allocations.
> 
> Yeah, "near kernel" allocations are needed only till SRAT information
> is parsed and fed into memblock.  From then on, it'll be the usual
> node-affine top-down allocations, so the memory amount of interest
> here is inherently tiny; otherwise, we're doing something silly in our
> boot sequence.
> 

Again, how much memory are we talking about here?

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 20:57                     ` H. Peter Anvin
@ 2013-08-12 21:06                       ` Yinghai Lu
  -1 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-12 21:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tejun Heo, Luck, Tony, Tang Chen, Tang Chen, Moore, Robert,
	Zheng, Lv, rjw, lenb, tglx, mingo, akpm, trenn, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen@linux.intel.com

On Mon, Aug 12, 2013 at 1:57 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 08/12/2013 01:54 PM, Tejun Heo wrote:
>> Hello, Tony.
>>
>> On Mon, Aug 12, 2013 at 08:49:42PM +0000, Luck, Tony wrote:
>>> The only fly I see in the ointment here is the crazy fragmentation of physical
>>> memory below 4G on X86 systems.  Typically it will all be on the same node.
>>> But I don't know if there is any specification that requires it be that way. If some
>>> "helpful" OEM decided to make some "lowmem" (below 4G) be available on
>>> every node, they might in theory do something truly awesomely strange.  But
>>> even here - the granularity of such mappings tends to be large enough that
>>> the "allocate near where the kernel was loaded" should still work to make those
>>> allocations be on the same node for the "few megabytes" level of allocations.
>>
>> Yeah, "near kernel" allocations are needed only till SRAT information
>> is parsed and fed into memblock.  From then on, it'll be the usual
>> node-affine top-down allocations, so the memory amount of interest
>> here is inherently tiny; otherwise, we're doing something silly in our
>> boot sequence.

"near kernel" is not very clear. when we have 64bit boot loader,
kernel could be anywhere. If the kernel is near the end of first
kernel, we could have chance to
have near kernel on second node.

should use BRK for safe if the buffer is not too big. need bootloader
will have kernel run-time size range in same node ram.

>>
>
> Again, how much memory are we talking about here?

page tables, buffer for slit table, buffer for double
memblock.reserved, override acpi tables.

looks like it is needing several mega bytes, esp someone using 4k page
mapping for debug purpose.

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 21:06                       ` Yinghai Lu
  0 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-12 21:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tejun Heo, Luck, Tony, Tang Chen, Tang Chen, Moore, Robert,
	Zheng, Lv, rjw, lenb, tglx, mingo, akpm, trenn, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Mon, Aug 12, 2013 at 1:57 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 08/12/2013 01:54 PM, Tejun Heo wrote:
>> Hello, Tony.
>>
>> On Mon, Aug 12, 2013 at 08:49:42PM +0000, Luck, Tony wrote:
>>> The only fly I see in the ointment here is the crazy fragmentation of physical
>>> memory below 4G on X86 systems.  Typically it will all be on the same node.
>>> But I don't know if there is any specification that requires it be that way. If some
>>> "helpful" OEM decided to make some "lowmem" (below 4G) be available on
>>> every node, they might in theory do something truly awesomely strange.  But
>>> even here - the granularity of such mappings tends to be large enough that
>>> the "allocate near where the kernel was loaded" should still work to make those
>>> allocations be on the same node for the "few megabytes" level of allocations.
>>
>> Yeah, "near kernel" allocations are needed only till SRAT information
>> is parsed and fed into memblock.  From then on, it'll be the usual
>> node-affine top-down allocations, so the memory amount of interest
>> here is inherently tiny; otherwise, we're doing something silly in our
>> boot sequence.

"near kernel" is not very clear. when we have 64bit boot loader,
kernel could be anywhere. If the kernel is near the end of first
kernel, we could have chance to
have near kernel on second node.

should use BRK for safe if the buffer is not too big. need bootloader
will have kernel run-time size range in same node ram.

>>
>
> Again, how much memory are we talking about here?

page tables, buffer for slit table, buffer for double
memblock.reserved, override acpi tables.

looks like it is needing several mega bytes, esp someone using 4k page
mapping for debug purpose.

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 21:06                       ` Yinghai Lu
@ 2013-08-12 21:08                         ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 21:08 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: H. Peter Anvin, Luck, Tony, Tang Chen, Tang Chen, Moore, Robert,
	Zheng, Lv, rjw, lenb, tglx, mingo, akpm, trenn, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen@linux.intel.com

On Mon, Aug 12, 2013 at 02:06:07PM -0700, Yinghai Lu wrote:
> "near kernel" is not very clear. when we have 64bit boot loader,
> kernel could be anywhere. If the kernel is near the end of first
> kernel, we could have chance to
> have near kernel on second node.
> 
> should use BRK for safe if the buffer is not too big. need bootloader
> will have kernel run-time size range in same node ram.

How would that make any difference?  You're just expanding the size of
kernel image instead of reserving it around the image.  It's exactly
the same thing.  You're just less flexible if you do that with BRK.
What am I missing here?

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 21:08                         ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 21:08 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: H. Peter Anvin, Luck, Tony, Tang Chen, Tang Chen, Moore, Robert,
	Zheng, Lv, rjw, lenb, tglx, mingo, akpm, trenn, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Mon, Aug 12, 2013 at 02:06:07PM -0700, Yinghai Lu wrote:
> "near kernel" is not very clear. when we have 64bit boot loader,
> kernel could be anywhere. If the kernel is near the end of first
> kernel, we could have chance to
> have near kernel on second node.
> 
> should use BRK for safe if the buffer is not too big. need bootloader
> will have kernel run-time size range in same node ram.

How would that make any difference?  You're just expanding the size of
kernel image instead of reserving it around the image.  It's exactly
the same thing.  You're just less flexible if you do that with BRK.
What am I missing here?

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* RE: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 20:54                   ` Tejun Heo
@ 2013-08-12 21:11                     ` Luck, Tony
  -1 siblings, 0 replies; 165+ messages in thread
From: Luck, Tony @ 2013-08-12 21:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, H. Peter Anvin, Tang Chen, Moore, Robert, Zheng, Lv,
	rjw, lenb, tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen@linux.intel.com

>> The only fly I see in the ointment here is the crazy fragmentation of physical
>> memory below 4G on X86 systems.  Typically it will all be on the same node.
>> But I don't know if there is any specification that requires it be that way. If some
>> "helpful" OEM decided to make some "lowmem" (below 4G) be available on
>> every node, they might in theory do something truly awesomely strange.  But
>> even here - the granularity of such mappings tends to be large enough that
>> the "allocate near where the kernel was loaded" should still work to make those
>> allocations be on the same node for the "few megabytes" level of allocations.
>
> Yeah, "near kernel" allocations are needed only till SRAT information
> is parsed and fed into memblock.  From then on, it'll be the usual
> node-affine top-down allocations, so the memory amount of interest
> here is inherently tiny; otherwise, we're doing something silly in our
> boot sequence.

Just an idle, slightly related, question.  Will a 64-bit X86 kernel work if the physical
load address is >4GB?  That would get it away from the fragmented bits of
address space and into vast tracts of same-node-ness.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* RE: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 21:11                     ` Luck, Tony
  0 siblings, 0 replies; 165+ messages in thread
From: Luck, Tony @ 2013-08-12 21:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, H. Peter Anvin, Tang Chen, Moore, Robert, Zheng, Lv,
	rjw, lenb, tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

>> The only fly I see in the ointment here is the crazy fragmentation of physical
>> memory below 4G on X86 systems.  Typically it will all be on the same node.
>> But I don't know if there is any specification that requires it be that way. If some
>> "helpful" OEM decided to make some "lowmem" (below 4G) be available on
>> every node, they might in theory do something truly awesomely strange.  But
>> even here - the granularity of such mappings tends to be large enough that
>> the "allocate near where the kernel was loaded" should still work to make those
>> allocations be on the same node for the "few megabytes" level of allocations.
>
> Yeah, "near kernel" allocations are needed only till SRAT information
> is parsed and fed into memblock.  From then on, it'll be the usual
> node-affine top-down allocations, so the memory amount of interest
> here is inherently tiny; otherwise, we're doing something silly in our
> boot sequence.

Just an idle, slightly related, question.  Will a 64-bit X86 kernel work if the physical
load address is >4GB?  That would get it away from the fragmented bits of
address space and into vast tracts of same-node-ness.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 21:06                       ` Yinghai Lu
@ 2013-08-12 21:11                         ` H. Peter Anvin
  -1 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-12 21:11 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tejun Heo, Luck, Tony, Tang Chen, Tang Chen, Moore, Robert,
	Zheng, Lv, rjw, lenb, tglx, mingo, akpm, trenn, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan@kernel.org

On 08/12/2013 02:06 PM, Yinghai Lu wrote:
> 
> should use BRK for safe if the buffer is not too big. need bootloader
> will have kernel run-time size range in same node ram.
> 

The bootloader typically won't know.

>>
>> Again, how much memory are we talking about here?
> 
> page tables, buffer for slit table, buffer for double
> memblock.reserved, override acpi tables.
> 
> looks like it is needing several mega bytes, esp someone using 4k page
> mapping for debug purpose.
> 

We need to set a careful limit, then.  "Several megabytes" could be a
problem causing a boot failure on a small memory machine if we extend
the BRK too much... obviously, a too-small BRK can fail on large systems.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 21:11                         ` H. Peter Anvin
  0 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-12 21:11 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tejun Heo, Luck, Tony, Tang Chen, Tang Chen, Moore, Robert,
	Zheng, Lv, rjw, lenb, tglx, mingo, akpm, trenn, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On 08/12/2013 02:06 PM, Yinghai Lu wrote:
> 
> should use BRK for safe if the buffer is not too big. need bootloader
> will have kernel run-time size range in same node ram.
> 

The bootloader typically won't know.

>>
>> Again, how much memory are we talking about here?
> 
> page tables, buffer for slit table, buffer for double
> memblock.reserved, override acpi tables.
> 
> looks like it is needing several mega bytes, esp someone using 4k page
> mapping for debug purpose.
> 

We need to set a careful limit, then.  "Several megabytes" could be a
problem causing a boot failure on a small memory machine if we extend
the BRK too much... obviously, a too-small BRK can fail on large systems.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 21:08                         ` Tejun Heo
@ 2013-08-12 21:12                           ` H. Peter Anvin
  -1 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-12 21:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yinghai Lu, Luck, Tony, Tang Chen, Tang Chen, Moore, Robert,
	Zheng, Lv, rjw, lenb, tglx, mingo, akpm, trenn, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan@kernel.org

On 08/12/2013 02:08 PM, Tejun Heo wrote:
> On Mon, Aug 12, 2013 at 02:06:07PM -0700, Yinghai Lu wrote:
>> "near kernel" is not very clear. when we have 64bit boot loader,
>> kernel could be anywhere. If the kernel is near the end of first
>> kernel, we could have chance to
>> have near kernel on second node.
>>
>> should use BRK for safe if the buffer is not too big. need bootloader
>> will have kernel run-time size range in same node ram.
> 
> How would that make any difference?  You're just expanding the size of
> kernel image instead of reserving it around the image.  It's exactly
> the same thing.  You're just less flexible if you do that with BRK.
> What am I missing here?
> 

The BRK is what we know is free.  Beyond that point you need
understanding of the memory map.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 21:12                           ` H. Peter Anvin
  0 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-12 21:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yinghai Lu, Luck, Tony, Tang Chen, Tang Chen, Moore, Robert,
	Zheng, Lv, rjw, lenb, tglx, mingo, akpm, trenn, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On 08/12/2013 02:08 PM, Tejun Heo wrote:
> On Mon, Aug 12, 2013 at 02:06:07PM -0700, Yinghai Lu wrote:
>> "near kernel" is not very clear. when we have 64bit boot loader,
>> kernel could be anywhere. If the kernel is near the end of first
>> kernel, we could have chance to
>> have near kernel on second node.
>>
>> should use BRK for safe if the buffer is not too big. need bootloader
>> will have kernel run-time size range in same node ram.
> 
> How would that make any difference?  You're just expanding the size of
> kernel image instead of reserving it around the image.  It's exactly
> the same thing.  You're just less flexible if you do that with BRK.
> What am I missing here?
> 

The BRK is what we know is free.  Beyond that point you need
understanding of the memory map.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 21:12                           ` H. Peter Anvin
@ 2013-08-12 21:14                             ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 21:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Yinghai Lu, Luck, Tony, Tang Chen, Tang Chen, Moore, Robert,
	Zheng, Lv, rjw, lenb, tglx, mingo, akpm, trenn, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen@linux.intel.com

On Mon, Aug 12, 2013 at 02:12:25PM -0700, H. Peter Anvin wrote:
> The BRK is what we know is free.  Beyond that point you need
> understanding of the memory map.

Hmmm?  All this happens after e820.  We *know* which memory is useable
and free.  We just don't know which nodes they belong to.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 21:14                             ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-12 21:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Yinghai Lu, Luck, Tony, Tang Chen, Tang Chen, Moore, Robert,
	Zheng, Lv, rjw, lenb, tglx, mingo, akpm, trenn, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Mon, Aug 12, 2013 at 02:12:25PM -0700, H. Peter Anvin wrote:
> The BRK is what we know is free.  Beyond that point you need
> understanding of the memory map.

Hmmm?  All this happens after e820.  We *know* which memory is useable
and free.  We just don't know which nodes they belong to.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 21:11                     ` Luck, Tony
@ 2013-08-12 21:25                       ` Yinghai Lu
  -1 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-12 21:25 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Tejun Heo, Tang Chen, H. Peter Anvin, Tang Chen, Moore, Robert,
	Zheng, Lv, rjw, lenb, tglx, mingo, akpm, trenn, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen@linux.intel.com

On Mon, Aug 12, 2013 at 2:11 PM, Luck, Tony <tony.luck@intel.com> wrote:
>>> The only fly I see in the ointment here is the crazy fragmentation of physical
>
> Just an idle, slightly related, question.  Will a 64-bit X86 kernel work if the physical  load address is >4GB?

Yes. for smp booting, will need some pages under 1M for trampoline AP.

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 21:25                       ` Yinghai Lu
  0 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-12 21:25 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Tejun Heo, Tang Chen, H. Peter Anvin, Tang Chen, Moore, Robert,
	Zheng, Lv, rjw, lenb, tglx, mingo, akpm, trenn, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Mon, Aug 12, 2013 at 2:11 PM, Luck, Tony <tony.luck@intel.com> wrote:
>>> The only fly I see in the ointment here is the crazy fragmentation of physical
>
> Just an idle, slightly related, question.  Will a 64-bit X86 kernel work if the physical  load address is >4GB?

Yes. for smp booting, will need some pages under 1M for trampoline AP.

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 21:25                       ` Yinghai Lu
@ 2013-08-12 21:28                         ` H. Peter Anvin
  -1 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-12 21:28 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Luck, Tony, Tejun Heo, Tang Chen, Tang Chen, Moore, Robert,
	Zheng, Lv, rjw, lenb, tglx, mingo, akpm, trenn, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan@kernel.org

On 08/12/2013 02:25 PM, Yinghai Lu wrote:
> On Mon, Aug 12, 2013 at 2:11 PM, Luck, Tony <tony.luck@intel.com> wrote:
>>>> The only fly I see in the ointment here is the crazy fragmentation of physical
>>
>> Just an idle, slightly related, question.  Will a 64-bit X86 kernel work if the physical  load address is >4GB?
> 
> Yes. for smp booting, will need some pages under 1M for trampoline AP.
> 

Not just for SMP anymore, either.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-12 21:28                         ` H. Peter Anvin
  0 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-12 21:28 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Luck, Tony, Tejun Heo, Tang Chen, Tang Chen, Moore, Robert,
	Zheng, Lv, rjw, lenb, tglx, mingo, akpm, trenn, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On 08/12/2013 02:25 PM, Yinghai Lu wrote:
> On Mon, Aug 12, 2013 at 2:11 PM, Luck, Tony <tony.luck@intel.com> wrote:
>>>> The only fly I see in the ointment here is the crazy fragmentation of physical
>>
>> Just an idle, slightly related, question.  Will a 64-bit X86 kernel work if the physical  load address is >4GB?
> 
> Yes. for smp booting, will need some pages under 1M for trampoline AP.
> 

Not just for SMP anymore, either.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 21:11                     ` Luck, Tony
@ 2013-08-13  5:14                       ` H. Peter Anvin
  -1 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-13  5:14 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Tejun Heo, Tang Chen, Tang Chen, Moore, Robert, Zheng, Lv, rjw,
	lenb, tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan

On 08/12/2013 02:11 PM, Luck, Tony wrote:
> 
> Just an idle, slightly related, question.  Will a 64-bit X86 kernel work if the physical
> load address is >4GB?  That would get it away from the fragmented bits of
> address space and into vast tracts of same-node-ness.
> 

It will, although not until very recently.  However, there is some fixed
memory required < 1 MB, so this may lock down two nodes.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-13  5:14                       ` H. Peter Anvin
  0 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-13  5:14 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Tejun Heo, Tang Chen, Tang Chen, Moore, Robert, Zheng, Lv, rjw,
	lenb, tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On 08/12/2013 02:11 PM, Luck, Tony wrote:
> 
> Just an idle, slightly related, question.  Will a 64-bit X86 kernel work if the physical
> load address is >4GB?  That would get it away from the fragmented bits of
> address space and into vast tracts of same-node-ness.
> 

It will, although not until very recently.  However, there is some fixed
memory required < 1 MB, so this may lock down two nodes.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 16:46           ` Tejun Heo
@ 2013-08-13  6:14             ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-13  6:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, H. Peter Anvin, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

On 08/13/2013 12:46 AM, Tejun Heo wrote:
......
>
> * Adding an option to tell the kernel to try to stay away from
>    hotpluggable nodes is fine.  I have no problem with that at all.
>
> * The patchsets upto this point have been somehow trying to reorder
>    operations shomehow such that *no* memory allocation happens before
>    memblock is populated with hotplug information.
>
> * However, we already *know* that the memory the kernel image is
>    occupying won't be removeable.  It's highly likely that the amount
>    of memory allocation before NUMA / hotplug information is fully
>    populated is pretty small.  Also, it's highly likely that small
>    amount of memory right after the kernel image is contained in the
>    same NUMA node, so if we allocate memory close to the kernel image,
>    it's likely that we don't contaminate hotpluggable node.  We're
>    talking about few megs at most right after the kernel image.  I
>    can't see how that would make any noticeable difference.
>
> * Once hotplug information is available, allocation can happen as
>    usual and the kernel can report the nodes which are actually
>    hotpluggable - marked as hotpluggable by the firmware&&  didn't get
>    contaminated during early alloc&&  didn't get overflow allocations
>    afterwards.  Note that we need such mechanism no matter what as the
>    kernel image can be loaded into hotpluggable nodes and reporting
>    that to userland is the only thing the kernel can do for cases like
>    that short of denying memory unplug on such nodes.
>

Hi tj, hpa, luck, yinghai,

So if all of you agree on the idea above from tj, I think
we can do it in this way. Will update the patches to allocate
memory near kernel image before SRAT is parsed.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-13  6:14             ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-13  6:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, H. Peter Anvin, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

On 08/13/2013 12:46 AM, Tejun Heo wrote:
......
>
> * Adding an option to tell the kernel to try to stay away from
>    hotpluggable nodes is fine.  I have no problem with that at all.
>
> * The patchsets upto this point have been somehow trying to reorder
>    operations shomehow such that *no* memory allocation happens before
>    memblock is populated with hotplug information.
>
> * However, we already *know* that the memory the kernel image is
>    occupying won't be removeable.  It's highly likely that the amount
>    of memory allocation before NUMA / hotplug information is fully
>    populated is pretty small.  Also, it's highly likely that small
>    amount of memory right after the kernel image is contained in the
>    same NUMA node, so if we allocate memory close to the kernel image,
>    it's likely that we don't contaminate hotpluggable node.  We're
>    talking about few megs at most right after the kernel image.  I
>    can't see how that would make any noticeable difference.
>
> * Once hotplug information is available, allocation can happen as
>    usual and the kernel can report the nodes which are actually
>    hotpluggable - marked as hotpluggable by the firmware&&  didn't get
>    contaminated during early alloc&&  didn't get overflow allocations
>    afterwards.  Note that we need such mechanism no matter what as the
>    kernel image can be loaded into hotpluggable nodes and reporting
>    that to userland is the only thing the kernel can do for cases like
>    that short of denying memory unplug on such nodes.
>

Hi tj, hpa, luck, yinghai,

So if all of you agree on the idea above from tj, I think
we can do it in this way. Will update the patches to allocate
memory near kernel image before SRAT is parsed.

Thanks.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-13  6:14             ` Tang Chen
@ 2013-08-13  9:56               ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-13  9:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, H. Peter Anvin, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hi tj,

When doing the "near kernel memory allocation", I have something
about memblock that I need you to comfirm.

1. First of all, memblock is platform independent. Different platforms
    have different ways to store kernel image address. So I don't think
    we can obtain the kernel image address on memblock side, right ?

    If so, then we need to pass kernel image address to memblock. But...

2. There are several places calling memblock_find_in_range_node() to
    allocate memory before SRAT parsed.

    early_reserve_e820_mpc_new()
    reserve_real_mode()
    init_mem_mapping()
    setup_log_buf()
    relocate_initrd()
    acpi_initrd_override()
    reserve_crashkernel()

    Maybe more, I didn't find out.

    And in the future, maybe someone will add code to allocate memory
    before SRAT parsed. So I don't think we should pass kernel image
    addr to them one by one. It will modify a lot of things.

So I think we need a generic way to tell memblock to allocate memory
from the kernel image end address to higher memory.


My idea is:

1. Introduce a memblock.current_limit_low to limit the lowest address
    that memblock can use.

2. Make memblock be able to allocate memory from low to high.

3. Get kernel image address on x86, and set memblock.current_limit_low
    to it before SRAT is parsed. Then we achieve the goal.

4. Reset it to 0, and make memblock allocate memory form high to low.


How do you think of this, or do you have any better idea ?


Thanks for your patient and help. :)


On 08/13/2013 02:14 PM, Tang Chen wrote:
> On 08/13/2013 12:46 AM, Tejun Heo wrote:
> ......
>>
>> * Adding an option to tell the kernel to try to stay away from
>> hotpluggable nodes is fine. I have no problem with that at all.
>>
>> * The patchsets upto this point have been somehow trying to reorder
>> operations shomehow such that *no* memory allocation happens before
>> memblock is populated with hotplug information.
>>
>> * However, we already *know* that the memory the kernel image is
>> occupying won't be removeable. It's highly likely that the amount
>> of memory allocation before NUMA / hotplug information is fully
>> populated is pretty small. Also, it's highly likely that small
>> amount of memory right after the kernel image is contained in the
>> same NUMA node, so if we allocate memory close to the kernel image,
>> it's likely that we don't contaminate hotpluggable node. We're
>> talking about few megs at most right after the kernel image. I
>> can't see how that would make any noticeable difference.
>>
>> * Once hotplug information is available, allocation can happen as
>> usual and the kernel can report the nodes which are actually
>> hotpluggable - marked as hotpluggable by the firmware&& didn't get
>> contaminated during early alloc&& didn't get overflow allocations
>> afterwards. Note that we need such mechanism no matter what as the
>> kernel image can be loaded into hotpluggable nodes and reporting
>> that to userland is the only thing the kernel can do for cases like
>> that short of denying memory unplug on such nodes.
>>
>
> Hi tj, hpa, luck, yinghai,
>
> So if all of you agree on the idea above from tj, I think
> we can do it in this way. Will update the patches to allocate
> memory near kernel image before SRAT is parsed.
>
> Thanks.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-13  9:56               ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-13  9:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, H. Peter Anvin, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hi tj,

When doing the "near kernel memory allocation", I have something
about memblock that I need you to comfirm.

1. First of all, memblock is platform independent. Different platforms
    have different ways to store kernel image address. So I don't think
    we can obtain the kernel image address on memblock side, right ?

    If so, then we need to pass kernel image address to memblock. But...

2. There are several places calling memblock_find_in_range_node() to
    allocate memory before SRAT parsed.

    early_reserve_e820_mpc_new()
    reserve_real_mode()
    init_mem_mapping()
    setup_log_buf()
    relocate_initrd()
    acpi_initrd_override()
    reserve_crashkernel()

    Maybe more, I didn't find out.

    And in the future, maybe someone will add code to allocate memory
    before SRAT parsed. So I don't think we should pass kernel image
    addr to them one by one. It will modify a lot of things.

So I think we need a generic way to tell memblock to allocate memory
from the kernel image end address to higher memory.


My idea is:

1. Introduce a memblock.current_limit_low to limit the lowest address
    that memblock can use.

2. Make memblock be able to allocate memory from low to high.

3. Get kernel image address on x86, and set memblock.current_limit_low
    to it before SRAT is parsed. Then we achieve the goal.

4. Reset it to 0, and make memblock allocate memory form high to low.


How do you think of this, or do you have any better idea ?


Thanks for your patient and help. :)


On 08/13/2013 02:14 PM, Tang Chen wrote:
> On 08/13/2013 12:46 AM, Tejun Heo wrote:
> ......
>>
>> * Adding an option to tell the kernel to try to stay away from
>> hotpluggable nodes is fine. I have no problem with that at all.
>>
>> * The patchsets upto this point have been somehow trying to reorder
>> operations shomehow such that *no* memory allocation happens before
>> memblock is populated with hotplug information.
>>
>> * However, we already *know* that the memory the kernel image is
>> occupying won't be removeable. It's highly likely that the amount
>> of memory allocation before NUMA / hotplug information is fully
>> populated is pretty small. Also, it's highly likely that small
>> amount of memory right after the kernel image is contained in the
>> same NUMA node, so if we allocate memory close to the kernel image,
>> it's likely that we don't contaminate hotpluggable node. We're
>> talking about few megs at most right after the kernel image. I
>> can't see how that would make any noticeable difference.
>>
>> * Once hotplug information is available, allocation can happen as
>> usual and the kernel can report the nodes which are actually
>> hotpluggable - marked as hotpluggable by the firmware&& didn't get
>> contaminated during early alloc&& didn't get overflow allocations
>> afterwards. Note that we need such mechanism no matter what as the
>> kernel image can be loaded into hotpluggable nodes and reporting
>> that to userland is the only thing the kernel can do for cases like
>> that short of denying memory unplug on such nodes.
>>
>
> Hi tj, hpa, luck, yinghai,
>
> So if all of you agree on the idea above from tj, I think
> we can do it in this way. Will update the patches to allocate
> memory near kernel image before SRAT is parsed.
>
> Thanks.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-13  9:56               ` Tang Chen
@ 2013-08-13 14:38                 ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-13 14:38 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tang Chen, H. Peter Anvin, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hello, Tang.

On Tue, Aug 13, 2013 at 05:56:46PM +0800, Tang Chen wrote:
> 1. Introduce a memblock.current_limit_low to limit the lowest address
>    that memblock can use.
> 
> 2. Make memblock be able to allocate memory from low to high.
> 
> 3. Get kernel image address on x86, and set memblock.current_limit_low
>    to it before SRAT is parsed. Then we achieve the goal.
> 
> 4. Reset it to 0, and make memblock allocate memory form high to low.
> 
> How do you think of this, or do you have any better idea ?

Yes, something like that.  Maybe have something like
memblock_set_alloc_range(low, high, low_to_high) in memblock?  Once
NUMA info is available arch code can call memblock_set_alloc_range(0,
0, false) to reset it to the default behavior.

> Thanks for your patient and help. :)

Heh, sorry about all the roundabouts.  Your persistence is much
appreciated. :)

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-13 14:38                 ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-13 14:38 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tang Chen, H. Peter Anvin, robert.moore, lv.zheng, rjw, lenb,
	tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, Luck, Tony (tony.luck@intel.com)

Hello, Tang.

On Tue, Aug 13, 2013 at 05:56:46PM +0800, Tang Chen wrote:
> 1. Introduce a memblock.current_limit_low to limit the lowest address
>    that memblock can use.
> 
> 2. Make memblock be able to allocate memory from low to high.
> 
> 3. Get kernel image address on x86, and set memblock.current_limit_low
>    to it before SRAT is parsed. Then we achieve the goal.
> 
> 4. Reset it to 0, and make memblock allocate memory form high to low.
> 
> How do you think of this, or do you have any better idea ?

Yes, something like that.  Maybe have something like
memblock_set_alloc_range(low, high, low_to_high) in memblock?  Once
NUMA info is available arch code can call memblock_set_alloc_range(0,
0, false) to reset it to the default behavior.

> Thanks for your patient and help. :)

Heh, sorry about all the roundabouts.  Your persistence is much
appreciated. :)

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-13  9:56               ` Tang Chen
@ 2013-08-13 22:33                 ` Yinghai Lu
  -1 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-13 22:33 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tejun Heo, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jwe

On Tue, Aug 13, 2013 at 2:56 AM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
> 2. There are several places calling memblock_find_in_range_node() to
>    allocate memory before SRAT parsed.
>
>    early_reserve_e820_mpc_new()

this one is under 1M.

>    reserve_real_mode()

this one is under 1M

>    init_mem_mapping()

Now we top and down, so initial page tables in in BRK, other page tables
is near the top!

>    setup_log_buf()

user could specify 4M or more.

>    relocate_initrd()

size could be very big, like several hundreds mega bytes.
should be anywhere, but will be freed after booting.

===> so we should not limit it to near kernel range.

>    acpi_initrd_override()

should be 64 * 10 about 1M.

>    reserve_crashkernel()

could be under 4G, or above 4G.
size could be 512M or 8G whatever.

looks like
should move down relocated_initrd and reserve_crashkernel.

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-13 22:33                 ` Yinghai Lu
  0 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-13 22:33 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tejun Heo, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei, yanghy,
	the arch/x86 maintainers, linux-doc, Linux Kernel Mailing List,
	Linux MM, ACPI Devel Maling List, Luck,
	Tony (tony.luck@intel.com)

On Tue, Aug 13, 2013 at 2:56 AM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
> 2. There are several places calling memblock_find_in_range_node() to
>    allocate memory before SRAT parsed.
>
>    early_reserve_e820_mpc_new()

this one is under 1M.

>    reserve_real_mode()

this one is under 1M

>    init_mem_mapping()

Now we top and down, so initial page tables in in BRK, other page tables
is near the top!

>    setup_log_buf()

user could specify 4M or more.

>    relocate_initrd()

size could be very big, like several hundreds mega bytes.
should be anywhere, but will be freed after booting.

===> so we should not limit it to near kernel range.

>    acpi_initrd_override()

should be 64 * 10 about 1M.

>    reserve_crashkernel()

could be under 4G, or above 4G.
size could be 512M or 8G whatever.

looks like
should move down relocated_initrd and reserve_crashkernel.

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-13 22:33                 ` Yinghai Lu
@ 2013-08-14  1:22                   ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-14  1:22 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tejun Heo, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis

On 08/14/2013 06:33 AM, Yinghai Lu wrote:
......
>
>>     relocate_initrd()
>
> size could be very big, like several hundreds mega bytes.
> should be anywhere, but will be freed after booting.
>
> ===>  so we should not limit it to near kernel range.
>
>>     acpi_initrd_override()
>
> should be 64 * 10 about 1M.
>
>>     reserve_crashkernel()
>
> could be under 4G, or above 4G.
> size could be 512M or 8G whatever.
>
> looks like
> should move down relocated_initrd and reserve_crashkernel.

OK, will try to do this.

Thank you for the explanation. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-14  1:22                   ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-14  1:22 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tejun Heo, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei, yanghy,
	the arch/x86 maintainers, linux-doc, Linux Kernel Mailing List,
	Linux MM, ACPI Devel Maling List, Luck,
	Tony (tony.luck@intel.com)

On 08/14/2013 06:33 AM, Yinghai Lu wrote:
......
>
>>     relocate_initrd()
>
> size could be very big, like several hundreds mega bytes.
> should be anywhere, but will be freed after booting.
>
> ===>  so we should not limit it to near kernel range.
>
>>     acpi_initrd_override()
>
> should be 64 * 10 about 1M.
>
>>     reserve_crashkernel()
>
> could be under 4G, or above 4G.
> size could be 512M or 8G whatever.
>
> looks like
> should move down relocated_initrd and reserve_crashkernel.

OK, will try to do this.

Thank you for the explanation. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 18:07               ` Tejun Heo
@ 2013-08-14 18:15                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 18:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, kosaki.motohiro

(8/12/13 2:07 PM), Tejun Heo wrote:
> Hey,
>
> On Tue, Aug 13, 2013 at 01:01:09AM +0800, Tang Chen wrote:
>> Sorry for the misunderstanding.
>>
>> I was trying to answer your question: "Why can't the kenrel allocate
>> hotpluggable memory opportunistic ?".
>
> I've used the wrong word, I was meaning best-effort, which is the only
> thing we can do anyway given that we have no control over where the
> kernel image is linked in relation to NUMA nodes.
>
>> If the kernel has any opportunity to allocate hotpluggable memory in
>> SRAT, then the kernel should tell users which memory is hotpluggable.
>>
>> But in what way ?  I think node is the best for now. But a node could
>> have a lot of memory. If the kernel uses only a little memory, we will
>> lose the whole movable node, which I don't want to do.
>>
>> So, I don't want to allow the kenrel allocating hotpluggable memory
>> opportunistic.
>
> What I was saying was that the kernel should try !hotpluggable memory
> first then fall back to hotpluggable memory instead of failing boot as
> nothing really is worse than failing to boot.

I don't follow this. We need to think why memory hotplug is necessary.
Because system reboot is unacceptable on several critical services. Then,
if someone set wrong boot option, systems SHOULD fail to boot. At that time,
admin have a chance to fix their mistake. In the other hand, after running
production service, they have no chance to fix the mistake. In general, default
boot option should have a fallback and non-default option should not have a
fallback. That's a fundamental rule.



^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-14 18:15                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 18:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, kosaki.motohiro

(8/12/13 2:07 PM), Tejun Heo wrote:
> Hey,
>
> On Tue, Aug 13, 2013 at 01:01:09AM +0800, Tang Chen wrote:
>> Sorry for the misunderstanding.
>>
>> I was trying to answer your question: "Why can't the kenrel allocate
>> hotpluggable memory opportunistic ?".
>
> I've used the wrong word, I was meaning best-effort, which is the only
> thing we can do anyway given that we have no control over where the
> kernel image is linked in relation to NUMA nodes.
>
>> If the kernel has any opportunity to allocate hotpluggable memory in
>> SRAT, then the kernel should tell users which memory is hotpluggable.
>>
>> But in what way ?  I think node is the best for now. But a node could
>> have a lot of memory. If the kernel uses only a little memory, we will
>> lose the whole movable node, which I don't want to do.
>>
>> So, I don't want to allow the kenrel allocating hotpluggable memory
>> opportunistic.
>
> What I was saying was that the kernel should try !hotpluggable memory
> first then fall back to hotpluggable memory instead of failing boot as
> nothing really is worse than failing to boot.

I don't follow this. We need to think why memory hotplug is necessary.
Because system reboot is unacceptable on several critical services. Then,
if someone set wrong boot option, systems SHOULD fail to boot. At that time,
admin have a chance to fix their mistake. In the other hand, after running
production service, they have no chance to fix the mistake. In general, default
boot option should have a fallback and non-default option should not have a
fallback. That's a fundamental rule.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-12 17:23               ` H. Peter Anvin
@ 2013-08-14 18:22                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 18:22 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tang Chen, Tejun Heo, Tang Chen, robert.moore, lv.zheng, rjw,
	lenb, tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, kosaki.motohiro

(8/12/13 1:23 PM), H. Peter Anvin wrote:
> On 08/12/2013 10:01 AM, Tang Chen wrote:
>>>
>>>> I'm just thinking of a more extreme case. For example, if a machine
>>>> has only one node hotpluggable, and the kernel resides in that node.
>>>> Then the system has no hotpluggable node.
>>>
>>> Yeah, sure, then there's no way that node can be hotpluggable and the
>>> right thing to do is booting up the machine and informing the userland
>>> that memory is not hotpluggable.
>>>
>>>> If we can prevent the kernel from using hotpluggable memory, in such
>>>> a machine, users can still do memory hotplug.
>>>>
>>>> I wanted to do it as generic as possible. But yes, finding out the
>>>> nodes the kernel resides in and make it unhotpluggable can work.
>>>
>>> Short of being able to remap memory under the kernel, I don't think
>>> this can be very generic and as a compromise trying to keep as many
>>> hotpluggable nodes as possible doesn't sound too bad.
>>
>> I think making one of the node hotpluggable is better. But OK, it is
>> no big deal. There won't be such machine in reality, I think. :)
>>
>
> The user may very well have configured a system with mirrored memory for
> the kernel node as that will be non-hotpluggable, but not for the
> others.  One can wonder how much that actually buys in real life, but
> still...

Note. Such system is much cheaper than full memory mirroring system. That's
one of reason why server vendors are interesting in hot plugging.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-14 18:22                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 18:22 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tang Chen, Tejun Heo, Tang Chen, robert.moore, lv.zheng, rjw,
	lenb, tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi, kosaki.motohiro

(8/12/13 1:23 PM), H. Peter Anvin wrote:
> On 08/12/2013 10:01 AM, Tang Chen wrote:
>>>
>>>> I'm just thinking of a more extreme case. For example, if a machine
>>>> has only one node hotpluggable, and the kernel resides in that node.
>>>> Then the system has no hotpluggable node.
>>>
>>> Yeah, sure, then there's no way that node can be hotpluggable and the
>>> right thing to do is booting up the machine and informing the userland
>>> that memory is not hotpluggable.
>>>
>>>> If we can prevent the kernel from using hotpluggable memory, in such
>>>> a machine, users can still do memory hotplug.
>>>>
>>>> I wanted to do it as generic as possible. But yes, finding out the
>>>> nodes the kernel resides in and make it unhotpluggable can work.
>>>
>>> Short of being able to remap memory under the kernel, I don't think
>>> this can be very generic and as a compromise trying to keep as many
>>> hotpluggable nodes as possible doesn't sound too bad.
>>
>> I think making one of the node hotpluggable is better. But OK, it is
>> no big deal. There won't be such machine in reality, I think. :)
>>
>
> The user may very well have configured a system with mirrored memory for
> the kernel node as that will be non-hotpluggable, but not for the
> others.  One can wonder how much that actually buys in real life, but
> still...

Note. Such system is much cheaper than full memory mirroring system. That's
one of reason why server vendors are interesting in hot plugging.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 18:15                 ` KOSAKI Motohiro
@ 2013-08-14 18:23                   ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-14 18:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello,

On Wed, Aug 14, 2013 at 02:15:44PM -0400, KOSAKI Motohiro wrote:
> I don't follow this. We need to think why memory hotplug is necessary.
> Because system reboot is unacceptable on several critical services. Then,
> if someone set wrong boot option, systems SHOULD fail to boot. At that time,
> admin have a chance to fix their mistake. In the other hand, after running
> production service, they have no chance to fix the mistake. In general, default
> boot option should have a fallback and non-default option should not have a
> fallback. That's a fundamental rule.

The fundamental rule is that the system has to boot.  Your argument is
pointless as the kernel has no control over where its own image is
placed w.r.t. hotpluggable nodes.  So, are we gonna fail boot if
kernel image intersects hotpluggable node and the option is specified
even if memory hotplug can be used on other nodes?  That doesn't make
any sense.

Failing to boot is *way* worse reporting mechanism than almost
everything else.  If the sysadmin is willing to risk machines failing
to come up, she would definitely be willing to check whether which
memory areas are actually hotpluggable too, right?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-14 18:23                   ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-14 18:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello,

On Wed, Aug 14, 2013 at 02:15:44PM -0400, KOSAKI Motohiro wrote:
> I don't follow this. We need to think why memory hotplug is necessary.
> Because system reboot is unacceptable on several critical services. Then,
> if someone set wrong boot option, systems SHOULD fail to boot. At that time,
> admin have a chance to fix their mistake. In the other hand, after running
> production service, they have no chance to fix the mistake. In general, default
> boot option should have a fallback and non-default option should not have a
> fallback. That's a fundamental rule.

The fundamental rule is that the system has to boot.  Your argument is
pointless as the kernel has no control over where its own image is
placed w.r.t. hotpluggable nodes.  So, are we gonna fail boot if
kernel image intersects hotpluggable node and the option is specified
even if memory hotplug can be used on other nodes?  That doesn't make
any sense.

Failing to boot is *way* worse reporting mechanism than almost
everything else.  If the sysadmin is willing to risk machines failing
to come up, she would definitely be willing to check whether which
memory areas are actually hotpluggable too, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 18:23                   ` Tejun Heo
@ 2013-08-14 19:40                     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 19:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 2:23 PM), Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 14, 2013 at 02:15:44PM -0400, KOSAKI Motohiro wrote:
>> I don't follow this. We need to think why memory hotplug is necessary.
>> Because system reboot is unacceptable on several critical services. Then,
>> if someone set wrong boot option, systems SHOULD fail to boot. At that time,
>> admin have a chance to fix their mistake. In the other hand, after running
>> production service, they have no chance to fix the mistake. In general, default
>> boot option should have a fallback and non-default option should not have a
>> fallback. That's a fundamental rule.
>
> The fundamental rule is that the system has to boot.

I don't agree it. Please look at other kernel options. A lot of these don't
follow you. These behave as direction, not advise.

I mean the fallback should be implemented at turning on default the feature.


>  Your argument is
> pointless as the kernel has no control over where its own image is
> placed w.r.t. hotpluggable nodes.  So, are we gonna fail boot if
> kernel image intersects hotpluggable node and the option is specified
> even if memory hotplug can be used on other nodes?  That doesn't make
> any sense.

I don't read whole discussion and I don't quite understand why no kernel
place controlling is relevant. Every unpluggable node is suitable for
kernel. If you mean current kernel placement logic don't care plugging,
that's a bug.

If we aim to hot remove, we have to have either kernel relocation or
hotplug awre kernel placement at boot time.

> Failing to boot is *way* worse reporting mechanism than almost
> everything else.  If the sysadmin is willing to risk machines failing
> to come up, she would definitely be willing to check whether which
> memory areas are actually hotpluggable too, right?

No. see above. Your opinion is not pragmatic useful.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-14 19:40                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 19:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 2:23 PM), Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 14, 2013 at 02:15:44PM -0400, KOSAKI Motohiro wrote:
>> I don't follow this. We need to think why memory hotplug is necessary.
>> Because system reboot is unacceptable on several critical services. Then,
>> if someone set wrong boot option, systems SHOULD fail to boot. At that time,
>> admin have a chance to fix their mistake. In the other hand, after running
>> production service, they have no chance to fix the mistake. In general, default
>> boot option should have a fallback and non-default option should not have a
>> fallback. That's a fundamental rule.
>
> The fundamental rule is that the system has to boot.

I don't agree it. Please look at other kernel options. A lot of these don't
follow you. These behave as direction, not advise.

I mean the fallback should be implemented at turning on default the feature.


>  Your argument is
> pointless as the kernel has no control over where its own image is
> placed w.r.t. hotpluggable nodes.  So, are we gonna fail boot if
> kernel image intersects hotpluggable node and the option is specified
> even if memory hotplug can be used on other nodes?  That doesn't make
> any sense.

I don't read whole discussion and I don't quite understand why no kernel
place controlling is relevant. Every unpluggable node is suitable for
kernel. If you mean current kernel placement logic don't care plugging,
that's a bug.

If we aim to hot remove, we have to have either kernel relocation or
hotplug awre kernel placement at boot time.

> Failing to boot is *way* worse reporting mechanism than almost
> everything else.  If the sysadmin is willing to risk machines failing
> to come up, she would definitely be willing to check whether which
> memory areas are actually hotpluggable too, right?

No. see above. Your opinion is not pragmatic useful.



^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 19:40                     ` KOSAKI Motohiro
@ 2013-08-14 19:55                       ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-14 19:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello,

On Wed, Aug 14, 2013 at 03:40:31PM -0400, KOSAKI Motohiro wrote:
> I don't agree it. Please look at other kernel options. A lot of these don't
> follow you. These behave as direction, not advise.
> 
> I mean the fallback should be implemented at turning on default the feature.

Yeah, some options are "please try this" and others "do this or fail".
There's no frigging fundamental rule there.

> I don't read whole discussion and I don't quite understand why no kernel
> place controlling is relevant. Every unpluggable node is suitable for
> kernel. If you mean current kernel placement logic don't care plugging,
> that's a bug.
> 
> If we aim to hot remove, we have to have either kernel relocation or
> hotplug awre kernel placement at boot time.

What if all nodes are hot pluggable?  Are we moving the kernel
dynamically then?

> >Failing to boot is *way* worse reporting mechanism than almost
> >everything else.  If the sysadmin is willing to risk machines failing
> >to come up, she would definitely be willing to check whether which
> >memory areas are actually hotpluggable too, right?
> 
> No. see above. Your opinion is not pragmatic useful.

No, what you're saying doesn't make any sense.  There are multiple
ways to report when something doesn't work.  Failing to boot is *one*
of them and not a very good one.  Here, for practical reasons, the end
result may differ depending on the specifics of the configuration, so
more detailed reporting is necessary anyway, so why do you insist on
failing the boot?  In what world is it a good thing for the machine to
fail boot after bios or kernel update?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-14 19:55                       ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-14 19:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello,

On Wed, Aug 14, 2013 at 03:40:31PM -0400, KOSAKI Motohiro wrote:
> I don't agree it. Please look at other kernel options. A lot of these don't
> follow you. These behave as direction, not advise.
> 
> I mean the fallback should be implemented at turning on default the feature.

Yeah, some options are "please try this" and others "do this or fail".
There's no frigging fundamental rule there.

> I don't read whole discussion and I don't quite understand why no kernel
> place controlling is relevant. Every unpluggable node is suitable for
> kernel. If you mean current kernel placement logic don't care plugging,
> that's a bug.
> 
> If we aim to hot remove, we have to have either kernel relocation or
> hotplug awre kernel placement at boot time.

What if all nodes are hot pluggable?  Are we moving the kernel
dynamically then?

> >Failing to boot is *way* worse reporting mechanism than almost
> >everything else.  If the sysadmin is willing to risk machines failing
> >to come up, she would definitely be willing to check whether which
> >memory areas are actually hotpluggable too, right?
> 
> No. see above. Your opinion is not pragmatic useful.

No, what you're saying doesn't make any sense.  There are multiple
ways to report when something doesn't work.  Failing to boot is *one*
of them and not a very good one.  Here, for practical reasons, the end
result may differ depending on the specifics of the configuration, so
more detailed reporting is necessary anyway, so why do you insist on
failing the boot?  In what world is it a good thing for the machine to
fail boot after bios or kernel update?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 19:55                       ` Tejun Heo
@ 2013-08-14 20:29                         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 20:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 3:55 PM), Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 14, 2013 at 03:40:31PM -0400, KOSAKI Motohiro wrote:
>> I don't agree it. Please look at other kernel options. A lot of these don't
>> follow you. These behave as direction, not advise.
>>
>> I mean the fallback should be implemented at turning on default the feature.
>
> Yeah, some options are "please try this" and others "do this or fail".
> There's no frigging fundamental rule there.

In this case, we have zero worth for fallback, right?


>> I don't read whole discussion and I don't quite understand why no kernel
>> place controlling is relevant. Every unpluggable node is suitable for
>> kernel. If you mean current kernel placement logic don't care plugging,
>> that's a bug.
>>
>> If we aim to hot remove, we have to have either kernel relocation or
>> hotplug awre kernel placement at boot time.
>
> What if all nodes are hot pluggable?  Are we moving the kernel
> dynamically then?

Intel folks already told, we have no such system in practice.


>>> Failing to boot is *way* worse reporting mechanism than almost
>>> everything else.  If the sysadmin is willing to risk machines failing
>>> to come up, she would definitely be willing to check whether which
>>> memory areas are actually hotpluggable too, right?
>>
>> No. see above. Your opinion is not pragmatic useful.
>
> No, what you're saying doesn't make any sense.  There are multiple
> ways to report when something doesn't work.  Failing to boot is *one*
> of them and not a very good one.  Here, for practical reasons, the end
> result may differ depending on the specifics of the configuration, so
> more detailed reporting is necessary anyway, so why do you insist on
> failing the boot?  In what world is it a good thing for the machine to
> fail boot after bios or kernel update?

Because boot failure have no chance to overlook and better way for practice.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-14 20:29                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 20:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 3:55 PM), Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 14, 2013 at 03:40:31PM -0400, KOSAKI Motohiro wrote:
>> I don't agree it. Please look at other kernel options. A lot of these don't
>> follow you. These behave as direction, not advise.
>>
>> I mean the fallback should be implemented at turning on default the feature.
>
> Yeah, some options are "please try this" and others "do this or fail".
> There's no frigging fundamental rule there.

In this case, we have zero worth for fallback, right?


>> I don't read whole discussion and I don't quite understand why no kernel
>> place controlling is relevant. Every unpluggable node is suitable for
>> kernel. If you mean current kernel placement logic don't care plugging,
>> that's a bug.
>>
>> If we aim to hot remove, we have to have either kernel relocation or
>> hotplug awre kernel placement at boot time.
>
> What if all nodes are hot pluggable?  Are we moving the kernel
> dynamically then?

Intel folks already told, we have no such system in practice.


>>> Failing to boot is *way* worse reporting mechanism than almost
>>> everything else.  If the sysadmin is willing to risk machines failing
>>> to come up, she would definitely be willing to check whether which
>>> memory areas are actually hotpluggable too, right?
>>
>> No. see above. Your opinion is not pragmatic useful.
>
> No, what you're saying doesn't make any sense.  There are multiple
> ways to report when something doesn't work.  Failing to boot is *one*
> of them and not a very good one.  Here, for practical reasons, the end
> result may differ depending on the specifics of the configuration, so
> more detailed reporting is necessary anyway, so why do you insist on
> failing the boot?  In what world is it a good thing for the machine to
> fail boot after bios or kernel update?

Because boot failure have no chance to overlook and better way for practice.




^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 20:29                         ` KOSAKI Motohiro
@ 2013-08-14 20:30                           ` H. Peter Anvin
  -1 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-14 20:30 UTC (permalink / raw)
  To: KOSAKI Motohiro, Tejun Heo
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

There are systems which can.  They have the ability to remap in hardware.

KOSAKI Motohiro <kosaki.motohiro@gmail.com> wrote:
>(8/14/13 3:55 PM), Tejun Heo wrote:
>> Hello,
>>
>> On Wed, Aug 14, 2013 at 03:40:31PM -0400, KOSAKI Motohiro wrote:
>>> I don't agree it. Please look at other kernel options. A lot of
>these don't
>>> follow you. These behave as direction, not advise.
>>>
>>> I mean the fallback should be implemented at turning on default the
>feature.
>>
>> Yeah, some options are "please try this" and others "do this or
>fail".
>> There's no frigging fundamental rule there.
>
>In this case, we have zero worth for fallback, right?
>
>
>>> I don't read whole discussion and I don't quite understand why no
>kernel
>>> place controlling is relevant. Every unpluggable node is suitable
>for
>>> kernel. If you mean current kernel placement logic don't care
>plugging,
>>> that's a bug.
>>>
>>> If we aim to hot remove, we have to have either kernel relocation or
>>> hotplug awre kernel placement at boot time.
>>
>> What if all nodes are hot pluggable?  Are we moving the kernel
>> dynamically then?
>
>Intel folks already told, we have no such system in practice.
>
>
>>>> Failing to boot is *way* worse reporting mechanism than almost
>>>> everything else.  If the sysadmin is willing to risk machines
>failing
>>>> to come up, she would definitely be willing to check whether which
>>>> memory areas are actually hotpluggable too, right?
>>>
>>> No. see above. Your opinion is not pragmatic useful.
>>
>> No, what you're saying doesn't make any sense.  There are multiple
>> ways to report when something doesn't work.  Failing to boot is *one*
>> of them and not a very good one.  Here, for practical reasons, the
>end
>> result may differ depending on the specifics of the configuration, so
>> more detailed reporting is necessary anyway, so why do you insist on
>> failing the boot?  In what world is it a good thing for the machine
>to
>> fail boot after bios or kernel update?
>
>Because boot failure have no chance to overlook and better way for
>practice.

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-14 20:30                           ` H. Peter Anvin
  0 siblings, 0 replies; 165+ messages in thread
From: H. Peter Anvin @ 2013-08-14 20:30 UTC (permalink / raw)
  To: KOSAKI Motohiro, Tejun Heo
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

There are systems which can.  They have the ability to remap in hardware.

KOSAKI Motohiro <kosaki.motohiro@gmail.com> wrote:
>(8/14/13 3:55 PM), Tejun Heo wrote:
>> Hello,
>>
>> On Wed, Aug 14, 2013 at 03:40:31PM -0400, KOSAKI Motohiro wrote:
>>> I don't agree it. Please look at other kernel options. A lot of
>these don't
>>> follow you. These behave as direction, not advise.
>>>
>>> I mean the fallback should be implemented at turning on default the
>feature.
>>
>> Yeah, some options are "please try this" and others "do this or
>fail".
>> There's no frigging fundamental rule there.
>
>In this case, we have zero worth for fallback, right?
>
>
>>> I don't read whole discussion and I don't quite understand why no
>kernel
>>> place controlling is relevant. Every unpluggable node is suitable
>for
>>> kernel. If you mean current kernel placement logic don't care
>plugging,
>>> that's a bug.
>>>
>>> If we aim to hot remove, we have to have either kernel relocation or
>>> hotplug awre kernel placement at boot time.
>>
>> What if all nodes are hot pluggable?  Are we moving the kernel
>> dynamically then?
>
>Intel folks already told, we have no such system in practice.
>
>
>>>> Failing to boot is *way* worse reporting mechanism than almost
>>>> everything else.  If the sysadmin is willing to risk machines
>failing
>>>> to come up, she would definitely be willing to check whether which
>>>> memory areas are actually hotpluggable too, right?
>>>
>>> No. see above. Your opinion is not pragmatic useful.
>>
>> No, what you're saying doesn't make any sense.  There are multiple
>> ways to report when something doesn't work.  Failing to boot is *one*
>> of them and not a very good one.  Here, for practical reasons, the
>end
>> result may differ depending on the specifics of the configuration, so
>> more detailed reporting is necessary anyway, so why do you insist on
>> failing the boot?  In what world is it a good thing for the machine
>to
>> fail boot after bios or kernel update?
>
>Because boot failure have no chance to overlook and better way for
>practice.

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 20:29                         ` KOSAKI Motohiro
@ 2013-08-14 20:35                           ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-14 20:35 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 04:29:05PM -0400, KOSAKI Motohiro wrote:
> Because boot failure have no chance to overlook and better way for practice.

That's an extremely poor excuse.  We favor WARNs over BUGs for good
reasons.  If a sysadmin cares about hotplug and can't deal with the
system successfully booting, it's *trivial* to make the system behave
in a way which has no chance of being overlooked.  What's next?
Panicking if somebody echoes invalid value to an important knob file?
We sure don't want that to be overlooked either, right?

This discussion is so dumb.  Please stop.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-14 20:35                           ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-14 20:35 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 04:29:05PM -0400, KOSAKI Motohiro wrote:
> Because boot failure have no chance to overlook and better way for practice.

That's an extremely poor excuse.  We favor WARNs over BUGs for good
reasons.  If a sysadmin cares about hotplug and can't deal with the
system successfully booting, it's *trivial* to make the system behave
in a way which has no chance of being overlooked.  What's next?
Panicking if somebody echoes invalid value to an important knob file?
We sure don't want that to be overlooked either, right?

This discussion is so dumb.  Please stop.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 20:35                           ` Tejun Heo
@ 2013-08-14 21:17                             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 21:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 4:35 PM), Tejun Heo wrote:
> On Wed, Aug 14, 2013 at 04:29:05PM -0400, KOSAKI Motohiro wrote:
>> Because boot failure have no chance to overlook and better way for practice.
>
> That's an extremely poor excuse.  We favor WARNs over BUGs for good
> reasons.  If a sysadmin cares about hotplug and can't deal with the
> system successfully booting, it's *trivial* to make the system behave
> in a way which has no chance of being overlooked.  What's next?
> Panicking if somebody echoes invalid value to an important knob file?
> We sure don't want that to be overlooked either, right?
>
> This discussion is so dumb.  Please stop.

You haven't explain practical benefit of your opinion. As far as users have
no benefit, I'm never agree. Sorry.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-14 21:17                             ` KOSAKI Motohiro
  0 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-14 21:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 4:35 PM), Tejun Heo wrote:
> On Wed, Aug 14, 2013 at 04:29:05PM -0400, KOSAKI Motohiro wrote:
>> Because boot failure have no chance to overlook and better way for practice.
>
> That's an extremely poor excuse.  We favor WARNs over BUGs for good
> reasons.  If a sysadmin cares about hotplug and can't deal with the
> system successfully booting, it's *trivial* to make the system behave
> in a way which has no chance of being overlooked.  What's next?
> Panicking if somebody echoes invalid value to an important knob file?
> We sure don't want that to be overlooked either, right?
>
> This discussion is so dumb.  Please stop.

You haven't explain practical benefit of your opinion. As far as users have
no benefit, I'm never agree. Sorry.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 21:17                             ` KOSAKI Motohiro
@ 2013-08-14 21:36                               ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-14 21:36 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 05:17:23PM -0400, KOSAKI Motohiro wrote:
> You haven't explain practical benefit of your opinion. As far as users have
> no benefit, I'm never agree. Sorry.

Umm... how about being more robust and actually useable to begin with?
What's the benefit of panicking?  Are you seriously saying that the
admin / boot script can use the kernel boot param to tell the kernel
to enable hotplug but can't check what nodes are hot unpluggable
afterwards?  The admin *needs* to check which nodes are hotpluggable
no matter how this part is handled.  How else is it gonna know which
nodes are hotpluggable?  Magic?

There's no such rule as kernel param should make the kernel panic if
it's not happy, so please take that out of your brain.  It of course
should be clear what the result of the kernel parameter is and
panicking is the crudest way to do that which is good enough or even
desriable in *some* cases.  It is not the required behavior by any
stretch of imgination, especially when the result of the parameter may
change due to changing circumstances.  That's an outright idiotic
thing to do.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-14 21:36                               ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-14 21:36 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 05:17:23PM -0400, KOSAKI Motohiro wrote:
> You haven't explain practical benefit of your opinion. As far as users have
> no benefit, I'm never agree. Sorry.

Umm... how about being more robust and actually useable to begin with?
What's the benefit of panicking?  Are you seriously saying that the
admin / boot script can use the kernel boot param to tell the kernel
to enable hotplug but can't check what nodes are hot unpluggable
afterwards?  The admin *needs* to check which nodes are hotpluggable
no matter how this part is handled.  How else is it gonna know which
nodes are hotpluggable?  Magic?

There's no such rule as kernel param should make the kernel panic if
it's not happy, so please take that out of your brain.  It of course
should be clear what the result of the kernel parameter is and
panicking is the crudest way to do that which is good enough or even
desriable in *some* cases.  It is not the required behavior by any
stretch of imgination, especially when the result of the parameter may
change due to changing circumstances.  That's an outright idiotic
thing to do.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 5/7] memblock, mem_hotplug: Make memblock skip hotpluggable regions by default.
  2013-08-08 10:16   ` Tang Chen
@ 2013-08-14 21:54     ` Naoya Horiguchi
  -1 siblings, 0 replies; 165+ messages in thread
From: Naoya Horiguchi @ 2013-08-14 21:54 UTC (permalink / raw)
  To: Tang Chen
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Thu, Aug 08, 2013 at 06:16:17PM +0800, Tang Chen wrote:
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
...
> @@ -719,6 +723,10 @@ void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
>  		if (nid != MAX_NUMNODES && nid != memblock_get_region_node(m))
>  			continue;
>  
> +		/* skip hotpluggable memory regions */
> +		if (m->flags & MEMBLOCK_HOTPLUG)
> +			continue;
> +
>  		/* scan areas before each reservation for intersection */
>  		for ( ; ri >= 0; ri--) {
>  			struct memblock_region *r = &rsv->regions[ri];
> -- 

Why don't you add this also in __next_free_mem_range()?

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 5/7] memblock, mem_hotplug: Make memblock skip hotpluggable regions by default.
@ 2013-08-14 21:54     ` Naoya Horiguchi
  0 siblings, 0 replies; 165+ messages in thread
From: Naoya Horiguchi @ 2013-08-14 21:54 UTC (permalink / raw)
  To: Tang Chen
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Thu, Aug 08, 2013 at 06:16:17PM +0800, Tang Chen wrote:
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
...
> @@ -719,6 +723,10 @@ void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
>  		if (nid != MAX_NUMNODES && nid != memblock_get_region_node(m))
>  			continue;
>  
> +		/* skip hotpluggable memory regions */
> +		if (m->flags & MEMBLOCK_HOTPLUG)
> +			continue;
> +
>  		/* scan areas before each reservation for intersection */
>  		for ( ; ri >= 0; ri--) {
>  			struct memblock_region *r = &rsv->regions[ri];
> -- 

Why don't you add this also in __next_free_mem_range()?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14 21:36                               ` Tejun Heo
@ 2013-08-15  1:08                                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-15  1:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 5:36 PM), Tejun Heo wrote:
> On Wed, Aug 14, 2013 at 05:17:23PM -0400, KOSAKI Motohiro wrote:
>> You haven't explain practical benefit of your opinion. As far as users have
>> no benefit, I'm never agree. Sorry.
> 
> Umm... how about being more robust and actually useable to begin with?
> What's the benefit of panicking?  Are you seriously saying that the
> admin / boot script can use the kernel boot param to tell the kernel
> to enable hotplug but can't check what nodes are hot unpluggable
> afterwards?  The admin *needs* to check which nodes are hotpluggable
> no matter how this part is handled.  How else is it gonna know which
> nodes are hotpluggable?  Magic?
> 
> There's no such rule as kernel param should make the kernel panic if
> it's not happy, so please take that out of your brain.  It of course
> should be clear what the result of the kernel parameter is and
> panicking is the crudest way to do that which is good enough or even
> desriable in *some* cases.  It is not the required behavior by any
> stretch of imgination, especially when the result of the parameter may
> change due to changing circumstances.  That's an outright idiotic
> thing to do.

Sigh, I'd like to point a link of past discussion. But I can't find it now.
Let's summarize past discussion as far as possible.

Firstly, technically you can't implement correct fallback. You used a term
"when can't allocate memory", but it's not so simple. Think following scenario,
memory is enough for kernel image, but kernel will load memory hogging drivers.
The system will crash after boot within 1 min. Then, MM subsystem don't believe
a fallback. Bogus and misguided fallback give a user false relief and they don't
notice their mistake quickly. The answer is, there is the fundamental rule.
We always said, "measure your system carefully, and setting option carefully too".
I have no seen any reason to make exception in this case.

Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
have a chance to get a hotplug related bug report. For protecting my life, I don't
want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
an admin makes mistake, they should shoot their foot, not me!

Thirdly, I haven't insist to aim verbose and kind messages as last breath. It much
likely help users. 

Last, we are now discussing hotplug feature. Then, we can assume hotpluggable machine.
They have a hotplug interface in farmware by definition. So, you need to aim a magic.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15  1:08                                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-15  1:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 5:36 PM), Tejun Heo wrote:
> On Wed, Aug 14, 2013 at 05:17:23PM -0400, KOSAKI Motohiro wrote:
>> You haven't explain practical benefit of your opinion. As far as users have
>> no benefit, I'm never agree. Sorry.
> 
> Umm... how about being more robust and actually useable to begin with?
> What's the benefit of panicking?  Are you seriously saying that the
> admin / boot script can use the kernel boot param to tell the kernel
> to enable hotplug but can't check what nodes are hot unpluggable
> afterwards?  The admin *needs* to check which nodes are hotpluggable
> no matter how this part is handled.  How else is it gonna know which
> nodes are hotpluggable?  Magic?
> 
> There's no such rule as kernel param should make the kernel panic if
> it's not happy, so please take that out of your brain.  It of course
> should be clear what the result of the kernel parameter is and
> panicking is the crudest way to do that which is good enough or even
> desriable in *some* cases.  It is not the required behavior by any
> stretch of imgination, especially when the result of the parameter may
> change due to changing circumstances.  That's an outright idiotic
> thing to do.

Sigh, I'd like to point a link of past discussion. But I can't find it now.
Let's summarize past discussion as far as possible.

Firstly, technically you can't implement correct fallback. You used a term
"when can't allocate memory", but it's not so simple. Think following scenario,
memory is enough for kernel image, but kernel will load memory hogging drivers.
The system will crash after boot within 1 min. Then, MM subsystem don't believe
a fallback. Bogus and misguided fallback give a user false relief and they don't
notice their mistake quickly. The answer is, there is the fundamental rule.
We always said, "measure your system carefully, and setting option carefully too".
I have no seen any reason to make exception in this case.

Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
have a chance to get a hotplug related bug report. For protecting my life, I don't
want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
an admin makes mistake, they should shoot their foot, not me!

Thirdly, I haven't insist to aim verbose and kind messages as last breath. It much
likely help users. 

Last, we are now discussing hotplug feature. Then, we can assume hotpluggable machine.
They have a hotplug interface in farmware by definition. So, you need to aim a magic.




^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15  1:08                                 ` KOSAKI Motohiro
@ 2013-08-15  1:21                                   ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15  1:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello, KOSAKI.

On Wed, Aug 14, 2013 at 09:08:22PM -0400, KOSAKI Motohiro wrote:
...
> a fallback. Bogus and misguided fallback give a user false relief and they don't
> notice their mistake quickly. The answer is, there is the fundamental rule.
> We always said, "measure your system carefully, and setting option carefully too".
> I have no seen any reason to make exception in this case.

Ugh... that is one stupid rule.  Sure, there are cases when those
aren't avoidable but sticking to that when there are better ways to do
it is stupid.  Why would you make it finicky when you don't have to?
That makes no sense.

> Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
> have a chance to get a hotplug related bug report. For protecting my life, I don't
> want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
> an admin makes mistake, they should shoot their foot, not me!

Dude, it's not cool to cause users' machine to fail boot because you
want bug report.  You don't do that.  There are other ways to achieve
that.  When the kernel can't make all hotpluggable nodes hotpluggable
(I mean, it's not necessarily node aligned to begin with), generate
warning and a debug dump with appropriate log levels.

If you think causing users' machine fail boot indetermistically is
acceptable, you really shouldn't be maintaining anything.  What is
this?  Are you nuts?

> Thirdly, I haven't insist to aim verbose and kind messages as last breath. It much
> likely help users. 

I have no idea what you're trying to say.

> Last, we are now discussing hotplug feature. Then, we can assume hotpluggable machine.
> They have a hotplug interface in farmware by definition. So, you need to aim a magic.

This is by no way magic.  It's a band-aid feature which aims to
achieve some portion of functionality with minimal impact on the rest
of code / runtime overhead.  If you wanna nack the whole thing, be my
guest.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15  1:21                                   ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15  1:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello, KOSAKI.

On Wed, Aug 14, 2013 at 09:08:22PM -0400, KOSAKI Motohiro wrote:
...
> a fallback. Bogus and misguided fallback give a user false relief and they don't
> notice their mistake quickly. The answer is, there is the fundamental rule.
> We always said, "measure your system carefully, and setting option carefully too".
> I have no seen any reason to make exception in this case.

Ugh... that is one stupid rule.  Sure, there are cases when those
aren't avoidable but sticking to that when there are better ways to do
it is stupid.  Why would you make it finicky when you don't have to?
That makes no sense.

> Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
> have a chance to get a hotplug related bug report. For protecting my life, I don't
> want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
> an admin makes mistake, they should shoot their foot, not me!

Dude, it's not cool to cause users' machine to fail boot because you
want bug report.  You don't do that.  There are other ways to achieve
that.  When the kernel can't make all hotpluggable nodes hotpluggable
(I mean, it's not necessarily node aligned to begin with), generate
warning and a debug dump with appropriate log levels.

If you think causing users' machine fail boot indetermistically is
acceptable, you really shouldn't be maintaining anything.  What is
this?  Are you nuts?

> Thirdly, I haven't insist to aim verbose and kind messages as last breath. It much
> likely help users. 

I have no idea what you're trying to say.

> Last, we are now discussing hotplug feature. Then, we can assume hotpluggable machine.
> They have a hotplug interface in farmware by definition. So, you need to aim a magic.

This is by no way magic.  It's a band-aid feature which aims to
achieve some portion of functionality with minimal impact on the rest
of code / runtime overhead.  If you wanna nack the whole thing, be my
guest.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15  1:21                                   ` Tejun Heo
@ 2013-08-15  1:33                                     ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15  1:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 09:21:33PM -0400, Tejun Heo wrote:
> > Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
> > have a chance to get a hotplug related bug report. For protecting my life, I don't
> > want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
> > an admin makes mistake, they should shoot their foot, not me!
> 
> Dude, it's not cool to cause users' machine to fail boot because you
> want bug report.  You don't do that.  There are other ways to achieve
> that.  When the kernel can't make all hotpluggable nodes hotpluggable
> (I mean, it's not necessarily node aligned to begin with), generate
> warning and a debug dump with appropriate log levels.
> 
> If you think causing users' machine fail boot indetermistically is
> acceptable, you really shouldn't be maintaining anything.  What is
> this?  Are you nuts?

This is doubly idiotic because this is all early boot.  Most users
don't even have a way to access the debug info if the machine crashes
that early.  Developement convenience is something that we consider
too but, seriously, users come first.  This is not your personal
playground.  Don't frigging crash if you have any other option.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15  1:33                                     ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15  1:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 09:21:33PM -0400, Tejun Heo wrote:
> > Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
> > have a chance to get a hotplug related bug report. For protecting my life, I don't
> > want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
> > an admin makes mistake, they should shoot their foot, not me!
> 
> Dude, it's not cool to cause users' machine to fail boot because you
> want bug report.  You don't do that.  There are other ways to achieve
> that.  When the kernel can't make all hotpluggable nodes hotpluggable
> (I mean, it's not necessarily node aligned to begin with), generate
> warning and a debug dump with appropriate log levels.
> 
> If you think causing users' machine fail boot indetermistically is
> acceptable, you really shouldn't be maintaining anything.  What is
> this?  Are you nuts?

This is doubly idiotic because this is all early boot.  Most users
don't even have a way to access the debug info if the machine crashes
that early.  Developement convenience is something that we consider
too but, seriously, users come first.  This is not your personal
playground.  Don't frigging crash if you have any other option.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15  1:21                                   ` Tejun Heo
@ 2013-08-15  1:38                                     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-15  1:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 9:21 PM), Tejun Heo wrote:
> Hello, KOSAKI.
>
> On Wed, Aug 14, 2013 at 09:08:22PM -0400, KOSAKI Motohiro wrote:
> ...
>> a fallback. Bogus and misguided fallback give a user false relief and they don't
>> notice their mistake quickly. The answer is, there is the fundamental rule.
>> We always said, "measure your system carefully, and setting option carefully too".
>> I have no seen any reason to make exception in this case.
>
> Ugh... that is one stupid rule.  Sure, there are cases when those
> aren't avoidable but sticking to that when there are better ways to do
> it is stupid.  Why would you make it finicky when you don't have to?
> That makes no sense.

As you think makes no sense, I also think your position makes no sense. So, please
stop emotional word. That doesn't help discussion progress.


>> Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
>> have a chance to get a hotplug related bug report. For protecting my life, I don't
>> want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
>> an admin makes mistake, they should shoot their foot, not me!
>
> Dude, it's not cool to cause users' machine to fail boot because you
> want bug report.  You don't do that.  There are other ways to achieve
> that.  When the kernel can't make all hotpluggable nodes hotpluggable
> (I mean, it's not necessarily node aligned to begin with), generate
> warning and a debug dump with appropriate log levels.

If the user was you, I agree. But I know the users don't react so.

> If you think causing users' machine fail boot indetermistically is
> acceptable, you really shouldn't be maintaining anything.  What is
> this?  Are you nuts?

Again, there is no perfect solution if an admin is true stupid. We can just
suggest "you are wrong, not kernel", but no more further. I'm sure just kernel
logging doesn't help because they don't read it and they say no body read such
plenty and for developer messages. I may accept any strong notification, but,
still, I don't think it's worth. Only sane way is, an admin realize their mistake
and fix themselves.


>> Thirdly, I haven't insist to aim verbose and kind messages as last breath. It much
>> likely help users.
>
> I have no idea what you're trying to say.

I meant, "which is verbose" makes no sense. I don't take it.


>> Last, we are now discussing hotplug feature. Then, we can assume hotpluggable machine.
>> They have a hotplug interface in farmware by definition. So, you need to aim a magic.
>
> This is by no way magic.  It's a band-aid feature which aims to
> achieve some portion of functionality with minimal impact on the rest
> of code / runtime overhead.  If you wanna nack the whole thing, be my
> guest.

Huh? no fallback mean no additional code. I can't imagine no code makes runtime overhead.



^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15  1:38                                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-15  1:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 9:21 PM), Tejun Heo wrote:
> Hello, KOSAKI.
>
> On Wed, Aug 14, 2013 at 09:08:22PM -0400, KOSAKI Motohiro wrote:
> ...
>> a fallback. Bogus and misguided fallback give a user false relief and they don't
>> notice their mistake quickly. The answer is, there is the fundamental rule.
>> We always said, "measure your system carefully, and setting option carefully too".
>> I have no seen any reason to make exception in this case.
>
> Ugh... that is one stupid rule.  Sure, there are cases when those
> aren't avoidable but sticking to that when there are better ways to do
> it is stupid.  Why would you make it finicky when you don't have to?
> That makes no sense.

As you think makes no sense, I also think your position makes no sense. So, please
stop emotional word. That doesn't help discussion progress.


>> Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
>> have a chance to get a hotplug related bug report. For protecting my life, I don't
>> want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
>> an admin makes mistake, they should shoot their foot, not me!
>
> Dude, it's not cool to cause users' machine to fail boot because you
> want bug report.  You don't do that.  There are other ways to achieve
> that.  When the kernel can't make all hotpluggable nodes hotpluggable
> (I mean, it's not necessarily node aligned to begin with), generate
> warning and a debug dump with appropriate log levels.

If the user was you, I agree. But I know the users don't react so.

> If you think causing users' machine fail boot indetermistically is
> acceptable, you really shouldn't be maintaining anything.  What is
> this?  Are you nuts?

Again, there is no perfect solution if an admin is true stupid. We can just
suggest "you are wrong, not kernel", but no more further. I'm sure just kernel
logging doesn't help because they don't read it and they say no body read such
plenty and for developer messages. I may accept any strong notification, but,
still, I don't think it's worth. Only sane way is, an admin realize their mistake
and fix themselves.


>> Thirdly, I haven't insist to aim verbose and kind messages as last breath. It much
>> likely help users.
>
> I have no idea what you're trying to say.

I meant, "which is verbose" makes no sense. I don't take it.


>> Last, we are now discussing hotplug feature. Then, we can assume hotpluggable machine.
>> They have a hotplug interface in farmware by definition. So, you need to aim a magic.
>
> This is by no way magic.  It's a band-aid feature which aims to
> achieve some portion of functionality with minimal impact on the rest
> of code / runtime overhead.  If you wanna nack the whole thing, be my
> guest.

Huh? no fallback mean no additional code. I can't imagine no code makes runtime overhead.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15  1:33                                     ` Tejun Heo
@ 2013-08-15  1:44                                       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-15  1:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 9:33 PM), Tejun Heo wrote:
> On Wed, Aug 14, 2013 at 09:21:33PM -0400, Tejun Heo wrote:
>>> Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
>>> have a chance to get a hotplug related bug report. For protecting my life, I don't
>>> want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
>>> an admin makes mistake, they should shoot their foot, not me!
>>
>> Dude, it's not cool to cause users' machine to fail boot because you
>> want bug report.  You don't do that.  There are other ways to achieve
>> that.  When the kernel can't make all hotpluggable nodes hotpluggable
>> (I mean, it's not necessarily node aligned to begin with), generate
>> warning and a debug dump with appropriate log levels.
>>
>> If you think causing users' machine fail boot indetermistically is
>> acceptable, you really shouldn't be maintaining anything.  What is
>> this?  Are you nuts?
>
> This is doubly idiotic because this is all early boot.  Most users
> don't even have a way to access the debug info if the machine crashes
> that early.  Developement convenience is something that we consider
> too but, seriously, users come first.  This is not your personal
> playground.  Don't frigging crash if you have any other option.

Again, the best depend on the purpose and the goal. If someone specify
to enable hotplugging, They are sure they need it. Now, any fallback
achieve their goal. Their goal is not booting. If they don't have enough
machine to achieve their goal, we have only one way, tell them that.
If we had an alternative way, I might say an another answer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15  1:44                                       ` KOSAKI Motohiro
  0 siblings, 0 replies; 165+ messages in thread
From: KOSAKI Motohiro @ 2013-08-15  1:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KOSAKI Motohiro, Tang Chen, Tang Chen, robert.moore, lv.zheng,
	rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, yanghy, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

(8/14/13 9:33 PM), Tejun Heo wrote:
> On Wed, Aug 14, 2013 at 09:21:33PM -0400, Tejun Heo wrote:
>>> Secondly, memory hotplug is now maintained I and kamezawa-san. Then, I much likely
>>> have a chance to get a hotplug related bug report. For protecting my life, I don't
>>> want get a false bug claim. Then, I wouldn't like to aim incomplete fallback. When
>>> an admin makes mistake, they should shoot their foot, not me!
>>
>> Dude, it's not cool to cause users' machine to fail boot because you
>> want bug report.  You don't do that.  There are other ways to achieve
>> that.  When the kernel can't make all hotpluggable nodes hotpluggable
>> (I mean, it's not necessarily node aligned to begin with), generate
>> warning and a debug dump with appropriate log levels.
>>
>> If you think causing users' machine fail boot indetermistically is
>> acceptable, you really shouldn't be maintaining anything.  What is
>> this?  Are you nuts?
>
> This is doubly idiotic because this is all early boot.  Most users
> don't even have a way to access the debug info if the machine crashes
> that early.  Developement convenience is something that we consider
> too but, seriously, users come first.  This is not your personal
> playground.  Don't frigging crash if you have any other option.

Again, the best depend on the purpose and the goal. If someone specify
to enable hotplugging, They are sure they need it. Now, any fallback
achieve their goal. Their goal is not booting. If they don't have enough
machine to achieve their goal, we have only one way, tell them that.
If we had an alternative way, I might say an another answer.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15  1:38                                     ` KOSAKI Motohiro
@ 2013-08-15  1:51                                       ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15  1:51 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 09:38:12PM -0400, KOSAKI Motohiro wrote:
> As you think makes no sense, I also think your position makes no sense. So, please
> stop emotional word. That doesn't help discussion progress.

Would you then please stop making nonsense assertions like "the
fundamental rule here is to crash"?  You could have started the whole
thread with "I'm not sure about the failure mode, it can be better to
hard fail because ..." and we could have debated on the details.
Instead I now have to break the nonsense assertion.  Of course the
tension is way higher.

> If the user was you, I agree. But I know the users don't react so.

Yeah, users react super well to machines failing boot without any way
to know what's going on.  How is a good idea?

> Again, there is no perfect solution if an admin is true stupid. We can just
> suggest "you are wrong, not kernel", but no more further. I'm sure just kernel
> logging doesn't help because they don't read it and they say no body read such

There are things like automated reporting.  The system is trying to
use hotplug, right?  It would have associated tools to do that, won't
it?  If you want to support it, build sensible tools and conventions
around it and given how specialized / highend the whole thing is, it
shouldn't be hard either.

> plenty and for developer messages. I may accept any strong notification, but,
> still, I don't think it's worth. Only sane way is, an admin realize their mistake
> and fix themselves.

Yes, we'll show them who's the boss.  No, this is not how things are
done in kernel.  We don't crash to give admins a lesson.  Do you even
realize that this isn't completely deterministic?  The machine might
boot fine one time and fail the next time.  What lesson would that
teach the admin?  Stay away from linux?

> Huh? no fallback mean no additional code. I can't imagine no code makes runtime overhead.

What fallback are you talking about?  You need to report hotpluggable
node somehow anyway.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15  1:51                                       ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15  1:51 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 09:38:12PM -0400, KOSAKI Motohiro wrote:
> As you think makes no sense, I also think your position makes no sense. So, please
> stop emotional word. That doesn't help discussion progress.

Would you then please stop making nonsense assertions like "the
fundamental rule here is to crash"?  You could have started the whole
thread with "I'm not sure about the failure mode, it can be better to
hard fail because ..." and we could have debated on the details.
Instead I now have to break the nonsense assertion.  Of course the
tension is way higher.

> If the user was you, I agree. But I know the users don't react so.

Yeah, users react super well to machines failing boot without any way
to know what's going on.  How is a good idea?

> Again, there is no perfect solution if an admin is true stupid. We can just
> suggest "you are wrong, not kernel", but no more further. I'm sure just kernel
> logging doesn't help because they don't read it and they say no body read such

There are things like automated reporting.  The system is trying to
use hotplug, right?  It would have associated tools to do that, won't
it?  If you want to support it, build sensible tools and conventions
around it and given how specialized / highend the whole thing is, it
shouldn't be hard either.

> plenty and for developer messages. I may accept any strong notification, but,
> still, I don't think it's worth. Only sane way is, an admin realize their mistake
> and fix themselves.

Yes, we'll show them who's the boss.  No, this is not how things are
done in kernel.  We don't crash to give admins a lesson.  Do you even
realize that this isn't completely deterministic?  The machine might
boot fine one time and fail the next time.  What lesson would that
teach the admin?  Stay away from linux?

> Huh? no fallback mean no additional code. I can't imagine no code makes runtime overhead.

What fallback are you talking about?  You need to report hotpluggable
node somehow anyway.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15  1:44                                       ` KOSAKI Motohiro
@ 2013-08-15  2:22                                         ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15  2:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 09:44:19PM -0400, KOSAKI Motohiro wrote:
> >This is doubly idiotic because this is all early boot.  Most users
> >don't even have a way to access the debug info if the machine crashes
> >that early.  Developement convenience is something that we consider
> >too but, seriously, users come first.  This is not your personal
> >playground.  Don't frigging crash if you have any other option.
> 
> Again, the best depend on the purpose and the goal. If someone specify
> to enable hotplugging, They are sure they need it. Now, any fallback
> achieve their goal. Their goal is not booting. If they don't have enough
> machine to achieve their goal, we have only one way, tell them that.

Yes, you go and tell them with the blank screen.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15  2:22                                         ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15  2:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Tang Chen, Tang Chen, robert.moore, lv.zheng, rjw, lenb, tglx,
	mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, Aug 14, 2013 at 09:44:19PM -0400, KOSAKI Motohiro wrote:
> >This is doubly idiotic because this is all early boot.  Most users
> >don't even have a way to access the debug info if the machine crashes
> >that early.  Developement convenience is something that we consider
> >too but, seriously, users come first.  This is not your personal
> >playground.  Don't frigging crash if you have any other option.
> 
> Again, the best depend on the purpose and the goal. If someone specify
> to enable hotplugging, They are sure they need it. Now, any fallback
> achieve their goal. Their goal is not booting. If they don't have enough
> machine to achieve their goal, we have only one way, tell them that.

Yes, you go and tell them with the blank screen.

-- 
tejun

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 5/7] memblock, mem_hotplug: Make memblock skip hotpluggable regions by default.
  2013-08-14 21:54     ` Naoya Horiguchi
@ 2013-08-15  5:15       ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-15  5:15 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On 08/15/2013 05:54 AM, Naoya Horiguchi wrote:
> On Thu, Aug 08, 2013 at 06:16:17PM +0800, Tang Chen wrote:
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
> ...
>> @@ -719,6 +723,10 @@ void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
>>   		if (nid != MAX_NUMNODES&&  nid != memblock_get_region_node(m))
>>   			continue;
>>
>> +		/* skip hotpluggable memory regions */
>> +		if (m->flags&  MEMBLOCK_HOTPLUG)
>> +			continue;
>> +
>>   		/* scan areas before each reservation for intersection */
>>   		for ( ; ri>= 0; ri--) {
>>   			struct memblock_region *r =&rsv->regions[ri];
>> -- 
> 
> Why don't you add this also in __next_free_mem_range()?

Hi Naoya,

__next_free_mem_range_rev() is for for_each_free_mem_range_reverse(),
which is
only called in memblock_find_in_range_node().

But I think __next_free_mem_range() is for for_each_free_mem_range,
which is
called by many others. These callers could has nothing to do with memory
hotplug.
So I didn't add.

Maybe adding the check here is not good. I'm trying to find somewhere to
check MEMBLOCK_HOTPLUG.

Thanks. :)

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 5/7] memblock, mem_hotplug: Make memblock skip hotpluggable regions by default.
@ 2013-08-15  5:15       ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-15  5:15 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: robert.moore, lv.zheng, rjw, lenb, tglx, mingo, hpa, akpm, tj,
	trenn, yinghai, jiang.liu, wency, laijs, isimatu.yasuaki,
	izumi.taku, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, yanghy, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On 08/15/2013 05:54 AM, Naoya Horiguchi wrote:
> On Thu, Aug 08, 2013 at 06:16:17PM +0800, Tang Chen wrote:
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
> ...
>> @@ -719,6 +723,10 @@ void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
>>   		if (nid != MAX_NUMNODES&&  nid != memblock_get_region_node(m))
>>   			continue;
>>
>> +		/* skip hotpluggable memory regions */
>> +		if (m->flags&  MEMBLOCK_HOTPLUG)
>> +			continue;
>> +
>>   		/* scan areas before each reservation for intersection */
>>   		for ( ; ri>= 0; ri--) {
>>   			struct memblock_region *r =&rsv->regions[ri];
>> -- 
> 
> Why don't you add this also in __next_free_mem_range()?

Hi Naoya,

__next_free_mem_range_rev() is for for_each_free_mem_range_reverse(),
which is
only called in memblock_find_in_range_node().

But I think __next_free_mem_range() is for for_each_free_mem_range,
which is
called by many others. These callers could has nothing to do with memory
hotplug.
So I didn't add.

Maybe adding the check here is not good. I'm trying to find somewhere to
check MEMBLOCK_HOTPLUG.

Thanks. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-13 22:33                 ` Yinghai Lu
@ 2013-08-15  8:42                   ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-15  8:42 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tejun Heo, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis

On 08/14/2013 06:33 AM, Yinghai Lu wrote:
......
>
>>     init_mem_mapping()
>
> Now we top and down, so initial page tables in in BRK, other page tables
> is near the top!

Hi yinghai, tj,

About the page table, the current logic is to use BRK to map the highest 
range
of memory. And then, use the mapped range to map the rest ranges, downwards.

In alloc_low_pages():
   57                 ret = memblock_find_in_range(min_pfn_mapped << 
PAGE_SHIFT,
   58                                         max_pfn_mapped << PAGE_SHIFT,
   59                                         PAGE_SIZE * num , PAGE_SIZE);
			......
   63                 pfn = ret >> PAGE_SHIFT;
			......
   78         return __va(pfn << PAGE_SHIFT);

So if we want to allocate page tables near the kernelimage, we have to do
the following:

1. Use BRK to map a range near kernel image, let's call it range X.
2. Calculate how much memory needed to map all the memory, let's say Y 
Bytes.
    Use range X to map at least Y Bytes memory near kernel image.
3. Use the mapped memory to map all the rest memory.

Does this sound OK to you guys ?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15  8:42                   ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-15  8:42 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tejun Heo, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei, yanghy,
	the arch/x86 maintainers, linux-doc, Linux Kernel Mailing List,
	Linux MM, ACPI Devel Maling List, Luck,
	Tony (tony.luck@intel.com)

On 08/14/2013 06:33 AM, Yinghai Lu wrote:
......
>
>>     init_mem_mapping()
>
> Now we top and down, so initial page tables in in BRK, other page tables
> is near the top!

Hi yinghai, tj,

About the page table, the current logic is to use BRK to map the highest 
range
of memory. And then, use the mapped range to map the rest ranges, downwards.

In alloc_low_pages():
   57                 ret = memblock_find_in_range(min_pfn_mapped << 
PAGE_SHIFT,
   58                                         max_pfn_mapped << PAGE_SHIFT,
   59                                         PAGE_SIZE * num , PAGE_SIZE);
			......
   63                 pfn = ret >> PAGE_SHIFT;
			......
   78         return __va(pfn << PAGE_SHIFT);

So if we want to allocate page tables near the kernelimage, we have to do
the following:

1. Use BRK to map a range near kernel image, let's call it range X.
2. Calculate how much memory needed to map all the memory, let's say Y 
Bytes.
    Use range X to map at least Y Bytes memory near kernel image.
3. Use the mapped memory to map all the rest memory.

Does this sound OK to you guys ?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15  8:42                   ` Tang Chen
@ 2013-08-15 12:19                     ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15 12:19 UTC (permalink / raw)
  To: Tang Chen
  Cc: Yinghai Lu, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel

On Thu, Aug 15, 2013 at 04:42:35PM +0800, Tang Chen wrote:
> 1. Use BRK to map a range near kernel image, let's call it range X.
> 2. Calculate how much memory needed to map all the memory, let's say
> Y Bytes.
>    Use range X to map at least Y Bytes memory near kernel image.
> 3. Use the mapped memory to map all the rest memory.
> 
> Does this sound OK to you guys ?

Either than where we put the page table, the rest is the same, right?
Please note that we do want the pagetable in high address by default,
so this should be an optional behavior dependent on the hotplug boot
param.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15 12:19                     ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15 12:19 UTC (permalink / raw)
  To: Tang Chen
  Cc: Yinghai Lu, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei, yanghy,
	the arch/x86 maintainers, linux-doc, Linux Kernel Mailing List,
	Linux MM, ACPI Devel Maling List, Luck,
	Tony (tony.luck@intel.com)

On Thu, Aug 15, 2013 at 04:42:35PM +0800, Tang Chen wrote:
> 1. Use BRK to map a range near kernel image, let's call it range X.
> 2. Calculate how much memory needed to map all the memory, let's say
> Y Bytes.
>    Use range X to map at least Y Bytes memory near kernel image.
> 3. Use the mapped memory to map all the rest memory.
> 
> Does this sound OK to you guys ?

Either than where we put the page table, the rest is the same, right?
Please note that we do want the pagetable in high address by default,
so this should be an optional behavior dependent on the hotplug boot
param.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15 12:19                     ` Tejun Heo
@ 2013-08-15 12:44                       ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-15 12:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yinghai Lu, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis

On 08/15/2013 08:19 PM, Tejun Heo wrote:
> On Thu, Aug 15, 2013 at 04:42:35PM +0800, Tang Chen wrote:
>> 1. Use BRK to map a range near kernel image, let's call it range X.
>> 2. Calculate how much memory needed to map all the memory, let's say
>> Y Bytes.
>>     Use range X to map at least Y Bytes memory near kernel image.
>> 3. Use the mapped memory to map all the rest memory.
>>
>> Does this sound OK to you guys ?
>
> Either than where we put the page table, the rest is the same, right?
> Please note that we do want the pagetable in high address by default,
> so this should be an optional behavior dependent on the hotplug boot
> param.
>

Yes, the new behavior should be controlled by boot option.

I wanted to ask for some comment to the above solution. When I was
coding, I found that change the behavior of pagetable initialization
could be ugly. We should start from the middle of the memory, where
the kernel image is, but not the end of memory.

I'm not asking for any detail comment, just seeing from the description,
is it acceptable ?

BTW, the rest are easier to deal with.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15 12:44                       ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-15 12:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yinghai Lu, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei, yanghy,
	the arch/x86 maintainers, linux-doc, Linux Kernel Mailing List,
	Linux MM, ACPI Devel Maling List, Luck,
	Tony (tony.luck@intel.com)

On 08/15/2013 08:19 PM, Tejun Heo wrote:
> On Thu, Aug 15, 2013 at 04:42:35PM +0800, Tang Chen wrote:
>> 1. Use BRK to map a range near kernel image, let's call it range X.
>> 2. Calculate how much memory needed to map all the memory, let's say
>> Y Bytes.
>>     Use range X to map at least Y Bytes memory near kernel image.
>> 3. Use the mapped memory to map all the rest memory.
>>
>> Does this sound OK to you guys ?
>
> Either than where we put the page table, the rest is the same, right?
> Please note that we do want the pagetable in high address by default,
> so this should be an optional behavior dependent on the hotplug boot
> param.
>

Yes, the new behavior should be controlled by boot option.

I wanted to ask for some comment to the above solution. When I was
coding, I found that change the behavior of pagetable initialization
could be ugly. We should start from the middle of the memory, where
the kernel image is, but not the end of memory.

I'm not asking for any detail comment, just seeing from the description,
is it acceptable ?

BTW, the rest are easier to deal with.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15 12:44                       ` Tang Chen
@ 2013-08-15 12:49                         ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15 12:49 UTC (permalink / raw)
  To: Tang Chen
  Cc: Yinghai Lu, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel

Hello,

On Thu, Aug 15, 2013 at 08:44:49PM +0800, Tang Chen wrote:
> I wanted to ask for some comment to the above solution. When I was
> coding, I found that change the behavior of pagetable initialization
> could be ugly. We should start from the middle of the memory, where
> the kernel image is, but not the end of memory.
> 
> I'm not asking for any detail comment, just seeing from the description,
> is it acceptable ?

Hmmm... I can't really tell without knowing how ugly it gets and why.
Do you mind posting a draft patch so that we can have better context?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15 12:49                         ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15 12:49 UTC (permalink / raw)
  To: Tang Chen
  Cc: Yinghai Lu, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei, yanghy,
	the arch/x86 maintainers, linux-doc, Linux Kernel Mailing List,
	Linux MM, ACPI Devel Maling List, Luck,
	Tony (tony.luck@intel.com)

Hello,

On Thu, Aug 15, 2013 at 08:44:49PM +0800, Tang Chen wrote:
> I wanted to ask for some comment to the above solution. When I was
> coding, I found that change the behavior of pagetable initialization
> could be ugly. We should start from the middle of the memory, where
> the kernel image is, but not the end of memory.
> 
> I'm not asking for any detail comment, just seeing from the description,
> is it acceptable ?

Hmmm... I can't really tell without knowing how ugly it gets and why.
Do you mind posting a draft patch so that we can have better context?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15 12:49                         ` Tejun Heo
@ 2013-08-15 12:52                           ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-15 12:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yinghai Lu, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis

On 08/15/2013 08:49 PM, Tejun Heo wrote:
> Hello,
>
> On Thu, Aug 15, 2013 at 08:44:49PM +0800, Tang Chen wrote:
>> I wanted to ask for some comment to the above solution. When I was
>> coding, I found that change the behavior of pagetable initialization
>> could be ugly. We should start from the middle of the memory, where
>> the kernel image is, but not the end of memory.
>>
>> I'm not asking for any detail comment, just seeing from the description,
>> is it acceptable ?
>
> Hmmm... I can't really tell without knowing how ugly it gets and why.
> Do you mind posting a draft patch so that we can have better context?

Sure. I'll try my best to post a draft patch tomorrow.

Thanks. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15 12:52                           ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-15 12:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yinghai Lu, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei, yanghy,
	the arch/x86 maintainers, linux-doc, Linux Kernel Mailing List,
	Linux MM, ACPI Devel Maling List, Luck,
	Tony (tony.luck@intel.com)

On 08/15/2013 08:49 PM, Tejun Heo wrote:
> Hello,
>
> On Thu, Aug 15, 2013 at 08:44:49PM +0800, Tang Chen wrote:
>> I wanted to ask for some comment to the above solution. When I was
>> coding, I found that change the behavior of pagetable initialization
>> could be ugly. We should start from the middle of the memory, where
>> the kernel image is, but not the end of memory.
>>
>> I'm not asking for any detail comment, just seeing from the description,
>> is it acceptable ?
>
> Hmmm... I can't really tell without knowing how ugly it gets and why.
> Do you mind posting a draft patch so that we can have better context?

Sure. I'll try my best to post a draft patch tomorrow.

Thanks. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15  8:42                   ` Tang Chen
@ 2013-08-15 14:35                     ` Yinghai Lu
  -1 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-15 14:35 UTC (permalink / raw)
  To: Tang Chen, H. Peter Anvin
  Cc: Tejun Heo, Tang Chen, Bob Moore, Lv Zheng, Rafael J. Wysocki,
	Len Brown, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Thomas Renninger, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel,
	jweiner@redhat.com

On Thu, Aug 15, 2013 at 1:42 AM, Tang Chen <tangchen@cn.fujitsu.com> wrote:

> So if we want to allocate page tables near the kernelimage, we have to do
> the following:
>
> 1. Use BRK to map a range near kernel image, let's call it range X.
> 2. Calculate how much memory needed to map all the memory, let's say Y
> Bytes.
>    Use range X to map at least Y Bytes memory near kernel image.
> 3. Use the mapped memory to map all the rest memory.
>
> Does this sound OK to you guys ?

oh, no.
We just get rid of pre-calculate the buffer size for page tables.

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15 14:35                     ` Yinghai Lu
  0 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-15 14:35 UTC (permalink / raw)
  To: Tang Chen, H. Peter Anvin
  Cc: Tejun Heo, Tang Chen, Bob Moore, Lv Zheng, Rafael J. Wysocki,
	Len Brown, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Thomas Renninger, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, Zhang Yanfei, yanghy, the arch/x86 maintainers,
	linux-doc, Linux Kernel Mailing List, Linux MM,
	ACPI Devel Maling List, Luck, Tony (tony.luck@intel.com)

On Thu, Aug 15, 2013 at 1:42 AM, Tang Chen <tangchen@cn.fujitsu.com> wrote:

> So if we want to allocate page tables near the kernelimage, we have to do
> the following:
>
> 1. Use BRK to map a range near kernel image, let's call it range X.
> 2. Calculate how much memory needed to map all the memory, let's say Y
> Bytes.
>    Use range X to map at least Y Bytes memory near kernel image.
> 3. Use the mapped memory to map all the rest memory.
>
> Does this sound OK to you guys ?

oh, no.
We just get rid of pre-calculate the buffer size for page tables.

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15 12:44                       ` Tang Chen
@ 2013-08-15 14:37                         ` Yinghai Lu
  -1 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-15 14:37 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tejun Heo, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jwe

On Thu, Aug 15, 2013 at 5:44 AM, Tang Chen <tangchen@cn.fujitsu.com> wrote:

> Yes, the new behavior should be controlled by boot option.

No, should avoid boot option.

unified code path, could make hotplug case use same code as normal case.
and make code more test coverage.

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15 14:37                         ` Yinghai Lu
  0 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-15 14:37 UTC (permalink / raw)
  To: Tang Chen
  Cc: Tejun Heo, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei, yanghy,
	the arch/x86 maintainers, linux-doc, Linux Kernel Mailing List,
	Linux MM, ACPI Devel Maling List, Luck,
	Tony (tony.luck@intel.com)

On Thu, Aug 15, 2013 at 5:44 AM, Tang Chen <tangchen@cn.fujitsu.com> wrote:

> Yes, the new behavior should be controlled by boot option.

No, should avoid boot option.

unified code path, could make hotplug case use same code as normal case.
and make code more test coverage.

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15 14:37                         ` Yinghai Lu
@ 2013-08-15 14:45                           ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15 14:45 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tang Chen, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel

Hello, Yinghai.

On Thu, Aug 15, 2013 at 07:37:59AM -0700, Yinghai Lu wrote:
> On Thu, Aug 15, 2013 at 5:44 AM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
> 
> > Yes, the new behavior should be controlled by boot option.
> 
> No, should avoid boot option.

It's suboptimal behavior which is chosen as trade-off to enable
hotplug support and shouldn't be the default behavior just like node
data and page table should be allocated on the same node by default.
Why would we allocate kernel page table in low memory be default?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15 14:45                           ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15 14:45 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tang Chen, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei, yanghy,
	the arch/x86 maintainers, linux-doc, Linux Kernel Mailing List,
	Linux MM, ACPI Devel Maling List, Luck,
	Tony (tony.luck@intel.com)

Hello, Yinghai.

On Thu, Aug 15, 2013 at 07:37:59AM -0700, Yinghai Lu wrote:
> On Thu, Aug 15, 2013 at 5:44 AM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
> 
> > Yes, the new behavior should be controlled by boot option.
> 
> No, should avoid boot option.

It's suboptimal behavior which is chosen as trade-off to enable
hotplug support and shouldn't be the default behavior just like node
data and page table should be allocated on the same node by default.
Why would we allocate kernel page table in low memory be default?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15 14:45                           ` Tejun Heo
@ 2013-08-15 15:05                             ` Yinghai Lu
  -1 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-15 15:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel

On Thu, Aug 15, 2013 at 7:45 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Yinghai.
>
> On Thu, Aug 15, 2013 at 07:37:59AM -0700, Yinghai Lu wrote:
>> On Thu, Aug 15, 2013 at 5:44 AM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
>>
>> > Yes, the new behavior should be controlled by boot option.
>>
>> No, should avoid boot option.
>
> It's suboptimal behavior which is chosen as trade-off to enable
> hotplug support and shouldn't be the default behavior just like node
> data and page table should be allocated on the same node by default.
> Why would we allocate kernel page table in low memory be default?

That is what my patchset want to do.
put page tables on the same node like node data.
with that, hotplug and normal case will be the same code path.

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15 15:05                             ` Yinghai Lu
  0 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-15 15:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei, yanghy,
	the arch/x86 maintainers, linux-doc, Linux Kernel Mailing List,
	Linux MM, ACPI Devel Maling List, Luck,
	Tony (tony.luck@intel.com)

On Thu, Aug 15, 2013 at 7:45 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Yinghai.
>
> On Thu, Aug 15, 2013 at 07:37:59AM -0700, Yinghai Lu wrote:
>> On Thu, Aug 15, 2013 at 5:44 AM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
>>
>> > Yes, the new behavior should be controlled by boot option.
>>
>> No, should avoid boot option.
>
> It's suboptimal behavior which is chosen as trade-off to enable
> hotplug support and shouldn't be the default behavior just like node
> data and page table should be allocated on the same node by default.
> Why would we allocate kernel page table in low memory be default?

That is what my patchset want to do.
put page tables on the same node like node data.
with that, hotplug and normal case will be the same code path.

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15 15:05                             ` Yinghai Lu
@ 2013-08-15 15:10                               ` Tejun Heo
  -1 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15 15:10 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tang Chen, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel

On Thu, Aug 15, 2013 at 08:05:38AM -0700, Yinghai Lu wrote:
> > It's suboptimal behavior which is chosen as trade-off to enable
> > hotplug support and shouldn't be the default behavior just like node
> > data and page table should be allocated on the same node by default.
> > Why would we allocate kernel page table in low memory be default?
> 
> That is what my patchset want to do.
> put page tables on the same node like node data.
> with that, hotplug and normal case will be the same code path.

Yeah, sure, when that works, that can be the default and only
behavior.  Right now, we do want a switch to control that, right?  I'm
not sure we have a good choice which we can choose as the only
behavior for kernel page table.  Maybe we can implement some
heuristics to decide whether there's enough lowmem but given how niche
memory hotplug is, at least for now, that feels like an overkill.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15 15:10                               ` Tejun Heo
  0 siblings, 0 replies; 165+ messages in thread
From: Tejun Heo @ 2013-08-15 15:10 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tang Chen, Tang Chen, H. Peter Anvin, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei, yanghy,
	the arch/x86 maintainers, linux-doc, Linux Kernel Mailing List,
	Linux MM, ACPI Devel Maling List, Luck,
	Tony (tony.luck@intel.com)

On Thu, Aug 15, 2013 at 08:05:38AM -0700, Yinghai Lu wrote:
> > It's suboptimal behavior which is chosen as trade-off to enable
> > hotplug support and shouldn't be the default behavior just like node
> > data and page table should be allocated on the same node by default.
> > Why would we allocate kernel page table in low memory be default?
> 
> That is what my patchset want to do.
> put page tables on the same node like node data.
> with that, hotplug and normal case will be the same code path.

Yeah, sure, when that works, that can be the default and only
behavior.  Right now, we do want a switch to control that, right?  I'm
not sure we have a good choice which we can choose as the only
behavior for kernel page table.  Maybe we can implement some
heuristics to decide whether there's enough lowmem but given how niche
memory hotplug is, at least for now, that feels like an overkill.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-14  1:22                   ` Tang Chen
@ 2013-08-15 19:06                     ` Toshi Kani
  -1 siblings, 0 replies; 165+ messages in thread
From: Toshi Kani @ 2013-08-15 19:06 UTC (permalink / raw)
  To: Tang Chen
  Cc: Yinghai Lu, Tejun Heo, Tang Chen, H. Peter Anvin, Bob Moore,
	Lv Zheng, Rafael J. Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis,
	lwoodman, Rik

On Wed, 2013-08-14 at 09:22 +0800, Tang Chen wrote:
> On 08/14/2013 06:33 AM, Yinghai Lu wrote:
> ......
> >
> >>     relocate_initrd()
> >
> > size could be very big, like several hundreds mega bytes.
> > should be anywhere, but will be freed after booting.
> >
> > ===>  so we should not limit it to near kernel range.
> >
> >>     acpi_initrd_override()
> >
> > should be 64 * 10 about 1M.
> >
> >>     reserve_crashkernel()
> >
> > could be under 4G, or above 4G.
> > size could be 512M or 8G whatever.
> >
> > looks like
> > should move down relocated_initrd and reserve_crashkernel.
> 
> OK, will try to do this.
> 
> Thank you for the explanation. :)

So, we still need reordering, and put a new requirement that all earlier
allocations must be small...

I think the root of this issue is that ACPI init point is not early
enough in the boot sequence.  If it were much earlier already, the whole
thing would have been very simple.  We are now trying to workaround this
issue in the mblock code (which itself is a fine idea), but this ACPI
issue still remains and similar issues may come up again in future.  

For instance, ACPI SCPR/DBGP/DBG2 tables allow the OS to initialize
serial console/debug ports at early boot time.  The earlier it can be
initialized, the better this feature will be.  These tables are not
currently used by Linux due to a licensing issue, but it could be
addressed some time soon.  As platforms becoming more complex & legacy
free, the needs of ACPI tables will increase.

I think moving up the ACPI init point earlier is a good direction.

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15 19:06                     ` Toshi Kani
  0 siblings, 0 replies; 165+ messages in thread
From: Toshi Kani @ 2013-08-15 19:06 UTC (permalink / raw)
  To: Tang Chen
  Cc: Yinghai Lu, Tejun Heo, Tang Chen, H. Peter Anvin, Bob Moore,
	Lv Zheng, Rafael J. Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis,
	lwoodman, Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei,
	yanghy, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM, ACPI Devel Maling List,
	Luck, Tony (tony.luck@intel.com)

On Wed, 2013-08-14 at 09:22 +0800, Tang Chen wrote:
> On 08/14/2013 06:33 AM, Yinghai Lu wrote:
> ......
> >
> >>     relocate_initrd()
> >
> > size could be very big, like several hundreds mega bytes.
> > should be anywhere, but will be freed after booting.
> >
> > ===>  so we should not limit it to near kernel range.
> >
> >>     acpi_initrd_override()
> >
> > should be 64 * 10 about 1M.
> >
> >>     reserve_crashkernel()
> >
> > could be under 4G, or above 4G.
> > size could be 512M or 8G whatever.
> >
> > looks like
> > should move down relocated_initrd and reserve_crashkernel.
> 
> OK, will try to do this.
> 
> Thank you for the explanation. :)

So, we still need reordering, and put a new requirement that all earlier
allocations must be small...

I think the root of this issue is that ACPI init point is not early
enough in the boot sequence.  If it were much earlier already, the whole
thing would have been very simple.  We are now trying to workaround this
issue in the mblock code (which itself is a fine idea), but this ACPI
issue still remains and similar issues may come up again in future.  

For instance, ACPI SCPR/DBGP/DBG2 tables allow the OS to initialize
serial console/debug ports at early boot time.  The earlier it can be
initialized, the better this feature will be.  These tables are not
currently used by Linux due to a licensing issue, but it could be
addressed some time soon.  As platforms becoming more complex & legacy
free, the needs of ACPI tables will increase.

I think moving up the ACPI init point earlier is a good direction.

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* RE: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15 15:05                             ` Yinghai Lu
@ 2013-08-15 19:08                               ` Luck, Tony
  -1 siblings, 0 replies; 165+ messages in thread
From: Luck, Tony @ 2013-08-15 19:08 UTC (permalink / raw)
  To: Yinghai Lu, Tejun Heo
  Cc: Tang Chen, Tang Chen, H. Peter Anvin, Moore, Robert, Zheng, Lv,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel

> That is what my patchset want to do.
> put page tables on the same node like node data.
> with that, hotplug and normal case will be the same code path.

Page tables are a big issue if we have 4K mappings (8 byte entry per
4K page means 2MB of page tables per GB of memory) ... but only
used for DEBUG cases, right?

If we use 2M mappings, then allocations are 512x smaller - so only
4K per GB - hard to justify spreading that across nodes.

If we can use 1GB mappings - then another 512x reduction to 8 bytes per GB (or 8KB per TB)

Aren't page structures a bigger issue?  ~64 bytes per 4K page.  Do we
make sure these get allocated from the NUMA node that they describe?
This should not hurt the ZONE_MOVEABLE-ness of this - although they are
kernel structures they can be freed when the node is removed (at that point
they describe memory that is no longer present). From a scalability perspective
we don't want to run node0 low on memory by using it for every other node.
>From a NUMA perspective we want the page_t that describes a page to be
in the same locality as the page itself.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* RE: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15 19:08                               ` Luck, Tony
  0 siblings, 0 replies; 165+ messages in thread
From: Luck, Tony @ 2013-08-15 19:08 UTC (permalink / raw)
  To: Yinghai Lu, Tejun Heo
  Cc: Tang Chen, Tang Chen, H. Peter Anvin, Moore, Robert, Zheng, Lv,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei, yanghy,
	the arch/x86 maintainers, linux-doc, Linux Kernel Mailing List,
	Linux MM, ACPI Devel Maling List

> That is what my patchset want to do.
> put page tables on the same node like node data.
> with that, hotplug and normal case will be the same code path.

Page tables are a big issue if we have 4K mappings (8 byte entry per
4K page means 2MB of page tables per GB of memory) ... but only
used for DEBUG cases, right?

If we use 2M mappings, then allocations are 512x smaller - so only
4K per GB - hard to justify spreading that across nodes.

If we can use 1GB mappings - then another 512x reduction to 8 bytes per GB (or 8KB per TB)

Aren't page structures a bigger issue?  ~64 bytes per 4K page.  Do we
make sure these get allocated from the NUMA node that they describe?
This should not hurt the ZONE_MOVEABLE-ness of this - although they are
kernel structures they can be freed when the node is removed (at that point
they describe memory that is no longer present). From a scalability perspective
we don't want to run node0 low on memory by using it for every other node.
>From a NUMA perspective we want the page_t that describes a page to be
in the same locality as the page itself.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15 19:08                               ` Luck, Tony
@ 2013-08-15 19:34                                 ` Yinghai Lu
  -1 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-15 19:34 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Tejun Heo, Tang Chen, Tang Chen, H. Peter Anvin, Moore, Robert,
	Zheng, Lv, Rafael J. Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis,
	lwoodman@redhat.com

On Thu, Aug 15, 2013 at 12:08 PM, Luck, Tony <tony.luck@intel.com> wrote:
>> That is what my patchset want to do.
>> put page tables on the same node like node data.
>> with that, hotplug and normal case will be the same code path.
>
> Page tables are a big issue if we have 4K mappings (8 byte entry per
> 4K page means 2MB of page tables per GB of memory) ... but only
> used for DEBUG cases, right?

yes.

>
> If we use 2M mappings, then allocations are 512x smaller - so only
> 4K per GB - hard to justify spreading that across nodes.
>
> If we can use 1GB mappings - then another 512x reduction to 8 bytes per GB (or 8KB per TB)

Yes. 4k for 512G.

Just make all cases use same code path even for DEBUG_PAGEALLOC with 4k page
mapping.

>
> Aren't page structures a bigger issue?  ~64 bytes per 4K page.  Do we
> make sure these get allocated from the NUMA node that they describe?
> This should not hurt the ZONE_MOVEABLE-ness of this - although they are
> kernel structures they can be freed when the node is removed (at that point
> they describe memory that is no longer present). From a scalability perspective
> we don't want to run node0 low on memory by using it for every other node.
> From a NUMA perspective we want the page_t that describes a page to be
> in the same locality as the page itself.

yes, that is vmemmap, and it is already numa aware.

Thanks

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15 19:34                                 ` Yinghai Lu
  0 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-15 19:34 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Tejun Heo, Tang Chen, Tang Chen, H. Peter Anvin, Moore, Robert,
	Zheng, Lv, Rafael J. Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis,
	lwoodman, Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei,
	yanghy, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM, ACPI Devel Maling List

On Thu, Aug 15, 2013 at 12:08 PM, Luck, Tony <tony.luck@intel.com> wrote:
>> That is what my patchset want to do.
>> put page tables on the same node like node data.
>> with that, hotplug and normal case will be the same code path.
>
> Page tables are a big issue if we have 4K mappings (8 byte entry per
> 4K page means 2MB of page tables per GB of memory) ... but only
> used for DEBUG cases, right?

yes.

>
> If we use 2M mappings, then allocations are 512x smaller - so only
> 4K per GB - hard to justify spreading that across nodes.
>
> If we can use 1GB mappings - then another 512x reduction to 8 bytes per GB (or 8KB per TB)

Yes. 4k for 512G.

Just make all cases use same code path even for DEBUG_PAGEALLOC with 4k page
mapping.

>
> Aren't page structures a bigger issue?  ~64 bytes per 4K page.  Do we
> make sure these get allocated from the NUMA node that they describe?
> This should not hurt the ZONE_MOVEABLE-ness of this - although they are
> kernel structures they can be freed when the node is removed (at that point
> they describe memory that is no longer present). From a scalability perspective
> we don't want to run node0 low on memory by using it for every other node.
> From a NUMA perspective we want the page_t that describes a page to be
> in the same locality as the page itself.

yes, that is vmemmap, and it is already numa aware.

Thanks

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15 15:10                               ` Tejun Heo
@ 2013-08-15 19:49                                 ` Toshi Kani
  -1 siblings, 0 replies; 165+ messages in thread
From: Toshi Kani @ 2013-08-15 19:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yinghai Lu, Tang Chen, Tang Chen, H. Peter Anvin, Bob Moore,
	Lv Zheng, Rafael J. Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis,
	lwoodman

On Thu, 2013-08-15 at 11:10 -0400, Tejun Heo wrote:
> On Thu, Aug 15, 2013 at 08:05:38AM -0700, Yinghai Lu wrote:
> > > It's suboptimal behavior which is chosen as trade-off to enable
> > > hotplug support and shouldn't be the default behavior just like node
> > > data and page table should be allocated on the same node by default.
> > > Why would we allocate kernel page table in low memory be default?
> > 
> > That is what my patchset want to do.
> > put page tables on the same node like node data.
> > with that, hotplug and normal case will be the same code path.
> 
> Yeah, sure, when that works, that can be the default and only
> behavior.  Right now, we do want a switch to control that, right?  I'm
> not sure we have a good choice which we can choose as the only
> behavior for kernel page table.  Maybe we can implement some
> heuristics to decide whether there's enough lowmem but given how niche
> memory hotplug is, at least for now, that feels like an overkill.

I think the key point here is that putting page tables in local nodes
also requires reading ACPI SRAT table earlier.  There seems to be not
much point of avoiding this change.

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15 19:49                                 ` Toshi Kani
  0 siblings, 0 replies; 165+ messages in thread
From: Toshi Kani @ 2013-08-15 19:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yinghai Lu, Tang Chen, Tang Chen, H. Peter Anvin, Bob Moore,
	Lv Zheng, Rafael J. Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis,
	lwoodman, Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei,
	yanghy, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM, ACPI Devel Maling List,
	Luck, Tony (tony.luck@intel.com)

On Thu, 2013-08-15 at 11:10 -0400, Tejun Heo wrote:
> On Thu, Aug 15, 2013 at 08:05:38AM -0700, Yinghai Lu wrote:
> > > It's suboptimal behavior which is chosen as trade-off to enable
> > > hotplug support and shouldn't be the default behavior just like node
> > > data and page table should be allocated on the same node by default.
> > > Why would we allocate kernel page table in low memory be default?
> > 
> > That is what my patchset want to do.
> > put page tables on the same node like node data.
> > with that, hotplug and normal case will be the same code path.
> 
> Yeah, sure, when that works, that can be the default and only
> behavior.  Right now, we do want a switch to control that, right?  I'm
> not sure we have a good choice which we can choose as the only
> behavior for kernel page table.  Maybe we can implement some
> heuristics to decide whether there's enough lowmem but given how niche
> memory hotplug is, at least for now, that feels like an overkill.

I think the key point here is that putting page tables in local nodes
also requires reading ACPI SRAT table earlier.  There seems to be not
much point of avoiding this change.

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15 19:06                     ` Toshi Kani
@ 2013-08-15 20:28                       ` Yinghai Lu
  -1 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-15 20:28 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Tang Chen, Tejun Heo, Tang Chen, H. Peter Anvin, Bob Moore,
	Lv Zheng, Rafael J. Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis,
	lwoodman

On Thu, Aug 15, 2013 at 12:06 PM, Toshi Kani <toshi.kani@hp.com> wrote:
> On Wed, 2013-08-14 at 09:22 +0800, Tang Chen wrote:
>> On 08/14/2013 06:33 AM, Yinghai Lu wrote:
>> ......
>> >
>> >>     relocate_initrd()
>> >
>> > size could be very big, like several hundreds mega bytes.
>> > should be anywhere, but will be freed after booting.
>> >
>> > ===>  so we should not limit it to near kernel range.
>> >
>> >>     acpi_initrd_override()
>> >
>> > should be 64 * 10 about 1M.
>> >
>> >>     reserve_crashkernel()
>> >
>> > could be under 4G, or above 4G.
>> > size could be 512M or 8G whatever.
>> >
>> > looks like
>> > should move down relocated_initrd and reserve_crashkernel.
>>
>> OK, will try to do this.
>>
>> Thank you for the explanation. :)
>
> So, we still need reordering, and put a new requirement that all earlier
> allocations must be small...
>
> I think the root of this issue is that ACPI init point is not early
> enough in the boot sequence.  If it were much earlier already, the whole
> thing would have been very simple.  We are now trying to workaround this
> issue in the mblock code (which itself is a fine idea), but this ACPI
> issue still remains and similar issues may come up again in future.
>
> For instance, ACPI SCPR/DBGP/DBG2 tables allow the OS to initialize
> serial console/debug ports at early boot time.  The earlier it can be
> initialized, the better this feature will be.  These tables are not
> currently used by Linux due to a licensing issue, but it could be
> addressed some time soon.  As platforms becoming more complex & legacy
> free, the needs of ACPI tables will increase.
>
> I think moving up the ACPI init point earlier is a good direction.

Good point.

If we put acpi_initrd_override in BRK, and can more acpi_boot_table_init()
much early.

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-15 20:28                       ` Yinghai Lu
  0 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-15 20:28 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Tang Chen, Tejun Heo, Tang Chen, H. Peter Anvin, Bob Moore,
	Lv Zheng, Rafael J. Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis,
	lwoodman, Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei,
	yanghy, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM, ACPI Devel Maling List,
	Luck, Tony (tony.luck@intel.com)

On Thu, Aug 15, 2013 at 12:06 PM, Toshi Kani <toshi.kani@hp.com> wrote:
> On Wed, 2013-08-14 at 09:22 +0800, Tang Chen wrote:
>> On 08/14/2013 06:33 AM, Yinghai Lu wrote:
>> ......
>> >
>> >>     relocate_initrd()
>> >
>> > size could be very big, like several hundreds mega bytes.
>> > should be anywhere, but will be freed after booting.
>> >
>> > ===>  so we should not limit it to near kernel range.
>> >
>> >>     acpi_initrd_override()
>> >
>> > should be 64 * 10 about 1M.
>> >
>> >>     reserve_crashkernel()
>> >
>> > could be under 4G, or above 4G.
>> > size could be 512M or 8G whatever.
>> >
>> > looks like
>> > should move down relocated_initrd and reserve_crashkernel.
>>
>> OK, will try to do this.
>>
>> Thank you for the explanation. :)
>
> So, we still need reordering, and put a new requirement that all earlier
> allocations must be small...
>
> I think the root of this issue is that ACPI init point is not early
> enough in the boot sequence.  If it were much earlier already, the whole
> thing would have been very simple.  We are now trying to workaround this
> issue in the mblock code (which itself is a fine idea), but this ACPI
> issue still remains and similar issues may come up again in future.
>
> For instance, ACPI SCPR/DBGP/DBG2 tables allow the OS to initialize
> serial console/debug ports at early boot time.  The earlier it can be
> initialized, the better this feature will be.  These tables are not
> currently used by Linux due to a licensing issue, but it could be
> addressed some time soon.  As platforms becoming more complex & legacy
> free, the needs of ACPI tables will increase.
>
> I think moving up the ACPI init point earlier is a good direction.

Good point.

If we put acpi_initrd_override in BRK, and can more acpi_boot_table_init()
much early.

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15 14:35                     ` Yinghai Lu
@ 2013-08-16  1:16                       ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-16  1:16 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: H. Peter Anvin, Tejun Heo, Tang Chen, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis

On 08/15/2013 10:35 PM, Yinghai Lu wrote:
> On Thu, Aug 15, 2013 at 1:42 AM, Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>
>> So if we want to allocate page tables near the kernelimage, we have to do
>> the following:
>>
>> 1. Use BRK to map a range near kernel image, let's call it range X.
>> 2. Calculate how much memory needed to map all the memory, let's say Y
>> Bytes.
>>     Use range X to map at least Y Bytes memory near kernel image.
>> 3. Use the mapped memory to map all the rest memory.
>>
>> Does this sound OK to you guys ?
>
> oh, no.
> We just get rid of pre-calculate the buffer size for page tables.
>

You mean BRK ?  I know that and will first use up this memory.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-16  1:16                       ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-16  1:16 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: H. Peter Anvin, Tejun Heo, Tang Chen, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei, yanghy,
	the arch/x86 maintainers, linux-doc, Linux Kernel Mailing List,
	Linux MM, ACPI Devel Maling List, Luck,
	Tony (tony.luck@intel.com)

On 08/15/2013 10:35 PM, Yinghai Lu wrote:
> On Thu, Aug 15, 2013 at 1:42 AM, Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>
>> So if we want to allocate page tables near the kernelimage, we have to do
>> the following:
>>
>> 1. Use BRK to map a range near kernel image, let's call it range X.
>> 2. Calculate how much memory needed to map all the memory, let's say Y
>> Bytes.
>>     Use range X to map at least Y Bytes memory near kernel image.
>> 3. Use the mapped memory to map all the rest memory.
>>
>> Does this sound OK to you guys ?
>
> oh, no.
> We just get rid of pre-calculate the buffer size for page tables.
>

You mean BRK ?  I know that and will first use up this memory.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-15 20:28                       ` Yinghai Lu
@ 2013-08-16  2:08                         ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-16  2:08 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Toshi Kani, Tejun Heo, Tang Chen, H. Peter Anvin, Bob Moore,
	Lv Zheng, Rafael J. Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis

On 08/16/2013 04:28 AM, Yinghai Lu wrote:
......
>>
>> So, we still need reordering, and put a new requirement that all earlier
>> allocations must be small...
>>
>> I think the root of this issue is that ACPI init point is not early
>> enough in the boot sequence.  If it were much earlier already, the whole
>> thing would have been very simple.  We are now trying to workaround this
>> issue in the mblock code (which itself is a fine idea), but this ACPI
>> issue still remains and similar issues may come up again in future.
>>
>> For instance, ACPI SCPR/DBGP/DBG2 tables allow the OS to initialize
>> serial console/debug ports at early boot time.  The earlier it can be
>> initialized, the better this feature will be.  These tables are not
>> currently used by Linux due to a licensing issue, but it could be
>> addressed some time soon.  As platforms becoming more complex&  legacy
>> free, the needs of ACPI tables will increase.
>>
>> I think moving up the ACPI init point earlier is a good direction.
>
> Good point.
>
> If we put acpi_initrd_override in BRK, and can more acpi_boot_table_init()
> much early.

Hi yinghai, toshi,

Since I brought up this issue, it has been a long time. And there were a 
lot
of different solutions came up. No solution is perfect enough for everyone.
I have tried a lot, and most of them failed. But I think most of the things
cannot be seen clearly without a real patch posted. Many good ideas came up
during patch reviewing.

So I think I'm going to try as many ways as possible.  :)


Parsing SRAT earlier is what I want to do in the very beginning indeed. And
now, seems that moving the whole acpi table installation and overriding 
earlier
will bring us much more benefits. I have tried this without moving up
acpi_initrd_override in my part1 patch-set. But not in the way Yinghai 
mentioned
above.

Seeing from the code, there are 5 pages in BRK for page tables.

   81 /* need 4 4k for initial PMD_SIZE, 4k for 0-ISA_END_ADDRESS */
   82 #define INIT_PGT_BUF_SIZE       (5 * PAGE_SIZE)
   83 RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE);

By "put acpi_initrd_override in BRK", do you mean increase the BRK by 
default ?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-16  2:08                         ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-16  2:08 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Toshi Kani, Tejun Heo, Tang Chen, H. Peter Anvin, Bob Moore,
	Lv Zheng, Rafael J. Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis,
	lwoodman, Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei,
	yanghy, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM, ACPI Devel Maling List,
	Luck, Tony (tony.luck@intel.com)

On 08/16/2013 04:28 AM, Yinghai Lu wrote:
......
>>
>> So, we still need reordering, and put a new requirement that all earlier
>> allocations must be small...
>>
>> I think the root of this issue is that ACPI init point is not early
>> enough in the boot sequence.  If it were much earlier already, the whole
>> thing would have been very simple.  We are now trying to workaround this
>> issue in the mblock code (which itself is a fine idea), but this ACPI
>> issue still remains and similar issues may come up again in future.
>>
>> For instance, ACPI SCPR/DBGP/DBG2 tables allow the OS to initialize
>> serial console/debug ports at early boot time.  The earlier it can be
>> initialized, the better this feature will be.  These tables are not
>> currently used by Linux due to a licensing issue, but it could be
>> addressed some time soon.  As platforms becoming more complex&  legacy
>> free, the needs of ACPI tables will increase.
>>
>> I think moving up the ACPI init point earlier is a good direction.
>
> Good point.
>
> If we put acpi_initrd_override in BRK, and can more acpi_boot_table_init()
> much early.

Hi yinghai, toshi,

Since I brought up this issue, it has been a long time. And there were a 
lot
of different solutions came up. No solution is perfect enough for everyone.
I have tried a lot, and most of them failed. But I think most of the things
cannot be seen clearly without a real patch posted. Many good ideas came up
during patch reviewing.

So I think I'm going to try as many ways as possible.  :)


Parsing SRAT earlier is what I want to do in the very beginning indeed. And
now, seems that moving the whole acpi table installation and overriding 
earlier
will bring us much more benefits. I have tried this without moving up
acpi_initrd_override in my part1 patch-set. But not in the way Yinghai 
mentioned
above.

Seeing from the code, there are 5 pages in BRK for page tables.

   81 /* need 4 4k for initial PMD_SIZE, 4k for 0-ISA_END_ADDRESS */
   82 #define INIT_PGT_BUF_SIZE       (5 * PAGE_SIZE)
   83 RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE);

By "put acpi_initrd_override in BRK", do you mean increase the BRK by 
default ?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-16  2:08                         ` Tang Chen
@ 2013-08-16  4:21                           ` Yinghai Lu
  -1 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-16  4:21 UTC (permalink / raw)
  To: Tang Chen, H. Peter Anvin, Konrad Rzeszutek Wilk
  Cc: Toshi Kani, Tejun Heo, Tang Chen, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jwein

On Thu, Aug 15, 2013 at 7:08 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
> On 08/16/2013 04:28 AM, Yinghai Lu wrote:
> ......
>>>
>>>
>>> So, we still need reordering, and put a new requirement that all earlier
>>> allocations must be small...
>>>
>>> I think the root of this issue is that ACPI init point is not early
>>> enough in the boot sequence.  If it were much earlier already, the whole
>>> thing would have been very simple.  We are now trying to workaround this
>>> issue in the mblock code (which itself is a fine idea), but this ACPI
>>> issue still remains and similar issues may come up again in future.
>>>
>>> For instance, ACPI SCPR/DBGP/DBG2 tables allow the OS to initialize
>>> serial console/debug ports at early boot time.  The earlier it can be
>>> initialized, the better this feature will be.  These tables are not
>>> currently used by Linux due to a licensing issue, but it could be
>>> addressed some time soon.  As platforms becoming more complex&  legacy
>>>
>>> free, the needs of ACPI tables will increase.
>>>
>>> I think moving up the ACPI init point earlier is a good direction.
>>
>>
>> Good point.
>>
>> If we put acpi_initrd_override in BRK, and can more acpi_boot_table_init()
>> much early.
...
>
> Parsing SRAT earlier is what I want to do in the very beginning indeed. And
> now, seems that moving the whole acpi table installation and overriding
> earlier
> will bring us much more benefits. I have tried this without moving up
> acpi_initrd_override in my part1 patch-set. But not in the way Yinghai
> mentioned
> above.
...
>
> By "put acpi_initrd_override in BRK", do you mean increase the BRK by
> default ?

Peter,

Do you agree on extending BRK 256k to put copied override acpi tables?

then we can find and copy them early in
arch/x86/kernel/head64.c::x86_64_start_kernel() or
arch/x86/kernel/head_32.S.

with that we can move acpi_table init as early as possible in setup_arch().

Thanks

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-16  4:21                           ` Yinghai Lu
  0 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-16  4:21 UTC (permalink / raw)
  To: Tang Chen, H. Peter Anvin, Konrad Rzeszutek Wilk
  Cc: Toshi Kani, Tejun Heo, Tang Chen, Bob Moore, Lv Zheng,
	Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi, Mel Gorman,
	Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, Zhang Yanfei, yanghy,
	the arch/x86 maintainers, linux-doc, Linux Kernel Mailing List,
	Linux MM, ACPI Devel Maling List, Luck,
	Tony (tony.luck@intel.com)

On Thu, Aug 15, 2013 at 7:08 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
> On 08/16/2013 04:28 AM, Yinghai Lu wrote:
> ......
>>>
>>>
>>> So, we still need reordering, and put a new requirement that all earlier
>>> allocations must be small...
>>>
>>> I think the root of this issue is that ACPI init point is not early
>>> enough in the boot sequence.  If it were much earlier already, the whole
>>> thing would have been very simple.  We are now trying to workaround this
>>> issue in the mblock code (which itself is a fine idea), but this ACPI
>>> issue still remains and similar issues may come up again in future.
>>>
>>> For instance, ACPI SCPR/DBGP/DBG2 tables allow the OS to initialize
>>> serial console/debug ports at early boot time.  The earlier it can be
>>> initialized, the better this feature will be.  These tables are not
>>> currently used by Linux due to a licensing issue, but it could be
>>> addressed some time soon.  As platforms becoming more complex&  legacy
>>>
>>> free, the needs of ACPI tables will increase.
>>>
>>> I think moving up the ACPI init point earlier is a good direction.
>>
>>
>> Good point.
>>
>> If we put acpi_initrd_override in BRK, and can more acpi_boot_table_init()
>> much early.
...
>
> Parsing SRAT earlier is what I want to do in the very beginning indeed. And
> now, seems that moving the whole acpi table installation and overriding
> earlier
> will bring us much more benefits. I have tried this without moving up
> acpi_initrd_override in my part1 patch-set. But not in the way Yinghai
> mentioned
> above.
...
>
> By "put acpi_initrd_override in BRK", do you mean increase the BRK by
> default ?

Peter,

Do you agree on extending BRK 256k to put copied override acpi tables?

then we can find and copy them early in
arch/x86/kernel/head64.c::x86_64_start_kernel() or
arch/x86/kernel/head_32.S.

with that we can move acpi_table init as early as possible in setup_arch().

Thanks

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-16  4:21                           ` Yinghai Lu
@ 2013-08-19  3:07                             ` Tang Chen
  -1 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-19  3:07 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: H. Peter Anvin, Konrad Rzeszutek Wilk, Toshi Kani, Tejun Heo,
	Tang Chen, Bob Moore, Lv Zheng, Rafael J. Wysocki, Len Brown,
	Thomas Gleixner, Ingo Molnar, Andrew Morton, Thomas Renninger,
	Jiang Liu, Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu,
	Taku Izumi, Mel Gorman, Minchan Kim, mina86, gong.chen

On 08/16/2013 12:21 PM, Yinghai Lu wrote:
......
>> By "put acpi_initrd_override in BRK", do you mean increase the BRK by
>> default ?
>
> Peter,
>
> Do you agree on extending BRK 256k to put copied override acpi tables?
>
> then we can find and copy them early in
> arch/x86/kernel/head64.c::x86_64_start_kernel() or
> arch/x86/kernel/head_32.S.

Hi Yinghai,

If we use BRK to store acpi tables, we don't need to setup page tables.
If we do acpi_initrd_override() in setup_arch(), after early_ioremap is
available, we don't need to split it into find & copy. It would be much
easier.

Can you agree on doing acpi_initrd_override() in setup_arch() ?  Is it
too late for xen ?

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-19  3:07                             ` Tang Chen
  0 siblings, 0 replies; 165+ messages in thread
From: Tang Chen @ 2013-08-19  3:07 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: H. Peter Anvin, Konrad Rzeszutek Wilk, Toshi Kani, Tejun Heo,
	Tang Chen, Bob Moore, Lv Zheng, Rafael J. Wysocki, Len Brown,
	Thomas Gleixner, Ingo Molnar, Andrew Morton, Thomas Renninger,
	Jiang Liu, Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu,
	Taku Izumi, Mel Gorman, Minchan Kim, mina86, gong.chen,
	Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, Zhang Yanfei, yanghy, the arch/x86 maintainers,
	linux-doc, Linux Kernel Mailing List, Linux MM,
	ACPI Devel Maling List, Luck, Tony (tony.luck@intel.com)

On 08/16/2013 12:21 PM, Yinghai Lu wrote:
......
>> By "put acpi_initrd_override in BRK", do you mean increase the BRK by
>> default ?
>
> Peter,
>
> Do you agree on extending BRK 256k to put copied override acpi tables?
>
> then we can find and copy them early in
> arch/x86/kernel/head64.c::x86_64_start_kernel() or
> arch/x86/kernel/head_32.S.

Hi Yinghai,

If we use BRK to store acpi tables, we don't need to setup page tables.
If we do acpi_initrd_override() in setup_arch(), after early_ioremap is
available, we don't need to split it into find & copy. It would be much
easier.

Can you agree on doing acpi_initrd_override() in setup_arch() ?  Is it
too late for xen ?

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
  2013-08-19  3:07                             ` Tang Chen
@ 2013-08-19  3:28                               ` Yinghai Lu
  -1 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-19  3:28 UTC (permalink / raw)
  To: Tang Chen
  Cc: H. Peter Anvin, Konrad Rzeszutek Wilk, Toshi Kani, Tejun Heo,
	Tang Chen, Bob Moore, Lv Zheng, Rafael J. Wysocki, Len Brown,
	Thomas Gleixner, Ingo Molnar, Andrew Morton, Thomas Renninger,
	Jiang Liu, Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu,
	Taku Izumi, Mel Gorman, Minchan Kim, mina86, gong.chen,
	Vasilis Liaskovitis

On Sun, Aug 18, 2013 at 8:07 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
> On 08/16/2013 12:21 PM, Yinghai Lu wrote:
> ......
>
>>> By "put acpi_initrd_override in BRK", do you mean increase the BRK by
>>> default ?
>>
>>
>> Peter,
>>
>> Do you agree on extending BRK 256k to put copied override acpi tables?
>>
>> then we can find and copy them early in
>> arch/x86/kernel/head64.c::x86_64_start_kernel() or
>> arch/x86/kernel/head_32.S.
>
>
> Hi Yinghai,
>
> If we use BRK to store acpi tables, we don't need to setup page tables.
> If we do acpi_initrd_override() in setup_arch(), after early_ioremap is
> available, we don't need to split it into find & copy. It would be much
> easier.

we don't need to use early_ioremap if acpi_initrd_override is called in
arch/x86/kernel/head64.c::x86_64_start_kernel() or arch/x86/kernel/head_32.S.

Thanks

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE.
@ 2013-08-19  3:28                               ` Yinghai Lu
  0 siblings, 0 replies; 165+ messages in thread
From: Yinghai Lu @ 2013-08-19  3:28 UTC (permalink / raw)
  To: Tang Chen
  Cc: H. Peter Anvin, Konrad Rzeszutek Wilk, Toshi Kani, Tejun Heo,
	Tang Chen, Bob Moore, Lv Zheng, Rafael J. Wysocki, Len Brown,
	Thomas Gleixner, Ingo Molnar, Andrew Morton, Thomas Renninger,
	Jiang Liu, Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu,
	Taku Izumi, Mel Gorman, Minchan Kim, mina86, gong.chen,
	Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, Zhang Yanfei, yanghy, the arch/x86 maintainers,
	linux-doc, Linux Kernel Mailing List, Linux MM,
	ACPI Devel Maling List, Luck, Tony (tony.luck@intel.com)

On Sun, Aug 18, 2013 at 8:07 PM, Tang Chen <tangchen@cn.fujitsu.com> wrote:
> On 08/16/2013 12:21 PM, Yinghai Lu wrote:
> ......
>
>>> By "put acpi_initrd_override in BRK", do you mean increase the BRK by
>>> default ?
>>
>>
>> Peter,
>>
>> Do you agree on extending BRK 256k to put copied override acpi tables?
>>
>> then we can find and copy them early in
>> arch/x86/kernel/head64.c::x86_64_start_kernel() or
>> arch/x86/kernel/head_32.S.
>
>
> Hi Yinghai,
>
> If we use BRK to store acpi tables, we don't need to setup page tables.
> If we do acpi_initrd_override() in setup_arch(), after early_ioremap is
> available, we don't need to split it into find & copy. It would be much
> easier.

we don't need to use early_ioremap if acpi_initrd_override is called in
arch/x86/kernel/head64.c::x86_64_start_kernel() or arch/x86/kernel/head_32.S.

Thanks

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 165+ messages in thread

end of thread, other threads:[~2013-08-19  3:28 UTC | newest]

Thread overview: 165+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-08 10:16 [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tang Chen
2013-08-08 10:16 ` Tang Chen
2013-08-08 10:16 ` [PATCH part5 1/7] x86: get pg_data_t's memory from other node Tang Chen
2013-08-08 10:16   ` Tang Chen
2013-08-12 14:39   ` Tejun Heo
2013-08-12 14:39     ` Tejun Heo
2013-08-12 15:12     ` Tang Chen
2013-08-12 15:12       ` Tang Chen
2013-08-08 10:16 ` [PATCH part5 2/7] x86, numa, mem_hotplug: Skip all the regions the kernel resides in Tang Chen
2013-08-08 10:16   ` Tang Chen
2013-08-08 10:16 ` [PATCH part5 3/7] memblock, numa: Introduce flag into memblock Tang Chen
2013-08-08 10:16   ` Tang Chen
2013-08-08 10:16 ` [PATCH part5 4/7] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions Tang Chen
2013-08-08 10:16   ` Tang Chen
2013-08-08 10:16 ` [PATCH part5 5/7] memblock, mem_hotplug: Make memblock skip hotpluggable regions by default Tang Chen
2013-08-08 10:16   ` Tang Chen
2013-08-14 21:54   ` Naoya Horiguchi
2013-08-14 21:54     ` Naoya Horiguchi
2013-08-15  5:15     ` Tang Chen
2013-08-15  5:15       ` Tang Chen
2013-08-08 10:16 ` [PATCH part5 6/7] mem-hotplug: Introduce movablenode boot option to {en|dis}able using SRAT Tang Chen
2013-08-08 10:16   ` Tang Chen
2013-08-08 10:16 ` [PATCH part5 7/7] x86, numa, acpi, memory-hotplug: Make movablenode have higher priority Tang Chen
2013-08-08 10:16   ` Tang Chen
2013-08-09 16:32 ` [PATCH part5 0/7] Arrange hotpluggable memory as ZONE_MOVABLE Tejun Heo
2013-08-09 16:32   ` Tejun Heo
2013-08-12  6:33   ` Tang Chen
2013-08-12  8:54   ` Tang Chen
2013-08-12  8:54     ` Tang Chen
2013-08-12 14:50 ` Tejun Heo
2013-08-12 14:50   ` Tejun Heo
2013-08-12 15:14   ` H. Peter Anvin
2013-08-12 15:14     ` H. Peter Anvin
2013-08-12 15:23     ` Tejun Heo
2013-08-12 15:23       ` Tejun Heo
2013-08-12 16:29       ` Tang Chen
2013-08-12 16:29         ` Tang Chen
2013-08-12 16:46         ` Tejun Heo
2013-08-12 16:46           ` Tejun Heo
2013-08-12 18:23           ` Tang Chen
2013-08-12 18:23             ` Tang Chen
2013-08-12 20:20             ` Tejun Heo
2013-08-12 20:20               ` Tejun Heo
2013-08-12 20:49               ` Luck, Tony
2013-08-12 20:49                 ` Luck, Tony
2013-08-12 20:54                 ` Tejun Heo
2013-08-12 20:54                   ` Tejun Heo
2013-08-12 20:57                   ` H. Peter Anvin
2013-08-12 20:57                     ` H. Peter Anvin
2013-08-12 21:06                     ` Yinghai Lu
2013-08-12 21:06                       ` Yinghai Lu
2013-08-12 21:08                       ` Tejun Heo
2013-08-12 21:08                         ` Tejun Heo
2013-08-12 21:12                         ` H. Peter Anvin
2013-08-12 21:12                           ` H. Peter Anvin
2013-08-12 21:14                           ` Tejun Heo
2013-08-12 21:14                             ` Tejun Heo
2013-08-12 21:11                       ` H. Peter Anvin
2013-08-12 21:11                         ` H. Peter Anvin
2013-08-12 21:11                   ` Luck, Tony
2013-08-12 21:11                     ` Luck, Tony
2013-08-12 21:25                     ` Yinghai Lu
2013-08-12 21:25                       ` Yinghai Lu
2013-08-12 21:28                       ` H. Peter Anvin
2013-08-12 21:28                         ` H. Peter Anvin
2013-08-13  5:14                     ` H. Peter Anvin
2013-08-13  5:14                       ` H. Peter Anvin
2013-08-13  6:14           ` Tang Chen
2013-08-13  6:14             ` Tang Chen
2013-08-13  9:56             ` Tang Chen
2013-08-13  9:56               ` Tang Chen
2013-08-13 14:38               ` Tejun Heo
2013-08-13 14:38                 ` Tejun Heo
2013-08-13 22:33               ` Yinghai Lu
2013-08-13 22:33                 ` Yinghai Lu
2013-08-14  1:22                 ` Tang Chen
2013-08-14  1:22                   ` Tang Chen
2013-08-15 19:06                   ` Toshi Kani
2013-08-15 19:06                     ` Toshi Kani
2013-08-15 20:28                     ` Yinghai Lu
2013-08-15 20:28                       ` Yinghai Lu
2013-08-16  2:08                       ` Tang Chen
2013-08-16  2:08                         ` Tang Chen
2013-08-16  4:21                         ` Yinghai Lu
2013-08-16  4:21                           ` Yinghai Lu
2013-08-19  3:07                           ` Tang Chen
2013-08-19  3:07                             ` Tang Chen
2013-08-19  3:28                             ` Yinghai Lu
2013-08-19  3:28                               ` Yinghai Lu
2013-08-15  8:42                 ` Tang Chen
2013-08-15  8:42                   ` Tang Chen
2013-08-15 12:19                   ` Tejun Heo
2013-08-15 12:19                     ` Tejun Heo
2013-08-15 12:44                     ` Tang Chen
2013-08-15 12:44                       ` Tang Chen
2013-08-15 12:49                       ` Tejun Heo
2013-08-15 12:49                         ` Tejun Heo
2013-08-15 12:52                         ` Tang Chen
2013-08-15 12:52                           ` Tang Chen
2013-08-15 14:37                       ` Yinghai Lu
2013-08-15 14:37                         ` Yinghai Lu
2013-08-15 14:45                         ` Tejun Heo
2013-08-15 14:45                           ` Tejun Heo
2013-08-15 15:05                           ` Yinghai Lu
2013-08-15 15:05                             ` Yinghai Lu
2013-08-15 15:10                             ` Tejun Heo
2013-08-15 15:10                               ` Tejun Heo
2013-08-15 19:49                               ` Toshi Kani
2013-08-15 19:49                                 ` Toshi Kani
2013-08-15 19:08                             ` Luck, Tony
2013-08-15 19:08                               ` Luck, Tony
2013-08-15 19:34                               ` Yinghai Lu
2013-08-15 19:34                                 ` Yinghai Lu
2013-08-15 14:35                   ` Yinghai Lu
2013-08-15 14:35                     ` Yinghai Lu
2013-08-16  1:16                     ` Tang Chen
2013-08-16  1:16                       ` Tang Chen
2013-08-12 15:41   ` Tang Chen
2013-08-12 15:41     ` Tang Chen
2013-08-12 15:46     ` Tejun Heo
2013-08-12 15:46       ` Tejun Heo
2013-08-12 16:19       ` Tang Chen
2013-08-12 16:19         ` Tang Chen
2013-08-12 16:22         ` Tejun Heo
2013-08-12 16:22           ` Tejun Heo
2013-08-12 17:01           ` Tang Chen
2013-08-12 17:01             ` Tang Chen
2013-08-12 17:23             ` H. Peter Anvin
2013-08-12 17:23               ` H. Peter Anvin
2013-08-14 18:22               ` KOSAKI Motohiro
2013-08-14 18:22                 ` KOSAKI Motohiro
2013-08-12 18:07             ` Tejun Heo
2013-08-12 18:07               ` Tejun Heo
2013-08-14 18:15               ` KOSAKI Motohiro
2013-08-14 18:15                 ` KOSAKI Motohiro
2013-08-14 18:23                 ` Tejun Heo
2013-08-14 18:23                   ` Tejun Heo
2013-08-14 19:40                   ` KOSAKI Motohiro
2013-08-14 19:40                     ` KOSAKI Motohiro
2013-08-14 19:55                     ` Tejun Heo
2013-08-14 19:55                       ` Tejun Heo
2013-08-14 20:29                       ` KOSAKI Motohiro
2013-08-14 20:29                         ` KOSAKI Motohiro
2013-08-14 20:30                         ` H. Peter Anvin
2013-08-14 20:30                           ` H. Peter Anvin
2013-08-14 20:35                         ` Tejun Heo
2013-08-14 20:35                           ` Tejun Heo
2013-08-14 21:17                           ` KOSAKI Motohiro
2013-08-14 21:17                             ` KOSAKI Motohiro
2013-08-14 21:36                             ` Tejun Heo
2013-08-14 21:36                               ` Tejun Heo
2013-08-15  1:08                               ` KOSAKI Motohiro
2013-08-15  1:08                                 ` KOSAKI Motohiro
2013-08-15  1:21                                 ` Tejun Heo
2013-08-15  1:21                                   ` Tejun Heo
2013-08-15  1:33                                   ` Tejun Heo
2013-08-15  1:33                                     ` Tejun Heo
2013-08-15  1:44                                     ` KOSAKI Motohiro
2013-08-15  1:44                                       ` KOSAKI Motohiro
2013-08-15  2:22                                       ` Tejun Heo
2013-08-15  2:22                                         ` Tejun Heo
2013-08-15  1:38                                   ` KOSAKI Motohiro
2013-08-15  1:38                                     ` KOSAKI Motohiro
2013-08-15  1:51                                     ` Tejun Heo
2013-08-15  1:51                                       ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.