All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH part1 v6 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed
@ 2013-10-04  1:56 ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-04  1:56 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

Hello, here is the v6 version. Any comments are welcome!

The v6 version is based on linus's tree (3.12-rc3)
HEAD is:
commit 15c03dd4859ab16f9212238f29dd315654aa94f6
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sun Sep 29 15:02:38 2013 -0700

    Linux 3.12-rc3


[Problem]

The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.


[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
To do this, we need ACPI's help.

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
   ZONE_MOVABLE.
   (This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
   hotpluggable memory for the kernel before the memory initialization
   finishes.

The problem 2 is the key problem we are going to solve. But before solving it,
we need some preparation. Please see below.


[Preparation]

Bootloader has to load the kernel image into memory. And this memory must be 
unhotpluggable. We cannot prevent this anyway. So in a memory hotplug system, 
we can assume any node the kernel resides in is not hotpluggable.

Before SRAT is parsed, we don't know which memory ranges are hotpluggable. But
memblock has already started to work. In the current kernel, memblock allocates 
the following memory before SRAT is parsed:

setup_arch()
 |->memblock_x86_fill()            /* memblock is ready */
 |......
 |->early_reserve_e820_mpc_new()   /* allocate memory under 1MB */
 |->reserve_real_mode()            /* allocate memory under 1MB */
 |->init_mem_mapping()             /* allocate page tables, about 2MB to map 1GB memory */
 |->dma_contiguous_reserve()       /* specified by user, should be low */
 |->setup_log_buf()                /* specified by user, several mega bytes */
 |->relocate_initrd()              /* could be large, but will be freed after boot, should reorder */
 |->acpi_initrd_override()         /* several mega bytes */
 |->reserve_crashkernel()          /* could be large, should reorder */
 |......
 |->initmem_init()                 /* Parse SRAT */

According to Tejun's advice, before SRAT is parsed, we should try our best to
allocate memory near the kernel image. Since the whole node the kernel resides 
in won't be hotpluggable, and for a modern server, a node may have at least 16GB
memory, allocating several mega bytes memory around the kernel image won't cross
to hotpluggable memory.


[About this patch-set]

So this patch-set is the preparation for the problem 2 that we want to solve. It
does the following:

1. Make memblock be able to allocate memory bottom up.
   1) Keep all the memblock APIs' prototype unmodified.
   2) When the direction is bottom up, keep the start address greater than the 
      end of kernel image.

2. Improve init_mem_mapping() to support allocate page tables in bottom up direction.

3. Introduce "movable_node" boot option to enable and disable this functionality.

Change log v5 -> v6:
1. Add tejun and toshi's ack in several patches.
2. Change movablenode to movable_node boot option and update the description
   for movable_node and CONFIG_MOVABLE_NODE. Thanks Ingo!
3. Fix the __pa_symbol() issue pointed by Andrew Morton.
4. Update some functions' comments and names.

Change log v4 -> v5:
1. Change memblock.current_direction to a boolean memblock.bottom_up. And remove 
   the direction enum.
2. Update and add some comments to explain things clearer.
3. Misc fixes, such as removing unnecessary #ifdef

Change log v3 -> v4:
1. Use bottom-up/top-down to unify things. Thanks tj.
2. Factor out of current top-down implementation and then introduce bottom-up mode,
   not mixing them in one patch. Thanks tj.
3. Changes function naming: memblock_direction_bottom_up -> memblock_bottom_up
4. Use memblock_set_bottom_up to replace memblock_set_current_direction, which makes
   the code simpler. Thanks tj.
5. Define two implementions of function memblock_bottom_up and memblock_set_bottom_up
   in order not to use #ifdef in the boot code. Thanks tj.
6. Add comments to explain why retry top-down allocation when bottom_up allocation
   failed. Thanks tj and toshi!

Change log v2 -> v3:
1. According to Toshi's suggestion, move the direction checking logic into memblock.
   And simply the code more.

Change log v1 -> v2:
1. According to tj's suggestion, implemented a new function memblock_alloc_bottom_up() 
   to allocate memory from bottom upwards, whihc can simplify the code.

Tang Chen (6):
  memblock: Factor out of top-down allocation
  memblock: Introduce bottom-up allocation mode
  x86/mm: Factor out of top-down direct mapping setup
  x86/mem-hotplug: Support initialize page tables in bottom-up
  x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is
    parsed.
  mem-hotplug: Introduce movable_node boot option

 Documentation/kernel-parameters.txt |    3 +
 arch/x86/kernel/setup.c             |    9 ++-
 arch/x86/mm/init.c                  |  127 ++++++++++++++++++++++++++++------
 arch/x86/mm/numa.c                  |   11 +++
 include/linux/memblock.h            |   24 +++++++
 mm/Kconfig                          |   17 +++--
 mm/memblock.c                       |  130 +++++++++++++++++++++++++++++++----
 mm/memory_hotplug.c                 |   31 ++++++++
 8 files changed, 311 insertions(+), 41 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed
@ 2013-10-04  1:56 ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-04  1:56 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

Hello, here is the v6 version. Any comments are welcome!

The v6 version is based on linus's tree (3.12-rc3)
HEAD is:
commit 15c03dd4859ab16f9212238f29dd315654aa94f6
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sun Sep 29 15:02:38 2013 -0700

    Linux 3.12-rc3


[Problem]

The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.


[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
To do this, we need ACPI's help.

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
   ZONE_MOVABLE.
   (This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
   hotpluggable memory for the kernel before the memory initialization
   finishes.

The problem 2 is the key problem we are going to solve. But before solving it,
we need some preparation. Please see below.


[Preparation]

Bootloader has to load the kernel image into memory. And this memory must be 
unhotpluggable. We cannot prevent this anyway. So in a memory hotplug system, 
we can assume any node the kernel resides in is not hotpluggable.

Before SRAT is parsed, we don't know which memory ranges are hotpluggable. But
memblock has already started to work. In the current kernel, memblock allocates 
the following memory before SRAT is parsed:

setup_arch()
 |->memblock_x86_fill()            /* memblock is ready */
 |......
 |->early_reserve_e820_mpc_new()   /* allocate memory under 1MB */
 |->reserve_real_mode()            /* allocate memory under 1MB */
 |->init_mem_mapping()             /* allocate page tables, about 2MB to map 1GB memory */
 |->dma_contiguous_reserve()       /* specified by user, should be low */
 |->setup_log_buf()                /* specified by user, several mega bytes */
 |->relocate_initrd()              /* could be large, but will be freed after boot, should reorder */
 |->acpi_initrd_override()         /* several mega bytes */
 |->reserve_crashkernel()          /* could be large, should reorder */
 |......
 |->initmem_init()                 /* Parse SRAT */

According to Tejun's advice, before SRAT is parsed, we should try our best to
allocate memory near the kernel image. Since the whole node the kernel resides 
in won't be hotpluggable, and for a modern server, a node may have at least 16GB
memory, allocating several mega bytes memory around the kernel image won't cross
to hotpluggable memory.


[About this patch-set]

So this patch-set is the preparation for the problem 2 that we want to solve. It
does the following:

1. Make memblock be able to allocate memory bottom up.
   1) Keep all the memblock APIs' prototype unmodified.
   2) When the direction is bottom up, keep the start address greater than the 
      end of kernel image.

2. Improve init_mem_mapping() to support allocate page tables in bottom up direction.

3. Introduce "movable_node" boot option to enable and disable this functionality.

Change log v5 -> v6:
1. Add tejun and toshi's ack in several patches.
2. Change movablenode to movable_node boot option and update the description
   for movable_node and CONFIG_MOVABLE_NODE. Thanks Ingo!
3. Fix the __pa_symbol() issue pointed by Andrew Morton.
4. Update some functions' comments and names.

Change log v4 -> v5:
1. Change memblock.current_direction to a boolean memblock.bottom_up. And remove 
   the direction enum.
2. Update and add some comments to explain things clearer.
3. Misc fixes, such as removing unnecessary #ifdef

Change log v3 -> v4:
1. Use bottom-up/top-down to unify things. Thanks tj.
2. Factor out of current top-down implementation and then introduce bottom-up mode,
   not mixing them in one patch. Thanks tj.
3. Changes function naming: memblock_direction_bottom_up -> memblock_bottom_up
4. Use memblock_set_bottom_up to replace memblock_set_current_direction, which makes
   the code simpler. Thanks tj.
5. Define two implementions of function memblock_bottom_up and memblock_set_bottom_up
   in order not to use #ifdef in the boot code. Thanks tj.
6. Add comments to explain why retry top-down allocation when bottom_up allocation
   failed. Thanks tj and toshi!

Change log v2 -> v3:
1. According to Toshi's suggestion, move the direction checking logic into memblock.
   And simply the code more.

Change log v1 -> v2:
1. According to tj's suggestion, implemented a new function memblock_alloc_bottom_up() 
   to allocate memory from bottom upwards, whihc can simplify the code.

Tang Chen (6):
  memblock: Factor out of top-down allocation
  memblock: Introduce bottom-up allocation mode
  x86/mm: Factor out of top-down direct mapping setup
  x86/mem-hotplug: Support initialize page tables in bottom-up
  x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is
    parsed.
  mem-hotplug: Introduce movable_node boot option

 Documentation/kernel-parameters.txt |    3 +
 arch/x86/kernel/setup.c             |    9 ++-
 arch/x86/mm/init.c                  |  127 ++++++++++++++++++++++++++++------
 arch/x86/mm/numa.c                  |   11 +++
 include/linux/memblock.h            |   24 +++++++
 mm/Kconfig                          |   17 +++--
 mm/memblock.c                       |  130 +++++++++++++++++++++++++++++++----
 mm/memory_hotplug.c                 |   31 ++++++++
 8 files changed, 311 insertions(+), 41 deletions(-)


^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 1/6] memblock: Factor out of top-down allocation
  2013-10-04  1:56 ` Zhang Yanfei
@ 2013-10-04  1:57   ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-04  1:57 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

From: Tang Chen <tangchen@cn.fujitsu.com>

This patch creates a new function __memblock_find_range_top_down
to factor out of top-down allocation from memblock_find_in_range_node.
This is a preparation because we will introduce a new bottom-up
allocation mode in the following patch.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 mm/memblock.c |   47 ++++++++++++++++++++++++++++++++++-------------
 1 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 0ac412a..accff10 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -83,33 +83,25 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
 }
 
 /**
- * memblock_find_in_range_node - find free area in given range and node
+ * __memblock_find_range_top_down - find free area utility, in top-down
  * @start: start of candidate range
  * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
  * @size: size of free area to find
  * @align: alignment of free area to find
  * @nid: nid of the free area to find, %MAX_NUMNODES for any node
  *
- * Find @size free area aligned to @align in the specified range and node.
+ * Utility called from memblock_find_in_range_node(), find free area top-down.
  *
  * RETURNS:
  * Found address on success, %0 on failure.
  */
-phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
-					phys_addr_t end, phys_addr_t size,
-					phys_addr_t align, int nid)
+static phys_addr_t __init_memblock
+__memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
+			       phys_addr_t size, phys_addr_t align, int nid)
 {
 	phys_addr_t this_start, this_end, cand;
 	u64 i;
 
-	/* pump up @end */
-	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
-		end = memblock.current_limit;
-
-	/* avoid allocating the first page */
-	start = max_t(phys_addr_t, start, PAGE_SIZE);
-	end = max(start, end);
-
 	for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
 		this_start = clamp(this_start, start, end);
 		this_end = clamp(this_end, start, end);
@@ -121,10 +113,39 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 		if (cand >= this_start)
 			return cand;
 	}
+
 	return 0;
 }
 
 /**
+ * memblock_find_in_range_node - find free area in given range and node
+ * @start: start of candidate range
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Find @size free area aligned to @align in the specified range and node.
+ *
+ * RETURNS:
+ * Found address on success, %0 on failure.
+ */
+phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
+					phys_addr_t end, phys_addr_t size,
+					phys_addr_t align, int nid)
+{
+	/* pump up @end */
+	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+		end = memblock.current_limit;
+
+	/* avoid allocating the first page */
+	start = max_t(phys_addr_t, start, PAGE_SIZE);
+	end = max(start, end);
+
+	return __memblock_find_range_top_down(start, end, size, align, nid);
+}
+
+/**
  * memblock_find_in_range - find free area in given range
  * @start: start of candidate range
  * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 1/6] memblock: Factor out of top-down allocation
@ 2013-10-04  1:57   ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-04  1:57 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

From: Tang Chen <tangchen@cn.fujitsu.com>

This patch creates a new function __memblock_find_range_top_down
to factor out of top-down allocation from memblock_find_in_range_node.
This is a preparation because we will introduce a new bottom-up
allocation mode in the following patch.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 mm/memblock.c |   47 ++++++++++++++++++++++++++++++++++-------------
 1 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 0ac412a..accff10 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -83,33 +83,25 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
 }
 
 /**
- * memblock_find_in_range_node - find free area in given range and node
+ * __memblock_find_range_top_down - find free area utility, in top-down
  * @start: start of candidate range
  * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
  * @size: size of free area to find
  * @align: alignment of free area to find
  * @nid: nid of the free area to find, %MAX_NUMNODES for any node
  *
- * Find @size free area aligned to @align in the specified range and node.
+ * Utility called from memblock_find_in_range_node(), find free area top-down.
  *
  * RETURNS:
  * Found address on success, %0 on failure.
  */
-phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
-					phys_addr_t end, phys_addr_t size,
-					phys_addr_t align, int nid)
+static phys_addr_t __init_memblock
+__memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
+			       phys_addr_t size, phys_addr_t align, int nid)
 {
 	phys_addr_t this_start, this_end, cand;
 	u64 i;
 
-	/* pump up @end */
-	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
-		end = memblock.current_limit;
-
-	/* avoid allocating the first page */
-	start = max_t(phys_addr_t, start, PAGE_SIZE);
-	end = max(start, end);
-
 	for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
 		this_start = clamp(this_start, start, end);
 		this_end = clamp(this_end, start, end);
@@ -121,10 +113,39 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 		if (cand >= this_start)
 			return cand;
 	}
+
 	return 0;
 }
 
 /**
+ * memblock_find_in_range_node - find free area in given range and node
+ * @start: start of candidate range
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Find @size free area aligned to @align in the specified range and node.
+ *
+ * RETURNS:
+ * Found address on success, %0 on failure.
+ */
+phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
+					phys_addr_t end, phys_addr_t size,
+					phys_addr_t align, int nid)
+{
+	/* pump up @end */
+	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+		end = memblock.current_limit;
+
+	/* avoid allocating the first page */
+	start = max_t(phys_addr_t, start, PAGE_SIZE);
+	end = max(start, end);
+
+	return __memblock_find_range_top_down(start, end, size, align, nid);
+}
+
+/**
  * memblock_find_in_range - find free area in given range
  * @start: start of candidate range
  * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 2/6] memblock: Introduce bottom-up allocation mode
  2013-10-04  1:56 ` Zhang Yanfei
@ 2013-10-04  1:58   ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-04  1:58 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

From: Tang Chen <tangchen@cn.fujitsu.com>

The Linux kernel cannot migrate pages used by the kernel. As a result, kernel
pages cannot be hot-removed. So we cannot allocate hotpluggable memory for
the kernel.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info.
But before SRAT is parsed, memblock has already started to allocate memory
for the kernel. So we need to prevent memblock from doing this.

In a memory hotplug system, any numa node the kernel resides in should
be unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely unhotpluggable.

So the basic idea is: Allocate memory from the end of the kernel image and
to the higher memory. Since memory allocation before SRAT is parsed won't
be too much, it could highly likely be in the same node with kernel image.

The current memblock can only allocate memory top-down. So this patch introduces
a new bottom-up allocation mode to allocate memory bottom-up. And later
when we use this allocation direction to allocate memory, we will limit
the start address above the kernel.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |   24 +++++++++++++
 mm/memblock.c            |   87 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 108 insertions(+), 3 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 31e95ac..77c60e5 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -35,6 +35,7 @@ struct memblock_type {
 };
 
 struct memblock {
+	bool bottom_up;  /* is bottom up direction? */
 	phys_addr_t current_limit;
 	struct memblock_type memory;
 	struct memblock_type reserved;
@@ -148,6 +149,29 @@ phys_addr_t memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, int nid)
 
 phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align);
 
+#ifdef CONFIG_MOVABLE_NODE
+/*
+ * Set the allocation direction to bottom-up or top-down.
+ */
+static inline void memblock_set_bottom_up(bool enable)
+{
+	memblock.bottom_up = enable;
+}
+
+/*
+ * Check if the allocation direction is bottom-up or not.
+ * if this is true, that said, memblock will allocate memory
+ * in bottom-up direction.
+ */
+static inline bool memblock_bottom_up(void)
+{
+	return memblock.bottom_up;
+}
+#else
+static inline void memblock_set_bottom_up(bool enable) {}
+static inline bool memblock_bottom_up(void) { return false; }
+#endif
+
 /* Flags for memblock_alloc_base() amd __memblock_alloc_base() */
 #define MEMBLOCK_ALLOC_ANYWHERE	(~(phys_addr_t)0)
 #define MEMBLOCK_ALLOC_ACCESSIBLE	0
diff --git a/mm/memblock.c b/mm/memblock.c
index accff10..04f20f4 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -20,6 +20,8 @@
 #include <linux/seq_file.h>
 #include <linux/memblock.h>
 
+#include <asm-generic/sections.h>
+
 static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
 static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
 
@@ -32,6 +34,7 @@ struct memblock memblock __initdata_memblock = {
 	.reserved.cnt		= 1,	/* empty dummy entry */
 	.reserved.max		= INIT_MEMBLOCK_REGIONS,
 
+	.bottom_up		= false,
 	.current_limit		= MEMBLOCK_ALLOC_ANYWHERE,
 };
 
@@ -82,6 +85,38 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
 	return (i < type->cnt) ? i : -1;
 }
 
+/*
+ * __memblock_find_range_bottom_up - find free area utility in bottom-up
+ * @start: start of candidate range
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Utility called from memblock_find_in_range_node(), find free area bottom-up.
+ *
+ * RETURNS:
+ * Found address on success, 0 on failure.
+ */
+static phys_addr_t __init_memblock
+__memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
+				phys_addr_t size, phys_addr_t align, int nid)
+{
+	phys_addr_t this_start, this_end, cand;
+	u64 i;
+
+	for_each_free_mem_range(i, nid, &this_start, &this_end, NULL) {
+		this_start = clamp(this_start, start, end);
+		this_end = clamp(this_end, start, end);
+
+		cand = round_up(this_start, align);
+		if (cand < this_end && this_end - cand >= size)
+			return cand;
+	}
+
+	return 0;
+}
+
 /**
  * __memblock_find_range_top_down - find free area utility, in top-down
  * @start: start of candidate range
@@ -93,7 +128,7 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
  * Utility called from memblock_find_in_range_node(), find free area top-down.
  *
  * RETURNS:
- * Found address on success, %0 on failure.
+ * Found address on success, 0 on failure.
  */
 static phys_addr_t __init_memblock
 __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
@@ -127,13 +162,24 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
  *
  * Find @size free area aligned to @align in the specified range and node.
  *
+ * When allocation direction is bottom-up, the @start should be greater
+ * than the end of the kernel image. Otherwise, it will be trimmed. The
+ * reason is that we want the bottom-up allocation just near the kernel
+ * image so it is highly likely that the allocated memory and the kernel
+ * will reside in the same node.
+ *
+ * If bottom-up allocation failed, will try to allocate memory top-down.
+ *
  * RETURNS:
- * Found address on success, %0 on failure.
+ * Found address on success, 0 on failure.
  */
 phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 					phys_addr_t end, phys_addr_t size,
 					phys_addr_t align, int nid)
 {
+	int ret;
+	phys_addr_t kernel_end;
+
 	/* pump up @end */
 	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
 		end = memblock.current_limit;
@@ -141,6 +187,41 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 	/* avoid allocating the first page */
 	start = max_t(phys_addr_t, start, PAGE_SIZE);
 	end = max(start, end);
+#ifdef CONFIG_X86
+	kernel_end = __pa_symbol(_end);
+#else
+	kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
+#endif
+
+	/*
+	 * try bottom-up allocation only when bottom-up mode
+	 * is set and @end is above the kernel image.
+	 */
+	if (memblock_bottom_up() && end > kernel_end) {
+		phys_addr_t bottom_up_start;
+
+		/* make sure we will allocate above the kernel */
+		bottom_up_start = max(start, kernel_end);
+
+		/* ok, try bottom-up allocation first */
+		ret = __memblock_find_range_bottom_up(bottom_up_start, end,
+						      size, align, nid);
+		if (ret)
+			return ret;
+
+		/*
+		 * we always limit bottom-up allocation above the kernel,
+		 * but top-down allocation doesn't have the limit, so
+		 * retrying top-down allocation may succeed when bottom-up
+		 * allocation failed.
+		 *
+		 * bottom-up allocation is expected to be fail very rarely,
+		 * so we use WARN_ONCE() here to see the stack trace if
+		 * fail happens.
+		 */
+		WARN_ONCE(1, "memblock: bottom-up allocation failed, "
+			     "memory hotunplug may be affected\n");
+	}
 
 	return __memblock_find_range_top_down(start, end, size, align, nid);
 }
@@ -155,7 +236,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
  * Find @size free area aligned to @align in the specified range.
  *
  * RETURNS:
- * Found address on success, %0 on failure.
+ * Found address on success, 0 on failure.
  */
 phys_addr_t __init_memblock memblock_find_in_range(phys_addr_t start,
 					phys_addr_t end, phys_addr_t size,
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 2/6] memblock: Introduce bottom-up allocation mode
@ 2013-10-04  1:58   ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-04  1:58 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

From: Tang Chen <tangchen@cn.fujitsu.com>

The Linux kernel cannot migrate pages used by the kernel. As a result, kernel
pages cannot be hot-removed. So we cannot allocate hotpluggable memory for
the kernel.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info.
But before SRAT is parsed, memblock has already started to allocate memory
for the kernel. So we need to prevent memblock from doing this.

In a memory hotplug system, any numa node the kernel resides in should
be unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely unhotpluggable.

So the basic idea is: Allocate memory from the end of the kernel image and
to the higher memory. Since memory allocation before SRAT is parsed won't
be too much, it could highly likely be in the same node with kernel image.

The current memblock can only allocate memory top-down. So this patch introduces
a new bottom-up allocation mode to allocate memory bottom-up. And later
when we use this allocation direction to allocate memory, we will limit
the start address above the kernel.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |   24 +++++++++++++
 mm/memblock.c            |   87 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 108 insertions(+), 3 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 31e95ac..77c60e5 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -35,6 +35,7 @@ struct memblock_type {
 };
 
 struct memblock {
+	bool bottom_up;  /* is bottom up direction? */
 	phys_addr_t current_limit;
 	struct memblock_type memory;
 	struct memblock_type reserved;
@@ -148,6 +149,29 @@ phys_addr_t memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, int nid)
 
 phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align);
 
+#ifdef CONFIG_MOVABLE_NODE
+/*
+ * Set the allocation direction to bottom-up or top-down.
+ */
+static inline void memblock_set_bottom_up(bool enable)
+{
+	memblock.bottom_up = enable;
+}
+
+/*
+ * Check if the allocation direction is bottom-up or not.
+ * if this is true, that said, memblock will allocate memory
+ * in bottom-up direction.
+ */
+static inline bool memblock_bottom_up(void)
+{
+	return memblock.bottom_up;
+}
+#else
+static inline void memblock_set_bottom_up(bool enable) {}
+static inline bool memblock_bottom_up(void) { return false; }
+#endif
+
 /* Flags for memblock_alloc_base() amd __memblock_alloc_base() */
 #define MEMBLOCK_ALLOC_ANYWHERE	(~(phys_addr_t)0)
 #define MEMBLOCK_ALLOC_ACCESSIBLE	0
diff --git a/mm/memblock.c b/mm/memblock.c
index accff10..04f20f4 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -20,6 +20,8 @@
 #include <linux/seq_file.h>
 #include <linux/memblock.h>
 
+#include <asm-generic/sections.h>
+
 static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
 static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
 
@@ -32,6 +34,7 @@ struct memblock memblock __initdata_memblock = {
 	.reserved.cnt		= 1,	/* empty dummy entry */
 	.reserved.max		= INIT_MEMBLOCK_REGIONS,
 
+	.bottom_up		= false,
 	.current_limit		= MEMBLOCK_ALLOC_ANYWHERE,
 };
 
@@ -82,6 +85,38 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
 	return (i < type->cnt) ? i : -1;
 }
 
+/*
+ * __memblock_find_range_bottom_up - find free area utility in bottom-up
+ * @start: start of candidate range
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Utility called from memblock_find_in_range_node(), find free area bottom-up.
+ *
+ * RETURNS:
+ * Found address on success, 0 on failure.
+ */
+static phys_addr_t __init_memblock
+__memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
+				phys_addr_t size, phys_addr_t align, int nid)
+{
+	phys_addr_t this_start, this_end, cand;
+	u64 i;
+
+	for_each_free_mem_range(i, nid, &this_start, &this_end, NULL) {
+		this_start = clamp(this_start, start, end);
+		this_end = clamp(this_end, start, end);
+
+		cand = round_up(this_start, align);
+		if (cand < this_end && this_end - cand >= size)
+			return cand;
+	}
+
+	return 0;
+}
+
 /**
  * __memblock_find_range_top_down - find free area utility, in top-down
  * @start: start of candidate range
@@ -93,7 +128,7 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
  * Utility called from memblock_find_in_range_node(), find free area top-down.
  *
  * RETURNS:
- * Found address on success, %0 on failure.
+ * Found address on success, 0 on failure.
  */
 static phys_addr_t __init_memblock
 __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
@@ -127,13 +162,24 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
  *
  * Find @size free area aligned to @align in the specified range and node.
  *
+ * When allocation direction is bottom-up, the @start should be greater
+ * than the end of the kernel image. Otherwise, it will be trimmed. The
+ * reason is that we want the bottom-up allocation just near the kernel
+ * image so it is highly likely that the allocated memory and the kernel
+ * will reside in the same node.
+ *
+ * If bottom-up allocation failed, will try to allocate memory top-down.
+ *
  * RETURNS:
- * Found address on success, %0 on failure.
+ * Found address on success, 0 on failure.
  */
 phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 					phys_addr_t end, phys_addr_t size,
 					phys_addr_t align, int nid)
 {
+	int ret;
+	phys_addr_t kernel_end;
+
 	/* pump up @end */
 	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
 		end = memblock.current_limit;
@@ -141,6 +187,41 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 	/* avoid allocating the first page */
 	start = max_t(phys_addr_t, start, PAGE_SIZE);
 	end = max(start, end);
+#ifdef CONFIG_X86
+	kernel_end = __pa_symbol(_end);
+#else
+	kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
+#endif
+
+	/*
+	 * try bottom-up allocation only when bottom-up mode
+	 * is set and @end is above the kernel image.
+	 */
+	if (memblock_bottom_up() && end > kernel_end) {
+		phys_addr_t bottom_up_start;
+
+		/* make sure we will allocate above the kernel */
+		bottom_up_start = max(start, kernel_end);
+
+		/* ok, try bottom-up allocation first */
+		ret = __memblock_find_range_bottom_up(bottom_up_start, end,
+						      size, align, nid);
+		if (ret)
+			return ret;
+
+		/*
+		 * we always limit bottom-up allocation above the kernel,
+		 * but top-down allocation doesn't have the limit, so
+		 * retrying top-down allocation may succeed when bottom-up
+		 * allocation failed.
+		 *
+		 * bottom-up allocation is expected to be fail very rarely,
+		 * so we use WARN_ONCE() here to see the stack trace if
+		 * fail happens.
+		 */
+		WARN_ONCE(1, "memblock: bottom-up allocation failed, "
+			     "memory hotunplug may be affected\n");
+	}
 
 	return __memblock_find_range_top_down(start, end, size, align, nid);
 }
@@ -155,7 +236,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
  * Find @size free area aligned to @align in the specified range.
  *
  * RETURNS:
- * Found address on success, %0 on failure.
+ * Found address on success, 0 on failure.
  */
 phys_addr_t __init_memblock memblock_find_in_range(phys_addr_t start,
 					phys_addr_t end, phys_addr_t size,
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 3/6] x86/mm: Factor out of top-down direct mapping setup
  2013-10-04  1:56 ` Zhang Yanfei
@ 2013-10-04  1:59   ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-04  1:59 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

From: Tang Chen <tangchen@cn.fujitsu.com>

This patch creates a new function memory_map_top_down to
factor out of the top-down direct memory mapping pagetable
setup. This is also a preparation for the following patch,
which will introduce the bottom-up memory mapping. That said,
we will put the two ways of pagetable setup into separate
functions, and choose to use which way in init_mem_mapping,
which makes the code more clear.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/mm/init.c |   60 ++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 04664cd..ea2be79 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -401,27 +401,28 @@ static unsigned long __init init_range_memory_mapping(
 
 /* (PUD_SHIFT-PMD_SHIFT)/2 */
 #define STEP_SIZE_SHIFT 5
-void __init init_mem_mapping(void)
+
+/**
+ * memory_map_top_down - Map [map_start, map_end) top down
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in top-down. That said, the page tables
+ * will be allocated at the end of the memory, and we map the
+ * memory in top-down.
+ */
+static void __init memory_map_top_down(unsigned long map_start,
+				       unsigned long map_end)
 {
-	unsigned long end, real_end, start, last_start;
+	unsigned long real_end, start, last_start;
 	unsigned long step_size;
 	unsigned long addr;
 	unsigned long mapped_ram_size = 0;
 	unsigned long new_mapped_ram_size;
 
-	probe_page_size_mask();
-
-#ifdef CONFIG_X86_64
-	end = max_pfn << PAGE_SHIFT;
-#else
-	end = max_low_pfn << PAGE_SHIFT;
-#endif
-
-	/* the ISA range is always mapped regardless of memory holes */
-	init_memory_mapping(0, ISA_END_ADDRESS);
-
 	/* xen has big range in reserved near end of ram, skip it at first.*/
-	addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
+	addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE);
 	real_end = addr + PMD_SIZE;
 
 	/* step_size need to be small so pgt_buf from BRK could cover it */
@@ -436,13 +437,13 @@ void __init init_mem_mapping(void)
 	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
 	 * for page table.
 	 */
-	while (last_start > ISA_END_ADDRESS) {
+	while (last_start > map_start) {
 		if (last_start > step_size) {
 			start = round_down(last_start - 1, step_size);
-			if (start < ISA_END_ADDRESS)
-				start = ISA_END_ADDRESS;
+			if (start < map_start)
+				start = map_start;
 		} else
-			start = ISA_END_ADDRESS;
+			start = map_start;
 		new_mapped_ram_size = init_range_memory_mapping(start,
 							last_start);
 		last_start = start;
@@ -453,8 +454,27 @@ void __init init_mem_mapping(void)
 		mapped_ram_size += new_mapped_ram_size;
 	}
 
-	if (real_end < end)
-		init_range_memory_mapping(real_end, end);
+	if (real_end < map_end)
+		init_range_memory_mapping(real_end, map_end);
+}
+
+void __init init_mem_mapping(void)
+{
+	unsigned long end;
+
+	probe_page_size_mask();
+
+#ifdef CONFIG_X86_64
+	end = max_pfn << PAGE_SHIFT;
+#else
+	end = max_low_pfn << PAGE_SHIFT;
+#endif
+
+	/* the ISA range is always mapped regardless of memory holes */
+	init_memory_mapping(0, ISA_END_ADDRESS);
+
+	/* setup direct mapping for range [ISA_END_ADDRESS, end) in top-down*/
+	memory_map_top_down(ISA_END_ADDRESS, end);
 
 #ifdef CONFIG_X86_64
 	if (max_pfn > max_low_pfn) {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 3/6] x86/mm: Factor out of top-down direct mapping setup
@ 2013-10-04  1:59   ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-04  1:59 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

From: Tang Chen <tangchen@cn.fujitsu.com>

This patch creates a new function memory_map_top_down to
factor out of the top-down direct memory mapping pagetable
setup. This is also a preparation for the following patch,
which will introduce the bottom-up memory mapping. That said,
we will put the two ways of pagetable setup into separate
functions, and choose to use which way in init_mem_mapping,
which makes the code more clear.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/mm/init.c |   60 ++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 04664cd..ea2be79 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -401,27 +401,28 @@ static unsigned long __init init_range_memory_mapping(
 
 /* (PUD_SHIFT-PMD_SHIFT)/2 */
 #define STEP_SIZE_SHIFT 5
-void __init init_mem_mapping(void)
+
+/**
+ * memory_map_top_down - Map [map_start, map_end) top down
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in top-down. That said, the page tables
+ * will be allocated at the end of the memory, and we map the
+ * memory in top-down.
+ */
+static void __init memory_map_top_down(unsigned long map_start,
+				       unsigned long map_end)
 {
-	unsigned long end, real_end, start, last_start;
+	unsigned long real_end, start, last_start;
 	unsigned long step_size;
 	unsigned long addr;
 	unsigned long mapped_ram_size = 0;
 	unsigned long new_mapped_ram_size;
 
-	probe_page_size_mask();
-
-#ifdef CONFIG_X86_64
-	end = max_pfn << PAGE_SHIFT;
-#else
-	end = max_low_pfn << PAGE_SHIFT;
-#endif
-
-	/* the ISA range is always mapped regardless of memory holes */
-	init_memory_mapping(0, ISA_END_ADDRESS);
-
 	/* xen has big range in reserved near end of ram, skip it at first.*/
-	addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
+	addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE);
 	real_end = addr + PMD_SIZE;
 
 	/* step_size need to be small so pgt_buf from BRK could cover it */
@@ -436,13 +437,13 @@ void __init init_mem_mapping(void)
 	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
 	 * for page table.
 	 */
-	while (last_start > ISA_END_ADDRESS) {
+	while (last_start > map_start) {
 		if (last_start > step_size) {
 			start = round_down(last_start - 1, step_size);
-			if (start < ISA_END_ADDRESS)
-				start = ISA_END_ADDRESS;
+			if (start < map_start)
+				start = map_start;
 		} else
-			start = ISA_END_ADDRESS;
+			start = map_start;
 		new_mapped_ram_size = init_range_memory_mapping(start,
 							last_start);
 		last_start = start;
@@ -453,8 +454,27 @@ void __init init_mem_mapping(void)
 		mapped_ram_size += new_mapped_ram_size;
 	}
 
-	if (real_end < end)
-		init_range_memory_mapping(real_end, end);
+	if (real_end < map_end)
+		init_range_memory_mapping(real_end, map_end);
+}
+
+void __init init_mem_mapping(void)
+{
+	unsigned long end;
+
+	probe_page_size_mask();
+
+#ifdef CONFIG_X86_64
+	end = max_pfn << PAGE_SHIFT;
+#else
+	end = max_low_pfn << PAGE_SHIFT;
+#endif
+
+	/* the ISA range is always mapped regardless of memory holes */
+	init_memory_mapping(0, ISA_END_ADDRESS);
+
+	/* setup direct mapping for range [ISA_END_ADDRESS, end) in top-down*/
+	memory_map_top_down(ISA_END_ADDRESS, end);
 
 #ifdef CONFIG_X86_64
 	if (max_pfn > max_low_pfn) {
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-04  1:56 ` Zhang Yanfei
@ 2013-10-04  2:00   ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-04  2:00 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

From: Tang Chen <tangchen@cn.fujitsu.com>

The Linux kernel cannot migrate pages used by the kernel. As a
result, kernel pages cannot be hot-removed. So we cannot allocate
hotpluggable memory for the kernel.

In a memory hotplug system, any numa node the kernel resides in
should be unhotpluggable. And for a modern server, each node could
have at least 16GB memory. So memory around the kernel image is
highly likely unhotpluggable.

ACPI SRAT (System Resource Affinity Table) contains the memory
hotplug info. But before SRAT is parsed, memblock has already
started to allocate memory for the kernel. So we need to prevent
memblock from doing this.

So direct memory mapping page tables setup is the case. init_mem_mapping()
is called before SRAT is parsed. To prevent page tables being allocated
within hotpluggable memory, we will use bottom-up direction to allocate
page tables from the end of kernel image to the higher memory.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/mm/init.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ea2be79..5cea9ed 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -458,6 +458,51 @@ static void __init memory_map_top_down(unsigned long map_start,
 		init_range_memory_mapping(real_end, map_end);
 }
 
+/**
+ * memory_map_bottom_up - Map [map_start, map_end) bottom up
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in bottom-up. Since we have limited the
+ * bottom-up allocation above the kernel, the page tables will
+ * be allocated just above the kernel and we map the memory
+ * in [map_start, map_end) in bottom-up.
+ */
+static void __init memory_map_bottom_up(unsigned long map_start,
+					unsigned long map_end)
+{
+	unsigned long next, new_mapped_ram_size, start;
+	unsigned long mapped_ram_size = 0;
+	/* step_size need to be small so pgt_buf from BRK could cover it */
+	unsigned long step_size = PMD_SIZE;
+
+	start = map_start;
+	min_pfn_mapped = start >> PAGE_SHIFT;
+
+	/*
+	 * We start from the bottom (@map_start) and go to the top (@map_end).
+	 * The memblock_find_in_range() gets us a block of RAM from the
+	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
+	 * for page table.
+	 */
+	while (start < map_end) {
+		if (map_end - start > step_size) {
+			next = round_up(start + 1, step_size);
+			if (next > map_end)
+				next = map_end;
+		} else
+			next = map_end;
+
+		new_mapped_ram_size = init_range_memory_mapping(start, next);
+		start = next;
+
+		if (new_mapped_ram_size > mapped_ram_size)
+			step_size <<= STEP_SIZE_SHIFT;
+		mapped_ram_size += new_mapped_ram_size;
+	}
+}
+
 void __init init_mem_mapping(void)
 {
 	unsigned long end;
@@ -473,8 +518,30 @@ void __init init_mem_mapping(void)
 	/* the ISA range is always mapped regardless of memory holes */
 	init_memory_mapping(0, ISA_END_ADDRESS);
 
-	/* setup direct mapping for range [ISA_END_ADDRESS, end) in top-down*/
-	memory_map_top_down(ISA_END_ADDRESS, end);
+	/*
+	 * If the allocation is in bottom-up direction, we setup direct mapping
+	 * in bottom-up, otherwise we setup direct mapping in top-down.
+	 */
+	if (memblock_bottom_up()) {
+		unsigned long kernel_end;
+
+#ifdef CONFIG_X86
+		kernel_end = __pa_symbol(_end);
+#else
+		kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
+#endif
+		/*
+		 * we need two separate calls here. This is because we want to
+		 * allocate page tables above the kernel. So we first map
+		 * [kernel_end, end) to make memory above the kernel be mapped
+		 * as soon as possible. And then use page tables allocated above
+		 * the kernel to map [ISA_END_ADDRESS, kernel_end).
+		 */
+		memory_map_bottom_up(kernel_end, end);
+		memory_map_bottom_up(ISA_END_ADDRESS, kernel_end);
+	} else {
+		memory_map_top_down(ISA_END_ADDRESS, end);
+	}
 
 #ifdef CONFIG_X86_64
 	if (max_pfn > max_low_pfn) {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-04  2:00   ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-04  2:00 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

From: Tang Chen <tangchen@cn.fujitsu.com>

The Linux kernel cannot migrate pages used by the kernel. As a
result, kernel pages cannot be hot-removed. So we cannot allocate
hotpluggable memory for the kernel.

In a memory hotplug system, any numa node the kernel resides in
should be unhotpluggable. And for a modern server, each node could
have at least 16GB memory. So memory around the kernel image is
highly likely unhotpluggable.

ACPI SRAT (System Resource Affinity Table) contains the memory
hotplug info. But before SRAT is parsed, memblock has already
started to allocate memory for the kernel. So we need to prevent
memblock from doing this.

So direct memory mapping page tables setup is the case. init_mem_mapping()
is called before SRAT is parsed. To prevent page tables being allocated
within hotpluggable memory, we will use bottom-up direction to allocate
page tables from the end of kernel image to the higher memory.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/mm/init.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ea2be79..5cea9ed 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -458,6 +458,51 @@ static void __init memory_map_top_down(unsigned long map_start,
 		init_range_memory_mapping(real_end, map_end);
 }
 
+/**
+ * memory_map_bottom_up - Map [map_start, map_end) bottom up
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in bottom-up. Since we have limited the
+ * bottom-up allocation above the kernel, the page tables will
+ * be allocated just above the kernel and we map the memory
+ * in [map_start, map_end) in bottom-up.
+ */
+static void __init memory_map_bottom_up(unsigned long map_start,
+					unsigned long map_end)
+{
+	unsigned long next, new_mapped_ram_size, start;
+	unsigned long mapped_ram_size = 0;
+	/* step_size need to be small so pgt_buf from BRK could cover it */
+	unsigned long step_size = PMD_SIZE;
+
+	start = map_start;
+	min_pfn_mapped = start >> PAGE_SHIFT;
+
+	/*
+	 * We start from the bottom (@map_start) and go to the top (@map_end).
+	 * The memblock_find_in_range() gets us a block of RAM from the
+	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
+	 * for page table.
+	 */
+	while (start < map_end) {
+		if (map_end - start > step_size) {
+			next = round_up(start + 1, step_size);
+			if (next > map_end)
+				next = map_end;
+		} else
+			next = map_end;
+
+		new_mapped_ram_size = init_range_memory_mapping(start, next);
+		start = next;
+
+		if (new_mapped_ram_size > mapped_ram_size)
+			step_size <<= STEP_SIZE_SHIFT;
+		mapped_ram_size += new_mapped_ram_size;
+	}
+}
+
 void __init init_mem_mapping(void)
 {
 	unsigned long end;
@@ -473,8 +518,30 @@ void __init init_mem_mapping(void)
 	/* the ISA range is always mapped regardless of memory holes */
 	init_memory_mapping(0, ISA_END_ADDRESS);
 
-	/* setup direct mapping for range [ISA_END_ADDRESS, end) in top-down*/
-	memory_map_top_down(ISA_END_ADDRESS, end);
+	/*
+	 * If the allocation is in bottom-up direction, we setup direct mapping
+	 * in bottom-up, otherwise we setup direct mapping in top-down.
+	 */
+	if (memblock_bottom_up()) {
+		unsigned long kernel_end;
+
+#ifdef CONFIG_X86
+		kernel_end = __pa_symbol(_end);
+#else
+		kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
+#endif
+		/*
+		 * we need two separate calls here. This is because we want to
+		 * allocate page tables above the kernel. So we first map
+		 * [kernel_end, end) to make memory above the kernel be mapped
+		 * as soon as possible. And then use page tables allocated above
+		 * the kernel to map [ISA_END_ADDRESS, kernel_end).
+		 */
+		memory_map_bottom_up(kernel_end, end);
+		memory_map_bottom_up(ISA_END_ADDRESS, kernel_end);
+	} else {
+		memory_map_top_down(ISA_END_ADDRESS, end);
+	}
 
 #ifdef CONFIG_X86_64
 	if (max_pfn > max_low_pfn) {
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed.
  2013-10-04  1:56 ` Zhang Yanfei
@ 2013-10-04  2:01   ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-04  2:01 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

From: Tang Chen <tangchen@cn.fujitsu.com>

Memory reserved for crashkernel could be large. So we should not allocate
this memory bottom up from the end of kernel image.

When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
and we can avoid allocating this memory for the kernel. So reorder
reserve_crashkernel() after SRAT is parsed.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/kernel/setup.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f0de629..b5e350d 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1120,8 +1120,6 @@ void __init setup_arch(char **cmdline_p)
 	acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
 #endif
 
-	reserve_crashkernel();
-
 	vsmp_init();
 
 	io_delay_init();
@@ -1134,6 +1132,13 @@ void __init setup_arch(char **cmdline_p)
 	early_acpi_boot_init();
 
 	initmem_init();
+
+	/*
+	 * Reserve memory for crash kernel after SRAT is parsed so that it
+	 * won't consume hotpluggable memory.
+	 */
+	reserve_crashkernel();
+
 	memblock_find_dma_reserve();
 
 #ifdef CONFIG_KVM_GUEST
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed.
@ 2013-10-04  2:01   ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-04  2:01 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

From: Tang Chen <tangchen@cn.fujitsu.com>

Memory reserved for crashkernel could be large. So we should not allocate
this memory bottom up from the end of kernel image.

When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
and we can avoid allocating this memory for the kernel. So reorder
reserve_crashkernel() after SRAT is parsed.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/kernel/setup.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f0de629..b5e350d 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1120,8 +1120,6 @@ void __init setup_arch(char **cmdline_p)
 	acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
 #endif
 
-	reserve_crashkernel();
-
 	vsmp_init();
 
 	io_delay_init();
@@ -1134,6 +1132,13 @@ void __init setup_arch(char **cmdline_p)
 	early_acpi_boot_init();
 
 	initmem_init();
+
+	/*
+	 * Reserve memory for crash kernel after SRAT is parsed so that it
+	 * won't consume hotpluggable memory.
+	 */
+	reserve_crashkernel();
+
 	memblock_find_dma_reserve();
 
 #ifdef CONFIG_KVM_GUEST
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 6/6] mem-hotplug: Introduce movable_node boot option
  2013-10-04  1:56 ` Zhang Yanfei
@ 2013-10-04  2:02   ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-04  2:02 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

From: Tang Chen <tangchen@cn.fujitsu.com>

The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and
later we can set it as ZONE_MOVABLE.

To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained
in the previous patches. So if movable_node boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
   top down.

Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |    3 +++
 arch/x86/mm/numa.c                  |   11 +++++++++++
 mm/Kconfig                          |   17 ++++++++++++-----
 mm/memory_hotplug.c                 |   31 +++++++++++++++++++++++++++++++
 4 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 539a236..13201d4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1769,6 +1769,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
+	movable_node	[KNL,X86] Boot-time switch to disable the effects
+			of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
+
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>
 
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..24aec58 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -567,6 +567,17 @@ static int __init numa_init(int (*init_func)(void))
 	ret = init_func();
 	if (ret < 0)
 		return ret;
+
+	/*
+	 * We reset memblock back to the top-down direction
+	 * here because if we configured ACPI_NUMA, we have
+	 * parsed SRAT in init_func(). It is ok to have the
+	 * reset here even if we did't configure ACPI_NUMA
+	 * or acpi numa init fails and fallbacks to dummy
+	 * numa init.
+	 */
+	memblock_set_bottom_up(false);
+
 	ret = numa_cleanup_meminfo(&numa_meminfo);
 	if (ret < 0)
 		return ret;
diff --git a/mm/Kconfig b/mm/Kconfig
index 026771a..0db1cc6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -153,11 +153,18 @@ config MOVABLE_NODE
 	help
 	  Allow a node to have only movable memory.  Pages used by the kernel,
 	  such as direct mapping pages cannot be migrated.  So the corresponding
-	  memory device cannot be hotplugged.  This option allows users to
-	  online all the memory of a node as movable memory so that the whole
-	  node can be hotplugged.  Users who don't use the memory hotplug
-	  feature are fine with this option on since they don't online memory
-	  as movable.
+	  memory device cannot be hotplugged.  This option allows the following
+	  two things:
+	  - When the system is booting, node full of hotpluggable memory can
+	  be arranged to have only movable memory so that the whole node can
+	  be hotplugged. (need movable_node boot option specified).
+	  - After the system is up, the option allows users to online all the
+	  memory of a node as movable memory so that the whole node can be
+	  hotplugged.
+
+	  Users who don't use the memory hotplug feature are fine with this
+	  option on since they don't specify movable_node boot option or they
+	  don't online memory as movable.
 
 	  Say Y here if you want to hotplug a whole node.
 	  Say N here if you want kernel to use memory on all nodes evenly.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ed85fe3..6874c31 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -31,6 +31,7 @@
 #include <linux/firmware-map.h>
 #include <linux/stop_machine.h>
 #include <linux/hugetlb.h>
+#include <linux/memblock.h>
 
 #include <asm/tlbflush.h>
 
@@ -1412,6 +1413,36 @@ static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
 }
 #endif /* CONFIG_MOVABLE_NODE */
 
+static int __init cmdline_parse_movable_node(char *p)
+{
+#ifdef CONFIG_MOVABLE_NODE
+	/*
+	 * Memory used by the kernel cannot be hot-removed because Linux
+	 * cannot migrate the kernel pages. When memory hotplug is
+	 * enabled, we should prevent memblock from allocating memory
+	 * for the kernel.
+	 *
+	 * ACPI SRAT records all hotpluggable memory ranges. But before
+	 * SRAT is parsed, we don't know about it.
+	 *
+	 * The kernel image is loaded into memory at very early time. We
+	 * cannot prevent this anyway. So on NUMA system, we set any
+	 * node the kernel resides in as un-hotpluggable.
+	 *
+	 * Since on modern servers, one node could have double-digit
+	 * gigabytes memory, we can assume the memory around the kernel
+	 * image is also un-hotpluggable. So before SRAT is parsed, just
+	 * allocate memory near the kernel image to try the best to keep
+	 * the kernel away from hotpluggable memory.
+	 */
+	memblock_set_bottom_up(true);
+#else
+	pr_warn("movable_node option not supported");
+#endif
+	return 0;
+}
+early_param("movable_node", cmdline_parse_movable_node);
+
 /* check which state of node_states will be changed when offline memory */
 static void node_states_check_changes_offline(unsigned long nr_pages,
 		struct zone *zone, struct memory_notify *arg)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 6/6] mem-hotplug: Introduce movable_node boot option
@ 2013-10-04  2:02   ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-04  2:02 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

From: Tang Chen <tangchen@cn.fujitsu.com>

The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and
later we can set it as ZONE_MOVABLE.

To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained
in the previous patches. So if movable_node boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
   top down.

Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |    3 +++
 arch/x86/mm/numa.c                  |   11 +++++++++++
 mm/Kconfig                          |   17 ++++++++++++-----
 mm/memory_hotplug.c                 |   31 +++++++++++++++++++++++++++++++
 4 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 539a236..13201d4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1769,6 +1769,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
+	movable_node	[KNL,X86] Boot-time switch to disable the effects
+			of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
+
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>
 
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..24aec58 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -567,6 +567,17 @@ static int __init numa_init(int (*init_func)(void))
 	ret = init_func();
 	if (ret < 0)
 		return ret;
+
+	/*
+	 * We reset memblock back to the top-down direction
+	 * here because if we configured ACPI_NUMA, we have
+	 * parsed SRAT in init_func(). It is ok to have the
+	 * reset here even if we did't configure ACPI_NUMA
+	 * or acpi numa init fails and fallbacks to dummy
+	 * numa init.
+	 */
+	memblock_set_bottom_up(false);
+
 	ret = numa_cleanup_meminfo(&numa_meminfo);
 	if (ret < 0)
 		return ret;
diff --git a/mm/Kconfig b/mm/Kconfig
index 026771a..0db1cc6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -153,11 +153,18 @@ config MOVABLE_NODE
 	help
 	  Allow a node to have only movable memory.  Pages used by the kernel,
 	  such as direct mapping pages cannot be migrated.  So the corresponding
-	  memory device cannot be hotplugged.  This option allows users to
-	  online all the memory of a node as movable memory so that the whole
-	  node can be hotplugged.  Users who don't use the memory hotplug
-	  feature are fine with this option on since they don't online memory
-	  as movable.
+	  memory device cannot be hotplugged.  This option allows the following
+	  two things:
+	  - When the system is booting, node full of hotpluggable memory can
+	  be arranged to have only movable memory so that the whole node can
+	  be hotplugged. (need movable_node boot option specified).
+	  - After the system is up, the option allows users to online all the
+	  memory of a node as movable memory so that the whole node can be
+	  hotplugged.
+
+	  Users who don't use the memory hotplug feature are fine with this
+	  option on since they don't specify movable_node boot option or they
+	  don't online memory as movable.
 
 	  Say Y here if you want to hotplug a whole node.
 	  Say N here if you want kernel to use memory on all nodes evenly.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ed85fe3..6874c31 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -31,6 +31,7 @@
 #include <linux/firmware-map.h>
 #include <linux/stop_machine.h>
 #include <linux/hugetlb.h>
+#include <linux/memblock.h>
 
 #include <asm/tlbflush.h>
 
@@ -1412,6 +1413,36 @@ static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
 }
 #endif /* CONFIG_MOVABLE_NODE */
 
+static int __init cmdline_parse_movable_node(char *p)
+{
+#ifdef CONFIG_MOVABLE_NODE
+	/*
+	 * Memory used by the kernel cannot be hot-removed because Linux
+	 * cannot migrate the kernel pages. When memory hotplug is
+	 * enabled, we should prevent memblock from allocating memory
+	 * for the kernel.
+	 *
+	 * ACPI SRAT records all hotpluggable memory ranges. But before
+	 * SRAT is parsed, we don't know about it.
+	 *
+	 * The kernel image is loaded into memory at very early time. We
+	 * cannot prevent this anyway. So on NUMA system, we set any
+	 * node the kernel resides in as un-hotpluggable.
+	 *
+	 * Since on modern servers, one node could have double-digit
+	 * gigabytes memory, we can assume the memory around the kernel
+	 * image is also un-hotpluggable. So before SRAT is parsed, just
+	 * allocate memory near the kernel image to try the best to keep
+	 * the kernel away from hotpluggable memory.
+	 */
+	memblock_set_bottom_up(true);
+#else
+	pr_warn("movable_node option not supported");
+#endif
+	return 0;
+}
+early_param("movable_node", cmdline_parse_movable_node);
+
 /* check which state of node_states will be changed when offline memory */
 static void node_states_check_changes_offline(unsigned long nr_pages,
 		struct zone *zone, struct memory_notify *arg)
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 2/6] memblock: Introduce bottom-up allocation mode
  2013-10-04  1:58   ` Zhang Yanfei
@ 2013-10-05 21:30     ` Toshi Kani
  -1 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-05 21:30 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM

On Fri, 2013-10-04 at 09:58 +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
> 
> The Linux kernel cannot migrate pages used by the kernel. As a result, kernel
> pages cannot be hot-removed. So we cannot allocate hotpluggable memory for
> the kernel.
> 
> ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info.
> But before SRAT is parsed, memblock has already started to allocate memory
> for the kernel. So we need to prevent memblock from doing this.
> 
> In a memory hotplug system, any numa node the kernel resides in should
> be unhotpluggable. And for a modern server, each node could have at least
> 16GB memory. So memory around the kernel image is highly likely unhotpluggable.
> 
> So the basic idea is: Allocate memory from the end of the kernel image and
> to the higher memory. Since memory allocation before SRAT is parsed won't
> be too much, it could highly likely be in the same node with kernel image.
> 
> The current memblock can only allocate memory top-down. So this patch introduces
> a new bottom-up allocation mode to allocate memory bottom-up. And later
> when we use this allocation direction to allocate memory, we will limit
> the start address above the kernel.
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

Thanks for the update.

Acked-by: Toshi Kani <toshi.kani@hp.com>

-Toshi



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 2/6] memblock: Introduce bottom-up allocation mode
@ 2013-10-05 21:30     ` Toshi Kani
  0 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-05 21:30 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On Fri, 2013-10-04 at 09:58 +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
> 
> The Linux kernel cannot migrate pages used by the kernel. As a result, kernel
> pages cannot be hot-removed. So we cannot allocate hotpluggable memory for
> the kernel.
> 
> ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info.
> But before SRAT is parsed, memblock has already started to allocate memory
> for the kernel. So we need to prevent memblock from doing this.
> 
> In a memory hotplug system, any numa node the kernel resides in should
> be unhotpluggable. And for a modern server, each node could have at least
> 16GB memory. So memory around the kernel image is highly likely unhotpluggable.
> 
> So the basic idea is: Allocate memory from the end of the kernel image and
> to the higher memory. Since memory allocation before SRAT is parsed won't
> be too much, it could highly likely be in the same node with kernel image.
> 
> The current memblock can only allocate memory top-down. So this patch introduces
> a new bottom-up allocation mode to allocate memory bottom-up. And later
> when we use this allocation direction to allocate memory, we will limit
> the start address above the kernel.
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

Thanks for the update.

Acked-by: Toshi Kani <toshi.kani@hp.com>

-Toshi



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-04  2:00   ` Zhang Yanfei
@ 2013-10-05 22:09     ` Toshi Kani
  -1 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-05 22:09 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM

On Fri, 2013-10-04 at 10:00 +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
> 
> The Linux kernel cannot migrate pages used by the kernel. As a
> result, kernel pages cannot be hot-removed. So we cannot allocate
> hotpluggable memory for the kernel.
> 
> In a memory hotplug system, any numa node the kernel resides in
> should be unhotpluggable. And for a modern server, each node could
> have at least 16GB memory. So memory around the kernel image is
> highly likely unhotpluggable.
> 
> ACPI SRAT (System Resource Affinity Table) contains the memory
> hotplug info. But before SRAT is parsed, memblock has already
> started to allocate memory for the kernel. So we need to prevent
> memblock from doing this.
> 
> So direct memory mapping page tables setup is the case. init_mem_mapping()
> is called before SRAT is parsed. To prevent page tables being allocated
> within hotpluggable memory, we will use bottom-up direction to allocate
> page tables from the end of kernel image to the higher memory.
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

Acked-by: Toshi Kani <toshi.kani@hp.com>

Thanks,
-Toshi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-05 22:09     ` Toshi Kani
  0 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-05 22:09 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On Fri, 2013-10-04 at 10:00 +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
> 
> The Linux kernel cannot migrate pages used by the kernel. As a
> result, kernel pages cannot be hot-removed. So we cannot allocate
> hotpluggable memory for the kernel.
> 
> In a memory hotplug system, any numa node the kernel resides in
> should be unhotpluggable. And for a modern server, each node could
> have at least 16GB memory. So memory around the kernel image is
> highly likely unhotpluggable.
> 
> ACPI SRAT (System Resource Affinity Table) contains the memory
> hotplug info. But before SRAT is parsed, memblock has already
> started to allocate memory for the kernel. So we need to prevent
> memblock from doing this.
> 
> So direct memory mapping page tables setup is the case. init_mem_mapping()
> is called before SRAT is parsed. To prevent page tables being allocated
> within hotpluggable memory, we will use bottom-up direction to allocate
> page tables from the end of kernel image to the higher memory.
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

Acked-by: Toshi Kani <toshi.kani@hp.com>

Thanks,
-Toshi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed.
  2013-10-04  2:01   ` Zhang Yanfei
@ 2013-10-05 22:10     ` Toshi Kani
  -1 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-05 22:10 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM

On Fri, 2013-10-04 at 10:01 +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
> 
> Memory reserved for crashkernel could be large. So we should not allocate
> this memory bottom up from the end of kernel image.
> 
> When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
> and we can avoid allocating this memory for the kernel. So reorder
> reserve_crashkernel() after SRAT is parsed.
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

Acked-by: Toshi Kani <toshi.kani@hp.com>

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed.
@ 2013-10-05 22:10     ` Toshi Kani
  0 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-05 22:10 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On Fri, 2013-10-04 at 10:01 +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
> 
> Memory reserved for crashkernel could be large. So we should not allocate
> this memory bottom up from the end of kernel image.
> 
> When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
> and we can avoid allocating this memory for the kernel. So reorder
> reserve_crashkernel() after SRAT is parsed.
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

Acked-by: Toshi Kani <toshi.kani@hp.com>

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 6/6] mem-hotplug: Introduce movable_node boot option
  2013-10-04  2:02   ` Zhang Yanfei
@ 2013-10-05 22:28     ` Toshi Kani
  -1 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-05 22:28 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM

On Fri, 2013-10-04 at 10:02 +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
> 
> The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
> As we mentioned before, if hotpluggable memory is used by the kernel,
> it cannot be hot-removed. So memory hotplug users may want to set all
> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
> 
> Memory hotplug users may also set a node as movable node, which has
> ZONE_MOVABLE only, so that the whole node can be hot-removed.
> 
> But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
> kernel cannot use memory in movable nodes. This will cause NUMA
> performance down. And other users may be unhappy.
> 
> So we need a way to allow users to enable and disable this functionality.
> In this patch, we introduce movable_node boot option to allow users to
> choose to not to consume hotpluggable memory at early boot time and
> later we can set it as ZONE_MOVABLE.
> 
> To achieve this, the movable_node boot option will control the memblock
> allocation direction. That said, after memblock is ready, before SRAT is
> parsed, we should allocate memory near the kernel image as we explained
> in the previous patches. So if movable_node boot option is set, the kernel
> does the following:
> 
> 1. After memblock is ready, make memblock allocate memory bottom up.
> 2. After SRAT is parsed, make memblock behave as default, allocate memory
>    top down.
> 
> Users can specify "movable_node" in kernel commandline to enable this
> functionality. For those who don't use memory hotplug or who don't want
> to lose their NUMA performance, just don't specify anything. The kernel
> will work as before.
> 
> Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Suggested-by: Ingo Molnar <mingo@kernel.org>
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> ---
>  Documentation/kernel-parameters.txt |    3 +++
>  arch/x86/mm/numa.c                  |   11 +++++++++++
>  mm/Kconfig                          |   17 ++++++++++++-----
>  mm/memory_hotplug.c                 |   31 +++++++++++++++++++++++++++++++
>  4 files changed, 57 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 539a236..13201d4 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1769,6 +1769,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  			that the amount of memory usable for all allocations
>  			is not too small.
>  
> +	movable_node	[KNL,X86] Boot-time switch to disable the effects
> +			of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.

I thought this is the option to "enable", not disable.

> +
>  	MTD_Partition=	[MTD]
>  			Format: <name>,<region-number>,<size>,<offset>
>  
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 8bf93ba..24aec58 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -567,6 +567,17 @@ static int __init numa_init(int (*init_func)(void))
>  	ret = init_func();
>  	if (ret < 0)
>  		return ret;
> +
> +	/*
> +	 * We reset memblock back to the top-down direction
> +	 * here because if we configured ACPI_NUMA, we have
> +	 * parsed SRAT in init_func(). It is ok to have the
> +	 * reset here even if we did't configure ACPI_NUMA
> +	 * or acpi numa init fails and fallbacks to dummy
> +	 * numa init.
> +	 */
> +	memblock_set_bottom_up(false);
> +
>  	ret = numa_cleanup_meminfo(&numa_meminfo);
>  	if (ret < 0)
>  		return ret;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 026771a..0db1cc6 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -153,11 +153,18 @@ config MOVABLE_NODE
>  	help
>  	  Allow a node to have only movable memory.  Pages used by the kernel,
>  	  such as direct mapping pages cannot be migrated.  So the corresponding
> -	  memory device cannot be hotplugged.  This option allows users to
> -	  online all the memory of a node as movable memory so that the whole
> -	  node can be hotplugged.  Users who don't use the memory hotplug
> -	  feature are fine with this option on since they don't online memory
> -	  as movable.
> +	  memory device cannot be hotplugged.  This option allows the following
> +	  two things:
> +	  - When the system is booting, node full of hotpluggable memory can
> +	  be arranged to have only movable memory so that the whole node can
> +	  be hotplugged. (need movable_node boot option specified).

I think "hotplugged" should be "hot-removed".

> +	  - After the system is up, the option allows users to online all the
> +	  memory of a node as movable memory so that the whole node can be
> +	  hotplugged.

Same here. 

> +
> +	  Users who don't use the memory hotplug feature are fine with this
> +	  option on since they don't specify movable_node boot option or they
> +	  don't online memory as movable.
>  
>  	  Say Y here if you want to hotplug a whole node.
>  	  Say N here if you want kernel to use memory on all nodes evenly.
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index ed85fe3..6874c31 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -31,6 +31,7 @@
>  #include <linux/firmware-map.h>
>  #include <linux/stop_machine.h>
>  #include <linux/hugetlb.h>
> +#include <linux/memblock.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -1412,6 +1413,36 @@ static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
>  }
>  #endif /* CONFIG_MOVABLE_NODE */
>  
> +static int __init cmdline_parse_movable_node(char *p)
> +{
> +#ifdef CONFIG_MOVABLE_NODE
> +	/*
> +	 * Memory used by the kernel cannot be hot-removed because Linux
> +	 * cannot migrate the kernel pages. When memory hotplug is
> +	 * enabled, we should prevent memblock from allocating memory
> +	 * for the kernel.
> +	 *
> +	 * ACPI SRAT records all hotpluggable memory ranges. But before
> +	 * SRAT is parsed, we don't know about it.
> +	 *
> +	 * The kernel image is loaded into memory at very early time. We
> +	 * cannot prevent this anyway. So on NUMA system, we set any
> +	 * node the kernel resides in as un-hotpluggable.
> +	 *
> +	 * Since on modern servers, one node could have double-digit
> +	 * gigabytes memory, we can assume the memory around the kernel
> +	 * image is also un-hotpluggable. So before SRAT is parsed, just
> +	 * allocate memory near the kernel image to try the best to keep
> +	 * the kernel away from hotpluggable memory.
> +	 */
> +	memblock_set_bottom_up(true);
> +#else
> +	pr_warn("movable_node option not supported");

"\n" is missing.

Thanks,
-Toshi


> +#endif
> +	return 0;
> +}
> +early_param("movable_node", cmdline_parse_movable_node);
> +
>  /* check which state of node_states will be changed when offline memory */
>  static void node_states_check_changes_offline(unsigned long nr_pages,
>  		struct zone *zone, struct memory_notify *arg)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 6/6] mem-hotplug: Introduce movable_node boot option
@ 2013-10-05 22:28     ` Toshi Kani
  0 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-05 22:28 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On Fri, 2013-10-04 at 10:02 +0800, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
> 
> The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
> As we mentioned before, if hotpluggable memory is used by the kernel,
> it cannot be hot-removed. So memory hotplug users may want to set all
> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
> 
> Memory hotplug users may also set a node as movable node, which has
> ZONE_MOVABLE only, so that the whole node can be hot-removed.
> 
> But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
> kernel cannot use memory in movable nodes. This will cause NUMA
> performance down. And other users may be unhappy.
> 
> So we need a way to allow users to enable and disable this functionality.
> In this patch, we introduce movable_node boot option to allow users to
> choose to not to consume hotpluggable memory at early boot time and
> later we can set it as ZONE_MOVABLE.
> 
> To achieve this, the movable_node boot option will control the memblock
> allocation direction. That said, after memblock is ready, before SRAT is
> parsed, we should allocate memory near the kernel image as we explained
> in the previous patches. So if movable_node boot option is set, the kernel
> does the following:
> 
> 1. After memblock is ready, make memblock allocate memory bottom up.
> 2. After SRAT is parsed, make memblock behave as default, allocate memory
>    top down.
> 
> Users can specify "movable_node" in kernel commandline to enable this
> functionality. For those who don't use memory hotplug or who don't want
> to lose their NUMA performance, just don't specify anything. The kernel
> will work as before.
> 
> Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Suggested-by: Ingo Molnar <mingo@kernel.org>
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> ---
>  Documentation/kernel-parameters.txt |    3 +++
>  arch/x86/mm/numa.c                  |   11 +++++++++++
>  mm/Kconfig                          |   17 ++++++++++++-----
>  mm/memory_hotplug.c                 |   31 +++++++++++++++++++++++++++++++
>  4 files changed, 57 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 539a236..13201d4 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1769,6 +1769,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  			that the amount of memory usable for all allocations
>  			is not too small.
>  
> +	movable_node	[KNL,X86] Boot-time switch to disable the effects
> +			of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.

I thought this is the option to "enable", not disable.

> +
>  	MTD_Partition=	[MTD]
>  			Format: <name>,<region-number>,<size>,<offset>
>  
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 8bf93ba..24aec58 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -567,6 +567,17 @@ static int __init numa_init(int (*init_func)(void))
>  	ret = init_func();
>  	if (ret < 0)
>  		return ret;
> +
> +	/*
> +	 * We reset memblock back to the top-down direction
> +	 * here because if we configured ACPI_NUMA, we have
> +	 * parsed SRAT in init_func(). It is ok to have the
> +	 * reset here even if we did't configure ACPI_NUMA
> +	 * or acpi numa init fails and fallbacks to dummy
> +	 * numa init.
> +	 */
> +	memblock_set_bottom_up(false);
> +
>  	ret = numa_cleanup_meminfo(&numa_meminfo);
>  	if (ret < 0)
>  		return ret;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 026771a..0db1cc6 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -153,11 +153,18 @@ config MOVABLE_NODE
>  	help
>  	  Allow a node to have only movable memory.  Pages used by the kernel,
>  	  such as direct mapping pages cannot be migrated.  So the corresponding
> -	  memory device cannot be hotplugged.  This option allows users to
> -	  online all the memory of a node as movable memory so that the whole
> -	  node can be hotplugged.  Users who don't use the memory hotplug
> -	  feature are fine with this option on since they don't online memory
> -	  as movable.
> +	  memory device cannot be hotplugged.  This option allows the following
> +	  two things:
> +	  - When the system is booting, node full of hotpluggable memory can
> +	  be arranged to have only movable memory so that the whole node can
> +	  be hotplugged. (need movable_node boot option specified).

I think "hotplugged" should be "hot-removed".

> +	  - After the system is up, the option allows users to online all the
> +	  memory of a node as movable memory so that the whole node can be
> +	  hotplugged.

Same here. 

> +
> +	  Users who don't use the memory hotplug feature are fine with this
> +	  option on since they don't specify movable_node boot option or they
> +	  don't online memory as movable.
>  
>  	  Say Y here if you want to hotplug a whole node.
>  	  Say N here if you want kernel to use memory on all nodes evenly.
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index ed85fe3..6874c31 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -31,6 +31,7 @@
>  #include <linux/firmware-map.h>
>  #include <linux/stop_machine.h>
>  #include <linux/hugetlb.h>
> +#include <linux/memblock.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -1412,6 +1413,36 @@ static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
>  }
>  #endif /* CONFIG_MOVABLE_NODE */
>  
> +static int __init cmdline_parse_movable_node(char *p)
> +{
> +#ifdef CONFIG_MOVABLE_NODE
> +	/*
> +	 * Memory used by the kernel cannot be hot-removed because Linux
> +	 * cannot migrate the kernel pages. When memory hotplug is
> +	 * enabled, we should prevent memblock from allocating memory
> +	 * for the kernel.
> +	 *
> +	 * ACPI SRAT records all hotpluggable memory ranges. But before
> +	 * SRAT is parsed, we don't know about it.
> +	 *
> +	 * The kernel image is loaded into memory at very early time. We
> +	 * cannot prevent this anyway. So on NUMA system, we set any
> +	 * node the kernel resides in as un-hotpluggable.
> +	 *
> +	 * Since on modern servers, one node could have double-digit
> +	 * gigabytes memory, we can assume the memory around the kernel
> +	 * image is also un-hotpluggable. So before SRAT is parsed, just
> +	 * allocate memory near the kernel image to try the best to keep
> +	 * the kernel away from hotpluggable memory.
> +	 */
> +	memblock_set_bottom_up(true);
> +#else
> +	pr_warn("movable_node option not supported");

"\n" is missing.

Thanks,
-Toshi


> +#endif
> +	return 0;
> +}
> +early_param("movable_node", cmdline_parse_movable_node);
> +
>  /* check which state of node_states will be changed when offline memory */
>  static void node_states_check_changes_offline(unsigned long nr_pages,
>  		struct zone *zone, struct memory_notify *arg)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 update 6/6] mem-hotplug: Introduce movable_node boot option
  2013-10-05 22:28     ` Toshi Kani
  (?)
@ 2013-10-06 14:43       ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-06 14:43 UTC (permalink / raw)
  To: Toshi Kani, Andrew Morton
  Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
	Tejun Heo, Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
	Wen Congyang, Lai Jiangshan, isimatu.yasuaki, izumi.taku,
	Mel Gorman, Minchan Kim, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, Rik van Riel, jweiner, prarit, x86, linux-doc,
	linux-kernel, Linux MM, linux-acpi, imtangchen

From: Tang Chen <tangchen@cn.fujitsu.com>

The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and
later we can set it as ZONE_MOVABLE.

To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained
in the previous patches. So if movable_node boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
   top down.

Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |    3 +++
 arch/x86/mm/numa.c                  |   11 +++++++++++
 mm/Kconfig                          |   17 ++++++++++++-----
 mm/memory_hotplug.c                 |   31 +++++++++++++++++++++++++++++++
 4 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 539a236..13201d4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1769,6 +1769,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
+	movable_node	[KNL,X86] Boot-time switch to enable the effects
+			of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
+
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>
 
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..24aec58 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -567,6 +567,17 @@ static int __init numa_init(int (*init_func)(void))
 	ret = init_func();
 	if (ret < 0)
 		return ret;
+
+	/*
+	 * We reset memblock back to the top-down direction
+	 * here because if we configured ACPI_NUMA, we have
+	 * parsed SRAT in init_func(). It is ok to have the
+	 * reset here even if we did't configure ACPI_NUMA
+	 * or acpi numa init fails and fallbacks to dummy
+	 * numa init.
+	 */
+	memblock_set_bottom_up(false);
+
 	ret = numa_cleanup_meminfo(&numa_meminfo);
 	if (ret < 0)
 		return ret;
diff --git a/mm/Kconfig b/mm/Kconfig
index 026771a..0db1cc6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -153,11 +153,18 @@ config MOVABLE_NODE
 	help
 	  Allow a node to have only movable memory.  Pages used by the kernel,
 	  such as direct mapping pages cannot be migrated.  So the corresponding
-	  memory device cannot be hotplugged.  This option allows users to
-	  online all the memory of a node as movable memory so that the whole
-	  node can be hotplugged.  Users who don't use the memory hotplug
-	  feature are fine with this option on since they don't online memory
-	  as movable.
+	  memory device cannot be hotplugged.  This option allows the following
+	  two things:
+	  - When the system is booting, node full of hotpluggable memory can
+	  be arranged to have only movable memory so that the whole node can
+	  be hot-removed. (need movable_node boot option specified).
+	  - After the system is up, the option allows users to online all the
+	  memory of a node as movable memory so that the whole node can be
+	  hot-removed.
+
+	  Users who don't use the memory hotplug feature are fine with this
+	  option on since they don't specify movable_node boot option or they
+	  don't online memory as movable.
 
 	  Say Y here if you want to hotplug a whole node.
 	  Say N here if you want kernel to use memory on all nodes evenly.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ed85fe3..6874c31 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -31,6 +31,7 @@
 #include <linux/firmware-map.h>
 #include <linux/stop_machine.h>
 #include <linux/hugetlb.h>
+#include <linux/memblock.h>
 
 #include <asm/tlbflush.h>
 
@@ -1412,6 +1413,36 @@ static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
 }
 #endif /* CONFIG_MOVABLE_NODE */
 
+static int __init cmdline_parse_movable_node(char *p)
+{
+#ifdef CONFIG_MOVABLE_NODE
+	/*
+	 * Memory used by the kernel cannot be hot-removed because Linux
+	 * cannot migrate the kernel pages. When memory hotplug is
+	 * enabled, we should prevent memblock from allocating memory
+	 * for the kernel.
+	 *
+	 * ACPI SRAT records all hotpluggable memory ranges. But before
+	 * SRAT is parsed, we don't know about it.
+	 *
+	 * The kernel image is loaded into memory at very early time. We
+	 * cannot prevent this anyway. So on NUMA system, we set any
+	 * node the kernel resides in as un-hotpluggable.
+	 *
+	 * Since on modern servers, one node could have double-digit
+	 * gigabytes memory, we can assume the memory around the kernel
+	 * image is also un-hotpluggable. So before SRAT is parsed, just
+	 * allocate memory near the kernel image to try the best to keep
+	 * the kernel away from hotpluggable memory.
+	 */
+	memblock_set_bottom_up(true);
+#else
+	pr_warn("movable_node option not supported\n");
+#endif
+	return 0;
+}
+early_param("movable_node", cmdline_parse_movable_node);
+
 /* check which state of node_states will be changed when offline memory */
 static void node_states_check_changes_offline(unsigned long nr_pages,
 		struct zone *zone, struct memory_notify *arg)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 update 6/6] mem-hotplug: Introduce movable_node boot option
@ 2013-10-06 14:43       ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-06 14:43 UTC (permalink / raw)
  To: Toshi Kani, Andrew Morton
  Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
	Tejun Heo, Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
	Wen Congyang, Lai Jiangshan, isimatu.yasuaki, izumi.taku,
	Mel Gorman, Minchan Kim, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, Rik van Riel, jweiner, prarit, x86, linux-doc,
	linux-kernel, Linux MM, linux-acpi, imtangchen, Zhang Yanfei,
	Tang Chen

From: Tang Chen <tangchen@cn.fujitsu.com>

The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and
later we can set it as ZONE_MOVABLE.

To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained
in the previous patches. So if movable_node boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
   top down.

Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |    3 +++
 arch/x86/mm/numa.c                  |   11 +++++++++++
 mm/Kconfig                          |   17 ++++++++++++-----
 mm/memory_hotplug.c                 |   31 +++++++++++++++++++++++++++++++
 4 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 539a236..13201d4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1769,6 +1769,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
+	movable_node	[KNL,X86] Boot-time switch to enable the effects
+			of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
+
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>
 
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..24aec58 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -567,6 +567,17 @@ static int __init numa_init(int (*init_func)(void))
 	ret = init_func();
 	if (ret < 0)
 		return ret;
+
+	/*
+	 * We reset memblock back to the top-down direction
+	 * here because if we configured ACPI_NUMA, we have
+	 * parsed SRAT in init_func(). It is ok to have the
+	 * reset here even if we did't configure ACPI_NUMA
+	 * or acpi numa init fails and fallbacks to dummy
+	 * numa init.
+	 */
+	memblock_set_bottom_up(false);
+
 	ret = numa_cleanup_meminfo(&numa_meminfo);
 	if (ret < 0)
 		return ret;
diff --git a/mm/Kconfig b/mm/Kconfig
index 026771a..0db1cc6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -153,11 +153,18 @@ config MOVABLE_NODE
 	help
 	  Allow a node to have only movable memory.  Pages used by the kernel,
 	  such as direct mapping pages cannot be migrated.  So the corresponding
-	  memory device cannot be hotplugged.  This option allows users to
-	  online all the memory of a node as movable memory so that the whole
-	  node can be hotplugged.  Users who don't use the memory hotplug
-	  feature are fine with this option on since they don't online memory
-	  as movable.
+	  memory device cannot be hotplugged.  This option allows the following
+	  two things:
+	  - When the system is booting, node full of hotpluggable memory can
+	  be arranged to have only movable memory so that the whole node can
+	  be hot-removed. (need movable_node boot option specified).
+	  - After the system is up, the option allows users to online all the
+	  memory of a node as movable memory so that the whole node can be
+	  hot-removed.
+
+	  Users who don't use the memory hotplug feature are fine with this
+	  option on since they don't specify movable_node boot option or they
+	  don't online memory as movable.
 
 	  Say Y here if you want to hotplug a whole node.
 	  Say N here if you want kernel to use memory on all nodes evenly.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ed85fe3..6874c31 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -31,6 +31,7 @@
 #include <linux/firmware-map.h>
 #include <linux/stop_machine.h>
 #include <linux/hugetlb.h>
+#include <linux/memblock.h>
 
 #include <asm/tlbflush.h>
 
@@ -1412,6 +1413,36 @@ static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
 }
 #endif /* CONFIG_MOVABLE_NODE */
 
+static int __init cmdline_parse_movable_node(char *p)
+{
+#ifdef CONFIG_MOVABLE_NODE
+	/*
+	 * Memory used by the kernel cannot be hot-removed because Linux
+	 * cannot migrate the kernel pages. When memory hotplug is
+	 * enabled, we should prevent memblock from allocating memory
+	 * for the kernel.
+	 *
+	 * ACPI SRAT records all hotpluggable memory ranges. But before
+	 * SRAT is parsed, we don't know about it.
+	 *
+	 * The kernel image is loaded into memory at very early time. We
+	 * cannot prevent this anyway. So on NUMA system, we set any
+	 * node the kernel resides in as un-hotpluggable.
+	 *
+	 * Since on modern servers, one node could have double-digit
+	 * gigabytes memory, we can assume the memory around the kernel
+	 * image is also un-hotpluggable. So before SRAT is parsed, just
+	 * allocate memory near the kernel image to try the best to keep
+	 * the kernel away from hotpluggable memory.
+	 */
+	memblock_set_bottom_up(true);
+#else
+	pr_warn("movable_node option not supported\n");
+#endif
+	return 0;
+}
+early_param("movable_node", cmdline_parse_movable_node);
+
 /* check which state of node_states will be changed when offline memory */
 static void node_states_check_changes_offline(unsigned long nr_pages,
 		struct zone *zone, struct memory_notify *arg)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH part1 v6 update 6/6] mem-hotplug: Introduce movable_node boot option
@ 2013-10-06 14:43       ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-06 14:43 UTC (permalink / raw)
  To: Toshi Kani, Andrew Morton
  Cc: Rafael J . Wysocki, lenb, Thomas Gleixner, mingo, H. Peter Anvin,
	Tejun Heo, Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
	Wen Congyang, Lai Jiangshan, isimatu.yasuaki, izumi.taku,
	Mel Gorman, Minchan Kim, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, Rik van Riel, jweiner, prarit, x86, linux-doc,
	linux-kernel, Linux MM, linux-acpi, imtangchen, Zhang Yanfei,
	Tang Chen

From: Tang Chen <tangchen@cn.fujitsu.com>

The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and
later we can set it as ZONE_MOVABLE.

To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained
in the previous patches. So if movable_node boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
   top down.

Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |    3 +++
 arch/x86/mm/numa.c                  |   11 +++++++++++
 mm/Kconfig                          |   17 ++++++++++++-----
 mm/memory_hotplug.c                 |   31 +++++++++++++++++++++++++++++++
 4 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 539a236..13201d4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1769,6 +1769,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
+	movable_node	[KNL,X86] Boot-time switch to enable the effects
+			of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
+
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>
 
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..24aec58 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -567,6 +567,17 @@ static int __init numa_init(int (*init_func)(void))
 	ret = init_func();
 	if (ret < 0)
 		return ret;
+
+	/*
+	 * We reset memblock back to the top-down direction
+	 * here because if we configured ACPI_NUMA, we have
+	 * parsed SRAT in init_func(). It is ok to have the
+	 * reset here even if we did't configure ACPI_NUMA
+	 * or acpi numa init fails and fallbacks to dummy
+	 * numa init.
+	 */
+	memblock_set_bottom_up(false);
+
 	ret = numa_cleanup_meminfo(&numa_meminfo);
 	if (ret < 0)
 		return ret;
diff --git a/mm/Kconfig b/mm/Kconfig
index 026771a..0db1cc6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -153,11 +153,18 @@ config MOVABLE_NODE
 	help
 	  Allow a node to have only movable memory.  Pages used by the kernel,
 	  such as direct mapping pages cannot be migrated.  So the corresponding
-	  memory device cannot be hotplugged.  This option allows users to
-	  online all the memory of a node as movable memory so that the whole
-	  node can be hotplugged.  Users who don't use the memory hotplug
-	  feature are fine with this option on since they don't online memory
-	  as movable.
+	  memory device cannot be hotplugged.  This option allows the following
+	  two things:
+	  - When the system is booting, node full of hotpluggable memory can
+	  be arranged to have only movable memory so that the whole node can
+	  be hot-removed. (need movable_node boot option specified).
+	  - After the system is up, the option allows users to online all the
+	  memory of a node as movable memory so that the whole node can be
+	  hot-removed.
+
+	  Users who don't use the memory hotplug feature are fine with this
+	  option on since they don't specify movable_node boot option or they
+	  don't online memory as movable.
 
 	  Say Y here if you want to hotplug a whole node.
 	  Say N here if you want kernel to use memory on all nodes evenly.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ed85fe3..6874c31 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -31,6 +31,7 @@
 #include <linux/firmware-map.h>
 #include <linux/stop_machine.h>
 #include <linux/hugetlb.h>
+#include <linux/memblock.h>
 
 #include <asm/tlbflush.h>
 
@@ -1412,6 +1413,36 @@ static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
 }
 #endif /* CONFIG_MOVABLE_NODE */
 
+static int __init cmdline_parse_movable_node(char *p)
+{
+#ifdef CONFIG_MOVABLE_NODE
+	/*
+	 * Memory used by the kernel cannot be hot-removed because Linux
+	 * cannot migrate the kernel pages. When memory hotplug is
+	 * enabled, we should prevent memblock from allocating memory
+	 * for the kernel.
+	 *
+	 * ACPI SRAT records all hotpluggable memory ranges. But before
+	 * SRAT is parsed, we don't know about it.
+	 *
+	 * The kernel image is loaded into memory at very early time. We
+	 * cannot prevent this anyway. So on NUMA system, we set any
+	 * node the kernel resides in as un-hotpluggable.
+	 *
+	 * Since on modern servers, one node could have double-digit
+	 * gigabytes memory, we can assume the memory around the kernel
+	 * image is also un-hotpluggable. So before SRAT is parsed, just
+	 * allocate memory near the kernel image to try the best to keep
+	 * the kernel away from hotpluggable memory.
+	 */
+	memblock_set_bottom_up(true);
+#else
+	pr_warn("movable_node option not supported\n");
+#endif
+	return 0;
+}
+early_param("movable_node", cmdline_parse_movable_node);
+
 /* check which state of node_states will be changed when offline memory */
 static void node_states_check_changes_offline(unsigned long nr_pages,
 		struct zone *zone, struct memory_notify *arg)
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 update 6/6] mem-hotplug: Introduce movable_node boot option
  2013-10-06 14:43       ` Zhang Yanfei
@ 2013-10-06 23:03         ` Toshi Kani
  -1 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-06 23:03 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik

On Sun, 2013-10-06 at 14:43 +0000, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
> 
> The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
> As we mentioned before, if hotpluggable memory is used by the kernel,
> it cannot be hot-removed. So memory hotplug users may want to set all
> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
> 
> Memory hotplug users may also set a node as movable node, which has
> ZONE_MOVABLE only, so that the whole node can be hot-removed.
> 
> But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
> kernel cannot use memory in movable nodes. This will cause NUMA
> performance down. And other users may be unhappy.
> 
> So we need a way to allow users to enable and disable this functionality.
> In this patch, we introduce movable_node boot option to allow users to
> choose to not to consume hotpluggable memory at early boot time and
> later we can set it as ZONE_MOVABLE.
> 
> To achieve this, the movable_node boot option will control the memblock
> allocation direction. That said, after memblock is ready, before SRAT is
> parsed, we should allocate memory near the kernel image as we explained
> in the previous patches. So if movable_node boot option is set, the kernel
> does the following:
> 
> 1. After memblock is ready, make memblock allocate memory bottom up.
> 2. After SRAT is parsed, make memblock behave as default, allocate memory
>    top down.
> 
> Users can specify "movable_node" in kernel commandline to enable this
> functionality. For those who don't use memory hotplug or who don't want
> to lose their NUMA performance, just don't specify anything. The kernel
> will work as before.
> 
> Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

Thanks for the quick update.

Acked-by: Toshi Kani <toshi.kani@hp.com>

-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 update 6/6] mem-hotplug: Introduce movable_node boot option
@ 2013-10-06 23:03         ` Toshi Kani
  0 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-06 23:03 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On Sun, 2013-10-06 at 14:43 +0000, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
> 
> The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
> As we mentioned before, if hotpluggable memory is used by the kernel,
> it cannot be hot-removed. So memory hotplug users may want to set all
> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
> 
> Memory hotplug users may also set a node as movable node, which has
> ZONE_MOVABLE only, so that the whole node can be hot-removed.
> 
> But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
> kernel cannot use memory in movable nodes. This will cause NUMA
> performance down. And other users may be unhappy.
> 
> So we need a way to allow users to enable and disable this functionality.
> In this patch, we introduce movable_node boot option to allow users to
> choose to not to consume hotpluggable memory at early boot time and
> later we can set it as ZONE_MOVABLE.
> 
> To achieve this, the movable_node boot option will control the memblock
> allocation direction. That said, after memblock is ready, before SRAT is
> parsed, we should allocate memory near the kernel image as we explained
> in the previous patches. So if movable_node boot option is set, the kernel
> does the following:
> 
> 1. After memblock is ready, make memblock allocate memory bottom up.
> 2. After SRAT is parsed, make memblock behave as default, allocate memory
>    top down.
> 
> Users can specify "movable_node" in kernel commandline to enable this
> functionality. For those who don't use memory hotplug or who don't want
> to lose their NUMA performance, just don't specify anything. The kernel
> will work as before.
> 
> Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

Thanks for the quick update.

Acked-by: Toshi Kani <toshi.kani@hp.com>

-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-04  2:00   ` Zhang Yanfei
@ 2013-10-07  0:00     ` H. Peter Anvin
  -1 siblings, 0 replies; 109+ messages in thread
From: H. Peter Anvin @ 2013-10-07  0:00 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	Tejun Heo, Toshi Kani, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
	x86, linux-doc

On 10/03/2013 07:00 PM, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
> 
> The Linux kernel cannot migrate pages used by the kernel. As a
> result, kernel pages cannot be hot-removed. So we cannot allocate
> hotpluggable memory for the kernel.
> 
> In a memory hotplug system, any numa node the kernel resides in
> should be unhotpluggable. And for a modern server, each node could
> have at least 16GB memory. So memory around the kernel image is
> highly likely unhotpluggable.
> 
> ACPI SRAT (System Resource Affinity Table) contains the memory
> hotplug info. But before SRAT is parsed, memblock has already
> started to allocate memory for the kernel. So we need to prevent
> memblock from doing this.
> 
> So direct memory mapping page tables setup is the case. init_mem_mapping()
> is called before SRAT is parsed. To prevent page tables being allocated
> within hotpluggable memory, we will use bottom-up direction to allocate
> page tables from the end of kernel image to the higher memory.
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

I'm still seriously concerned about this.  This unconditionally
introduces new behavior which may very well break some classes of
systems -- the whole point of creating the page tables top down is
because the kernel tends to be allocated in lower memory, which is also
the memory that some devices need for DMA.

+#ifdef CONFIG_X86
+		kernel_end = __pa_symbol(_end);
+#else
+		kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
+#endif

We really should make __pa_symbol() available everywhere by putting
something like the above in a global define (under #ifndef __pa_symbol).

Is RELOC_HIDE() even correct here?

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-07  0:00     ` H. Peter Anvin
  0 siblings, 0 replies; 109+ messages in thread
From: H. Peter Anvin @ 2013-10-07  0:00 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	Tejun Heo, Toshi Kani, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
	x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

On 10/03/2013 07:00 PM, Zhang Yanfei wrote:
> From: Tang Chen <tangchen@cn.fujitsu.com>
> 
> The Linux kernel cannot migrate pages used by the kernel. As a
> result, kernel pages cannot be hot-removed. So we cannot allocate
> hotpluggable memory for the kernel.
> 
> In a memory hotplug system, any numa node the kernel resides in
> should be unhotpluggable. And for a modern server, each node could
> have at least 16GB memory. So memory around the kernel image is
> highly likely unhotpluggable.
> 
> ACPI SRAT (System Resource Affinity Table) contains the memory
> hotplug info. But before SRAT is parsed, memblock has already
> started to allocate memory for the kernel. So we need to prevent
> memblock from doing this.
> 
> So direct memory mapping page tables setup is the case. init_mem_mapping()
> is called before SRAT is parsed. To prevent page tables being allocated
> within hotpluggable memory, we will use bottom-up direction to allocate
> page tables from the end of kernel image to the higher memory.
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

I'm still seriously concerned about this.  This unconditionally
introduces new behavior which may very well break some classes of
systems -- the whole point of creating the page tables top down is
because the kernel tends to be allocated in lower memory, which is also
the memory that some devices need for DMA.

+#ifdef CONFIG_X86
+		kernel_end = __pa_symbol(_end);
+#else
+		kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
+#endif

We really should make __pa_symbol() available everywhere by putting
something like the above in a global define (under #ifndef __pa_symbol).

Is RELOC_HIDE() even correct here?

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-07  0:00     ` H. Peter Anvin
@ 2013-10-07 14:17       ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-07 14:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	Tejun Heo, Toshi Kani, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
	x86, linux-doc, linux-kernel, Linux MM

Hello peter,

On 10/07/2013 08:00 AM, H. Peter Anvin wrote:
> On 10/03/2013 07:00 PM, Zhang Yanfei wrote:
>> From: Tang Chen <tangchen@cn.fujitsu.com>
>>
>> The Linux kernel cannot migrate pages used by the kernel. As a
>> result, kernel pages cannot be hot-removed. So we cannot allocate
>> hotpluggable memory for the kernel.
>>
>> In a memory hotplug system, any numa node the kernel resides in
>> should be unhotpluggable. And for a modern server, each node could
>> have at least 16GB memory. So memory around the kernel image is
>> highly likely unhotpluggable.
>>
>> ACPI SRAT (System Resource Affinity Table) contains the memory
>> hotplug info. But before SRAT is parsed, memblock has already
>> started to allocate memory for the kernel. So we need to prevent
>> memblock from doing this.
>>
>> So direct memory mapping page tables setup is the case. init_mem_mapping()
>> is called before SRAT is parsed. To prevent page tables being allocated
>> within hotpluggable memory, we will use bottom-up direction to allocate
>> page tables from the end of kernel image to the higher memory.
>>
>> Acked-by: Tejun Heo <tj@kernel.org>
>> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
>> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> 
> I'm still seriously concerned about this.  This unconditionally
> introduces new behavior which may very well break some classes of

Well, this new behaviour is not unconditional, if user doesn't specify
the movable_node option, the kernel will act as before, allocating
memory top-down.

> systems -- the whole point of creating the page tables top down is
> because the kernel tends to be allocated in lower memory, which is also
> the memory that some devices need for DMA.

How much memory does these devices needed for DMA? And you mean memory
under 16MB or 4GB?

> 
> +#ifdef CONFIG_X86
> +		kernel_end = __pa_symbol(_end);
> +#else
> +		kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
> +#endif
> 
> We really should make __pa_symbol() available everywhere by putting
> something like the above in a global define (under #ifndef __pa_symbol).

Hmmmm...in include/asm-generic/page.h?

> 
> Is RELOC_HIDE() even correct here?

Sorry, could you explain a bit?

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-07 14:17       ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-07 14:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	Tejun Heo, Toshi Kani, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
	x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

Hello peter,

On 10/07/2013 08:00 AM, H. Peter Anvin wrote:
> On 10/03/2013 07:00 PM, Zhang Yanfei wrote:
>> From: Tang Chen <tangchen@cn.fujitsu.com>
>>
>> The Linux kernel cannot migrate pages used by the kernel. As a
>> result, kernel pages cannot be hot-removed. So we cannot allocate
>> hotpluggable memory for the kernel.
>>
>> In a memory hotplug system, any numa node the kernel resides in
>> should be unhotpluggable. And for a modern server, each node could
>> have at least 16GB memory. So memory around the kernel image is
>> highly likely unhotpluggable.
>>
>> ACPI SRAT (System Resource Affinity Table) contains the memory
>> hotplug info. But before SRAT is parsed, memblock has already
>> started to allocate memory for the kernel. So we need to prevent
>> memblock from doing this.
>>
>> So direct memory mapping page tables setup is the case. init_mem_mapping()
>> is called before SRAT is parsed. To prevent page tables being allocated
>> within hotpluggable memory, we will use bottom-up direction to allocate
>> page tables from the end of kernel image to the higher memory.
>>
>> Acked-by: Tejun Heo <tj@kernel.org>
>> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
>> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> 
> I'm still seriously concerned about this.  This unconditionally
> introduces new behavior which may very well break some classes of

Well, this new behaviour is not unconditional, if user doesn't specify
the movable_node option, the kernel will act as before, allocating
memory top-down.

> systems -- the whole point of creating the page tables top down is
> because the kernel tends to be allocated in lower memory, which is also
> the memory that some devices need for DMA.

How much memory does these devices needed for DMA? And you mean memory
under 16MB or 4GB?

> 
> +#ifdef CONFIG_X86
> +		kernel_end = __pa_symbol(_end);
> +#else
> +		kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
> +#endif
> 
> We really should make __pa_symbol() available everywhere by putting
> something like the above in a global define (under #ifndef __pa_symbol).

Hmmmm...in include/asm-generic/page.h?

> 
> Is RELOC_HIDE() even correct here?

Sorry, could you explain a bit?

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed
  2013-10-04  1:56 ` Zhang Yanfei
@ 2013-10-08  4:23   ` Ingo Molnar
  -1 siblings, 0 replies; 109+ messages in thread
From: Ingo Molnar @ 2013-10-08  4:23 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit, x86, linux-doc, linux-kernel


* Zhang Yanfei <zhangyanfei.yes@gmail.com> wrote:

> Hello, here is the v6 version. Any comments are welcome!

Ok, I think this is as good as this feature can get without hardware 
support.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed
@ 2013-10-08  4:23   ` Ingo Molnar
  0 siblings, 0 replies; 109+ messages in thread
From: Ingo Molnar @ 2013-10-08  4:23 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit, x86, linux-doc, linux-kernel,
	Linux MM, linux-acpi, imtangchen, Zhang Yanfei, Tang Chen


* Zhang Yanfei <zhangyanfei.yes@gmail.com> wrote:

> Hello, here is the v6 version. Any comments are welcome!

Ok, I think this is as good as this feature can get without hardware 
support.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed
  2013-10-08  4:23   ` Ingo Molnar
@ 2013-10-08 15:28     ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-08 15:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit, x86, linux-doc,
	linux-kernel@vger.kernel.org

Hello Ingo,

On 10/08/2013 12:23 PM, Ingo Molnar wrote:
> 
> * Zhang Yanfei <zhangyanfei.yes@gmail.com> wrote:
> 
>> Hello, here is the v6 version. Any comments are welcome!
> 
> Ok, I think this is as good as this feature can get without hardware 
> support.
> 

Without hardware/firmware support, we cannot know which memory is
hotpluggable.

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed
@ 2013-10-08 15:28     ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-08 15:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	H. Peter Anvin, Tejun Heo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit, x86, linux-doc, linux-kernel,
	Linux MM, linux-acpi, imtangchen, Zhang Yanfei, Tang Chen

Hello Ingo,

On 10/08/2013 12:23 PM, Ingo Molnar wrote:
> 
> * Zhang Yanfei <zhangyanfei.yes@gmail.com> wrote:
> 
>> Hello, here is the v6 version. Any comments are welcome!
> 
> Ok, I think this is as good as this feature can get without hardware 
> support.
> 

Without hardware/firmware support, we cannot know which memory is
hotpluggable.

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-07  0:00     ` H. Peter Anvin
  (?)
@ 2013-10-08 17:36       ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-08 17:36 UTC (permalink / raw)
  To: H. Peter Anvin, Tejun Heo
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	Toshi Kani, Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
	Wen Congyang, Lai Jiangshan, isimatu.yasuaki, izumi.taku,
	Mel Gorman, Minchan Kim, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, Rik van Riel, jweiner, prarit, x86, linux-doc,
	linux-kernel, Linux MM, linux-acpi

Hello tejun
CC: Peter

On 10/07/2013 08:00 AM, H. Peter Anvin wrote:
> On 10/03/2013 07:00 PM, Zhang Yanfei wrote:
>> From: Tang Chen <tangchen@cn.fujitsu.com>
>>
>> The Linux kernel cannot migrate pages used by the kernel. As a
>> result, kernel pages cannot be hot-removed. So we cannot allocate
>> hotpluggable memory for the kernel.
>>
>> In a memory hotplug system, any numa node the kernel resides in
>> should be unhotpluggable. And for a modern server, each node could
>> have at least 16GB memory. So memory around the kernel image is
>> highly likely unhotpluggable.
>>
>> ACPI SRAT (System Resource Affinity Table) contains the memory
>> hotplug info. But before SRAT is parsed, memblock has already
>> started to allocate memory for the kernel. So we need to prevent
>> memblock from doing this.
>>
>> So direct memory mapping page tables setup is the case. init_mem_mapping()
>> is called before SRAT is parsed. To prevent page tables being allocated
>> within hotpluggable memory, we will use bottom-up direction to allocate
>> page tables from the end of kernel image to the higher memory.
>>
>> Acked-by: Tejun Heo <tj@kernel.org>
>> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
>> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> 
> I'm still seriously concerned about this.  This unconditionally
> introduces new behavior which may very well break some classes of
> systems -- the whole point of creating the page tables top down is
> because the kernel tends to be allocated in lower memory, which is also
> the memory that some devices need for DMA.
> 

After thinking for a while, this issue pointed by Peter seems to be really
existing. And looking back to what you suggested the allocation close to the
kernel, 

> so if we allocate memory close to the kernel image,
>   it's likely that we don't contaminate hotpluggable node.  We're
>   talking about few megs at most right after the kernel image.  I
>   can't see how that would make any noticeable difference.

You meant that the memory size is about few megs. But here, page tables
seems to be large enough in big memory machines, so that page tables will
consume the precious lower memory. So I think we may really reorder
the page table setup after we get the hotplug info in some way. Just like
we have done in patch 5, we reorder reserve_crashkernel() to be called
after initmem_init().

So do you still have any objection to the pagetable setup reorder?

-- 
Thanks.
Zhang Yanfei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-08 17:36       ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-08 17:36 UTC (permalink / raw)
  To: H. Peter Anvin, Tejun Heo
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	Toshi Kani, Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
	Wen Congyang, Lai Jiangshan, isimatu.yasuaki, izumi.taku,
	Mel Gorman, Minchan Kim, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, Rik van Riel, jweiner, prarit, x86, linux-doc,
	linux-kernel, Linux MM, linux-acpi, imtangchen, Zhang Yanfei,
	Tang Chen

Hello tejun
CC: Peter

On 10/07/2013 08:00 AM, H. Peter Anvin wrote:
> On 10/03/2013 07:00 PM, Zhang Yanfei wrote:
>> From: Tang Chen <tangchen@cn.fujitsu.com>
>>
>> The Linux kernel cannot migrate pages used by the kernel. As a
>> result, kernel pages cannot be hot-removed. So we cannot allocate
>> hotpluggable memory for the kernel.
>>
>> In a memory hotplug system, any numa node the kernel resides in
>> should be unhotpluggable. And for a modern server, each node could
>> have at least 16GB memory. So memory around the kernel image is
>> highly likely unhotpluggable.
>>
>> ACPI SRAT (System Resource Affinity Table) contains the memory
>> hotplug info. But before SRAT is parsed, memblock has already
>> started to allocate memory for the kernel. So we need to prevent
>> memblock from doing this.
>>
>> So direct memory mapping page tables setup is the case. init_mem_mapping()
>> is called before SRAT is parsed. To prevent page tables being allocated
>> within hotpluggable memory, we will use bottom-up direction to allocate
>> page tables from the end of kernel image to the higher memory.
>>
>> Acked-by: Tejun Heo <tj@kernel.org>
>> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
>> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> 
> I'm still seriously concerned about this.  This unconditionally
> introduces new behavior which may very well break some classes of
> systems -- the whole point of creating the page tables top down is
> because the kernel tends to be allocated in lower memory, which is also
> the memory that some devices need for DMA.
> 

After thinking for a while, this issue pointed by Peter seems to be really
existing. And looking back to what you suggested the allocation close to the
kernel, 

> so if we allocate memory close to the kernel image,
>   it's likely that we don't contaminate hotpluggable node.  We're
>   talking about few megs at most right after the kernel image.  I
>   can't see how that would make any noticeable difference.

You meant that the memory size is about few megs. But here, page tables
seems to be large enough in big memory machines, so that page tables will
consume the precious lower memory. So I think we may really reorder
the page table setup after we get the hotplug info in some way. Just like
we have done in patch 5, we reorder reserve_crashkernel() to be called
after initmem_init().

So do you still have any objection to the pagetable setup reorder?

-- 
Thanks.
Zhang Yanfei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-08 17:36       ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-08 17:36 UTC (permalink / raw)
  To: H. Peter Anvin, Tejun Heo
  Cc: Andrew Morton, Rafael J . Wysocki, lenb, Thomas Gleixner, mingo,
	Toshi Kani, Wanpeng Li, Thomas Renninger, Yinghai Lu, Jiang Liu,
	Wen Congyang, Lai Jiangshan, isimatu.yasuaki, izumi.taku,
	Mel Gorman, Minchan Kim, mina86, gong.chen, vasilis.liaskovitis,
	lwoodman, Rik van Riel, jweiner, prarit, x86, linux-doc,
	linux-kernel, Linux MM, linux-acpi, imtangchen, Zhang Yanfei,
	Tang Chen

Hello tejun
CC: Peter

On 10/07/2013 08:00 AM, H. Peter Anvin wrote:
> On 10/03/2013 07:00 PM, Zhang Yanfei wrote:
>> From: Tang Chen <tangchen@cn.fujitsu.com>
>>
>> The Linux kernel cannot migrate pages used by the kernel. As a
>> result, kernel pages cannot be hot-removed. So we cannot allocate
>> hotpluggable memory for the kernel.
>>
>> In a memory hotplug system, any numa node the kernel resides in
>> should be unhotpluggable. And for a modern server, each node could
>> have at least 16GB memory. So memory around the kernel image is
>> highly likely unhotpluggable.
>>
>> ACPI SRAT (System Resource Affinity Table) contains the memory
>> hotplug info. But before SRAT is parsed, memblock has already
>> started to allocate memory for the kernel. So we need to prevent
>> memblock from doing this.
>>
>> So direct memory mapping page tables setup is the case. init_mem_mapping()
>> is called before SRAT is parsed. To prevent page tables being allocated
>> within hotpluggable memory, we will use bottom-up direction to allocate
>> page tables from the end of kernel image to the higher memory.
>>
>> Acked-by: Tejun Heo <tj@kernel.org>
>> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
>> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> 
> I'm still seriously concerned about this.  This unconditionally
> introduces new behavior which may very well break some classes of
> systems -- the whole point of creating the page tables top down is
> because the kernel tends to be allocated in lower memory, which is also
> the memory that some devices need for DMA.
> 

After thinking for a while, this issue pointed by Peter seems to be really
existing. And looking back to what you suggested the allocation close to the
kernel, 

> so if we allocate memory close to the kernel image,
>   it's likely that we don't contaminate hotpluggable node.  We're
>   talking about few megs at most right after the kernel image.  I
>   can't see how that would make any noticeable difference.

You meant that the memory size is about few megs. But here, page tables
seems to be large enough in big memory machines, so that page tables will
consume the precious lower memory. So I think we may really reorder
the page table setup after we get the hotplug info in some way. Just like
we have done in patch 5, we reorder reserve_crashkernel() to be called
after initmem_init().

So do you still have any objection to the pagetable setup reorder?

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-08 17:36       ` Zhang Yanfei
@ 2013-10-09 16:44         ` Tejun Heo
  -1 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-09 16:44 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: H. Peter Anvin, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM

Hello,

On Wed, Oct 09, 2013 at 01:36:36AM +0800, Zhang Yanfei wrote:
> > I'm still seriously concerned about this.  This unconditionally
> > introduces new behavior which may very well break some classes of

This is an optional behavior which is triggered by a very specific
kernel boot param, which I suspect is gonna need to stick around to
support memory hotplug in the current setup unless we add another
layer of address translation to support memory hotplug.

> > systems -- the whole point of creating the page tables top down is
> > because the kernel tends to be allocated in lower memory, which is also
> > the memory that some devices need for DMA.

Would that really matter for the target use cases here?  These are
likely fairly huge highend machines.  ISA DMA limit is below the
kernel image and 32bit limit is pretty big in comparison and at this
point even that limit is likely to be irrelevant at least for the
target machines, which are gonna be almost inherently extremely niche.

> > so if we allocate memory close to the kernel image,
> >   it's likely that we don't contaminate hotpluggable node.  We're
> >   talking about few megs at most right after the kernel image.  I
> >   can't see how that would make any noticeable difference.
> 
> You meant that the memory size is about few megs. But here, page tables
> seems to be large enough in big memory machines, so that page tables will

Hmmm?  Even with 4k mappings and, say, 16Gigs of memory, it's still
somewhere above 32MiB, right?  And, these physical mappings don't
usually use 4k mappings to begin with.  Unless we're worrying about
ISA DMA limit, I don't think it'd be problematic.

> consume the precious lower memory. So I think we may really reorder
> the page table setup after we get the hotplug info in some way. Just like
> we have done in patch 5, we reorder reserve_crashkernel() to be called
> after initmem_init().
> 
> So do you still have any objection to the pagetable setup reorder?

I still feel quite uneasy about pulling SRAT parsing and ACPI initrd
overriding into early boot.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 16:44         ` Tejun Heo
  0 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-09 16:44 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: H. Peter Anvin, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

Hello,

On Wed, Oct 09, 2013 at 01:36:36AM +0800, Zhang Yanfei wrote:
> > I'm still seriously concerned about this.  This unconditionally
> > introduces new behavior which may very well break some classes of

This is an optional behavior which is triggered by a very specific
kernel boot param, which I suspect is gonna need to stick around to
support memory hotplug in the current setup unless we add another
layer of address translation to support memory hotplug.

> > systems -- the whole point of creating the page tables top down is
> > because the kernel tends to be allocated in lower memory, which is also
> > the memory that some devices need for DMA.

Would that really matter for the target use cases here?  These are
likely fairly huge highend machines.  ISA DMA limit is below the
kernel image and 32bit limit is pretty big in comparison and at this
point even that limit is likely to be irrelevant at least for the
target machines, which are gonna be almost inherently extremely niche.

> > so if we allocate memory close to the kernel image,
> >   it's likely that we don't contaminate hotpluggable node.  We're
> >   talking about few megs at most right after the kernel image.  I
> >   can't see how that would make any noticeable difference.
> 
> You meant that the memory size is about few megs. But here, page tables
> seems to be large enough in big memory machines, so that page tables will

Hmmm?  Even with 4k mappings and, say, 16Gigs of memory, it's still
somewhere above 32MiB, right?  And, these physical mappings don't
usually use 4k mappings to begin with.  Unless we're worrying about
ISA DMA limit, I don't think it'd be problematic.

> consume the precious lower memory. So I think we may really reorder
> the page table setup after we get the hotplug info in some way. Just like
> we have done in patch 5, we reorder reserve_crashkernel() to be called
> after initmem_init().
> 
> So do you still have any objection to the pagetable setup reorder?

I still feel quite uneasy about pulling SRAT parsing and ACPI initrd
overriding into early boot.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 16:44         ` Tejun Heo
@ 2013-10-09 17:14           ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-09 17:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: H. Peter Anvin, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel

Hello tejun,

Thanks for the response:)

On 10/10/2013 12:44 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Oct 09, 2013 at 01:36:36AM +0800, Zhang Yanfei wrote:
>>> I'm still seriously concerned about this.  This unconditionally
>>> introduces new behavior which may very well break some classes of
> 
> This is an optional behavior which is triggered by a very specific
> kernel boot param, which I suspect is gonna need to stick around to
> support memory hotplug in the current setup unless we add another
> layer of address translation to support memory hotplug.

Yeah, I have explained that this is conditional.

> 
>>> systems -- the whole point of creating the page tables top down is
>>> because the kernel tends to be allocated in lower memory, which is also
>>> the memory that some devices need for DMA.
> 
> Would that really matter for the target use cases here?  These are
> likely fairly huge highend machines.  ISA DMA limit is below the
> kernel image and 32bit limit is pretty big in comparison and at this
> point even that limit is likely to be irrelevant at least for the
> target machines, which are gonna be almost inherently extremely niche.
> 
>>> so if we allocate memory close to the kernel image,
>>>   it's likely that we don't contaminate hotpluggable node.  We're
>>>   talking about few megs at most right after the kernel image.  I
>>>   can't see how that would make any noticeable difference.
>>
>> You meant that the memory size is about few megs. But here, page tables
>> seems to be large enough in big memory machines, so that page tables will
> 
> Hmmm?  Even with 4k mappings and, say, 16Gigs of memory, it's still
> somewhere above 32MiB, right?  And, these physical mappings don't
> usually use 4k mappings to begin with.  Unless we're worrying about
> ISA DMA limit, I don't think it'd be problematic.

I think Peter meant very huge memory machines, say 2T memory? In the worst
case, this may need 2G memory for page tables, seems huge....

And I am not familiar with the ISA DMA limit, does this mean the memory 
below 4G? Just as we have the ZONE_DMA32 in x86_64. (16MB limit seems not
the case here)

> 
>> consume the precious lower memory. So I think we may really reorder
>> the page table setup after we get the hotplug info in some way. Just like
>> we have done in patch 5, we reorder reserve_crashkernel() to be called
>> after initmem_init().
>>
>> So do you still have any objection to the pagetable setup reorder?
> 
> I still feel quite uneasy about pulling SRAT parsing and ACPI initrd
> overriding into early boot.
> 

I am trying to read all the discussion mails before. Maybe from the very
first patchset that made you uneasy about parsing SRAT earlier. The patchset
may do too much splitting and registering. So I am thinking that if we
could combine two thing together to make things cleaner:

1. introduce bottom up allocation to allocate memory near the kernel before
   we parse SRAT.
2. Since peter have the serious concern about the pagetable setup in bottom-up
   and Ingo also said we'd better not to touch the current top-down pagetable
   setup. Could we just put acpi_initrd_override and numa_init related functions
   before init_mem_mapping()? After numa info is parsed (including SRAT), we
   reset the allocation direction back to top-down, so we needn't change the
   page table setup process. And before numa info parsed, we use the bottom-up
   allocation to make sure all memory allocated by memblock is near the kernel
   image.

How do you think?

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 17:14           ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-09 17:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: H. Peter Anvin, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

Hello tejun,

Thanks for the response:)

On 10/10/2013 12:44 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Oct 09, 2013 at 01:36:36AM +0800, Zhang Yanfei wrote:
>>> I'm still seriously concerned about this.  This unconditionally
>>> introduces new behavior which may very well break some classes of
> 
> This is an optional behavior which is triggered by a very specific
> kernel boot param, which I suspect is gonna need to stick around to
> support memory hotplug in the current setup unless we add another
> layer of address translation to support memory hotplug.

Yeah, I have explained that this is conditional.

> 
>>> systems -- the whole point of creating the page tables top down is
>>> because the kernel tends to be allocated in lower memory, which is also
>>> the memory that some devices need for DMA.
> 
> Would that really matter for the target use cases here?  These are
> likely fairly huge highend machines.  ISA DMA limit is below the
> kernel image and 32bit limit is pretty big in comparison and at this
> point even that limit is likely to be irrelevant at least for the
> target machines, which are gonna be almost inherently extremely niche.
> 
>>> so if we allocate memory close to the kernel image,
>>>   it's likely that we don't contaminate hotpluggable node.  We're
>>>   talking about few megs at most right after the kernel image.  I
>>>   can't see how that would make any noticeable difference.
>>
>> You meant that the memory size is about few megs. But here, page tables
>> seems to be large enough in big memory machines, so that page tables will
> 
> Hmmm?  Even with 4k mappings and, say, 16Gigs of memory, it's still
> somewhere above 32MiB, right?  And, these physical mappings don't
> usually use 4k mappings to begin with.  Unless we're worrying about
> ISA DMA limit, I don't think it'd be problematic.

I think Peter meant very huge memory machines, say 2T memory? In the worst
case, this may need 2G memory for page tables, seems huge....

And I am not familiar with the ISA DMA limit, does this mean the memory 
below 4G? Just as we have the ZONE_DMA32 in x86_64. (16MB limit seems not
the case here)

> 
>> consume the precious lower memory. So I think we may really reorder
>> the page table setup after we get the hotplug info in some way. Just like
>> we have done in patch 5, we reorder reserve_crashkernel() to be called
>> after initmem_init().
>>
>> So do you still have any objection to the pagetable setup reorder?
> 
> I still feel quite uneasy about pulling SRAT parsing and ACPI initrd
> overriding into early boot.
> 

I am trying to read all the discussion mails before. Maybe from the very
first patchset that made you uneasy about parsing SRAT earlier. The patchset
may do too much splitting and registering. So I am thinking that if we
could combine two thing together to make things cleaner:

1. introduce bottom up allocation to allocate memory near the kernel before
   we parse SRAT.
2. Since peter have the serious concern about the pagetable setup in bottom-up
   and Ingo also said we'd better not to touch the current top-down pagetable
   setup. Could we just put acpi_initrd_override and numa_init related functions
   before init_mem_mapping()? After numa info is parsed (including SRAT), we
   reset the allocation direction back to top-down, so we needn't change the
   page table setup process. And before numa info parsed, we use the bottom-up
   allocation to make sure all memory allocated by memblock is near the kernel
   image.

How do you think?

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 16:44         ` Tejun Heo
@ 2013-10-09 19:10           ` Yinghai Lu
  -1 siblings, 0 replies; 109+ messages in thread
From: Yinghai Lu @ 2013-10-09 19:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	Len Brown, Thomas Gleixner, Ingo Molnar, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner

On Wed, Oct 9, 2013 at 9:44 AM, Tejun Heo <tj@kernel.org> wrote:
>> consume the precious lower memory. So I think we may really reorder
>> the page table setup after we get the hotplug info in some way. Just like
>> we have done in patch 5, we reorder reserve_crashkernel() to be called
>> after initmem_init().
>>
>> So do you still have any objection to the pagetable setup reorder?
>
> I still feel quite uneasy about pulling SRAT parsing and ACPI initrd
> overriding into early boot.

for your reconsidering to parse srat early, I refresh that old patchset
at

https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/log/?h=for-x86-mm-3.13

actually looks one-third or haf patches already have your ack.


Thanks

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 19:10           ` Yinghai Lu
  0 siblings, 0 replies; 109+ messages in thread
From: Yinghai Lu @ 2013-10-09 19:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	Len Brown, Thomas Gleixner, Ingo Molnar, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, x86, linux-doc, linux-kernel, Linux MM,
	ACPI Devel Maling List, Chen Tang, Zhang Yanfei, Tang Chen

On Wed, Oct 9, 2013 at 9:44 AM, Tejun Heo <tj@kernel.org> wrote:
>> consume the precious lower memory. So I think we may really reorder
>> the page table setup after we get the hotplug info in some way. Just like
>> we have done in patch 5, we reorder reserve_crashkernel() to be called
>> after initmem_init().
>>
>> So do you still have any objection to the pagetable setup reorder?
>
> I still feel quite uneasy about pulling SRAT parsing and ACPI initrd
> overriding into early boot.

for your reconsidering to parse srat early, I refresh that old patchset
at

https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/log/?h=for-x86-mm-3.13

actually looks one-third or haf patches already have your ack.


Thanks

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 17:14           ` Zhang Yanfei
@ 2013-10-09 19:20             ` Tejun Heo
  -1 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-09 19:20 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: H. Peter Anvin, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM

Hello,

On Thu, Oct 10, 2013 at 01:14:23AM +0800, Zhang Yanfei wrote:
> >> You meant that the memory size is about few megs. But here, page tables
> >> seems to be large enough in big memory machines, so that page tables will
> > 
> > Hmmm?  Even with 4k mappings and, say, 16Gigs of memory, it's still
> > somewhere above 32MiB, right?  And, these physical mappings don't
> > usually use 4k mappings to begin with.  Unless we're worrying about
> > ISA DMA limit, I don't think it'd be problematic.
> 
> I think Peter meant very huge memory machines, say 2T memory? In the worst
> case, this may need 2G memory for page tables, seems huge....

Realistically tho, why would people be using 4k mappings on 2T
machines?  For the sake of argument, let's say 4k mappings are
required for some weird reason, even then, doing SRAT parsing early
doesn't necessarily solve the problem in itself.  It'd still need
heuristics to avoid occupying too much of 32bit memory because it
isn't difficult to imagine specific NUMA settings which would drive
page table allocation into low address.

No matter what we do, there's no way around the fact that this whole
effort is mostly an incomplete solution in its nature and that's why I
think we better keep things isolated and simple.  It isn't a good idea
to make structural changes to accomodate something which isn't and
doesn't have much chance of becoming a full solution.  In addition,
the problem itself is niche to begin with.

> And I am not familiar with the ISA DMA limit, does this mean the memory 
> below 4G? Just as we have the ZONE_DMA32 in x86_64. (16MB limit seems not
> the case here)

Yeah, I was referring to the 16MB limit, which apparently ceased to
exist.

> 1. introduce bottom up allocation to allocate memory near the kernel before
>    we parse SRAT.
> 2. Since peter have the serious concern about the pagetable setup in bottom-up
>    and Ingo also said we'd better not to touch the current top-down pagetable
>    setup. Could we just put acpi_initrd_override and numa_init related functions
>    before init_mem_mapping()? After numa info is parsed (including SRAT), we
>    reset the allocation direction back to top-down, so we needn't change the
>    page table setup process. And before numa info parsed, we use the bottom-up
>    allocation to make sure all memory allocated by memblock is near the kernel
>    image.
> 
> How do you think?

Let's wait to hear more about Peter's concern.  Peter, the whole thing
is very specialized, off-by-default thing which is more or less a
kludge no matter which implementation direction we choose and as far
as the cost and risk go, I think the proposed series is pretty small
in its foot print.  What do you think?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 19:20             ` Tejun Heo
  0 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-09 19:20 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: H. Peter Anvin, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

Hello,

On Thu, Oct 10, 2013 at 01:14:23AM +0800, Zhang Yanfei wrote:
> >> You meant that the memory size is about few megs. But here, page tables
> >> seems to be large enough in big memory machines, so that page tables will
> > 
> > Hmmm?  Even with 4k mappings and, say, 16Gigs of memory, it's still
> > somewhere above 32MiB, right?  And, these physical mappings don't
> > usually use 4k mappings to begin with.  Unless we're worrying about
> > ISA DMA limit, I don't think it'd be problematic.
> 
> I think Peter meant very huge memory machines, say 2T memory? In the worst
> case, this may need 2G memory for page tables, seems huge....

Realistically tho, why would people be using 4k mappings on 2T
machines?  For the sake of argument, let's say 4k mappings are
required for some weird reason, even then, doing SRAT parsing early
doesn't necessarily solve the problem in itself.  It'd still need
heuristics to avoid occupying too much of 32bit memory because it
isn't difficult to imagine specific NUMA settings which would drive
page table allocation into low address.

No matter what we do, there's no way around the fact that this whole
effort is mostly an incomplete solution in its nature and that's why I
think we better keep things isolated and simple.  It isn't a good idea
to make structural changes to accomodate something which isn't and
doesn't have much chance of becoming a full solution.  In addition,
the problem itself is niche to begin with.

> And I am not familiar with the ISA DMA limit, does this mean the memory 
> below 4G? Just as we have the ZONE_DMA32 in x86_64. (16MB limit seems not
> the case here)

Yeah, I was referring to the 16MB limit, which apparently ceased to
exist.

> 1. introduce bottom up allocation to allocate memory near the kernel before
>    we parse SRAT.
> 2. Since peter have the serious concern about the pagetable setup in bottom-up
>    and Ingo also said we'd better not to touch the current top-down pagetable
>    setup. Could we just put acpi_initrd_override and numa_init related functions
>    before init_mem_mapping()? After numa info is parsed (including SRAT), we
>    reset the allocation direction back to top-down, so we needn't change the
>    page table setup process. And before numa info parsed, we use the bottom-up
>    allocation to make sure all memory allocated by memblock is near the kernel
>    image.
> 
> How do you think?

Let's wait to hear more about Peter's concern.  Peter, the whole thing
is very specialized, off-by-default thing which is more or less a
kludge no matter which implementation direction we choose and as far
as the cost and risk go, I think the proposed series is pretty small
in its foot print.  What do you think?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 19:10           ` Yinghai Lu
@ 2013-10-09 19:23             ` Tejun Heo
  -1 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-09 19:23 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	Len Brown, Thomas Gleixner, Ingo Molnar, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner

Hello, Yinghai.

On Wed, Oct 09, 2013 at 12:10:34PM -0700, Yinghai Lu wrote:
> > I still feel quite uneasy about pulling SRAT parsing and ACPI initrd
> > overriding into early boot.
> 
> for your reconsidering to parse srat early, I refresh that old patchset
> at
> 
> https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/log/?h=for-x86-mm-3.13
> 
> actually looks one-third or haf patches already have your ack.

Yes, but those acks assume that the overall approach is a good idea.
The biggest issue that I have with the approach is that it is invasive
and modifies basic structure for an inherently kludgy solution for a
quite niche problem.  The benefit / cost ratio still seems quite off
to me - we're making a lot of general changes to serve something very
specialized, which might not even stay relevant for long time.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 19:23             ` Tejun Heo
  0 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-09 19:23 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	Len Brown, Thomas Gleixner, Ingo Molnar, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Taku Izumi, Mel Gorman, Minchan Kim, mina86,
	gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, x86, linux-doc, linux-kernel, Linux MM,
	ACPI Devel Maling List, Chen Tang, Zhang Yanfei, Tang Chen

Hello, Yinghai.

On Wed, Oct 09, 2013 at 12:10:34PM -0700, Yinghai Lu wrote:
> > I still feel quite uneasy about pulling SRAT parsing and ACPI initrd
> > overriding into early boot.
> 
> for your reconsidering to parse srat early, I refresh that old patchset
> at
> 
> https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/log/?h=for-x86-mm-3.13
> 
> actually looks one-third or haf patches already have your ack.

Yes, but those acks assume that the overall approach is a good idea.
The biggest issue that I have with the approach is that it is invasive
and modifies basic structure for an inherently kludgy solution for a
quite niche problem.  The benefit / cost ratio still seems quite off
to me - we're making a lot of general changes to serve something very
specialized, which might not even stay relevant for long time.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 19:20             ` Tejun Heo
@ 2013-10-09 19:30               ` Dave Hansen
  -1 siblings, 0 replies; 109+ messages in thread
From: Dave Hansen @ 2013-10-09 19:30 UTC (permalink / raw)
  To: Tejun Heo, Zhang Yanfei
  Cc: H. Peter Anvin, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel

On 10/09/2013 12:20 PM, Tejun Heo wrote:
> Realistically tho, why would people be using 4k mappings on 2T
> machines?

CONFIG_DEBUG_PAGEALLOC and CONFIG_KMEMCHECK both disable using >4k
pages.  I actually ran in to this on a 1TB machine a few weeks ago:

	https://lkml.org/lkml/2013/8/9/546

So it's not a common case for stuff that customers have, but it sure as
*HECK* is needed for debugging.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 19:30               ` Dave Hansen
  0 siblings, 0 replies; 109+ messages in thread
From: Dave Hansen @ 2013-10-09 19:30 UTC (permalink / raw)
  To: Tejun Heo, Zhang Yanfei
  Cc: H. Peter Anvin, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On 10/09/2013 12:20 PM, Tejun Heo wrote:
> Realistically tho, why would people be using 4k mappings on 2T
> machines?

CONFIG_DEBUG_PAGEALLOC and CONFIG_KMEMCHECK both disable using >4k
pages.  I actually ran in to this on a 1TB machine a few weeks ago:

	https://lkml.org/lkml/2013/8/9/546

So it's not a common case for stuff that customers have, but it sure as
*HECK* is needed for debugging.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 19:30               ` Dave Hansen
@ 2013-10-09 19:47                 ` Tejun Heo
  -1 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-09 19:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit, x86, linux-doc,
	linux-kernel@vger.kernel.org

Hello,

On Wed, Oct 09, 2013 at 12:30:09PM -0700, Dave Hansen wrote:
> On 10/09/2013 12:20 PM, Tejun Heo wrote:
> > Realistically tho, why would people be using 4k mappings on 2T
> > machines?
> 
> CONFIG_DEBUG_PAGEALLOC and CONFIG_KMEMCHECK both disable using >4k
> pages.  I actually ran in to this on a 1TB machine a few weeks ago:
> 
> 	https://lkml.org/lkml/2013/8/9/546
> 
> So it's not a common case for stuff that customers have, but it sure as
> *HECK* is needed for debugging.

But as I said in the same paragraph, parsing SRAT earlier doesn't
solve the problem in itself either.  Ignoring the option if 4k mapping
is required and memory consumption would be prohibitive should work,
no?  Something like that would be necessary if we're gonna worry about
cases like this no matter how we implement it, but, frankly, I'm not
sure this is something worth worrying about.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 19:47                 ` Tejun Heo
  0 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-09 19:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li,
	Thomas Renninger, Yinghai Lu, Jiang Liu, Wen Congyang,
	Lai Jiangshan, isimatu.yasuaki, izumi.taku, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, prarit, x86, linux-doc, linux-kernel,
	Linux MM, linux-acpi, imtangchen, Zhang Yanfei, Tang Chen

Hello,

On Wed, Oct 09, 2013 at 12:30:09PM -0700, Dave Hansen wrote:
> On 10/09/2013 12:20 PM, Tejun Heo wrote:
> > Realistically tho, why would people be using 4k mappings on 2T
> > machines?
> 
> CONFIG_DEBUG_PAGEALLOC and CONFIG_KMEMCHECK both disable using >4k
> pages.  I actually ran in to this on a 1TB machine a few weeks ago:
> 
> 	https://lkml.org/lkml/2013/8/9/546
> 
> So it's not a common case for stuff that customers have, but it sure as
> *HECK* is needed for debugging.

But as I said in the same paragraph, parsing SRAT earlier doesn't
solve the problem in itself either.  Ignoring the option if 4k mapping
is required and memory consumption would be prohibitive should work,
no?  Something like that would be necessary if we're gonna worry about
cases like this no matter how we implement it, but, frankly, I'm not
sure this is something worth worrying about.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 19:20             ` Tejun Heo
@ 2013-10-09 20:58               ` Toshi Kani
  -1 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-09 20:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel

On Wed, 2013-10-09 at 15:20 -0400, Tejun Heo wrote:
> Hello,
> 
> On Thu, Oct 10, 2013 at 01:14:23AM +0800, Zhang Yanfei wrote:
> > >> You meant that the memory size is about few megs. But here, page tables
> > >> seems to be large enough in big memory machines, so that page tables will
> > > 
> > > Hmmm?  Even with 4k mappings and, say, 16Gigs of memory, it's still
> > > somewhere above 32MiB, right?  And, these physical mappings don't
> > > usually use 4k mappings to begin with.  Unless we're worrying about
> > > ISA DMA limit, I don't think it'd be problematic.
> > 
> > I think Peter meant very huge memory machines, say 2T memory? In the worst
> > case, this may need 2G memory for page tables, seems huge....
> 
> Realistically tho, why would people be using 4k mappings on 2T
> machines?  For the sake of argument, let's say 4k mappings are
> required for some weird reason, even then, doing SRAT parsing early
> doesn't necessarily solve the problem in itself.  It'd still need
> heuristics to avoid occupying too much of 32bit memory because it
> isn't difficult to imagine specific NUMA settings which would drive
> page table allocation into low address.
> 
> No matter what we do, there's no way around the fact that this whole
> effort is mostly an incomplete solution in its nature and that's why I
> think we better keep things isolated and simple.  It isn't a good idea
> to make structural changes to accomodate something which isn't and
> doesn't have much chance of becoming a full solution.  In addition,
> the problem itself is niche to begin with.

Let's not assume that memory hotplug is always a niche feature for huge
& special systems.  It may be a niche to begin with, but it could be
supported on VMs, which allows anyone to use.  Vasilis has been working
on KVM to support memory hotplug.

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 20:58               ` Toshi Kani
  0 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-09 20:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On Wed, 2013-10-09 at 15:20 -0400, Tejun Heo wrote:
> Hello,
> 
> On Thu, Oct 10, 2013 at 01:14:23AM +0800, Zhang Yanfei wrote:
> > >> You meant that the memory size is about few megs. But here, page tables
> > >> seems to be large enough in big memory machines, so that page tables will
> > > 
> > > Hmmm?  Even with 4k mappings and, say, 16Gigs of memory, it's still
> > > somewhere above 32MiB, right?  And, these physical mappings don't
> > > usually use 4k mappings to begin with.  Unless we're worrying about
> > > ISA DMA limit, I don't think it'd be problematic.
> > 
> > I think Peter meant very huge memory machines, say 2T memory? In the worst
> > case, this may need 2G memory for page tables, seems huge....
> 
> Realistically tho, why would people be using 4k mappings on 2T
> machines?  For the sake of argument, let's say 4k mappings are
> required for some weird reason, even then, doing SRAT parsing early
> doesn't necessarily solve the problem in itself.  It'd still need
> heuristics to avoid occupying too much of 32bit memory because it
> isn't difficult to imagine specific NUMA settings which would drive
> page table allocation into low address.
> 
> No matter what we do, there's no way around the fact that this whole
> effort is mostly an incomplete solution in its nature and that's why I
> think we better keep things isolated and simple.  It isn't a good idea
> to make structural changes to accomodate something which isn't and
> doesn't have much chance of becoming a full solution.  In addition,
> the problem itself is niche to begin with.

Let's not assume that memory hotplug is always a niche feature for huge
& special systems.  It may be a niche to begin with, but it could be
supported on VMs, which allows anyone to use.  Vasilis has been working
on KVM to support memory hotplug.

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 20:58               ` Toshi Kani
@ 2013-10-09 21:11                 ` Tejun Heo
  -1 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-09 21:11 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel

Hello, Toshi.

On Wed, Oct 09, 2013 at 02:58:31PM -0600, Toshi Kani wrote:
> Let's not assume that memory hotplug is always a niche feature for huge
> & special systems.  It may be a niche to begin with, but it could be
> supported on VMs, which allows anyone to use.  Vasilis has been working
> on KVM to support memory hotplug.

I'm not saying hotplug will always be niche.  I'm saying the approach
we're currently taking is.  It seems fairly inflexible to hang the
whole thing on NUMA nodes.  What does the planned kvm support do?
Splitting SRAT nodes so that it can do both actual NUMA node
distribution and hotplug granuliarity?  IIRC I asked a couple times
what the long term plan was for this feature and there doesn't seem to
be any road map for this thing to become a full solution.  Unless I
misunderstood, this is more of "let's put out the fire as there
already are (or gonna be) machines which can do it" kinda thing, which
is fine too.  My point is that it doesn't make a lot of sense to
change boot sequence invasively to accomodate that.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 21:11                 ` Tejun Heo
  0 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-09 21:11 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

Hello, Toshi.

On Wed, Oct 09, 2013 at 02:58:31PM -0600, Toshi Kani wrote:
> Let's not assume that memory hotplug is always a niche feature for huge
> & special systems.  It may be a niche to begin with, but it could be
> supported on VMs, which allows anyone to use.  Vasilis has been working
> on KVM to support memory hotplug.

I'm not saying hotplug will always be niche.  I'm saying the approach
we're currently taking is.  It seems fairly inflexible to hang the
whole thing on NUMA nodes.  What does the planned kvm support do?
Splitting SRAT nodes so that it can do both actual NUMA node
distribution and hotplug granuliarity?  IIRC I asked a couple times
what the long term plan was for this feature and there doesn't seem to
be any road map for this thing to become a full solution.  Unless I
misunderstood, this is more of "let's put out the fire as there
already are (or gonna be) machines which can do it" kinda thing, which
is fine too.  My point is that it doesn't make a lot of sense to
change boot sequence invasively to accomodate that.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 21:11                 ` Tejun Heo
@ 2013-10-09 21:14                   ` H. Peter Anvin
  -1 siblings, 0 replies; 109+ messages in thread
From: H. Peter Anvin @ 2013-10-09 21:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Toshi Kani, Zhang Yanfei, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc

On 10/09/2013 02:11 PM, Tejun Heo wrote:
> Hello, Toshi.
> 
> On Wed, Oct 09, 2013 at 02:58:31PM -0600, Toshi Kani wrote:
>> Let's not assume that memory hotplug is always a niche feature for huge
>> & special systems.  It may be a niche to begin with, but it could be
>> supported on VMs, which allows anyone to use.  Vasilis has been working
>> on KVM to support memory hotplug.
> 
> I'm not saying hotplug will always be niche.  I'm saying the approach
> we're currently taking is.  It seems fairly inflexible to hang the
> whole thing on NUMA nodes.  What does the planned kvm support do?
> Splitting SRAT nodes so that it can do both actual NUMA node
> distribution and hotplug granuliarity?  IIRC I asked a couple times
> what the long term plan was for this feature and there doesn't seem to
> be any road map for this thing to become a full solution.  Unless I
> misunderstood, this is more of "let's put out the fire as there
> already are (or gonna be) machines which can do it" kinda thing, which
> is fine too.  My point is that it doesn't make a lot of sense to
> change boot sequence invasively to accomodate that.
> 

I would also argue that in the VM scenario -- and arguable even in the
hardware scenario -- the right thing is to not expose the flexible
memory in the e820/EFI tables, and instead have it hotadded (possibly
*immediately* so) on boot.  This avoids both the boot time funnies as
well as the scaling issues with metadata.

The whole reason for VMs wanting this is because ballooning doesn't
scale with regards to metadata.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 21:14                   ` H. Peter Anvin
  0 siblings, 0 replies; 109+ messages in thread
From: H. Peter Anvin @ 2013-10-09 21:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Toshi Kani, Zhang Yanfei, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On 10/09/2013 02:11 PM, Tejun Heo wrote:
> Hello, Toshi.
> 
> On Wed, Oct 09, 2013 at 02:58:31PM -0600, Toshi Kani wrote:
>> Let's not assume that memory hotplug is always a niche feature for huge
>> & special systems.  It may be a niche to begin with, but it could be
>> supported on VMs, which allows anyone to use.  Vasilis has been working
>> on KVM to support memory hotplug.
> 
> I'm not saying hotplug will always be niche.  I'm saying the approach
> we're currently taking is.  It seems fairly inflexible to hang the
> whole thing on NUMA nodes.  What does the planned kvm support do?
> Splitting SRAT nodes so that it can do both actual NUMA node
> distribution and hotplug granuliarity?  IIRC I asked a couple times
> what the long term plan was for this feature and there doesn't seem to
> be any road map for this thing to become a full solution.  Unless I
> misunderstood, this is more of "let's put out the fire as there
> already are (or gonna be) machines which can do it" kinda thing, which
> is fine too.  My point is that it doesn't make a lot of sense to
> change boot sequence invasively to accomodate that.
> 

I would also argue that in the VM scenario -- and arguable even in the
hardware scenario -- the right thing is to not expose the flexible
memory in the e820/EFI tables, and instead have it hotadded (possibly
*immediately* so) on boot.  This avoids both the boot time funnies as
well as the scaling issues with metadata.

The whole reason for VMs wanting this is because ballooning doesn't
scale with regards to metadata.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 19:20             ` Tejun Heo
@ 2013-10-09 21:19               ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-09 21:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: H. Peter Anvin, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel

Hi tejun,

On 10/10/2013 03:20 AM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Oct 10, 2013 at 01:14:23AM +0800, Zhang Yanfei wrote:
>>>> You meant that the memory size is about few megs. But here, page tables
>>>> seems to be large enough in big memory machines, so that page tables will
>>>
>>> Hmmm?  Even with 4k mappings and, say, 16Gigs of memory, it's still
>>> somewhere above 32MiB, right?  And, these physical mappings don't
>>> usually use 4k mappings to begin with.  Unless we're worrying about
>>> ISA DMA limit, I don't think it'd be problematic.
>>
>> I think Peter meant very huge memory machines, say 2T memory? In the worst
>> case, this may need 2G memory for page tables, seems huge....
> 
> Realistically tho, why would people be using 4k mappings on 2T
> machines?  For the sake of argument, let's say 4k mappings are
> required for some weird reason, even then, doing SRAT parsing early
> doesn't necessarily solve the problem in itself.  It'd still need
> heuristics to avoid occupying too much of 32bit memory because it
> isn't difficult to imagine specific NUMA settings which would drive
> page table allocation into low address.
> 
> No matter what we do, there's no way around the fact that this whole
> effort is mostly an incomplete solution in its nature and that's why I
> think we better keep things isolated and simple.  It isn't a good idea
> to make structural changes to accomodate something which isn't and
> doesn't have much chance of becoming a full solution.  In addition,
> the problem itself is niche to begin with.
> 
>> And I am not familiar with the ISA DMA limit, does this mean the memory 
>> below 4G? Just as we have the ZONE_DMA32 in x86_64. (16MB limit seems not
>> the case here)
> 
> Yeah, I was referring to the 16MB limit, which apparently ceased to
> exist.

Hmmmm...If we are talking 16MB limit hear, I don't think it a problem, either.
Currently, default loading & running address of kernel is 16MB, so the
kernel itself is above 16MB, memory allocated in bottom-up mode is obviously
above the 16MB. Just seeing from a RHEL6.3 server:

  01000000-01507ff4 : Kernel code
  01507ff5-01c07b2f : Kernel data
  01d4e000-02012023 : Kernel bss

IOW, even if kernel is loaded and running at 1MB, it self will occupy about
16MB from the above.

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 21:19               ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-09 21:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: H. Peter Anvin, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

Hi tejun,

On 10/10/2013 03:20 AM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Oct 10, 2013 at 01:14:23AM +0800, Zhang Yanfei wrote:
>>>> You meant that the memory size is about few megs. But here, page tables
>>>> seems to be large enough in big memory machines, so that page tables will
>>>
>>> Hmmm?  Even with 4k mappings and, say, 16Gigs of memory, it's still
>>> somewhere above 32MiB, right?  And, these physical mappings don't
>>> usually use 4k mappings to begin with.  Unless we're worrying about
>>> ISA DMA limit, I don't think it'd be problematic.
>>
>> I think Peter meant very huge memory machines, say 2T memory? In the worst
>> case, this may need 2G memory for page tables, seems huge....
> 
> Realistically tho, why would people be using 4k mappings on 2T
> machines?  For the sake of argument, let's say 4k mappings are
> required for some weird reason, even then, doing SRAT parsing early
> doesn't necessarily solve the problem in itself.  It'd still need
> heuristics to avoid occupying too much of 32bit memory because it
> isn't difficult to imagine specific NUMA settings which would drive
> page table allocation into low address.
> 
> No matter what we do, there's no way around the fact that this whole
> effort is mostly an incomplete solution in its nature and that's why I
> think we better keep things isolated and simple.  It isn't a good idea
> to make structural changes to accomodate something which isn't and
> doesn't have much chance of becoming a full solution.  In addition,
> the problem itself is niche to begin with.
> 
>> And I am not familiar with the ISA DMA limit, does this mean the memory 
>> below 4G? Just as we have the ZONE_DMA32 in x86_64. (16MB limit seems not
>> the case here)
> 
> Yeah, I was referring to the 16MB limit, which apparently ceased to
> exist.

Hmmmm...If we are talking 16MB limit hear, I don't think it a problem, either.
Currently, default loading & running address of kernel is 16MB, so the
kernel itself is above 16MB, memory allocated in bottom-up mode is obviously
above the 16MB. Just seeing from a RHEL6.3 server:

  01000000-01507ff4 : Kernel code
  01507ff5-01c07b2f : Kernel data
  01d4e000-02012023 : Kernel bss

IOW, even if kernel is loaded and running at 1MB, it self will occupy about
16MB from the above.

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 21:19               ` Zhang Yanfei
@ 2013-10-09 21:22                 ` H. Peter Anvin
  -1 siblings, 0 replies; 109+ messages in thread
From: H. Peter Anvin @ 2013-10-09 21:22 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Tejun Heo, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc

On 10/09/2013 02:19 PM, Zhang Yanfei wrote:
>>
>> Yeah, I was referring to the 16MB limit, which apparently ceased to
>> exist.
> 
> Hmmmm...If we are talking 16MB limit hear, I don't think it a problem, either.
> Currently, default loading & running address of kernel is 16MB, so the
> kernel itself is above 16MB, memory allocated in bottom-up mode is obviously
> above the 16MB. Just seeing from a RHEL6.3 server:
> 
>   01000000-01507ff4 : Kernel code
>   01507ff5-01c07b2f : Kernel data
>   01d4e000-02012023 : Kernel bss
> 
> IOW, even if kernel is loaded and running at 1MB, it self will occupy about
> 16MB from the above.
> 

For various DMA devices you can find almost every possible power of 2
being a limitation.  The most common limits are 24, 32, and 40 bits, but
you also see odd ones like 30 bits in the field.  Really.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 21:22                 ` H. Peter Anvin
  0 siblings, 0 replies; 109+ messages in thread
From: H. Peter Anvin @ 2013-10-09 21:22 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Tejun Heo, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On 10/09/2013 02:19 PM, Zhang Yanfei wrote:
>>
>> Yeah, I was referring to the 16MB limit, which apparently ceased to
>> exist.
> 
> Hmmmm...If we are talking 16MB limit hear, I don't think it a problem, either.
> Currently, default loading & running address of kernel is 16MB, so the
> kernel itself is above 16MB, memory allocated in bottom-up mode is obviously
> above the 16MB. Just seeing from a RHEL6.3 server:
> 
>   01000000-01507ff4 : Kernel code
>   01507ff5-01c07b2f : Kernel data
>   01d4e000-02012023 : Kernel bss
> 
> IOW, even if kernel is loaded and running at 1MB, it self will occupy about
> 16MB from the above.
> 

For various DMA devices you can find almost every possible power of 2
being a limitation.  The most common limits are 24, 32, and 40 bits, but
you also see odd ones like 30 bits in the field.  Really.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 21:14                   ` H. Peter Anvin
@ 2013-10-09 21:45                     ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-09 21:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tejun Heo, Toshi Kani, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
	x86, linux-doc, linux-kernel, Linux MM

Hello Peter,

On 10/10/2013 05:14 AM, H. Peter Anvin wrote:
> On 10/09/2013 02:11 PM, Tejun Heo wrote:
>> Hello, Toshi.
>>
>> On Wed, Oct 09, 2013 at 02:58:31PM -0600, Toshi Kani wrote:
>>> Let's not assume that memory hotplug is always a niche feature for huge
>>> & special systems.  It may be a niche to begin with, but it could be
>>> supported on VMs, which allows anyone to use.  Vasilis has been working
>>> on KVM to support memory hotplug.
>>
>> I'm not saying hotplug will always be niche.  I'm saying the approach
>> we're currently taking is.  It seems fairly inflexible to hang the
>> whole thing on NUMA nodes.  What does the planned kvm support do?
>> Splitting SRAT nodes so that it can do both actual NUMA node
>> distribution and hotplug granuliarity?  IIRC I asked a couple times
>> what the long term plan was for this feature and there doesn't seem to
>> be any road map for this thing to become a full solution.  Unless I
>> misunderstood, this is more of "let's put out the fire as there
>> already are (or gonna be) machines which can do it" kinda thing, which
>> is fine too.  My point is that it doesn't make a lot of sense to
>> change boot sequence invasively to accomodate that.
>>
> 
> I would also argue that in the VM scenario -- and arguable even in the
> hardware scenario -- the right thing is to not expose the flexible
> memory in the e820/EFI tables, and instead have it hotadded (possibly
> *immediately* so) on boot.  This avoids both the boot time funnies as
> well as the scaling issues with metadata.
> 

So in this kind of scenario, hotpluggable memory will not be detected
at boot time, and admin should not use this movable_node boot option
and the kernel will act as before, using top-down allocation always.

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 21:45                     ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-09 21:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tejun Heo, Toshi Kani, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
	x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

Hello Peter,

On 10/10/2013 05:14 AM, H. Peter Anvin wrote:
> On 10/09/2013 02:11 PM, Tejun Heo wrote:
>> Hello, Toshi.
>>
>> On Wed, Oct 09, 2013 at 02:58:31PM -0600, Toshi Kani wrote:
>>> Let's not assume that memory hotplug is always a niche feature for huge
>>> & special systems.  It may be a niche to begin with, but it could be
>>> supported on VMs, which allows anyone to use.  Vasilis has been working
>>> on KVM to support memory hotplug.
>>
>> I'm not saying hotplug will always be niche.  I'm saying the approach
>> we're currently taking is.  It seems fairly inflexible to hang the
>> whole thing on NUMA nodes.  What does the planned kvm support do?
>> Splitting SRAT nodes so that it can do both actual NUMA node
>> distribution and hotplug granuliarity?  IIRC I asked a couple times
>> what the long term plan was for this feature and there doesn't seem to
>> be any road map for this thing to become a full solution.  Unless I
>> misunderstood, this is more of "let's put out the fire as there
>> already are (or gonna be) machines which can do it" kinda thing, which
>> is fine too.  My point is that it doesn't make a lot of sense to
>> change boot sequence invasively to accomodate that.
>>
> 
> I would also argue that in the VM scenario -- and arguable even in the
> hardware scenario -- the right thing is to not expose the flexible
> memory in the e820/EFI tables, and instead have it hotadded (possibly
> *immediately* so) on boot.  This avoids both the boot time funnies as
> well as the scaling issues with metadata.
> 

So in this kind of scenario, hotpluggable memory will not be detected
at boot time, and admin should not use this movable_node boot option
and the kernel will act as before, using top-down allocation always.

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 21:45                     ` Zhang Yanfei
@ 2013-10-09 23:10                       ` H. Peter Anvin
  -1 siblings, 0 replies; 109+ messages in thread
From: H. Peter Anvin @ 2013-10-09 23:10 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Tejun Heo, Toshi Kani, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
	x86, linux-doc

On 10/09/2013 02:45 PM, Zhang Yanfei wrote:
>>
>> I would also argue that in the VM scenario -- and arguable even in the
>> hardware scenario -- the right thing is to not expose the flexible
>> memory in the e820/EFI tables, and instead have it hotadded (possibly
>> *immediately* so) on boot.  This avoids both the boot time funnies as
>> well as the scaling issues with metadata.
>>
> 
> So in this kind of scenario, hotpluggable memory will not be detected
> at boot time, and admin should not use this movable_node boot option
> and the kernel will act as before, using top-down allocation always.
> 

Yes.  The idea is that the kernel will boot up without the hotplug
memory, but if desired, will immediately see a hotplug-add event for the
movable memory.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 23:10                       ` H. Peter Anvin
  0 siblings, 0 replies; 109+ messages in thread
From: H. Peter Anvin @ 2013-10-09 23:10 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Tejun Heo, Toshi Kani, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
	x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

On 10/09/2013 02:45 PM, Zhang Yanfei wrote:
>>
>> I would also argue that in the VM scenario -- and arguable even in the
>> hardware scenario -- the right thing is to not expose the flexible
>> memory in the e820/EFI tables, and instead have it hotadded (possibly
>> *immediately* so) on boot.  This avoids both the boot time funnies as
>> well as the scaling issues with metadata.
>>
> 
> So in this kind of scenario, hotpluggable memory will not be detected
> at boot time, and admin should not use this movable_node boot option
> and the kernel will act as before, using top-down allocation always.
> 

Yes.  The idea is that the kernel will boot up without the hotplug
memory, but if desired, will immediately see a hotplug-add event for the
movable memory.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 23:10                       ` H. Peter Anvin
@ 2013-10-09 23:26                         ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-09 23:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tejun Heo, Toshi Kani, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
	x86, linux-doc, linux-kernel, Linux MM

Hello Peter,

On 10/10/2013 07:10 AM, H. Peter Anvin wrote:
> On 10/09/2013 02:45 PM, Zhang Yanfei wrote:
>>>
>>> I would also argue that in the VM scenario -- and arguable even in the
>>> hardware scenario -- the right thing is to not expose the flexible
>>> memory in the e820/EFI tables, and instead have it hotadded (possibly
>>> *immediately* so) on boot.  This avoids both the boot time funnies as
>>> well as the scaling issues with metadata.
>>>
>>
>> So in this kind of scenario, hotpluggable memory will not be detected
>> at boot time, and admin should not use this movable_node boot option
>> and the kernel will act as before, using top-down allocation always.
>>
> 
> Yes.  The idea is that the kernel will boot up without the hotplug
> memory, but if desired, will immediately see a hotplug-add event for the
> movable memory.

Yeah, this is good.

But in the scenario that boot with hotplug memory, we need the movable_node
option. So as tejun has explained a lot about this patchset, do you still
have objection to it or could I ask andrew to merge it into -mm tree for
more tests?

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 23:26                         ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-09 23:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tejun Heo, Toshi Kani, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
	x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

Hello Peter,

On 10/10/2013 07:10 AM, H. Peter Anvin wrote:
> On 10/09/2013 02:45 PM, Zhang Yanfei wrote:
>>>
>>> I would also argue that in the VM scenario -- and arguable even in the
>>> hardware scenario -- the right thing is to not expose the flexible
>>> memory in the e820/EFI tables, and instead have it hotadded (possibly
>>> *immediately* so) on boot.  This avoids both the boot time funnies as
>>> well as the scaling issues with metadata.
>>>
>>
>> So in this kind of scenario, hotpluggable memory will not be detected
>> at boot time, and admin should not use this movable_node boot option
>> and the kernel will act as before, using top-down allocation always.
>>
> 
> Yes.  The idea is that the kernel will boot up without the hotplug
> memory, but if desired, will immediately see a hotplug-add event for the
> movable memory.

Yeah, this is good.

But in the scenario that boot with hotplug memory, we need the movable_node
option. So as tejun has explained a lot about this patchset, do you still
have objection to it or could I ask andrew to merge it into -mm tree for
more tests?

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 21:22                 ` H. Peter Anvin
@ 2013-10-09 23:30                   ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-09 23:30 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tejun Heo, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM

On 10/10/2013 05:22 AM, H. Peter Anvin wrote:
> On 10/09/2013 02:19 PM, Zhang Yanfei wrote:
>>>
>>> Yeah, I was referring to the 16MB limit, which apparently ceased to
>>> exist.
>>
>> Hmmmm...If we are talking 16MB limit hear, I don't think it a problem, either.
>> Currently, default loading & running address of kernel is 16MB, so the
>> kernel itself is above 16MB, memory allocated in bottom-up mode is obviously
>> above the 16MB. Just seeing from a RHEL6.3 server:
>>
>>   01000000-01507ff4 : Kernel code
>>   01507ff5-01c07b2f : Kernel data
>>   01d4e000-02012023 : Kernel bss
>>
>> IOW, even if kernel is loaded and running at 1MB, it self will occupy about
>> 16MB from the above.
>>
> 
> For various DMA devices you can find almost every possible power of 2
> being a limitation.  The most common limits are 24, 32, and 40 bits, but
> you also see odd ones like 30 bits in the field.  Really.
> 

Thanks for this.

I was always curious about what the limit is when we said DMA limit before.

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 23:30                   ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-09 23:30 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tejun Heo, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On 10/10/2013 05:22 AM, H. Peter Anvin wrote:
> On 10/09/2013 02:19 PM, Zhang Yanfei wrote:
>>>
>>> Yeah, I was referring to the 16MB limit, which apparently ceased to
>>> exist.
>>
>> Hmmmm...If we are talking 16MB limit hear, I don't think it a problem, either.
>> Currently, default loading & running address of kernel is 16MB, so the
>> kernel itself is above 16MB, memory allocated in bottom-up mode is obviously
>> above the 16MB. Just seeing from a RHEL6.3 server:
>>
>>   01000000-01507ff4 : Kernel code
>>   01507ff5-01c07b2f : Kernel data
>>   01d4e000-02012023 : Kernel bss
>>
>> IOW, even if kernel is loaded and running at 1MB, it self will occupy about
>> 16MB from the above.
>>
> 
> For various DMA devices you can find almost every possible power of 2
> being a limitation.  The most common limits are 24, 32, and 40 bits, but
> you also see odd ones like 30 bits in the field.  Really.
> 

Thanks for this.

I was always curious about what the limit is when we said DMA limit before.

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 21:11                 ` Tejun Heo
@ 2013-10-09 23:58                   ` Toshi Kani
  -1 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-09 23:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel

Hello Tejun,

On Wed, 2013-10-09 at 17:11 -0400, Tejun Heo wrote:
> On Wed, Oct 09, 2013 at 02:58:31PM -0600, Toshi Kani wrote:
> > Let's not assume that memory hotplug is always a niche feature for huge
> > & special systems.  It may be a niche to begin with, but it could be
> > supported on VMs, which allows anyone to use.  Vasilis has been working
> > on KVM to support memory hotplug.
> 
> I'm not saying hotplug will always be niche.

Great. :)

> I'm saying the approach
> we're currently taking is.  It seems fairly inflexible to hang the
> whole thing on NUMA nodes.  What does the planned kvm support do?
> Splitting SRAT nodes so that it can do both actual NUMA node
> distribution and hotplug granuliarity?  

I agree that using a node as the granularity is inflexible, but we have
to start from some point first, so that we can improve in future.  SRAT
may have multiple entries per a proximity and each of which can be set
to hotpluggable or not.  So, using SRAT does not limit us to the node
granularity.  The kernel however has limitations that zone type, etc,
are managed per a node basis.

> IIRC I asked a couple times
> what the long term plan was for this feature and there doesn't seem to
> be any road map for this thing to become a full solution.  Unless I
> misunderstood, this is more of "let's put out the fire as there
> already are (or gonna be) machines which can do it" kinda thing, which
> is fine too.  My point is that it doesn't make a lot of sense to
> change boot sequence invasively to accomodate that.

Well, there was a plan before, which considered to enhance it to a
memory device granularity at step 3.  But we had a major replan at step
1 per your suggestion.

https://lkml.org/lkml/2013/6/19/73

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-09 23:58                   ` Toshi Kani
  0 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-09 23:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

Hello Tejun,

On Wed, 2013-10-09 at 17:11 -0400, Tejun Heo wrote:
> On Wed, Oct 09, 2013 at 02:58:31PM -0600, Toshi Kani wrote:
> > Let's not assume that memory hotplug is always a niche feature for huge
> > & special systems.  It may be a niche to begin with, but it could be
> > supported on VMs, which allows anyone to use.  Vasilis has been working
> > on KVM to support memory hotplug.
> 
> I'm not saying hotplug will always be niche.

Great. :)

> I'm saying the approach
> we're currently taking is.  It seems fairly inflexible to hang the
> whole thing on NUMA nodes.  What does the planned kvm support do?
> Splitting SRAT nodes so that it can do both actual NUMA node
> distribution and hotplug granuliarity?  

I agree that using a node as the granularity is inflexible, but we have
to start from some point first, so that we can improve in future.  SRAT
may have multiple entries per a proximity and each of which can be set
to hotpluggable or not.  So, using SRAT does not limit us to the node
granularity.  The kernel however has limitations that zone type, etc,
are managed per a node basis.

> IIRC I asked a couple times
> what the long term plan was for this feature and there doesn't seem to
> be any road map for this thing to become a full solution.  Unless I
> misunderstood, this is more of "let's put out the fire as there
> already are (or gonna be) machines which can do it" kinda thing, which
> is fine too.  My point is that it doesn't make a lot of sense to
> change boot sequence invasively to accomodate that.

Well, there was a plan before, which considered to enhance it to a
memory device granularity at step 3.  But we had a major replan at step
1 per your suggestion.

https://lkml.org/lkml/2013/6/19/73

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 21:14                   ` H. Peter Anvin
@ 2013-10-10  0:25                     ` Toshi Kani
  -1 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-10  0:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tejun Heo, Zhang Yanfei, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
	x86, linux-doc, linux-kernel, Linux MM

On Wed, 2013-10-09 at 14:14 -0700, H. Peter Anvin wrote:
> On 10/09/2013 02:11 PM, Tejun Heo wrote:
> > Hello, Toshi.
> > 
> > On Wed, Oct 09, 2013 at 02:58:31PM -0600, Toshi Kani wrote:
> >> Let's not assume that memory hotplug is always a niche feature for huge
> >> & special systems.  It may be a niche to begin with, but it could be
> >> supported on VMs, which allows anyone to use.  Vasilis has been working
> >> on KVM to support memory hotplug.
> > 
> > I'm not saying hotplug will always be niche.  I'm saying the approach
> > we're currently taking is.  It seems fairly inflexible to hang the
> > whole thing on NUMA nodes.  What does the planned kvm support do?
> > Splitting SRAT nodes so that it can do both actual NUMA node
> > distribution and hotplug granuliarity?  IIRC I asked a couple times
> > what the long term plan was for this feature and there doesn't seem to
> > be any road map for this thing to become a full solution.  Unless I
> > misunderstood, this is more of "let's put out the fire as there
> > already are (or gonna be) machines which can do it" kinda thing, which
> > is fine too.  My point is that it doesn't make a lot of sense to
> > change boot sequence invasively to accomodate that.
> > 
> 
> I would also argue that in the VM scenario -- and arguable even in the
> hardware scenario -- the right thing is to not expose the flexible
> memory in the e820/EFI tables, and instead have it hotadded (possibly
> *immediately* so) on boot.  This avoids both the boot time funnies as
> well as the scaling issues with metadata.

That's a good idea!  It will work just fine if firmware is written in
such a way.  However, since the most (if not all) firmware integrates
and exposes all memory at boot, we still need to support this regular
scenario as well.

> The whole reason for VMs wanting this is because ballooning doesn't
> scale with regards to metadata.

Agreed.

Thanks,
-Toshi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10  0:25                     ` Toshi Kani
  0 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-10  0:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tejun Heo, Zhang Yanfei, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
	x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

On Wed, 2013-10-09 at 14:14 -0700, H. Peter Anvin wrote:
> On 10/09/2013 02:11 PM, Tejun Heo wrote:
> > Hello, Toshi.
> > 
> > On Wed, Oct 09, 2013 at 02:58:31PM -0600, Toshi Kani wrote:
> >> Let's not assume that memory hotplug is always a niche feature for huge
> >> & special systems.  It may be a niche to begin with, but it could be
> >> supported on VMs, which allows anyone to use.  Vasilis has been working
> >> on KVM to support memory hotplug.
> > 
> > I'm not saying hotplug will always be niche.  I'm saying the approach
> > we're currently taking is.  It seems fairly inflexible to hang the
> > whole thing on NUMA nodes.  What does the planned kvm support do?
> > Splitting SRAT nodes so that it can do both actual NUMA node
> > distribution and hotplug granuliarity?  IIRC I asked a couple times
> > what the long term plan was for this feature and there doesn't seem to
> > be any road map for this thing to become a full solution.  Unless I
> > misunderstood, this is more of "let's put out the fire as there
> > already are (or gonna be) machines which can do it" kinda thing, which
> > is fine too.  My point is that it doesn't make a lot of sense to
> > change boot sequence invasively to accomodate that.
> > 
> 
> I would also argue that in the VM scenario -- and arguable even in the
> hardware scenario -- the right thing is to not expose the flexible
> memory in the e820/EFI tables, and instead have it hotadded (possibly
> *immediately* so) on boot.  This avoids both the boot time funnies as
> well as the scaling issues with metadata.

That's a good idea!  It will work just fine if firmware is written in
such a way.  However, since the most (if not all) firmware integrates
and exposes all memory at boot, we still need to support this regular
scenario as well.

> The whole reason for VMs wanting this is because ballooning doesn't
> scale with regards to metadata.

Agreed.

Thanks,
-Toshi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 23:58                   ` Toshi Kani
@ 2013-10-10  1:00                     ` Tejun Heo
  -1 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-10  1:00 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel

Hello, Toshi.

On Wed, Oct 09, 2013 at 05:58:55PM -0600, Toshi Kani wrote:
> Well, there was a plan before, which considered to enhance it to a
> memory device granularity at step 3.  But we had a major replan at step
> 1 per your suggestion.
> 
> https://lkml.org/lkml/2013/6/19/73

Where?

 "3. Improve memory hotplug to support local device pagetable."

How can the above possibly be considered as a plan for finer
granularity?  Forget about the "how" part.  The stated goal doesn't
even mention finer granularity.  Are firmware writers gonna be
required to split SRAT entries into multiple sub-nodes to support it?
Is segregating zones further for this even a good idea?  Adding more
NUMA nodes has its own overhead and the mm code isn't written
expecting it to be repurposed for segmenting the same NUMA node for
hotplug underneath it.

Maybe zoning is a viable approach.  Maybe it is not.  I don't know,
but you guys don't seem to be too interested in actual long term
planning while pushing for something invasive which may or may not be
viable in the longer term, which can often lead to silly situations.
It isn't even clear whether SRAT is the right interface for this.  If
it's gonna require firwmare writer's cooperation anyway, why not
provide the information as extended part of e820?  It doesn't seem to
have much to do with NUMA or zones.  The only information the kernel
needs to know is whether certain memory areas should only be used for
page cache.

At this point, at least to me, it doesn't seem reasonably clear how
this is gonna develop and the whole thing feels like a kludge, which
can be fine too, but seriously if you guys wanna push for an invasive
approach, it should really be backed by longer term plan, vision,
justification and the ability to make the necessary changes in the
various involved layers.  Maybe I'm being too pessimistic but I feel
that there are a lot missing in most of those areas, which makes it
quite risky to commit to invasive changes.

If the zone based kludgy appraoch is something meaningfully useful,
I'd suggest to sticking to it at least for now.  Some of it would be
useful anyway and if it doesn't fan out the added maintenance overhead
is fairly low.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10  1:00                     ` Tejun Heo
  0 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-10  1:00 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

Hello, Toshi.

On Wed, Oct 09, 2013 at 05:58:55PM -0600, Toshi Kani wrote:
> Well, there was a plan before, which considered to enhance it to a
> memory device granularity at step 3.  But we had a major replan at step
> 1 per your suggestion.
> 
> https://lkml.org/lkml/2013/6/19/73

Where?

 "3. Improve memory hotplug to support local device pagetable."

How can the above possibly be considered as a plan for finer
granularity?  Forget about the "how" part.  The stated goal doesn't
even mention finer granularity.  Are firmware writers gonna be
required to split SRAT entries into multiple sub-nodes to support it?
Is segregating zones further for this even a good idea?  Adding more
NUMA nodes has its own overhead and the mm code isn't written
expecting it to be repurposed for segmenting the same NUMA node for
hotplug underneath it.

Maybe zoning is a viable approach.  Maybe it is not.  I don't know,
but you guys don't seem to be too interested in actual long term
planning while pushing for something invasive which may or may not be
viable in the longer term, which can often lead to silly situations.
It isn't even clear whether SRAT is the right interface for this.  If
it's gonna require firwmare writer's cooperation anyway, why not
provide the information as extended part of e820?  It doesn't seem to
have much to do with NUMA or zones.  The only information the kernel
needs to know is whether certain memory areas should only be used for
page cache.

At this point, at least to me, it doesn't seem reasonably clear how
this is gonna develop and the whole thing feels like a kludge, which
can be fine too, but seriously if you guys wanna push for an invasive
approach, it should really be backed by longer term plan, vision,
justification and the ability to make the necessary changes in the
various involved layers.  Maybe I'm being too pessimistic but I feel
that there are a lot missing in most of those areas, which makes it
quite risky to commit to invasive changes.

If the zone based kludgy appraoch is something meaningfully useful,
I'd suggest to sticking to it at least for now.  Some of it would be
useful anyway and if it doesn't fan out the added maintenance overhead
is fairly low.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 23:26                         ` Zhang Yanfei
  (?)
@ 2013-10-10  1:20                           ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-10  1:20 UTC (permalink / raw)
  To: H. Peter Anvin, Tejun Heo
  Cc: Zhang Yanfei, Toshi Kani, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel

Hello guys,

On 10/10/2013 07:26 AM, Zhang Yanfei wrote:
> Hello Peter,
> 
> On 10/10/2013 07:10 AM, H. Peter Anvin wrote:
>> On 10/09/2013 02:45 PM, Zhang Yanfei wrote:
>>>>
>>>> I would also argue that in the VM scenario -- and arguable even in the
>>>> hardware scenario -- the right thing is to not expose the flexible
>>>> memory in the e820/EFI tables, and instead have it hotadded (possibly
>>>> *immediately* so) on boot.  This avoids both the boot time funnies as
>>>> well as the scaling issues with metadata.
>>>>
>>>
>>> So in this kind of scenario, hotpluggable memory will not be detected
>>> at boot time, and admin should not use this movable_node boot option
>>> and the kernel will act as before, using top-down allocation always.
>>>
>>
>> Yes.  The idea is that the kernel will boot up without the hotplug
>> memory, but if desired, will immediately see a hotplug-add event for the
>> movable memory.
> 
> Yeah, this is good.
> 
> But in the scenario that boot with hotplug memory, we need the movable_node
> option. So as tejun has explained a lot about this patchset, do you still
> have objection to it or could I ask andrew to merge it into -mm tree for
> more tests?
> 

Since tejun has explained a lot about this approach, could we come to
an agreement on this one?

Peter? If you have no objection, I'll post a new v7 version which will fix
the __pa_symbol problem pointed by you.

-- 
Thanks.
Zhang Yanfei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10  1:20                           ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-10  1:20 UTC (permalink / raw)
  To: H. Peter Anvin, Tejun Heo
  Cc: Zhang Yanfei, Toshi Kani, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Tang Chen

Hello guys,

On 10/10/2013 07:26 AM, Zhang Yanfei wrote:
> Hello Peter,
> 
> On 10/10/2013 07:10 AM, H. Peter Anvin wrote:
>> On 10/09/2013 02:45 PM, Zhang Yanfei wrote:
>>>>
>>>> I would also argue that in the VM scenario -- and arguable even in the
>>>> hardware scenario -- the right thing is to not expose the flexible
>>>> memory in the e820/EFI tables, and instead have it hotadded (possibly
>>>> *immediately* so) on boot.  This avoids both the boot time funnies as
>>>> well as the scaling issues with metadata.
>>>>
>>>
>>> So in this kind of scenario, hotpluggable memory will not be detected
>>> at boot time, and admin should not use this movable_node boot option
>>> and the kernel will act as before, using top-down allocation always.
>>>
>>
>> Yes.  The idea is that the kernel will boot up without the hotplug
>> memory, but if desired, will immediately see a hotplug-add event for the
>> movable memory.
> 
> Yeah, this is good.
> 
> But in the scenario that boot with hotplug memory, we need the movable_node
> option. So as tejun has explained a lot about this patchset, do you still
> have objection to it or could I ask andrew to merge it into -mm tree for
> more tests?
> 

Since tejun has explained a lot about this approach, could we come to
an agreement on this one?

Peter? If you have no objection, I'll post a new v7 version which will fix
the __pa_symbol problem pointed by you.

-- 
Thanks.
Zhang Yanfei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10  1:20                           ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-10  1:20 UTC (permalink / raw)
  To: H. Peter Anvin, Tejun Heo
  Cc: Zhang Yanfei, Toshi Kani, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Tang Chen

Hello guys,

On 10/10/2013 07:26 AM, Zhang Yanfei wrote:
> Hello Peter,
> 
> On 10/10/2013 07:10 AM, H. Peter Anvin wrote:
>> On 10/09/2013 02:45 PM, Zhang Yanfei wrote:
>>>>
>>>> I would also argue that in the VM scenario -- and arguable even in the
>>>> hardware scenario -- the right thing is to not expose the flexible
>>>> memory in the e820/EFI tables, and instead have it hotadded (possibly
>>>> *immediately* so) on boot.  This avoids both the boot time funnies as
>>>> well as the scaling issues with metadata.
>>>>
>>>
>>> So in this kind of scenario, hotpluggable memory will not be detected
>>> at boot time, and admin should not use this movable_node boot option
>>> and the kernel will act as before, using top-down allocation always.
>>>
>>
>> Yes.  The idea is that the kernel will boot up without the hotplug
>> memory, but if desired, will immediately see a hotplug-add event for the
>> movable memory.
> 
> Yeah, this is good.
> 
> But in the scenario that boot with hotplug memory, we need the movable_node
> option. So as tejun has explained a lot about this patchset, do you still
> have objection to it or could I ask andrew to merge it into -mm tree for
> more tests?
> 

Since tejun has explained a lot about this approach, could we come to
an agreement on this one?

Peter? If you have no objection, I'll post a new v7 version which will fix
the __pa_symbol problem pointed by you.

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-10  1:00                     ` Tejun Heo
@ 2013-10-10 14:36                       ` Toshi Kani
  -1 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-10 14:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman@redhat.com

On Thu, 2013-10-10 at 01:00 +0000, Tejun Heo wrote:
> Hello, Toshi.
> 
> On Wed, Oct 09, 2013 at 05:58:55PM -0600, Toshi Kani wrote:
> > Well, there was a plan before, which considered to enhance it to a
> > memory device granularity at step 3.  But we had a major replan at step
> > 1 per your suggestion.
> > 
> > https://lkml.org/lkml/2013/6/19/73
> 
> Where?
> 
>  "3. Improve memory hotplug to support local device pagetable."
> 
> How can the above possibly be considered as a plan for finer
> granularity?  Forget about the "how" part.  The stated goal doesn't
> even mention finer granularity.  

The word "device" above refers memory device level granularity.  

> Are firmware writers gonna be
> required to split SRAT entries into multiple sub-nodes to support it?

Yes, and that's part of the ACPI spec.  That's not something the OS
requests to do.  If a memory range has different attribute, firmware has
to put it in a separate entry.

> Is segregating zones further for this even a good idea?  Adding more
> NUMA nodes has its own overhead and the mm code isn't written
> expecting it to be repurposed for segmenting the same NUMA node for
> hotplug underneath it.

I agree.  But my point is that it is an issue today with the current
kernel implementation.  This issue is not introduced by using SRAT.

> Maybe zoning is a viable approach.  Maybe it is not.  I don't know,
> but you guys don't seem to be too interested in actual long term
> planning while pushing for something invasive which may or may not be
> viable in the longer term, which can often lead to silly situations.
> It isn't even clear whether SRAT is the right interface for this.  If
> it's gonna require firwmare writer's cooperation anyway, why not
> provide the information as extended part of e820?  It doesn't seem to
> have much to do with NUMA or zones.  The only information the kernel
> needs to know is whether certain memory areas should only be used for
> page cache.

SRAT and _EJ0 method are the only interfaces that define ejectability in
the standard spec.  Are you suggesting us to change the e820 spec or not
to comply with the spec?  I do not think such approaches work.    

> At this point, at least to me, it doesn't seem reasonably clear how
> this is gonna develop and the whole thing feels like a kludge, which
> can be fine too, but seriously if you guys wanna push for an invasive
> approach, it should really be backed by longer term plan, vision,
> justification and the ability to make the necessary changes in the
> various involved layers.  Maybe I'm being too pessimistic but I feel
> that there are a lot missing in most of those areas, which makes it
> quite risky to commit to invasive changes.
> 
> If the zone based kludgy appraoch is something meaningfully useful,
> I'd suggest to sticking to it at least for now.  Some of it would be
> useful anyway and if it doesn't fan out the added maintenance overhead
> is fairly low.

I think memory hotplug was originally implemented on ia64 with the node
granularity.  I share your concerns, but that's been done a long time
ago.  It's too late to complain the past.  This SRAT work is not
introducing such restriction.

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10 14:36                       ` Toshi Kani
  0 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-10 14:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On Thu, 2013-10-10 at 01:00 +0000, Tejun Heo wrote:
> Hello, Toshi.
> 
> On Wed, Oct 09, 2013 at 05:58:55PM -0600, Toshi Kani wrote:
> > Well, there was a plan before, which considered to enhance it to a
> > memory device granularity at step 3.  But we had a major replan at step
> > 1 per your suggestion.
> > 
> > https://lkml.org/lkml/2013/6/19/73
> 
> Where?
> 
>  "3. Improve memory hotplug to support local device pagetable."
> 
> How can the above possibly be considered as a plan for finer
> granularity?  Forget about the "how" part.  The stated goal doesn't
> even mention finer granularity.  

The word "device" above refers memory device level granularity.  

> Are firmware writers gonna be
> required to split SRAT entries into multiple sub-nodes to support it?

Yes, and that's part of the ACPI spec.  That's not something the OS
requests to do.  If a memory range has different attribute, firmware has
to put it in a separate entry.

> Is segregating zones further for this even a good idea?  Adding more
> NUMA nodes has its own overhead and the mm code isn't written
> expecting it to be repurposed for segmenting the same NUMA node for
> hotplug underneath it.

I agree.  But my point is that it is an issue today with the current
kernel implementation.  This issue is not introduced by using SRAT.

> Maybe zoning is a viable approach.  Maybe it is not.  I don't know,
> but you guys don't seem to be too interested in actual long term
> planning while pushing for something invasive which may or may not be
> viable in the longer term, which can often lead to silly situations.
> It isn't even clear whether SRAT is the right interface for this.  If
> it's gonna require firwmare writer's cooperation anyway, why not
> provide the information as extended part of e820?  It doesn't seem to
> have much to do with NUMA or zones.  The only information the kernel
> needs to know is whether certain memory areas should only be used for
> page cache.

SRAT and _EJ0 method are the only interfaces that define ejectability in
the standard spec.  Are you suggesting us to change the e820 spec or not
to comply with the spec?  I do not think such approaches work.    

> At this point, at least to me, it doesn't seem reasonably clear how
> this is gonna develop and the whole thing feels like a kludge, which
> can be fine too, but seriously if you guys wanna push for an invasive
> approach, it should really be backed by longer term plan, vision,
> justification and the ability to make the necessary changes in the
> various involved layers.  Maybe I'm being too pessimistic but I feel
> that there are a lot missing in most of those areas, which makes it
> quite risky to commit to invasive changes.
> 
> If the zone based kludgy appraoch is something meaningfully useful,
> I'd suggest to sticking to it at least for now.  Some of it would be
> useful anyway and if it doesn't fan out the added maintenance overhead
> is fairly low.

I think memory hotplug was originally implemented on ia64 with the node
granularity.  I share your concerns, but that's been done a long time
ago.  It's too late to complain the past.  This SRAT work is not
introducing such restriction.

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-10 14:36                       ` Toshi Kani
@ 2013-10-10 15:35                         ` Tejun Heo
  -1 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-10 15:35 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman@redhat.com

Hello,

On Thu, Oct 10, 2013 at 08:36:49AM -0600, Toshi Kani wrote:
> >  "3. Improve memory hotplug to support local device pagetable."
> > 
> > How can the above possibly be considered as a plan for finer
> > granularity?  Forget about the "how" part.  The stated goal doesn't
> > even mention finer granularity.  
> 
> The word "device" above refers memory device level granularity.  

That's a lot of reading inbetween the words.

> > Are firmware writers gonna be
> > required to split SRAT entries into multiple sub-nodes to support it?
> 
> Yes, and that's part of the ACPI spec.  That's not something the OS
> requests to do.  If a memory range has different attribute, firmware has
> to put it in a separate entry.

I was referring to having to segment a contiguous hotplug memory area
further to support finer granularity.  This is represented by separate
mem devices rather than segmented SRAT entries, right?  Hmmm... so we
should parse device nodes before setting up page tables?

> SRAT and _EJ0 method are the only interfaces that define ejectability in
> the standard spec.  Are you suggesting us to change the e820 spec or not
> to comply with the spec?  I do not think such approaches work.    

It's slower but standards get revised and updated over time.  Have no
idea whether there'd be a sane way to do that for e820 tho.

> I think memory hotplug was originally implemented on ia64 with the node
> granularity.  I share your concerns, but that's been done a long time
> ago.  It's too late to complain the past.  This SRAT work is not
> introducing such restriction.

We're going round and round.  You're saying that using SRAT isn't
worse than what came before while failing to illustrate how committing
to invasive changes would eventually lead to something better.  "it
isn't worse" isn't much of an argument.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10 15:35                         ` Tejun Heo
  0 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-10 15:35 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

Hello,

On Thu, Oct 10, 2013 at 08:36:49AM -0600, Toshi Kani wrote:
> >  "3. Improve memory hotplug to support local device pagetable."
> > 
> > How can the above possibly be considered as a plan for finer
> > granularity?  Forget about the "how" part.  The stated goal doesn't
> > even mention finer granularity.  
> 
> The word "device" above refers memory device level granularity.  

That's a lot of reading inbetween the words.

> > Are firmware writers gonna be
> > required to split SRAT entries into multiple sub-nodes to support it?
> 
> Yes, and that's part of the ACPI spec.  That's not something the OS
> requests to do.  If a memory range has different attribute, firmware has
> to put it in a separate entry.

I was referring to having to segment a contiguous hotplug memory area
further to support finer granularity.  This is represented by separate
mem devices rather than segmented SRAT entries, right?  Hmmm... so we
should parse device nodes before setting up page tables?

> SRAT and _EJ0 method are the only interfaces that define ejectability in
> the standard spec.  Are you suggesting us to change the e820 spec or not
> to comply with the spec?  I do not think such approaches work.    

It's slower but standards get revised and updated over time.  Have no
idea whether there'd be a sane way to do that for e820 tho.

> I think memory hotplug was originally implemented on ia64 with the node
> granularity.  I share your concerns, but that's been done a long time
> ago.  It's too late to complain the past.  This SRAT work is not
> introducing such restriction.

We're going round and round.  You're saying that using SRAT isn't
worse than what came before while failing to illustrate how committing
to invasive changes would eventually lead to something better.  "it
isn't worse" isn't much of an argument.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-10 15:35                         ` Tejun Heo
@ 2013-10-10 16:24                           ` Toshi Kani
  -1 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-10 16:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman@redhat.com

Hello Tejun,

On Thu, 2013-10-10 at 11:35 -0400, Tejun Heo wrote:
 :
> > > Are firmware writers gonna be
> > > required to split SRAT entries into multiple sub-nodes to support it?
> > 
> > Yes, and that's part of the ACPI spec.  That's not something the OS
> > requests to do.  If a memory range has different attribute, firmware has
> > to put it in a separate entry.
> 
> I was referring to having to segment a contiguous hotplug memory area
> further to support finer granularity.  This is represented by separate
> mem devices rather than segmented SRAT entries, right?  Hmmm... so we
> should parse device nodes before setting up page tables?

Yes, a memory device object is the finest granularity of performing
memory hotplug on ACPI based platforms.  SRAT must be consistent with
the memory device object info, but its entry does not have to be
segmented by the device granularity.  It only needs to be segmented when
memory attribute is different.  For instance, SRAT may have a single
entry for Case A), but Case B) must have two separate entries.  In both
cases, MEMA & MEMB represent a contiguous memory range.

Case A) Both MEMA and MEMB devices are hotpluggable

 MEMA:  _CRS: 0x0000-0x0fff  _EJ0: hotpluggable
 MEMB:  _CRS: 0x1000-0x1fff  _EJ0: hotpluggable

 SRAT: 0x0000-0x1ffff hotpluggable

Case B) Only MEMB is hotpluggable

 MEMA:  _CRS: 0x0000-0x0fff
 MEMB:  _CRS: 0x1000-0x1fff  _EJ0: hotpluggable

 SRAT: 0x0000-0x0fff
       0x1000-0x1fff  hotpluggable

> > SRAT and _EJ0 method are the only interfaces that define ejectability in
> > the standard spec.  Are you suggesting us to change the e820 spec or not
> > to comply with the spec?  I do not think such approaches work.    
> 
> It's slower but standards get revised and updated over time.  Have no
> idea whether there'd be a sane way to do that for e820 tho.

I am familiar with the process.  Yes, it is slow, but most importantly,
it needs some standard group or company to actively maintain the spec in
order to update it.  I do not think e820 is in such state.

> > I think memory hotplug was originally implemented on ia64 with the node
> > granularity.  I share your concerns, but that's been done a long time
> > ago.  It's too late to complain the past.  This SRAT work is not
> > introducing such restriction.
> 
> We're going round and round.  You're saying that using SRAT isn't
> worse than what came before while failing to illustrate how committing
> to invasive changes would eventually lead to something better.  "it
> isn't worse" isn't much of an argument.

We did avoid moving up the ACPI table init function per your suggestion.
I guess I do not understand why you still concerned about using SRAT...

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10 16:24                           ` Toshi Kani
  0 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-10 16:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

Hello Tejun,

On Thu, 2013-10-10 at 11:35 -0400, Tejun Heo wrote:
 :
> > > Are firmware writers gonna be
> > > required to split SRAT entries into multiple sub-nodes to support it?
> > 
> > Yes, and that's part of the ACPI spec.  That's not something the OS
> > requests to do.  If a memory range has different attribute, firmware has
> > to put it in a separate entry.
> 
> I was referring to having to segment a contiguous hotplug memory area
> further to support finer granularity.  This is represented by separate
> mem devices rather than segmented SRAT entries, right?  Hmmm... so we
> should parse device nodes before setting up page tables?

Yes, a memory device object is the finest granularity of performing
memory hotplug on ACPI based platforms.  SRAT must be consistent with
the memory device object info, but its entry does not have to be
segmented by the device granularity.  It only needs to be segmented when
memory attribute is different.  For instance, SRAT may have a single
entry for Case A), but Case B) must have two separate entries.  In both
cases, MEMA & MEMB represent a contiguous memory range.

Case A) Both MEMA and MEMB devices are hotpluggable

 MEMA:  _CRS: 0x0000-0x0fff  _EJ0: hotpluggable
 MEMB:  _CRS: 0x1000-0x1fff  _EJ0: hotpluggable

 SRAT: 0x0000-0x1ffff hotpluggable

Case B) Only MEMB is hotpluggable

 MEMA:  _CRS: 0x0000-0x0fff
 MEMB:  _CRS: 0x1000-0x1fff  _EJ0: hotpluggable

 SRAT: 0x0000-0x0fff
       0x1000-0x1fff  hotpluggable

> > SRAT and _EJ0 method are the only interfaces that define ejectability in
> > the standard spec.  Are you suggesting us to change the e820 spec or not
> > to comply with the spec?  I do not think such approaches work.    
> 
> It's slower but standards get revised and updated over time.  Have no
> idea whether there'd be a sane way to do that for e820 tho.

I am familiar with the process.  Yes, it is slow, but most importantly,
it needs some standard group or company to actively maintain the spec in
order to update it.  I do not think e820 is in such state.

> > I think memory hotplug was originally implemented on ia64 with the node
> > granularity.  I share your concerns, but that's been done a long time
> > ago.  It's too late to complain the past.  This SRAT work is not
> > introducing such restriction.
> 
> We're going round and round.  You're saying that using SRAT isn't
> worse than what came before while failing to illustrate how committing
> to invasive changes would eventually lead to something better.  "it
> isn't worse" isn't much of an argument.

We did avoid moving up the ACPI table init function per your suggestion.
I guess I do not understand why you still concerned about using SRAT...

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-10 16:24                           ` Toshi Kani
@ 2013-10-10 16:46                             ` Tejun Heo
  -1 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-10 16:46 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman@redhat.com

Hey,

On Thu, Oct 10, 2013 at 10:24:09AM -0600, Toshi Kani wrote:
> > We're going round and round.  You're saying that using SRAT isn't
> > worse than what came before while failing to illustrate how committing
> > to invasive changes would eventually lead to something better.  "it
> > isn't worse" isn't much of an argument.
> 
> We did avoid moving up the ACPI table init function per your suggestion.
> I guess I do not understand why you still concerned about using SRAT...

As you wrote above, SRAT is not enough to support device granularity.
We need to parse the device hierarchy too before setting up page
tables and one of the previous arguments was "it's only SRAT".  It
doesn't instill confidence when there doesn't seem to be much long
term planning going on especially as the general quality of the
patches isn't particularly high.  I find it difficult to believe that
this effort as it currently stands is likely to reach full solution
and as such it feels much safer to opt for a simpler, less dangerous
approach for immedate use, for which either approach doesn't make much
of difference.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10 16:46                             ` Tejun Heo
  0 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-10 16:46 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

Hey,

On Thu, Oct 10, 2013 at 10:24:09AM -0600, Toshi Kani wrote:
> > We're going round and round.  You're saying that using SRAT isn't
> > worse than what came before while failing to illustrate how committing
> > to invasive changes would eventually lead to something better.  "it
> > isn't worse" isn't much of an argument.
> 
> We did avoid moving up the ACPI table init function per your suggestion.
> I guess I do not understand why you still concerned about using SRAT...

As you wrote above, SRAT is not enough to support device granularity.
We need to parse the device hierarchy too before setting up page
tables and one of the previous arguments was "it's only SRAT".  It
doesn't instill confidence when there doesn't seem to be much long
term planning going on especially as the general quality of the
patches isn't particularly high.  I find it difficult to believe that
this effort as it currently stands is likely to reach full solution
and as such it feels much safer to opt for a simpler, less dangerous
approach for immedate use, for which either approach doesn't make much
of difference.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-10 16:46                             ` Tejun Heo
@ 2013-10-10 16:50                               ` Toshi Kani
  -1 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-10 16:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman@redhat.com

Hello,

On Thu, 2013-10-10 at 12:46 -0400, Tejun Heo wrote:
> On Thu, Oct 10, 2013 at 10:24:09AM -0600, Toshi Kani wrote:
> > > We're going round and round.  You're saying that using SRAT isn't
> > > worse than what came before while failing to illustrate how committing
> > > to invasive changes would eventually lead to something better.  "it
> > > isn't worse" isn't much of an argument.
> > 
> > We did avoid moving up the ACPI table init function per your suggestion.
> > I guess I do not understand why you still concerned about using SRAT...
> 
> As you wrote above, SRAT is not enough to support device granularity.
> We need to parse the device hierarchy too before setting up page
> tables and one of the previous arguments was "it's only SRAT".  It
> doesn't instill confidence when there doesn't seem to be much long
> term planning going on especially as the general quality of the
> patches isn't particularly high.  I find it difficult to believe that
> this effort as it currently stands is likely to reach full solution
> and as such it feels much safer to opt for a simpler, less dangerous
> approach for immedate use, for which either approach doesn't make much
> of difference.

Can you elaborate why we need to parse the device hierarchy before
setting up page tables?

Thanks,
-Toshi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10 16:50                               ` Toshi Kani
  0 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-10 16:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

Hello,

On Thu, 2013-10-10 at 12:46 -0400, Tejun Heo wrote:
> On Thu, Oct 10, 2013 at 10:24:09AM -0600, Toshi Kani wrote:
> > > We're going round and round.  You're saying that using SRAT isn't
> > > worse than what came before while failing to illustrate how committing
> > > to invasive changes would eventually lead to something better.  "it
> > > isn't worse" isn't much of an argument.
> > 
> > We did avoid moving up the ACPI table init function per your suggestion.
> > I guess I do not understand why you still concerned about using SRAT...
> 
> As you wrote above, SRAT is not enough to support device granularity.
> We need to parse the device hierarchy too before setting up page
> tables and one of the previous arguments was "it's only SRAT".  It
> doesn't instill confidence when there doesn't seem to be much long
> term planning going on especially as the general quality of the
> patches isn't particularly high.  I find it difficult to believe that
> this effort as it currently stands is likely to reach full solution
> and as such it feels much safer to opt for a simpler, less dangerous
> approach for immedate use, for which either approach doesn't make much
> of difference.

Can you elaborate why we need to parse the device hierarchy before
setting up page tables?

Thanks,
-Toshi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-10 16:50                               ` Toshi Kani
@ 2013-10-10 16:55                                 ` Tejun Heo
  -1 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-10 16:55 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman@redhat.com

On Thu, Oct 10, 2013 at 10:50:40AM -0600, Toshi Kani wrote:
> Can you elaborate why we need to parse the device hierarchy before
> setting up page tables?

How else can one put the page tables on the "local device"?  Am I
missing something?

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10 16:55                                 ` Tejun Heo
  0 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-10 16:55 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On Thu, Oct 10, 2013 at 10:50:40AM -0600, Toshi Kani wrote:
> Can you elaborate why we need to parse the device hierarchy before
> setting up page tables?

How else can one put the page tables on the "local device"?  Am I
missing something?

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-10 16:55                                 ` Tejun Heo
@ 2013-10-10 16:59                                   ` Toshi Kani
  -1 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-10 16:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman@redhat.com

On Thu, 2013-10-10 at 12:55 -0400, Tejun Heo wrote:
> On Thu, Oct 10, 2013 at 10:50:40AM -0600, Toshi Kani wrote:
> > Can you elaborate why we need to parse the device hierarchy before
> > setting up page tables?
> 
> How else can one put the page tables on the "local device"?  Am I
> missing something?

The local page table item is gone under the current plan as you
suggested...

Thanks,
-Toshi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10 16:59                                   ` Toshi Kani
  0 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-10 16:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zhang Yanfei, H. Peter Anvin, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On Thu, 2013-10-10 at 12:55 -0400, Tejun Heo wrote:
> On Thu, Oct 10, 2013 at 10:50:40AM -0600, Toshi Kani wrote:
> > Can you elaborate why we need to parse the device hierarchy before
> > setting up page tables?
> 
> How else can one put the page tables on the "local device"?  Am I
> missing something?

The local page table item is gone under the current plan as you
suggested...

Thanks,
-Toshi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-10 16:59                                   ` Toshi Kani
@ 2013-10-10 17:12                                     ` H. Peter Anvin
  -1 siblings, 0 replies; 109+ messages in thread
From: H. Peter Anvin @ 2013-10-10 17:12 UTC (permalink / raw)
  To: Toshi Kani, Tejun Heo
  Cc: Zhang Yanfei, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis@profitbricks.com

On 10/10/2013 09:59 AM, Toshi Kani wrote:
> On Thu, 2013-10-10 at 12:55 -0400, Tejun Heo wrote:
>> On Thu, Oct 10, 2013 at 10:50:40AM -0600, Toshi Kani wrote:
>>> Can you elaborate why we need to parse the device hierarchy before
>>> setting up page tables?
>>
>> How else can one put the page tables on the "local device"?  Am I
>> missing something?
> 
> The local page table item is gone under the current plan as you
> suggested...
> 

That would be a significant performance regression.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10 17:12                                     ` H. Peter Anvin
  0 siblings, 0 replies; 109+ messages in thread
From: H. Peter Anvin @ 2013-10-10 17:12 UTC (permalink / raw)
  To: Toshi Kani, Tejun Heo
  Cc: Zhang Yanfei, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
	x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

On 10/10/2013 09:59 AM, Toshi Kani wrote:
> On Thu, 2013-10-10 at 12:55 -0400, Tejun Heo wrote:
>> On Thu, Oct 10, 2013 at 10:50:40AM -0600, Toshi Kani wrote:
>>> Can you elaborate why we need to parse the device hierarchy before
>>> setting up page tables?
>>
>> How else can one put the page tables on the "local device"?  Am I
>> missing something?
> 
> The local page table item is gone under the current plan as you
> suggested...
> 

That would be a significant performance regression.

	-hpa


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-10 17:12                                     ` H. Peter Anvin
@ 2013-10-10 19:17                                       ` Toshi Kani
  -1 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-10 19:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tejun Heo, Zhang Yanfei, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman

On Thu, 2013-10-10 at 10:12 -0700, H. Peter Anvin wrote:
> On 10/10/2013 09:59 AM, Toshi Kani wrote:
> > On Thu, 2013-10-10 at 12:55 -0400, Tejun Heo wrote:
> >> On Thu, Oct 10, 2013 at 10:50:40AM -0600, Toshi Kani wrote:
> >>> Can you elaborate why we need to parse the device hierarchy before
> >>> setting up page tables?
> >>
> >> How else can one put the page tables on the "local device"?  Am I
> >> missing something?
> > 
> > The local page table item is gone under the current plan as you
> > suggested...
> > 
> 
> That would be a significant performance regression.

In earlier discussions, Tejun pointed out that huge mappings dismiss the
benefit of local page tables.

https://lkml.org/lkml/2013/8/23/245

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10 19:17                                       ` Toshi Kani
  0 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-10 19:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Tejun Heo, Zhang Yanfei, Andrew Morton, Rafael J . Wysocki, lenb,
	Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger, Yinghai Lu,
	Jiang Liu, Wen Congyang, Lai Jiangshan, isimatu.yasuaki,
	izumi.taku, Mel Gorman, Minchan Kim, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner, prarit,
	x86, linux-doc, linux-kernel, Linux MM, linux-acpi, imtangchen,
	Zhang Yanfei, Tang Chen

On Thu, 2013-10-10 at 10:12 -0700, H. Peter Anvin wrote:
> On 10/10/2013 09:59 AM, Toshi Kani wrote:
> > On Thu, 2013-10-10 at 12:55 -0400, Tejun Heo wrote:
> >> On Thu, Oct 10, 2013 at 10:50:40AM -0600, Toshi Kani wrote:
> >>> Can you elaborate why we need to parse the device hierarchy before
> >>> setting up page tables?
> >>
> >> How else can one put the page tables on the "local device"?  Am I
> >> missing something?
> > 
> > The local page table item is gone under the current plan as you
> > suggested...
> > 
> 
> That would be a significant performance regression.

In earlier discussions, Tejun pointed out that huge mappings dismiss the
benefit of local page tables.

https://lkml.org/lkml/2013/8/23/245

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-10 19:17                                       ` Toshi Kani
@ 2013-10-10 22:19                                         ` Tejun Heo
  -1 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-10 22:19 UTC (permalink / raw)
  To: Toshi Kani
  Cc: H. Peter Anvin, Zhang Yanfei, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman@redhat.com

On Thu, Oct 10, 2013 at 01:17:10PM -0600, Toshi Kani wrote:
> In earlier discussions, Tejun pointed out that huge mappings dismiss the
> benefit of local page tables.
> 
> https://lkml.org/lkml/2013/8/23/245

This is going nowhere.  If we're assuming use of large mappings, none
of this matters.  The pagetable is gonna be small no matter what and
locating it near kernel image doesn't really impact anything whether
hotplug is gonna be per-node or per-device.  Short of the ability to
relocate kernel image itself, parsing or not parsing SRAT early
doesn't lead to anything of consequence.  What are we even arguing
about?  That's what bothers me about this effort.  Nobody seems to
have actually thought it through.

To summarize,

* To do local page table, full ACPI device hierarchy should be parsed.

* Local page table is pointless if you assume huge mappings and the
  plan is to assume huge mappings so that only SRAT is necessary
  before allocating page tables.

* But if you assume huge mappings, it doesn't make material difference
  whether the page table is after the kernel image or near the top of
  non-hotpluggable memory.  It's tiny anyway.

* So, what's the point of pulling SRAT parsing into early boot?  If we
  assume huge mappings, it doesn't make any material difference for
  either per-node or per-device unplug - it's tiny.  If we don't
  assume huge mappings, we're talking about parsing full ACPI device
  tree before building pagetable.  Let's say that's something we can
  accept.  Is the benefit worthwhile?  Doing all that just for debug
  configs?  Is that something people are actually arguing for?  Sure,
  if it works without too much effort, it's great, but do we really
  wanna do all that and update page table allocation so that
  everything is per-device just to support debug configs, for real?

I'm not asking for super concrete plan but right now people working on
this don't seem to have much idea of what the goals are or why they
want certain things and the discussions naturally repeat themselves.
FWIW, I'm getting to a point where I think nacking the whole series is
the right thing to do here.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10 22:19                                         ` Tejun Heo
  0 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2013-10-10 22:19 UTC (permalink / raw)
  To: Toshi Kani
  Cc: H. Peter Anvin, Zhang Yanfei, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On Thu, Oct 10, 2013 at 01:17:10PM -0600, Toshi Kani wrote:
> In earlier discussions, Tejun pointed out that huge mappings dismiss the
> benefit of local page tables.
> 
> https://lkml.org/lkml/2013/8/23/245

This is going nowhere.  If we're assuming use of large mappings, none
of this matters.  The pagetable is gonna be small no matter what and
locating it near kernel image doesn't really impact anything whether
hotplug is gonna be per-node or per-device.  Short of the ability to
relocate kernel image itself, parsing or not parsing SRAT early
doesn't lead to anything of consequence.  What are we even arguing
about?  That's what bothers me about this effort.  Nobody seems to
have actually thought it through.

To summarize,

* To do local page table, full ACPI device hierarchy should be parsed.

* Local page table is pointless if you assume huge mappings and the
  plan is to assume huge mappings so that only SRAT is necessary
  before allocating page tables.

* But if you assume huge mappings, it doesn't make material difference
  whether the page table is after the kernel image or near the top of
  non-hotpluggable memory.  It's tiny anyway.

* So, what's the point of pulling SRAT parsing into early boot?  If we
  assume huge mappings, it doesn't make any material difference for
  either per-node or per-device unplug - it's tiny.  If we don't
  assume huge mappings, we're talking about parsing full ACPI device
  tree before building pagetable.  Let's say that's something we can
  accept.  Is the benefit worthwhile?  Doing all that just for debug
  configs?  Is that something people are actually arguing for?  Sure,
  if it works without too much effort, it's great, but do we really
  wanna do all that and update page table allocation so that
  everything is per-device just to support debug configs, for real?

I'm not asking for super concrete plan but right now people working on
this don't seem to have much idea of what the goals are or why they
want certain things and the discussions naturally repeat themselves.
FWIW, I'm getting to a point where I think nacking the whole series is
the right thing to do here.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-10 22:19                                         ` Tejun Heo
@ 2013-10-10 23:00                                           ` Toshi Kani
  -1 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-10 23:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: H. Peter Anvin, Zhang Yanfei, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman@redhat.com

On Thu, 2013-10-10 at 18:19 -0400, Tejun Heo wrote:
> On Thu, Oct 10, 2013 at 01:17:10PM -0600, Toshi Kani wrote:
> > In earlier discussions, Tejun pointed out that huge mappings dismiss the
> > benefit of local page tables.
> > 
> > https://lkml.org/lkml/2013/8/23/245
> 
> This is going nowhere.  If we're assuming use of large mappings, none
> of this matters.  The pagetable is gonna be small no matter what and
> locating it near kernel image doesn't really impact anything whether
> hotplug is gonna be per-node or per-device.  Short of the ability to
> relocate kernel image itself, parsing or not parsing SRAT early
> doesn't lead to anything of consequence.  What are we even arguing
> about?  That's what bothers me about this effort.  Nobody seems to
> have actually thought it through.

Calm down, please.  I simply referred the thread where we had discussed
on this matter and agreed up on, so that we do not have to repeat the
same discussion again.

> To summarize,
> 
> * To do local page table, full ACPI device hierarchy should be parsed.
> 
> * Local page table is pointless if you assume huge mappings and the
>   plan is to assume huge mappings so that only SRAT is necessary
>   before allocating page tables.
> 
> * But if you assume huge mappings, it doesn't make material difference
>   whether the page table is after the kernel image or near the top of
>   non-hotpluggable memory.  It's tiny anyway.
> 
> * So, what's the point of pulling SRAT parsing into early boot?  If we
>   assume huge mappings, it doesn't make any material difference for
>   either per-node or per-device unplug - it's tiny.  If we don't
>   assume huge mappings, we're talking about parsing full ACPI device
>   tree before building pagetable.  Let's say that's something we can
>   accept.  Is the benefit worthwhile?  Doing all that just for debug
>   configs?  Is that something people are actually arguing for?  Sure,
>   if it works without too much effort, it's great, but do we really
>   wanna do all that and update page table allocation so that
>   everything is per-device just to support debug configs, for real?
>
> I'm not asking for super concrete plan but right now people working on
> this don't seem to have much idea of what the goals are or why they
> want certain things and the discussions naturally repeat themselves.
> FWIW, I'm getting to a point where I think nacking the whole series is
> the right thing to do here.

The patchset out for reviewing does not pull SRAT parsing into early
boot.

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-10 23:00                                           ` Toshi Kani
  0 siblings, 0 replies; 109+ messages in thread
From: Toshi Kani @ 2013-10-10 23:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: H. Peter Anvin, Zhang Yanfei, Andrew Morton, Rafael J . Wysocki,
	lenb, Thomas Gleixner, mingo, Wanpeng Li, Thomas Renninger,
	Yinghai Lu, Jiang Liu, Wen Congyang, Lai Jiangshan,
	isimatu.yasuaki, izumi.taku, Mel Gorman, Minchan Kim, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel, jweiner,
	prarit, x86, linux-doc, linux-kernel, Linux MM, linux-acpi,
	imtangchen, Zhang Yanfei, Tang Chen

On Thu, 2013-10-10 at 18:19 -0400, Tejun Heo wrote:
> On Thu, Oct 10, 2013 at 01:17:10PM -0600, Toshi Kani wrote:
> > In earlier discussions, Tejun pointed out that huge mappings dismiss the
> > benefit of local page tables.
> > 
> > https://lkml.org/lkml/2013/8/23/245
> 
> This is going nowhere.  If we're assuming use of large mappings, none
> of this matters.  The pagetable is gonna be small no matter what and
> locating it near kernel image doesn't really impact anything whether
> hotplug is gonna be per-node or per-device.  Short of the ability to
> relocate kernel image itself, parsing or not parsing SRAT early
> doesn't lead to anything of consequence.  What are we even arguing
> about?  That's what bothers me about this effort.  Nobody seems to
> have actually thought it through.

Calm down, please.  I simply referred the thread where we had discussed
on this matter and agreed up on, so that we do not have to repeat the
same discussion again.

> To summarize,
> 
> * To do local page table, full ACPI device hierarchy should be parsed.
> 
> * Local page table is pointless if you assume huge mappings and the
>   plan is to assume huge mappings so that only SRAT is necessary
>   before allocating page tables.
> 
> * But if you assume huge mappings, it doesn't make material difference
>   whether the page table is after the kernel image or near the top of
>   non-hotpluggable memory.  It's tiny anyway.
> 
> * So, what's the point of pulling SRAT parsing into early boot?  If we
>   assume huge mappings, it doesn't make any material difference for
>   either per-node or per-device unplug - it's tiny.  If we don't
>   assume huge mappings, we're talking about parsing full ACPI device
>   tree before building pagetable.  Let's say that's something we can
>   accept.  Is the benefit worthwhile?  Doing all that just for debug
>   configs?  Is that something people are actually arguing for?  Sure,
>   if it works without too much effort, it's great, but do we really
>   wanna do all that and update page table allocation so that
>   everything is per-device just to support debug configs, for real?
>
> I'm not asking for super concrete plan but right now people working on
> this don't seem to have much idea of what the goals are or why they
> want certain things and the discussions naturally repeat themselves.
> FWIW, I'm getting to a point where I think nacking the whole series is
> the right thing to do here.

The patchset out for reviewing does not pull SRAT parsing into early
boot.

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-09 19:23             ` Tejun Heo
@ 2013-10-11  5:27               ` Yinghai Lu
  -1 siblings, 0 replies; 109+ messages in thread
From: Yinghai Lu @ 2013-10-11  5:27 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton, H. Peter Anvin, Ingo Molnar
  Cc: Zhang Yanfei, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Toshi Kani, Wanpeng Li, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis,
	lwoodman, Rik van Riel, jweiner, Prarit Bhargava, x86, linux-d

On Wed, Oct 9, 2013 at 12:23 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Yinghai.
>
> On Wed, Oct 09, 2013 at 12:10:34PM -0700, Yinghai Lu wrote:
>> > I still feel quite uneasy about pulling SRAT parsing and ACPI initrd
>> > overriding into early boot.
>>
>> for your reconsidering to parse srat early, I refresh that old patchset
>> at
>>
>> https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/log/?h=for-x86-mm-3.13
>>
>> actually looks one-third or haf patches already have your ack.
>
> Yes, but those acks assume that the overall approach is a good idea.
> The biggest issue that I have with the approach is that it is invasive
> and modifies basic structure for an inherently kludgy solution for a
> quite niche problem.  The benefit / cost ratio still seems quite off
> to me - we're making a lot of general changes to serve something very
> specialized, which might not even stay relevant for long time.
>

I really hate adding another the code path.

Now with v7 from Yanfei, will have movable_node boot command parameter and
if that is specified  kernel would allocate ram early in different way.

Parse srat early patchset add about 217 lines, (from x86, ACPI, NUMA,
ia64: split SLIT handling out)

 arch/ia64/kernel/setup.c                |   4 +-
 arch/x86/include/asm/acpi.h             |   3 +-
 arch/x86/include/asm/page_types.h       |   2 +-
 arch/x86/include/asm/pgtable.h          |   2 +-
 arch/x86/include/asm/setup.h            |   9 ++
 arch/x86/kernel/head64.c                |   2 +
 arch/x86/kernel/head_32.S               |   4 +
 arch/x86/kernel/microcode_intel_early.c |   8 +-
 arch/x86/kernel/setup.c                 |  86 ++++++-----
 arch/x86/mm/init.c                      | 101 ++++++++-----
 arch/x86/mm/numa.c                      | 244 +++++++++++++++++++++++++-------
 arch/x86/mm/numa_emulation.c            |   2 +-
 arch/x86/mm/numa_internal.h             |   2 +
 arch/x86/mm/srat.c                      |  11 +-
 drivers/acpi/numa.c                     |  13 +-
 drivers/acpi/osl.c                      | 131 ++++++++++++-----
 include/linux/acpi.h                    |  20 +--
 include/linux/mm.h                      |   3 -
 mm/page_alloc.c                         |  52 +------
 19 files changed, 458 insertions(+), 241 deletions(-)

if I drop last two, aka does not allocate page table on local code.
will only keep page table on first node, will only need to have add 137 lines.

 arch/ia64/kernel/setup.c                |   4 +-
 arch/x86/include/asm/acpi.h             |   3 +-
 arch/x86/include/asm/page_types.h       |   2 +-
 arch/x86/include/asm/setup.h            |   9 ++
 arch/x86/kernel/head64.c                |   2 +
 arch/x86/kernel/head_32.S               |   4 +
 arch/x86/kernel/microcode_intel_early.c |   8 +-
 arch/x86/kernel/setup.c                 |  85 +++++++++------
 arch/x86/mm/init.c                      |  10 +-
 arch/x86/mm/numa.c                      | 188 +++++++++++++++++++++++---------
 arch/x86/mm/numa_emulation.c            |   2 +-
 arch/x86/mm/numa_internal.h             |   2 +
 arch/x86/mm/srat.c                      |  11 +-
 drivers/acpi/numa.c                     |  13 ++-
 drivers/acpi/osl.c                      | 131 +++++++++++++++-------
 include/linux/acpi.h                    |  20 ++--
 include/linux/mm.h                      |   3 -
 mm/page_alloc.c                         |  52 +--------
 18 files changed, 343 insertions(+), 206 deletions(-)

and Yanfei's add about 265 lines

 Documentation/kernel-
parameters.txt |    3 +
 arch/x86/kernel/setup.c             |    9 ++-
 arch/x86/mm/init.c                  |  122 ++++++++++++++++++++++++++++------
 arch/x86/mm/numa.c                  |   11 +++
 include/linux/memblock.h            |   24 +++++++
 include/linux/mm.h                  |    4 +
 mm/Kconfig                          |   17 +++--
 mm/memblock.c                       |  126 +++++++++++++++++++++++++++++++----
 mm/memory_hotplug.c                 |   31 +++++++++
 9 files changed, 306 insertions(+), 41 deletions(-)

For long term to keep the code more maintainable, We really should go
though parse srat table early.

Thanks

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-11  5:27               ` Yinghai Lu
  0 siblings, 0 replies; 109+ messages in thread
From: Yinghai Lu @ 2013-10-11  5:27 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton, H. Peter Anvin, Ingo Molnar
  Cc: Zhang Yanfei, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Toshi Kani, Wanpeng Li, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis,
	lwoodman, Rik van Riel, jweiner, Prarit Bhargava, x86, linux-doc,
	linux-kernel, Linux MM, ACPI Devel Maling List, Chen Tang,
	Zhang Yanfei, Tang Chen

On Wed, Oct 9, 2013 at 12:23 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Yinghai.
>
> On Wed, Oct 09, 2013 at 12:10:34PM -0700, Yinghai Lu wrote:
>> > I still feel quite uneasy about pulling SRAT parsing and ACPI initrd
>> > overriding into early boot.
>>
>> for your reconsidering to parse srat early, I refresh that old patchset
>> at
>>
>> https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/log/?h=for-x86-mm-3.13
>>
>> actually looks one-third or haf patches already have your ack.
>
> Yes, but those acks assume that the overall approach is a good idea.
> The biggest issue that I have with the approach is that it is invasive
> and modifies basic structure for an inherently kludgy solution for a
> quite niche problem.  The benefit / cost ratio still seems quite off
> to me - we're making a lot of general changes to serve something very
> specialized, which might not even stay relevant for long time.
>

I really hate adding another the code path.

Now with v7 from Yanfei, will have movable_node boot command parameter and
if that is specified  kernel would allocate ram early in different way.

Parse srat early patchset add about 217 lines, (from x86, ACPI, NUMA,
ia64: split SLIT handling out)

 arch/ia64/kernel/setup.c                |   4 +-
 arch/x86/include/asm/acpi.h             |   3 +-
 arch/x86/include/asm/page_types.h       |   2 +-
 arch/x86/include/asm/pgtable.h          |   2 +-
 arch/x86/include/asm/setup.h            |   9 ++
 arch/x86/kernel/head64.c                |   2 +
 arch/x86/kernel/head_32.S               |   4 +
 arch/x86/kernel/microcode_intel_early.c |   8 +-
 arch/x86/kernel/setup.c                 |  86 ++++++-----
 arch/x86/mm/init.c                      | 101 ++++++++-----
 arch/x86/mm/numa.c                      | 244 +++++++++++++++++++++++++-------
 arch/x86/mm/numa_emulation.c            |   2 +-
 arch/x86/mm/numa_internal.h             |   2 +
 arch/x86/mm/srat.c                      |  11 +-
 drivers/acpi/numa.c                     |  13 +-
 drivers/acpi/osl.c                      | 131 ++++++++++++-----
 include/linux/acpi.h                    |  20 +--
 include/linux/mm.h                      |   3 -
 mm/page_alloc.c                         |  52 +------
 19 files changed, 458 insertions(+), 241 deletions(-)

if I drop last two, aka does not allocate page table on local code.
will only keep page table on first node, will only need to have add 137 lines.

 arch/ia64/kernel/setup.c                |   4 +-
 arch/x86/include/asm/acpi.h             |   3 +-
 arch/x86/include/asm/page_types.h       |   2 +-
 arch/x86/include/asm/setup.h            |   9 ++
 arch/x86/kernel/head64.c                |   2 +
 arch/x86/kernel/head_32.S               |   4 +
 arch/x86/kernel/microcode_intel_early.c |   8 +-
 arch/x86/kernel/setup.c                 |  85 +++++++++------
 arch/x86/mm/init.c                      |  10 +-
 arch/x86/mm/numa.c                      | 188 +++++++++++++++++++++++---------
 arch/x86/mm/numa_emulation.c            |   2 +-
 arch/x86/mm/numa_internal.h             |   2 +
 arch/x86/mm/srat.c                      |  11 +-
 drivers/acpi/numa.c                     |  13 ++-
 drivers/acpi/osl.c                      | 131 +++++++++++++++-------
 include/linux/acpi.h                    |  20 ++--
 include/linux/mm.h                      |   3 -
 mm/page_alloc.c                         |  52 +--------
 18 files changed, 343 insertions(+), 206 deletions(-)

and Yanfei's add about 265 lines

 Documentation/kernel-
parameters.txt |    3 +
 arch/x86/kernel/setup.c             |    9 ++-
 arch/x86/mm/init.c                  |  122 ++++++++++++++++++++++++++++------
 arch/x86/mm/numa.c                  |   11 +++
 include/linux/memblock.h            |   24 +++++++
 include/linux/mm.h                  |    4 +
 mm/Kconfig                          |   17 +++--
 mm/memblock.c                       |  126 +++++++++++++++++++++++++++++++----
 mm/memory_hotplug.c                 |   31 +++++++++
 9 files changed, 306 insertions(+), 41 deletions(-)

For long term to keep the code more maintainable, We really should go
though parse srat table early.

Thanks

Yinghai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-11  5:27               ` Yinghai Lu
@ 2013-10-11  5:47                 ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-11  5:47 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tejun Heo, Andrew Morton, H. Peter Anvin, Ingo Molnar,
	Zhang Yanfei, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Toshi Kani, Wanpeng Li, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis

Hello yinghai,

I know your opinion but take code modification as an example
seems like it doesn't stand. More code doesn't mean more complexity......

On 10/11/2013 01:27 PM, Yinghai Lu wrote:
> On Wed, Oct 9, 2013 at 12:23 PM, Tejun Heo <tj@kernel.org> wrote:
>> Hello, Yinghai.
>>
>> On Wed, Oct 09, 2013 at 12:10:34PM -0700, Yinghai Lu wrote:
>>>> I still feel quite uneasy about pulling SRAT parsing and ACPI initrd
>>>> overriding into early boot.
>>>
>>> for your reconsidering to parse srat early, I refresh that old patchset
>>> at
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/log/?h=for-x86-mm-3.13
>>>
>>> actually looks one-third or haf patches already have your ack.
>>
>> Yes, but those acks assume that the overall approach is a good idea.
>> The biggest issue that I have with the approach is that it is invasive
>> and modifies basic structure for an inherently kludgy solution for a
>> quite niche problem.  The benefit / cost ratio still seems quite off
>> to me - we're making a lot of general changes to serve something very
>> specialized, which might not even stay relevant for long time.
>>
> 
> I really hate adding another the code path.
> 
> Now with v7 from Yanfei, will have movable_node boot command parameter and
> if that is specified  kernel would allocate ram early in different way.
> 
> Parse srat early patchset add about 217 lines, (from x86, ACPI, NUMA,
> ia64: split SLIT handling out)
> 
>  arch/ia64/kernel/setup.c                |   4 +-
>  arch/x86/include/asm/acpi.h             |   3 +-
>  arch/x86/include/asm/page_types.h       |   2 +-
>  arch/x86/include/asm/pgtable.h          |   2 +-
>  arch/x86/include/asm/setup.h            |   9 ++
>  arch/x86/kernel/head64.c                |   2 +
>  arch/x86/kernel/head_32.S               |   4 +
>  arch/x86/kernel/microcode_intel_early.c |   8 +-
>  arch/x86/kernel/setup.c                 |  86 ++++++-----
>  arch/x86/mm/init.c                      | 101 ++++++++-----
>  arch/x86/mm/numa.c                      | 244 +++++++++++++++++++++++++-------
>  arch/x86/mm/numa_emulation.c            |   2 +-
>  arch/x86/mm/numa_internal.h             |   2 +
>  arch/x86/mm/srat.c                      |  11 +-
>  drivers/acpi/numa.c                     |  13 +-
>  drivers/acpi/osl.c                      | 131 ++++++++++++-----
>  include/linux/acpi.h                    |  20 +--
>  include/linux/mm.h                      |   3 -
>  mm/page_alloc.c                         |  52 +------
>  19 files changed, 458 insertions(+), 241 deletions(-)
> 
> if I drop last two, aka does not allocate page table on local code.
> will only keep page table on first node, will only need to have add 137 lines.
> 
>  arch/ia64/kernel/setup.c                |   4 +-
>  arch/x86/include/asm/acpi.h             |   3 +-
>  arch/x86/include/asm/page_types.h       |   2 +-
>  arch/x86/include/asm/setup.h            |   9 ++
>  arch/x86/kernel/head64.c                |   2 +
>  arch/x86/kernel/head_32.S               |   4 +
>  arch/x86/kernel/microcode_intel_early.c |   8 +-
>  arch/x86/kernel/setup.c                 |  85 +++++++++------
>  arch/x86/mm/init.c                      |  10 +-
>  arch/x86/mm/numa.c                      | 188 +++++++++++++++++++++++---------
>  arch/x86/mm/numa_emulation.c            |   2 +-
>  arch/x86/mm/numa_internal.h             |   2 +
>  arch/x86/mm/srat.c                      |  11 +-
>  drivers/acpi/numa.c                     |  13 ++-
>  drivers/acpi/osl.c                      | 131 +++++++++++++++-------
>  include/linux/acpi.h                    |  20 ++--
>  include/linux/mm.h                      |   3 -
>  mm/page_alloc.c                         |  52 +--------
>  18 files changed, 343 insertions(+), 206 deletions(-)
> 
> and Yanfei's add about 265 lines
> 
>  Documentation/kernel-
> parameters.txt |    3 +
>  arch/x86/kernel/setup.c             |    9 ++-
>  arch/x86/mm/init.c                  |  122 ++++++++++++++++++++++++++++------
>  arch/x86/mm/numa.c                  |   11 +++
>  include/linux/memblock.h            |   24 +++++++
>  include/linux/mm.h                  |    4 +
>  mm/Kconfig                          |   17 +++--
>  mm/memblock.c                       |  126 +++++++++++++++++++++++++++++++----
>  mm/memory_hotplug.c                 |   31 +++++++++
>  9 files changed, 306 insertions(+), 41 deletions(-)
> 
> For long term to keep the code more maintainable, We really should go
> though parse srat table early.
> 
> Thanks
> 
> Yinghai
> 


-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-11  5:47                 ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-11  5:47 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tejun Heo, Andrew Morton, H. Peter Anvin, Ingo Molnar,
	Zhang Yanfei, Rafael J . Wysocki, Len Brown, Thomas Gleixner,
	Toshi Kani, Wanpeng Li, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Taku Izumi,
	Mel Gorman, Minchan Kim, mina86, gong.chen, Vasilis Liaskovitis,
	lwoodman, Rik van Riel, jweiner, Prarit Bhargava, x86, linux-doc,
	linux-kernel, Linux MM, ACPI Devel Maling List, Chen Tang,
	Tang Chen

Hello yinghai,

I know your opinion but take code modification as an example
seems like it doesn't stand. More code doesn't mean more complexity......

On 10/11/2013 01:27 PM, Yinghai Lu wrote:
> On Wed, Oct 9, 2013 at 12:23 PM, Tejun Heo <tj@kernel.org> wrote:
>> Hello, Yinghai.
>>
>> On Wed, Oct 09, 2013 at 12:10:34PM -0700, Yinghai Lu wrote:
>>>> I still feel quite uneasy about pulling SRAT parsing and ACPI initrd
>>>> overriding into early boot.
>>>
>>> for your reconsidering to parse srat early, I refresh that old patchset
>>> at
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/log/?h=for-x86-mm-3.13
>>>
>>> actually looks one-third or haf patches already have your ack.
>>
>> Yes, but those acks assume that the overall approach is a good idea.
>> The biggest issue that I have with the approach is that it is invasive
>> and modifies basic structure for an inherently kludgy solution for a
>> quite niche problem.  The benefit / cost ratio still seems quite off
>> to me - we're making a lot of general changes to serve something very
>> specialized, which might not even stay relevant for long time.
>>
> 
> I really hate adding another the code path.
> 
> Now with v7 from Yanfei, will have movable_node boot command parameter and
> if that is specified  kernel would allocate ram early in different way.
> 
> Parse srat early patchset add about 217 lines, (from x86, ACPI, NUMA,
> ia64: split SLIT handling out)
> 
>  arch/ia64/kernel/setup.c                |   4 +-
>  arch/x86/include/asm/acpi.h             |   3 +-
>  arch/x86/include/asm/page_types.h       |   2 +-
>  arch/x86/include/asm/pgtable.h          |   2 +-
>  arch/x86/include/asm/setup.h            |   9 ++
>  arch/x86/kernel/head64.c                |   2 +
>  arch/x86/kernel/head_32.S               |   4 +
>  arch/x86/kernel/microcode_intel_early.c |   8 +-
>  arch/x86/kernel/setup.c                 |  86 ++++++-----
>  arch/x86/mm/init.c                      | 101 ++++++++-----
>  arch/x86/mm/numa.c                      | 244 +++++++++++++++++++++++++-------
>  arch/x86/mm/numa_emulation.c            |   2 +-
>  arch/x86/mm/numa_internal.h             |   2 +
>  arch/x86/mm/srat.c                      |  11 +-
>  drivers/acpi/numa.c                     |  13 +-
>  drivers/acpi/osl.c                      | 131 ++++++++++++-----
>  include/linux/acpi.h                    |  20 +--
>  include/linux/mm.h                      |   3 -
>  mm/page_alloc.c                         |  52 +------
>  19 files changed, 458 insertions(+), 241 deletions(-)
> 
> if I drop last two, aka does not allocate page table on local code.
> will only keep page table on first node, will only need to have add 137 lines.
> 
>  arch/ia64/kernel/setup.c                |   4 +-
>  arch/x86/include/asm/acpi.h             |   3 +-
>  arch/x86/include/asm/page_types.h       |   2 +-
>  arch/x86/include/asm/setup.h            |   9 ++
>  arch/x86/kernel/head64.c                |   2 +
>  arch/x86/kernel/head_32.S               |   4 +
>  arch/x86/kernel/microcode_intel_early.c |   8 +-
>  arch/x86/kernel/setup.c                 |  85 +++++++++------
>  arch/x86/mm/init.c                      |  10 +-
>  arch/x86/mm/numa.c                      | 188 +++++++++++++++++++++++---------
>  arch/x86/mm/numa_emulation.c            |   2 +-
>  arch/x86/mm/numa_internal.h             |   2 +
>  arch/x86/mm/srat.c                      |  11 +-
>  drivers/acpi/numa.c                     |  13 ++-
>  drivers/acpi/osl.c                      | 131 +++++++++++++++-------
>  include/linux/acpi.h                    |  20 ++--
>  include/linux/mm.h                      |   3 -
>  mm/page_alloc.c                         |  52 +--------
>  18 files changed, 343 insertions(+), 206 deletions(-)
> 
> and Yanfei's add about 265 lines
> 
>  Documentation/kernel-
> parameters.txt |    3 +
>  arch/x86/kernel/setup.c             |    9 ++-
>  arch/x86/mm/init.c                  |  122 ++++++++++++++++++++++++++++------
>  arch/x86/mm/numa.c                  |   11 +++
>  include/linux/memblock.h            |   24 +++++++
>  include/linux/mm.h                  |    4 +
>  mm/Kconfig                          |   17 +++--
>  mm/memblock.c                       |  126 +++++++++++++++++++++++++++++++----
>  mm/memory_hotplug.c                 |   31 +++++++++
>  9 files changed, 306 insertions(+), 41 deletions(-)
> 
> For long term to keep the code more maintainable, We really should go
> though parse srat table early.
> 
> Thanks
> 
> Yinghai
> 


-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-11  5:47                 ` Zhang Yanfei
@ 2013-10-11  6:33                   ` Ingo Molnar
  -1 siblings, 0 replies; 109+ messages in thread
From: Ingo Molnar @ 2013-10-11  6:33 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Yinghai Lu, Tejun Heo, Andrew Morton, H. Peter Anvin,
	Ingo Molnar, Zhang Yanfei, Rafael J . Wysocki, Len Brown,
	Thomas Gleixner, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Jiang Liu, Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu,
	Taku Izumi, Mel Gorman, Minchan Kim, mina86, gong.chen,
	Vasilis Liaskovitis, lwoodman@redhat.com


* Zhang Yanfei <zhangyanfei@cn.fujitsu.com> wrote:

> Hello yinghai,
> 
> I know your opinion but take code modification as an example seems like 
> it doesn't stand. More code doesn't mean more complexity......

I think you forgot to reply to this point:

> > For long term to keep the code more maintainable, We really should go 
> > though parse srat table early.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-11  6:33                   ` Ingo Molnar
  0 siblings, 0 replies; 109+ messages in thread
From: Ingo Molnar @ 2013-10-11  6:33 UTC (permalink / raw)
  To: Zhang Yanfei
  Cc: Yinghai Lu, Tejun Heo, Andrew Morton, H. Peter Anvin,
	Ingo Molnar, Zhang Yanfei, Rafael J . Wysocki, Len Brown,
	Thomas Gleixner, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Jiang Liu, Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu,
	Taku Izumi, Mel Gorman, Minchan Kim, mina86, gong.chen,
	Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, x86, linux-doc, linux-kernel, Linux MM,
	ACPI Devel Maling List, Chen Tang, Tang Chen


* Zhang Yanfei <zhangyanfei@cn.fujitsu.com> wrote:

> Hello yinghai,
> 
> I know your opinion but take code modification as an example seems like 
> it doesn't stand. More code doesn't mean more complexity......

I think you forgot to reply to this point:

> > For long term to keep the code more maintainable, We really should go 
> > though parse srat table early.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
  2013-10-11  6:33                   ` Ingo Molnar
@ 2013-10-11  6:46                     ` Zhang Yanfei
  -1 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-11  6:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Yinghai Lu, Tejun Heo, Andrew Morton, H. Peter Anvin,
	Ingo Molnar, Zhang Yanfei, Rafael J . Wysocki, Len Brown,
	Thomas Gleixner, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Jiang Liu, Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu,
	Taku Izumi, Mel Gorman, Minchan Kim, mina86, gong.chen

Hello Ingo,

On 10/11/2013 02:33 PM, Ingo Molnar wrote:
> 
> * Zhang Yanfei <zhangyanfei@cn.fujitsu.com> wrote:
> 
>> Hello yinghai,
>>
>> I know your opinion but take code modification as an example seems like 
>> it doesn't stand. More code doesn't mean more complexity......
> 
> I think you forgot to reply to this point:
> 
>>> For long term to keep the code more maintainable, We really should go 
>>> though parse srat table early.
> 

Both ways (the approach of the this patchset and the approach of parsing
SRAT earlier) could let us realize the functionality that we want. So
as long as this point could convince tejun, I am ok with that.

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up
@ 2013-10-11  6:46                     ` Zhang Yanfei
  0 siblings, 0 replies; 109+ messages in thread
From: Zhang Yanfei @ 2013-10-11  6:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Yinghai Lu, Tejun Heo, Andrew Morton, H. Peter Anvin,
	Ingo Molnar, Zhang Yanfei, Rafael J . Wysocki, Len Brown,
	Thomas Gleixner, Toshi Kani, Wanpeng Li, Thomas Renninger,
	Jiang Liu, Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu,
	Taku Izumi, Mel Gorman, Minchan Kim, mina86, gong.chen,
	Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, x86, linux-doc, linux-kernel, Linux MM,
	ACPI Devel Maling List, Chen Tang, Tang Chen

Hello Ingo,

On 10/11/2013 02:33 PM, Ingo Molnar wrote:
> 
> * Zhang Yanfei <zhangyanfei@cn.fujitsu.com> wrote:
> 
>> Hello yinghai,
>>
>> I know your opinion but take code modification as an example seems like 
>> it doesn't stand. More code doesn't mean more complexity......
> 
> I think you forgot to reply to this point:
> 
>>> For long term to keep the code more maintainable, We really should go 
>>> though parse srat table early.
> 

Both ways (the approach of the this patchset and the approach of parsing
SRAT earlier) could let us realize the functionality that we want. So
as long as this point could convince tejun, I am ok with that.

-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

end of thread, other threads:[~2013-10-11  6:47 UTC | newest]

Thread overview: 109+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-04  1:56 [PATCH part1 v6 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed Zhang Yanfei
2013-10-04  1:56 ` Zhang Yanfei
2013-10-04  1:57 ` [PATCH part1 v6 1/6] memblock: Factor out of top-down allocation Zhang Yanfei
2013-10-04  1:57   ` Zhang Yanfei
2013-10-04  1:58 ` [PATCH part1 v6 2/6] memblock: Introduce bottom-up allocation mode Zhang Yanfei
2013-10-04  1:58   ` Zhang Yanfei
2013-10-05 21:30   ` Toshi Kani
2013-10-05 21:30     ` Toshi Kani
2013-10-04  1:59 ` [PATCH part1 v6 3/6] x86/mm: Factor out of top-down direct mapping setup Zhang Yanfei
2013-10-04  1:59   ` Zhang Yanfei
2013-10-04  2:00 ` [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up Zhang Yanfei
2013-10-04  2:00   ` Zhang Yanfei
2013-10-05 22:09   ` Toshi Kani
2013-10-05 22:09     ` Toshi Kani
2013-10-07  0:00   ` H. Peter Anvin
2013-10-07  0:00     ` H. Peter Anvin
2013-10-07 14:17     ` Zhang Yanfei
2013-10-07 14:17       ` Zhang Yanfei
2013-10-08 17:36     ` Zhang Yanfei
2013-10-08 17:36       ` Zhang Yanfei
2013-10-08 17:36       ` Zhang Yanfei
2013-10-09 16:44       ` Tejun Heo
2013-10-09 16:44         ` Tejun Heo
2013-10-09 17:14         ` Zhang Yanfei
2013-10-09 17:14           ` Zhang Yanfei
2013-10-09 19:20           ` Tejun Heo
2013-10-09 19:20             ` Tejun Heo
2013-10-09 19:30             ` Dave Hansen
2013-10-09 19:30               ` Dave Hansen
2013-10-09 19:47               ` Tejun Heo
2013-10-09 19:47                 ` Tejun Heo
2013-10-09 20:58             ` Toshi Kani
2013-10-09 20:58               ` Toshi Kani
2013-10-09 21:11               ` Tejun Heo
2013-10-09 21:11                 ` Tejun Heo
2013-10-09 21:14                 ` H. Peter Anvin
2013-10-09 21:14                   ` H. Peter Anvin
2013-10-09 21:45                   ` Zhang Yanfei
2013-10-09 21:45                     ` Zhang Yanfei
2013-10-09 23:10                     ` H. Peter Anvin
2013-10-09 23:10                       ` H. Peter Anvin
2013-10-09 23:26                       ` Zhang Yanfei
2013-10-09 23:26                         ` Zhang Yanfei
2013-10-10  1:20                         ` Zhang Yanfei
2013-10-10  1:20                           ` Zhang Yanfei
2013-10-10  1:20                           ` Zhang Yanfei
2013-10-10  0:25                   ` Toshi Kani
2013-10-10  0:25                     ` Toshi Kani
2013-10-09 23:58                 ` Toshi Kani
2013-10-09 23:58                   ` Toshi Kani
2013-10-10  1:00                   ` Tejun Heo
2013-10-10  1:00                     ` Tejun Heo
2013-10-10 14:36                     ` Toshi Kani
2013-10-10 14:36                       ` Toshi Kani
2013-10-10 15:35                       ` Tejun Heo
2013-10-10 15:35                         ` Tejun Heo
2013-10-10 16:24                         ` Toshi Kani
2013-10-10 16:24                           ` Toshi Kani
2013-10-10 16:46                           ` Tejun Heo
2013-10-10 16:46                             ` Tejun Heo
2013-10-10 16:50                             ` Toshi Kani
2013-10-10 16:50                               ` Toshi Kani
2013-10-10 16:55                               ` Tejun Heo
2013-10-10 16:55                                 ` Tejun Heo
2013-10-10 16:59                                 ` Toshi Kani
2013-10-10 16:59                                   ` Toshi Kani
2013-10-10 17:12                                   ` H. Peter Anvin
2013-10-10 17:12                                     ` H. Peter Anvin
2013-10-10 19:17                                     ` Toshi Kani
2013-10-10 19:17                                       ` Toshi Kani
2013-10-10 22:19                                       ` Tejun Heo
2013-10-10 22:19                                         ` Tejun Heo
2013-10-10 23:00                                         ` Toshi Kani
2013-10-10 23:00                                           ` Toshi Kani
2013-10-09 21:19             ` Zhang Yanfei
2013-10-09 21:19               ` Zhang Yanfei
2013-10-09 21:22               ` H. Peter Anvin
2013-10-09 21:22                 ` H. Peter Anvin
2013-10-09 23:30                 ` Zhang Yanfei
2013-10-09 23:30                   ` Zhang Yanfei
2013-10-09 19:10         ` Yinghai Lu
2013-10-09 19:10           ` Yinghai Lu
2013-10-09 19:23           ` Tejun Heo
2013-10-09 19:23             ` Tejun Heo
2013-10-11  5:27             ` Yinghai Lu
2013-10-11  5:27               ` Yinghai Lu
2013-10-11  5:47               ` Zhang Yanfei
2013-10-11  5:47                 ` Zhang Yanfei
2013-10-11  6:33                 ` Ingo Molnar
2013-10-11  6:33                   ` Ingo Molnar
2013-10-11  6:46                   ` Zhang Yanfei
2013-10-11  6:46                     ` Zhang Yanfei
2013-10-04  2:01 ` [PATCH part1 v6 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed Zhang Yanfei
2013-10-04  2:01   ` Zhang Yanfei
2013-10-05 22:10   ` Toshi Kani
2013-10-05 22:10     ` Toshi Kani
2013-10-04  2:02 ` [PATCH part1 v6 6/6] mem-hotplug: Introduce movable_node boot option Zhang Yanfei
2013-10-04  2:02   ` Zhang Yanfei
2013-10-05 22:28   ` Toshi Kani
2013-10-05 22:28     ` Toshi Kani
2013-10-06 14:43     ` [PATCH part1 v6 update " Zhang Yanfei
2013-10-06 14:43       ` Zhang Yanfei
2013-10-06 14:43       ` Zhang Yanfei
2013-10-06 23:03       ` Toshi Kani
2013-10-06 23:03         ` Toshi Kani
2013-10-08  4:23 ` [PATCH part1 v6 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed Ingo Molnar
2013-10-08  4:23   ` Ingo Molnar
2013-10-08 15:28   ` Zhang Yanfei
2013-10-08 15:28     ` Zhang Yanfei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.