linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed.
@ 2013-08-27  9:37 Tang Chen
  2013-08-27  9:37 ` [PATCH 01/11] memblock: Rename current_limit to current_limit_high in memblock Tang Chen
                   ` (14 more replies)
  0 siblings, 15 replies; 32+ messages in thread
From: Tang Chen @ 2013-08-27  9:37 UTC (permalink / raw)
  To: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

This patch-set is based on tj's suggestion, and not fully tested. 
Just for review and discussion.


[Problem]

The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.


[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
To do this, we need ACPI's help.

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
   ZONE_MOVABLE.
   (This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
   hotpluggable memory for the kernel before the memory initialization
   finishes.

The problem 2 is the key problem we are going to solve. But before solving it,
we need some preparation. Please see below.


[Preparation]

Bootloader has to load the kernel image into memory. And this memory must be 
unhotpluggable. We cannot prevent this anyway. So in a memory hotplug system, 
we can assume any node the kernel resides in is not hotpluggable.

Before SRAT is parsed, we don't know which memory ranges are hotpluggable. But
memblock has already started to work. In the current kernel, memblock allocates 
the following memory before SRAT is parsed:

setup_arch()
 |->memblock_x86_fill()            /* memblock is ready */
 |......
 |->early_reserve_e820_mpc_new()   /* allocate memory under 1MB */
 |->reserve_real_mode()            /* allocate memory under 1MB */
 |->init_mem_mapping()             /* allocate page tables, about 2MB to map 1GB memory */
 |->dma_contiguous_reserve()       /* specified by user, should be low */
 |->setup_log_buf()                /* specified by user, several mega bytes */
 |->relocate_initrd()              /* could be large, but will be freed after boot, should reorder */
 |->acpi_initrd_override()         /* several mega bytes */
 |->reserve_crashkernel()          /* could be large, should reorder */
 |......
 |->initmem_init()                 /* Parse SRAT */

According to Tejun's advice, before SRAT is parsed, we should try our best to
allocate memory near the kernel image. Since the whole node the kernel resides 
in won't be hotpluggable, and for a modern server, a node may have at least 16GB
memory, allocating several mega bytes memory around the kernel image won't cross
to hotpluggable memory.


[About this patch-set]

So this patch-set does the following:

1. Make memblock be able to allocate memory from low address to high address.
   Also introduce low limit to prevent memblock allocating memory too low.

2. Improve init_mem_mapping() to support allocate page tables from low address 
   to high address.

3. Introduce "movablenode" boot option to enable and disable this functionality.

PS: Reordering of relocate_initrd() and reserve_crashkernel() has not been done 
    yet. acpi_initrd_override() needs to access initrd with virtual address. So 
    relocate_initrd() must be done before acpi_initrd_override().


Tang Chen (11):
  memblock: Rename current_limit to current_limit_high in memblock.
  memblock: Rename memblock_set_current_limit() to
    memblock_set_current_limit_high().
  memblock: Introduce lowest limit in memblock.
  memblock: Introduce memblock_set_current_limit_low() to set lower
    limit of memblock.
  memblock: Introduce allocation order to memblock.
  memblock: Improve memblock to support allocation from lower address.
  x86, memblock: Set lowest limit for memblock_alloc_base_nid().
  x86, acpi, memblock: Use __memblock_alloc_base() in
    acpi_initrd_override()
  mem-hotplug: Introduce movablenode boot option to {en|dis}able using
    SRAT.
  x86, mem-hotplug: Support initialize page tables from low to high.
  x86, mem_hotplug: Allocate memory near kernel image before SRAT is
    parsed.

 Documentation/kernel-parameters.txt |   15 ++++
 arch/arm/mm/mmu.c                   |    2 +-
 arch/arm64/mm/mmu.c                 |    4 +-
 arch/microblaze/mm/init.c           |    2 +-
 arch/powerpc/mm/40x_mmu.c           |    4 +-
 arch/powerpc/mm/44x_mmu.c           |    2 +-
 arch/powerpc/mm/fsl_booke_mmu.c     |    4 +-
 arch/powerpc/mm/hash_utils_64.c     |    4 +-
 arch/powerpc/mm/init_32.c           |    4 +-
 arch/powerpc/mm/ppc_mmu_32.c        |    4 +-
 arch/powerpc/mm/tlb_nohash.c        |    4 +-
 arch/unicore32/mm/mmu.c             |    2 +-
 arch/x86/kernel/setup.c             |   41 ++++++++++-
 arch/x86/mm/init.c                  |  119 ++++++++++++++++++++++++--------
 drivers/acpi/osl.c                  |    4 +-
 include/linux/memblock.h            |   33 ++++++++--
 include/linux/memory_hotplug.h      |    5 ++
 mm/memblock.c                       |  131 +++++++++++++++++++++++++++++-----
 mm/memory_hotplug.c                 |    9 +++
 mm/nobootmem.c                      |    4 +-
 20 files changed, 320 insertions(+), 77 deletions(-)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 01/11] memblock: Rename current_limit to current_limit_high in memblock.
  2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
@ 2013-08-27  9:37 ` Tang Chen
  2013-08-27  9:37 ` [PATCH 02/11] memblock: Rename memblock_set_current_limit() to memblock_set_current_limit_high() Tang Chen
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Tang Chen @ 2013-08-27  9:37 UTC (permalink / raw)
  To: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

memblock.current_limit specifies the highest address that memblock
could allocate. The next coming patches will introduce a lowest
limit to memblock, so rename it to current_limit_high.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |    2 +-
 mm/memblock.c            |   10 +++++-----
 mm/nobootmem.c           |    4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f388203..f0c0a91 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -35,7 +35,7 @@ struct memblock_type {
 };
 
 struct memblock {
-	phys_addr_t current_limit;
+	phys_addr_t current_limit_high;	/* upper boundary of accessable range */
 	struct memblock_type memory;
 	struct memblock_type reserved;
 };
diff --git a/mm/memblock.c b/mm/memblock.c
index a847bfe..ff2226f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -32,7 +32,7 @@ struct memblock memblock __initdata_memblock = {
 	.reserved.cnt		= 1,	/* empty dummy entry */
 	.reserved.max		= INIT_MEMBLOCK_REGIONS,
 
-	.current_limit		= MEMBLOCK_ALLOC_ANYWHERE,
+	.current_limit_high	= MEMBLOCK_ALLOC_ANYWHERE,
 };
 
 int memblock_debug __initdata_memblock;
@@ -104,7 +104,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 
 	/* pump up @end */
 	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
-		end = memblock.current_limit;
+		end = memblock.current_limit_high;
 
 	/* avoid allocating the first page */
 	start = max_t(phys_addr_t, start, PAGE_SIZE);
@@ -240,11 +240,11 @@ static int __init_memblock memblock_double_array(struct memblock_type *type,
 			new_area_start = new_area_size = 0;
 
 		addr = memblock_find_in_range(new_area_start + new_area_size,
-						memblock.current_limit,
+						memblock.current_limit_high,
 						new_alloc_size, PAGE_SIZE);
 		if (!addr && new_area_size)
 			addr = memblock_find_in_range(0,
-				min(new_area_start, memblock.current_limit),
+				min(new_area_start, memblock.current_limit_high),
 				new_alloc_size, PAGE_SIZE);
 
 		new_array = addr ? __va(addr) : NULL;
@@ -979,7 +979,7 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
 
 void __init_memblock memblock_set_current_limit(phys_addr_t limit)
 {
-	memblock.current_limit = limit;
+	memblock.current_limit_high = limit;
 }
 
 static void __init_memblock memblock_dump(struct memblock_type *type, char *name)
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 61107cf..8cc163c 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -38,8 +38,8 @@ static void * __init __alloc_memory_core_early(int nid, u64 size, u64 align,
 	void *ptr;
 	u64 addr;
 
-	if (limit > memblock.current_limit)
-		limit = memblock.current_limit;
+	if (limit > memblock.current_limit_high)
+		limit = memblock.current_limit_high;
 
 	addr = memblock_find_in_range_node(goal, limit, size, align, nid);
 	if (!addr)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 02/11] memblock: Rename memblock_set_current_limit() to memblock_set_current_limit_high().
  2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
  2013-08-27  9:37 ` [PATCH 01/11] memblock: Rename current_limit to current_limit_high in memblock Tang Chen
@ 2013-08-27  9:37 ` Tang Chen
  2013-08-27  9:37 ` [PATCH 03/11] memblock: Introduce lowest limit in memblock Tang Chen
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Tang Chen @ 2013-08-27  9:37 UTC (permalink / raw)
  To: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

Since we renamed memblock.current_limit to current_limit_high, we also
rename memblock_set_current_limit() to memblock_set_current_limit_high().

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/arm/mm/mmu.c               |    2 +-
 arch/arm64/mm/mmu.c             |    4 ++--
 arch/microblaze/mm/init.c       |    2 +-
 arch/powerpc/mm/40x_mmu.c       |    4 ++--
 arch/powerpc/mm/44x_mmu.c       |    2 +-
 arch/powerpc/mm/fsl_booke_mmu.c |    4 ++--
 arch/powerpc/mm/hash_utils_64.c |    4 ++--
 arch/powerpc/mm/init_32.c       |    4 ++--
 arch/powerpc/mm/ppc_mmu_32.c    |    4 ++--
 arch/powerpc/mm/tlb_nohash.c    |    4 ++--
 arch/unicore32/mm/mmu.c         |    2 +-
 arch/x86/kernel/setup.c         |    4 ++--
 include/linux/memblock.h        |    8 ++++----
 mm/memblock.c                   |    2 +-
 14 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 53cdbd3..121565e 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -1114,7 +1114,7 @@ void __init sanity_check_meminfo(void)
 	if (!memblock_limit)
 		memblock_limit = arm_lowmem_limit;
 
-	memblock_set_current_limit(memblock_limit);
+	memblock_set_current_limit_high(memblock_limit);
 }
 
 static inline void prepare_page_table(void)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a8d1059..7f27451 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -305,7 +305,7 @@ static void __init map_mem(void)
 	 * The initial direct kernel mapping, located at swapper_pg_dir,
 	 * gives us PGDIR_SIZE memory starting from PHYS_OFFSET (aligned).
 	 */
-	memblock_set_current_limit((PHYS_OFFSET & PGDIR_MASK) + PGDIR_SIZE);
+	memblock_set_current_limit_high((PHYS_OFFSET & PGDIR_MASK) + PGDIR_SIZE);
 
 	/* map all the memory banks */
 	for_each_memblock(memory, reg) {
@@ -319,7 +319,7 @@ static void __init map_mem(void)
 	}
 
 	/* Limit no longer required. */
-	memblock_set_current_limit(MEMBLOCK_ALLOC_ANYWHERE);
+	memblock_set_current_limit_high(MEMBLOCK_ALLOC_ANYWHERE);
 }
 
 /*
diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c
index 74c7bcc..554b61d 100644
--- a/arch/microblaze/mm/init.c
+++ b/arch/microblaze/mm/init.c
@@ -391,7 +391,7 @@ asmlinkage void __init mmu_init(void)
 	/* Shortly after that, the entire linear mapping will be available */
 	/* This will also cause that unflatten device tree will be allocated
 	 * inside 768MB limit */
-	memblock_set_current_limit(memory_start + lowmem_size - 1);
+	memblock_set_current_limit_high(memory_start + lowmem_size - 1);
 }
 
 /* This is only called until mem_init is done. */
diff --git a/arch/powerpc/mm/40x_mmu.c b/arch/powerpc/mm/40x_mmu.c
index 5810967..9ce26d3 100644
--- a/arch/powerpc/mm/40x_mmu.c
+++ b/arch/powerpc/mm/40x_mmu.c
@@ -141,7 +141,7 @@ unsigned long __init mmu_mapin_ram(unsigned long top)
 	 * coverage with normal-sized pages (or other reasons) do not
 	 * attempt to allocate outside the allowed range.
 	 */
-	memblock_set_current_limit(mapped);
+	memblock_set_current_limit_high(mapped);
 
 	return mapped;
 }
@@ -155,5 +155,5 @@ void setup_initial_memory_limit(phys_addr_t first_memblock_base,
 	BUG_ON(first_memblock_base != 0);
 
 	/* 40x can only access 16MB at the moment (see head_40x.S) */
-	memblock_set_current_limit(min_t(u64, first_memblock_size, 0x00800000));
+	memblock_set_current_limit_high(min_t(u64, first_memblock_size, 0x00800000));
 }
diff --git a/arch/powerpc/mm/44x_mmu.c b/arch/powerpc/mm/44x_mmu.c
index 82b1ff7..c4eb6f6 100644
--- a/arch/powerpc/mm/44x_mmu.c
+++ b/arch/powerpc/mm/44x_mmu.c
@@ -225,7 +225,7 @@ void setup_initial_memory_limit(phys_addr_t first_memblock_base,
 
 	/* 44x has a 256M TLB entry pinned at boot */
 	size = (min_t(u64, first_memblock_size, PPC_PIN_SIZE));
-	memblock_set_current_limit(first_memblock_base + size);
+	memblock_set_current_limit_high(first_memblock_base + size);
 }
 
 #ifdef CONFIG_SMP
diff --git a/arch/powerpc/mm/fsl_booke_mmu.c b/arch/powerpc/mm/fsl_booke_mmu.c
index 07ba45b..c3d0662 100644
--- a/arch/powerpc/mm/fsl_booke_mmu.c
+++ b/arch/powerpc/mm/fsl_booke_mmu.c
@@ -230,7 +230,7 @@ void __init adjust_total_lowmem(void)
 	pr_cont("%lu Mb, residual: %dMb\n", tlbcam_sz(tlbcam_index - 1) >> 20,
 	        (unsigned int)((total_lowmem - __max_low_memory) >> 20));
 
-	memblock_set_current_limit(memstart_addr + __max_low_memory);
+	memblock_set_current_limit_high(memstart_addr + __max_low_memory);
 }
 
 void setup_initial_memory_limit(phys_addr_t first_memblock_base,
@@ -239,6 +239,6 @@ void setup_initial_memory_limit(phys_addr_t first_memblock_base,
 	phys_addr_t limit = first_memblock_base + first_memblock_size;
 
 	/* 64M mapped initially according to head_fsl_booke.S */
-	memblock_set_current_limit(min_t(u64, limit, 0x04000000));
+	memblock_set_current_limit_high(min_t(u64, limit, 0x04000000));
 }
 #endif
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 6ecc38b..550c890 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -759,7 +759,7 @@ static void __init htab_initialize(void)
 		BUG_ON(htab_bolt_mapping(base, base + size, __pa(base),
 				prot, mmu_linear_psize, mmu_kernel_ssize));
 	}
-	memblock_set_current_limit(MEMBLOCK_ALLOC_ANYWHERE);
+	memblock_set_current_limit_high(MEMBLOCK_ALLOC_ANYWHERE);
 
 	/*
 	 * If we have a memory_limit and we've allocated TCEs then we need to
@@ -1432,5 +1432,5 @@ void setup_initial_memory_limit(phys_addr_t first_memblock_base,
 	ppc64_rma_size = min_t(u64, first_memblock_size, 0x40000000);
 
 	/* Finally limit subsequent allocations */
-	memblock_set_current_limit(ppc64_rma_size);
+	memblock_set_current_limit_high(ppc64_rma_size);
 }
diff --git a/arch/powerpc/mm/init_32.c b/arch/powerpc/mm/init_32.c
index 01e2db9..992728d 100644
--- a/arch/powerpc/mm/init_32.c
+++ b/arch/powerpc/mm/init_32.c
@@ -192,7 +192,7 @@ void __init MMU_init(void)
 #endif
 
 	/* Shortly after that, the entire linear mapping will be available */
-	memblock_set_current_limit(lowmem_end_addr);
+	memblock_set_current_limit_high(lowmem_end_addr);
 }
 
 /* This is only called until mem_init is done. */
@@ -214,6 +214,6 @@ void setup_initial_memory_limit(phys_addr_t first_memblock_base,
 	BUG_ON(first_memblock_base != 0);
 
 	/* 8xx can only access 8MB at the moment */
-	memblock_set_current_limit(min_t(u64, first_memblock_size, 0x00800000));
+	memblock_set_current_limit_high(min_t(u64, first_memblock_size, 0x00800000));
 }
 #endif /* CONFIG_8xx */
diff --git a/arch/powerpc/mm/ppc_mmu_32.c b/arch/powerpc/mm/ppc_mmu_32.c
index 11571e1..815dbe1 100644
--- a/arch/powerpc/mm/ppc_mmu_32.c
+++ b/arch/powerpc/mm/ppc_mmu_32.c
@@ -282,7 +282,7 @@ void setup_initial_memory_limit(phys_addr_t first_memblock_base,
 
 	/* 601 can only access 16MB at the moment */
 	if (PVR_VER(mfspr(SPRN_PVR)) == 1)
-		memblock_set_current_limit(min_t(u64, first_memblock_size, 0x01000000));
+		memblock_set_current_limit_high(min_t(u64, first_memblock_size, 0x01000000));
 	else /* Anything else has 256M mapped */
-		memblock_set_current_limit(min_t(u64, first_memblock_size, 0x10000000));
+		memblock_set_current_limit_high(min_t(u64, first_memblock_size, 0x10000000));
 }
diff --git a/arch/powerpc/mm/tlb_nohash.c b/arch/powerpc/mm/tlb_nohash.c
index 41cd68d..5e41488 100644
--- a/arch/powerpc/mm/tlb_nohash.c
+++ b/arch/powerpc/mm/tlb_nohash.c
@@ -640,7 +640,7 @@ static void __early_init_mmu(int boot_cpu)
 	 */
 	mb();
 
-	memblock_set_current_limit(linear_map_top);
+	memblock_set_current_limit_high(linear_map_top);
 }
 
 void __init early_init_mmu(void)
@@ -680,7 +680,7 @@ void setup_initial_memory_limit(phys_addr_t first_memblock_base,
 		ppc64_rma_size = min_t(u64, first_memblock_size, 0x40000000);
 
 	/* Finally limit subsequent allocations */
-	memblock_set_current_limit(first_memblock_base + ppc64_rma_size);
+	memblock_set_current_limit_high(first_memblock_base + ppc64_rma_size);
 }
 #else /* ! CONFIG_PPC64 */
 void __init early_init_mmu(void)
diff --git a/arch/unicore32/mm/mmu.c b/arch/unicore32/mm/mmu.c
index 4f5a532..278f7e3 100644
--- a/arch/unicore32/mm/mmu.c
+++ b/arch/unicore32/mm/mmu.c
@@ -287,7 +287,7 @@ static void __init sanity_check_meminfo(void)
 	int i, j;
 
 	lowmem_limit = __pa(vmalloc_min - 1) + 1;
-	memblock_set_current_limit(lowmem_limit);
+	memblock_set_current_limit_high(lowmem_limit);
 
 	for (i = 0, j = 0; i < meminfo.nr_banks; i++) {
 		struct membank *bank = &meminfo.bank[j];
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 382e20b..fa7b5f0 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1060,7 +1060,7 @@ void __init setup_arch(char **cmdline_p)
 
 	cleanup_highmap();
 
-	memblock_set_current_limit(ISA_END_ADDRESS);
+	memblock_set_current_limit_high(ISA_END_ADDRESS);
 	memblock_x86_fill();
 
 	/*
@@ -1093,7 +1093,7 @@ void __init setup_arch(char **cmdline_p)
 
 	setup_real_mode();
 
-	memblock_set_current_limit(get_max_mapped());
+	memblock_set_current_limit_high(get_max_mapped());
 	dma_contiguous_reserve(0);
 
 	/*
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f0c0a91..c28cd6b 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -173,12 +173,12 @@ static inline void memblock_dump_all(void)
 }
 
 /**
- * memblock_set_current_limit - Set the current allocation limit to allow
- *                         limiting allocations to what is currently
+ * memblock_set_current_limit_high - Set the current allocation upper limit to
+ *                         allow limiting allocations to what is currently
  *                         accessible during boot
- * @limit: New limit value (physical address)
+ * @limit: New upper limit value (physical address)
  */
-void memblock_set_current_limit(phys_addr_t limit);
+void memblock_set_current_limit_high(phys_addr_t limit);
 
 
 /*
diff --git a/mm/memblock.c b/mm/memblock.c
index ff2226f..d351911 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -977,7 +977,7 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
 	}
 }
 
-void __init_memblock memblock_set_current_limit(phys_addr_t limit)
+void __init_memblock memblock_set_current_limit_high(phys_addr_t limit)
 {
 	memblock.current_limit_high = limit;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 03/11] memblock: Introduce lowest limit in memblock.
  2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
  2013-08-27  9:37 ` [PATCH 01/11] memblock: Rename current_limit to current_limit_high in memblock Tang Chen
  2013-08-27  9:37 ` [PATCH 02/11] memblock: Rename memblock_set_current_limit() to memblock_set_current_limit_high() Tang Chen
@ 2013-08-27  9:37 ` Tang Chen
  2013-08-27  9:37 ` [PATCH 04/11] memblock: Introduce memblock_set_current_limit_low() to set lower limit of memblock Tang Chen
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Tang Chen @ 2013-08-27  9:37 UTC (permalink / raw)
  To: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

The current memblock allocates memory from high address to low. And it has
a highest limit.

The next coming patches will improve memblock to be able to allocate memory
from low address to high. So we need a lowest limit.

Introduce current_limit_low to memblock. When users specify start address
as MEMBLOCK_ALLOC_ACCESSIBLE, memblock will use current_limit_low as the
low limit of allocation.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |    1 +
 mm/memblock.c            |   18 +++++++++++++++---
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index c28cd6b..40eb18e 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -35,6 +35,7 @@ struct memblock_type {
 };
 
 struct memblock {
+	phys_addr_t current_limit_low;	/* lower boundary of accessable range */
 	phys_addr_t current_limit_high;	/* upper boundary of accessable range */
 	struct memblock_type memory;
 	struct memblock_type reserved;
diff --git a/mm/memblock.c b/mm/memblock.c
index d351911..0dd5387 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -32,6 +32,7 @@ struct memblock memblock __initdata_memblock = {
 	.reserved.cnt		= 1,	/* empty dummy entry */
 	.reserved.max		= INIT_MEMBLOCK_REGIONS,
 
+	.current_limit_low	= 0,
 	.current_limit_high	= MEMBLOCK_ALLOC_ANYWHERE,
 };
 
@@ -84,7 +85,7 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
 
 /**
  * memblock_find_in_range_node - find free area in given range and node
- * @start: start of candidate range
+ * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE
  * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
  * @size: size of free area to find
  * @align: alignment of free area to find
@@ -92,6 +93,15 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
  *
  * Find @size free area aligned to @align in the specified range and node.
  *
+ * If @start is %MEMBLOCK_ALLOC_ACCESSIBLE, then set @start to
+ * memblock.current_limit_low which limit the lowest address memblock could
+ * access. %MEMBLOCK_ALLOC_ACCESSIBLE means nothing to @start.
+ *
+ * If @end is %MEMBLOCK_ALLOC_ACCESSIBLE, then set @start to
+ * memblock.current_limit_high which limit the highest address memblock could
+ * access. @end can also be %MEMBLOCK_ALLOC_ANYWHERE, which is the maximum
+ * physical address.
+ *
  * RETURNS:
  * Found address on success, %0 on failure.
  */
@@ -102,7 +112,9 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 	phys_addr_t this_start, this_end, cand;
 	u64 i;
 
-	/* pump up @end */
+	/* pump up @start and @end */
+	if (start == MEMBLOCK_ALLOC_ACCESSIBLE)
+		start = memblock.current_limit_low;
 	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
 		end = memblock.current_limit_high;
 
@@ -126,7 +138,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 
 /**
  * memblock_find_in_range - find free area in given range
- * @start: start of candidate range
+ * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE
  * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
  * @size: size of free area to find
  * @align: alignment of free area to find
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 04/11] memblock: Introduce memblock_set_current_limit_low() to set lower limit of memblock.
  2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
                   ` (2 preceding siblings ...)
  2013-08-27  9:37 ` [PATCH 03/11] memblock: Introduce lowest limit in memblock Tang Chen
@ 2013-08-27  9:37 ` Tang Chen
  2013-08-27  9:37 ` [PATCH 05/11] memblock: Introduce allocation order to memblock Tang Chen
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Tang Chen @ 2013-08-27  9:37 UTC (permalink / raw)
  To: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

Corresponding to memblock_set_current_limit_high(), we introduce memblock_set_current_limit_low()
to set the lowest limit for memblock.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |    9 ++++++++-
 mm/memblock.c            |    5 +++++
 2 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 40eb18e..cabd685 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -174,6 +174,14 @@ static inline void memblock_dump_all(void)
 }
 
 /**
+ * memblock_set_current_limit_low - Set the current allocation lower limit to
+ *                         allow limiting allocations to what is currently
+ *                         accessible during boot
+ * @limit: New lower limit value (physical address)
+ */
+void memblock_set_current_limit_low(phys_addr_t limit);
+
+/**
  * memblock_set_current_limit_high - Set the current allocation upper limit to
  *                         allow limiting allocations to what is currently
  *                         accessible during boot
@@ -181,7 +189,6 @@ static inline void memblock_dump_all(void)
  */
 void memblock_set_current_limit_high(phys_addr_t limit);
 
-
 /*
  * pfn conversion functions
  *
diff --git a/mm/memblock.c b/mm/memblock.c
index 0dd5387..54c1c2e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -989,6 +989,11 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
 	}
 }
 
+void __init_memblock memblock_set_current_limit_low(phys_addr_t limit)
+{
+	memblock.current_limit_low = limit;
+}
+
 void __init_memblock memblock_set_current_limit_high(phys_addr_t limit)
 {
 	memblock.current_limit_high = limit;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 05/11] memblock: Introduce allocation order to memblock.
  2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
                   ` (3 preceding siblings ...)
  2013-08-27  9:37 ` [PATCH 04/11] memblock: Introduce memblock_set_current_limit_low() to set lower limit of memblock Tang Chen
@ 2013-08-27  9:37 ` Tang Chen
       [not found]   ` <20130905091615.GB15294@hacker.(null)>
  2013-08-27  9:37 ` [PATCH 06/11] memblock: Improve memblock to support allocation from lower address Tang Chen
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Tang Chen @ 2013-08-27  9:37 UTC (permalink / raw)
  To: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

The Linux kernel cannot migrate pages used by the kernel. As a result, kernel
pages cannot be hot-removed. So we cannot allocate hotpluggable memory for
the kernel.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info.
But before SRAT is parsed, memblock has already started to allocate memory
for the kernel. So we need to prevent memblock from doing this.

In a memory hotplug system, any numa node the kernel resides in should
be unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely unhotpluggable.

So the basic idea is: Allocate memory from the end of the kernel image and
to the higher memory. Since memory allocation before SRAT is parsed won't
be too much, it could highly likely be in the same node with kernel image.

The current memblock can only allocate memory from high address to low.
So this patch introduces the allocation order to memblock. It could be
used to tell memblock to allocate memory from high to low or from low
to high.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/memblock.h |   15 +++++++++++++++
 mm/memblock.c            |   13 +++++++++++++
 2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index cabd685..f233c1f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,6 +19,11 @@
 
 #define INIT_MEMBLOCK_REGIONS	128
 
+/* Allocation order. */
+#define MEMBLOCK_ORDER_HIGH_TO_LOW	0
+#define MEMBLOCK_ORDER_LOW_TO_HIGH	1
+#define MEMBLOCK_ORDER_DEFAULT		MEMBLOCK_ORDER_HIGH_TO_LOW
+
 struct memblock_region {
 	phys_addr_t base;
 	phys_addr_t size;
@@ -35,6 +40,7 @@ struct memblock_type {
 };
 
 struct memblock {
+	int current_order;	/* allocate from higher or lower address */
 	phys_addr_t current_limit_low;	/* lower boundary of accessable range */
 	phys_addr_t current_limit_high;	/* upper boundary of accessable range */
 	struct memblock_type memory;
@@ -174,6 +180,15 @@ static inline void memblock_dump_all(void)
 }
 
 /**
+ * memblock_set_current_order - Set the current allocation order to allow
+ *                         allocating memory from higher to lower address or
+ *                         from lower to higher address
+ * @order: In which order to allocate memory. Could be
+ *         MEMBLOCK_ORDER_{HIGH_TO_LOW|LOW_TO_HIGH}
+ */
+void memblock_set_current_order(int order);
+
+/**
  * memblock_set_current_limit_low - Set the current allocation lower limit to
  *                         allow limiting allocations to what is currently
  *                         accessible during boot
diff --git a/mm/memblock.c b/mm/memblock.c
index 54c1c2e..8f1e2d4 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -32,6 +32,7 @@ struct memblock memblock __initdata_memblock = {
 	.reserved.cnt		= 1,	/* empty dummy entry */
 	.reserved.max		= INIT_MEMBLOCK_REGIONS,
 
+	.current_order		= MEMBLOCK_ORDER_DEFAULT,
 	.current_limit_low	= 0,
 	.current_limit_high	= MEMBLOCK_ALLOC_ANYWHERE,
 };
@@ -989,6 +990,18 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
 	}
 }
 
+void __init_memblock memblock_set_current_order(int order)
+{
+	if (order != MEMBLOCK_ORDER_HIGH_TO_LOW &&
+	    order != MEMBLOCK_ORDER_LOW_TO_HIGH) {
+		pr_warn("memblock: Failed to set allocation order. "
+			"Invalid order type: %d\n", order);
+		return;
+	}
+
+	memblock.current_order = order;
+}
+
 void __init_memblock memblock_set_current_limit_low(phys_addr_t limit)
 {
 	memblock.current_limit_low = limit;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 06/11] memblock: Improve memblock to support allocation from lower address.
  2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
                   ` (4 preceding siblings ...)
  2013-08-27  9:37 ` [PATCH 05/11] memblock: Introduce allocation order to memblock Tang Chen
@ 2013-08-27  9:37 ` Tang Chen
  2013-09-04  0:24   ` Toshi Kani
  2013-08-27  9:37 ` [PATCH 07/11] x86, memblock: Set lowest limit for memblock_alloc_base_nid() Tang Chen
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Tang Chen @ 2013-08-27  9:37 UTC (permalink / raw)
  To: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

This patch modifies the memblock_find_in_range_node() to support two
different allocation orders. After this patch, memblock will check
memblock.current_order, and decide in which order to allocate memory.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 mm/memblock.c |   90 +++++++++++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 75 insertions(+), 15 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 8f1e2d4..961d4a5 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -85,6 +85,77 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
 }
 
 /**
+ * __memblock_find_range - find free area utility
+ * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Utility called from memblock_find_in_range_node(), find free area from
+ * lower address to higher address.
+ *
+ * RETURNS:
+ * Found address on success, %0 on failure.
+ */
+phys_addr_t __init_memblock
+__memblock_find_range(phys_addr_t start, phys_addr_t end,
+		      phys_addr_t size, phys_addr_t align, int nid)
+{
+	phys_addr_t this_start, this_end, cand;
+	u64 i;
+
+	for_each_free_mem_range(i, nid, &this_start, &this_end, NULL) {
+		this_start = clamp(this_start, start, end);
+		this_end = clamp(this_end, start, end);
+
+		cand = round_up(this_start, align);
+		if (cand < this_end && this_end - cand >= size)
+			return cand;
+	}
+	return 0;
+}
+
+/**
+ * __memblock_find_range_rev - find free area utility, in reverse order
+ * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Utility called from memblock_find_in_range_node(), find free area from
+ * higher address to lower address.
+ *
+ * RETURNS:
+ * Found address on success, %0 on failure.
+ */
+phys_addr_t __init_memblock
+__memblock_find_range_rev(phys_addr_t start, phys_addr_t end,
+			  phys_addr_t size, phys_addr_t align, int nid)
+{
+	phys_addr_t this_start, this_end, cand;
+	u64 i;
+
+	for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
+		this_start = clamp(this_start, start, end);
+		this_end = clamp(this_end, start, end);
+
+		/*
+		 * Just in case that (this_end - size) underflows and cause
+		 * (cand >= this_start) to be true incorrectly.
+		 */
+		if (this_end < size)
+			break;
+
+		cand = round_down(this_end - size, align);
+		if (cand >= this_start)
+			return cand;
+	}
+	return 0;
+}
+
+/**
  * memblock_find_in_range_node - find free area in given range and node
  * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE
  * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
@@ -110,9 +181,6 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 					phys_addr_t end, phys_addr_t size,
 					phys_addr_t align, int nid)
 {
-	phys_addr_t this_start, this_end, cand;
-	u64 i;
-
 	/* pump up @start and @end */
 	if (start == MEMBLOCK_ALLOC_ACCESSIBLE)
 		start = memblock.current_limit_low;
@@ -123,18 +191,10 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 	start = max_t(phys_addr_t, start, PAGE_SIZE);
 	end = max(start, end);
 
-	for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
-		this_start = clamp(this_start, start, end);
-		this_end = clamp(this_end, start, end);
-
-		if (this_end < size)
-			continue;
-
-		cand = round_down(this_end - size, align);
-		if (cand >= this_start)
-			return cand;
-	}
-	return 0;
+	if (memblock.current_order == MEMBLOCK_ORDER_DEFAULT)
+		return __memblock_find_range_rev(start, end, size, align, nid);
+	else
+		return __memblock_find_range(start, end, size, align, nid);
 }
 
 /**
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 07/11] x86, memblock: Set lowest limit for memblock_alloc_base_nid().
  2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
                   ` (5 preceding siblings ...)
  2013-08-27  9:37 ` [PATCH 06/11] memblock: Improve memblock to support allocation from lower address Tang Chen
@ 2013-08-27  9:37 ` Tang Chen
  2013-09-04  0:37   ` Toshi Kani
  2013-08-27  9:37 ` [PATCH 08/11] x86, acpi, memblock: Use __memblock_alloc_base() in acpi_initrd_override() Tang Chen
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Tang Chen @ 2013-08-27  9:37 UTC (permalink / raw)
  To: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

memblock_alloc_base_nid() is a common API of memblock. And it calls
memblock_find_in_range_node() with %start = 0, which means it has no
limit for the lowest address by default.

	memblock_find_in_range_node(0, max_addr, size, align, nid);

Since we introduced current_limit_low to memblock, if we have no limit
for the lowest address or we are not sure, we should pass
MEMBLOCK_ALLOC_ACCESSIBLE to %start so that it will be limited by the
default low limit.

dma_contiguous_reserve() and setup_log_buf() will eventually call
memblock_alloc_base_nid() to allocate memory. So if the allocation order
is from low to high, they will allocate memory from the lowest limit
to higher memory.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 mm/memblock.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 961d4a5..be8c4d1 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -851,7 +851,8 @@ static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
 	/* align @size to avoid excessive fragmentation on reserved array */
 	size = round_up(size, align);
 
-	found = memblock_find_in_range_node(0, max_addr, size, align, nid);
+	found = memblock_find_in_range_node(MEMBLOCK_ALLOC_ACCESSIBLE,
+					    max_addr, size, align, nid);
 	if (found && !memblock_reserve(found, size))
 		return found;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 08/11] x86, acpi, memblock: Use __memblock_alloc_base() in acpi_initrd_override()
  2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
                   ` (6 preceding siblings ...)
  2013-08-27  9:37 ` [PATCH 07/11] x86, memblock: Set lowest limit for memblock_alloc_base_nid() Tang Chen
@ 2013-08-27  9:37 ` Tang Chen
  2013-08-28  0:04   ` Rafael J. Wysocki
  2013-08-27  9:37 ` [PATCH 09/11] mem-hotplug: Introduce movablenode boot option to {en|dis}able using SRAT Tang Chen
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Tang Chen @ 2013-08-27  9:37 UTC (permalink / raw)
  To: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

The current acpi_initrd_override() calls memblock_find_in_range() to allocate
memory, and pass 0 to %start, which will not limited by the current_limit_low.

acpi_initrd_override()
 |->memblock_find_in_range(0, ...)
     |->memblock_find_in_range_node(0, ...)

When we want to allocate memory from the end of kernel image to higher memory,
we need to limit the lowest address to the end of kernel image.

We have modified memblock_alloc_base_nid() to call memblock_find_in_range_node()
with %start = MEMBLOCK_ALLOC_ACCESSIBLE, which means it will be limited by
current_limit_low. And __memblock_alloc_base() calls memblock_alloc_base_nid().

__memblock_alloc_base()
 |->memblock_alloc_base_nid()
     |->memblock_find_in_range_node(MEMBLOCK_ALLOC_ACCESSIBLE, ...)

So use __memblock_alloc_base() to allocate memory in acpi_initrd_override().

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 drivers/acpi/osl.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index fece767..1d68fc0 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -629,8 +629,8 @@ void __init acpi_initrd_override(void *data, size_t size)
 		return;
 
 	/* under 4G at first, then above 4G */
-	acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
-					all_tables_size, PAGE_SIZE);
+	acpi_tables_addr = __memblock_alloc_base(all_tables_size,
+						 PAGE_SIZE, (1ULL<<32) - 1);
 	if (!acpi_tables_addr) {
 		WARN_ON(1);
 		return;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 09/11] mem-hotplug: Introduce movablenode boot option to {en|dis}able using SRAT.
  2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
                   ` (7 preceding siblings ...)
  2013-08-27  9:37 ` [PATCH 08/11] x86, acpi, memblock: Use __memblock_alloc_base() in acpi_initrd_override() Tang Chen
@ 2013-08-27  9:37 ` Tang Chen
  2013-08-27  9:37 ` [PATCH 10/11] x86, mem-hotplug: Support initialize page tables from low to high Tang Chen
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Tang Chen @ 2013-08-27  9:37 UTC (permalink / raw)
  To: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

The Hot-Pluggable fired in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movablenode boot option to allow users to
choose to reserve hotpluggable memory and set it as ZONE_MOVABLE or not.

Users can specify "movablenode" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |   15 +++++++++++++++
 include/linux/memory_hotplug.h      |    5 +++++
 mm/memory_hotplug.c                 |    9 +++++++++
 3 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 15356ac..7349d1f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1718,6 +1718,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
+	movablenode		[KNL,X86] This parameter enables/disables the
+			kernel to arrange hotpluggable memory ranges recorded
+			in ACPI SRAT(System Resource Affinity Table) as
+			ZONE_MOVABLE. And these memory can be hot-removed when
+			the system is up.
+			By specifying this option, all the hotpluggable memory
+			will be in ZONE_MOVABLE, which the kernel cannot use.
+			This will cause NUMA performance down. For users who
+			care about NUMA performance, just don't use it.
+			If all the memory ranges in the system are hotpluggable,
+			then the ones used by the kernel at early time, such as
+			kernel code and data segments, initrd file and so on,
+			won't be set as ZONE_MOVABLE, and won't be hotpluggable.
+			Otherwise the kernel won't have enough memory to boot.
+
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index dd38e62..5d2c07b 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -33,6 +33,11 @@ enum {
 	ONLINE_MOVABLE,
 };
 
+#ifdef CONFIG_MOVABLE_NODE
+/* Enable/disable SRAT in movablenode boot option */
+extern bool movablenode_enable_srat;
+#endif /* CONFIG_MOVABLE_NODE */
+
 /*
  * pgdat resizing functions
  */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ca1dd3a..7252a7d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1345,6 +1345,15 @@ static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
 {
 	return true;
 }
+
+bool __initdata movablenode_enable_srat;
+
+static int __init cmdline_parse_movablenode(char *p)
+{
+	movablenode_enable_srat = true;
+	return 0;
+}
+early_param("movablenode", cmdline_parse_movablenode);
 #else /* CONFIG_MOVABLE_NODE */
 /* ensure the node has NORMAL memory if it is still online */
 static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 10/11] x86, mem-hotplug: Support initialize page tables from low to high.
  2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
                   ` (8 preceding siblings ...)
  2013-08-27  9:37 ` [PATCH 09/11] mem-hotplug: Introduce movablenode boot option to {en|dis}able using SRAT Tang Chen
@ 2013-08-27  9:37 ` Tang Chen
       [not found]   ` <20130905133027.GA23038@hacker.(null)>
  2013-08-27  9:37 ` [PATCH 11/11] x86, mem_hotplug: Allocate memory near kernel image before SRAT is parsed Tang Chen
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Tang Chen @ 2013-08-27  9:37 UTC (permalink / raw)
  To: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

init_mem_mapping() is called before SRAT is parsed. And memblock will allocate
memory for page tables. To prevent page tables being allocated within hotpluggable
memory, we will allocate page tables from the end of kernel image to the higher
memory.

The order of page tables allocation is controled by movablenode boot option.
Since the default behavior of page tables initialization procedure is allocate
page tables from top of the memory downwards, if users don't specify movablenode
boot option, the kernel will behave as before.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/mm/init.c |  119 +++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 91 insertions(+), 28 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 793204b..f004d8e 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -407,13 +407,77 @@ static unsigned long __init init_range_memory_mapping(
 
 /* (PUD_SHIFT-PMD_SHIFT)/2 */
 #define STEP_SIZE_SHIFT 5
-void __init init_mem_mapping(void)
+
+#ifdef CONFIG_MOVABLE_NODE
+/**
+ * memory_map_from_low - Map [start, end) from low to high
+ * @start: start address of the target memory range
+ * @end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range [start, end) in a
+ * heuristic way. In the beginning, step_size is small. The more memory we map
+ * memory in the next loop.
+ */
+static void __init memory_map_from_low(unsigned long start, unsigned long end)
+{
+	unsigned long next, new_mapped_ram_size;
+	unsigned long mapped_ram_size = 0;
+	/* step_size need to be small so pgt_buf from BRK could cover it */
+	unsigned long step_size = PMD_SIZE;
+
+	while (start < end) {
+		if (end - start > step_size) {
+			next = round_up(start + 1, step_size);
+			if (next > end)
+				next = end;
+		} else
+			next = end;
+
+		new_mapped_ram_size = init_range_memory_mapping(start, next);
+		start = next;
+
+		if (new_mapped_ram_size > mapped_ram_size)
+			step_size <<= STEP_SIZE_SHIFT;
+		mapped_ram_size += new_mapped_ram_size;
+	}
+}
+#endif /* CONFIG_MOVABLE_NODE */
+
+/**
+ * memory_map_from_high - Map [start, end) from high to low
+ * @start: start address of the target memory range
+ * @end: end address of the target memory range
+ *
+ * This function is similar to memory_map_from_low() except it maps memory
+ * from high to low.
+ */
+static void __init memory_map_from_high(unsigned long start, unsigned long end)
 {
-	unsigned long end, real_end, start, last_start;
-	unsigned long step_size;
-	unsigned long addr;
+	unsigned long prev, new_mapped_ram_size;
 	unsigned long mapped_ram_size = 0;
-	unsigned long new_mapped_ram_size;
+	/* step_size need to be small so pgt_buf from BRK could cover it */
+	unsigned long step_size = PMD_SIZE;
+
+	while (start < end) {
+		if (end > step_size) {
+			prev = round_down(end - 1, step_size);
+			if (prev < start)
+				prev = start;
+		} else
+			prev = start;
+
+		new_mapped_ram_size = init_range_memory_mapping(prev, end);
+		end = prev;
+
+		if (new_mapped_ram_size > mapped_ram_size)
+			step_size <<= STEP_SIZE_SHIFT;
+		mapped_ram_size += new_mapped_ram_size;
+	}
+}
+
+void __init init_mem_mapping(void)
+{
+	unsigned long end;
 
 	probe_page_size_mask();
 
@@ -423,44 +487,43 @@ void __init init_mem_mapping(void)
 	end = max_low_pfn << PAGE_SHIFT;
 #endif
 
-	/* the ISA range is always mapped regardless of memory holes */
-	init_memory_mapping(0, ISA_END_ADDRESS);
+	max_pfn_mapped = 0; /* will get exact value next */
+	min_pfn_mapped = end >> PAGE_SHIFT;
+
+#ifdef CONFIG_MOVABLE_NODE
+	unsigned long kernel_end;
+
+	if (movablenode_enable_srat &&
+	    memblock.current_order == MEMBLOCK_ORDER_LOW_TO_HIGH) {
+		kernel_end = round_up(__pa_symbol(_end), PMD_SIZE);
+
+		memory_map_from_low(kernel_end, end);
+		memory_map_from_low(ISA_END_ADDRESS, kernel_end);
+		goto out;
+	}
+#endif /* CONFIG_MOVABLE_NODE */
+
+	unsigned long addr, real_end;
 
 	/* xen has big range in reserved near end of ram, skip it at first.*/
 	addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
 	real_end = addr + PMD_SIZE;
 
-	/* step_size need to be small so pgt_buf from BRK could cover it */
-	step_size = PMD_SIZE;
-	max_pfn_mapped = 0; /* will get exact value next */
-	min_pfn_mapped = real_end >> PAGE_SHIFT;
-	last_start = start = real_end;
-
 	/*
 	 * We start from the top (end of memory) and go to the bottom.
 	 * The memblock_find_in_range() gets us a block of RAM from the
 	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
 	 * for page table.
 	 */
-	while (last_start > ISA_END_ADDRESS) {
-		if (last_start > step_size) {
-			start = round_down(last_start - 1, step_size);
-			if (start < ISA_END_ADDRESS)
-				start = ISA_END_ADDRESS;
-		} else
-			start = ISA_END_ADDRESS;
-		new_mapped_ram_size = init_range_memory_mapping(start,
-							last_start);
-		last_start = start;
-		/* only increase step_size after big range get mapped */
-		if (new_mapped_ram_size > mapped_ram_size)
-			step_size <<= STEP_SIZE_SHIFT;
-		mapped_ram_size += new_mapped_ram_size;
-	}
+	memory_map_from_high(ISA_END_ADDRESS, real_end);
 
 	if (real_end < end)
 		init_range_memory_mapping(real_end, end);
 
+out:
+	/* the ISA range is always mapped regardless of memory holes */
+	init_memory_mapping(0, ISA_END_ADDRESS);
+
 #ifdef CONFIG_X86_64
 	if (max_pfn > max_low_pfn) {
 		/* can we preseve max_low_pfn ?*/
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 11/11] x86, mem_hotplug: Allocate memory near kernel image before SRAT is parsed.
  2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
                   ` (9 preceding siblings ...)
  2013-08-27  9:37 ` [PATCH 10/11] x86, mem-hotplug: Support initialize page tables from low to high Tang Chen
@ 2013-08-27  9:37 ` Tang Chen
  2013-09-04 19:40   ` Toshi Kani
       [not found] ` <20130828080311.GA608@hacker.(null)>
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Tang Chen @ 2013-08-27  9:37 UTC (permalink / raw)
  To: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

After memblock is ready, before SRAT is parsed, we should allocate memory
near the kernel image. So this patch does the following:

1. After memblock is ready, make memblock allocate memory from low address
   to high, and set the lowest limit to the end of kernel image.
2. After SRAT is parsed, make memblock behave as default, allocate memory
   from high address to low, and reset the lowest limit to 0.

This behavior is controlled by movablenode boot option.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 arch/x86/kernel/setup.c |   37 +++++++++++++++++++++++++++++++++++++
 1 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fa7b5f0..0b35bbd 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1087,6 +1087,31 @@ void __init setup_arch(char **cmdline_p)
 	trim_platform_memory_ranges();
 	trim_low_memory_range();
 
+#ifdef CONFIG_MOVABLE_NODE
+	if (movablenode_enable_srat) {
+		/*
+		 * Memory used by the kernel cannot be hot-removed because Linux cannot
+		 * migrate the kernel pages. When memory hotplug is enabled, we should
+		 * prevent memblock from allocating memory for the kernel.
+		 *
+		 * ACPI SRAT records all hotpluggable memory ranges. But before SRAT is
+		 * parsed, we don't know about it.
+		 *
+		 * The kernel image is loaded into memory at very early time. We cannot
+		 * prevent this anyway. So on NUMA system, we set any node the kernel
+		 * resides in as un-hotpluggable.
+		 *
+		 * Since on modern servers, one node could have double-digit gigabytes
+		 * memory, we can assume the memory around the kernel image is also
+		 * un-hotpluggable. So before SRAT is parsed, just allocate memory near
+		 * the kernel image to try the best to keep the kernel away from
+		 * hotpluggable memory.
+		 */
+		memblock_set_current_order(MEMBLOCK_ORDER_LOW_TO_HIGH);
+		memblock_set_current_limit_low(__pa_symbol(_end));
+	}
+#endif /* CONFIG_MOVABLE_NODE */
+
 	init_mem_mapping();
 
 	early_trap_pf_init();
@@ -1127,6 +1152,18 @@ void __init setup_arch(char **cmdline_p)
 	early_acpi_boot_init();
 
 	initmem_init();
+
+#ifdef CONFIG_MOVABLE_NODE
+	if (movablenode_enable_srat) {
+		/*
+		 * When ACPI SRAT is parsed, which is done in initmem_init(), set
+		 * memblock back to the default behavior.
+		 */
+		memblock_set_current_order(MEMBLOCK_ORDER_DEFAULT);
+		memblock_set_current_limit_low(0);
+	}
+#endif /* CONFIG_MOVABLE_NODE */
+
 	memblock_find_dma_reserve();
 
 #ifdef CONFIG_KVM_GUEST
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 08/11] x86, acpi, memblock: Use __memblock_alloc_base() in acpi_initrd_override()
  2013-08-27  9:37 ` [PATCH 08/11] x86, acpi, memblock: Use __memblock_alloc_base() in acpi_initrd_override() Tang Chen
@ 2013-08-28  0:04   ` Rafael J. Wysocki
  0 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2013-08-28  0:04 UTC (permalink / raw)
  To: Tang Chen
  Cc: lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Tuesday, August 27, 2013 05:37:45 PM Tang Chen wrote:
> The current acpi_initrd_override() calls memblock_find_in_range() to allocate
> memory, and pass 0 to %start, which will not limited by the current_limit_low.
> 
> acpi_initrd_override()
>  |->memblock_find_in_range(0, ...)
>      |->memblock_find_in_range_node(0, ...)
> 
> When we want to allocate memory from the end of kernel image to higher memory,
> we need to limit the lowest address to the end of kernel image.
> 
> We have modified memblock_alloc_base_nid() to call memblock_find_in_range_node()
> with %start = MEMBLOCK_ALLOC_ACCESSIBLE, which means it will be limited by
> current_limit_low. And __memblock_alloc_base() calls memblock_alloc_base_nid().
> 
> __memblock_alloc_base()
>  |->memblock_alloc_base_nid()
>      |->memblock_find_in_range_node(MEMBLOCK_ALLOC_ACCESSIBLE, ...)
> 
> So use __memblock_alloc_base() to allocate memory in acpi_initrd_override().
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

Looks OK to me.

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  drivers/acpi/osl.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index fece767..1d68fc0 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -629,8 +629,8 @@ void __init acpi_initrd_override(void *data, size_t size)
>  		return;
>  
>  	/* under 4G at first, then above 4G */
> -	acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
> -					all_tables_size, PAGE_SIZE);
> +	acpi_tables_addr = __memblock_alloc_base(all_tables_size,
> +						 PAGE_SIZE, (1ULL<<32) - 1);
>  	if (!acpi_tables_addr) {
>  		WARN_ON(1);
>  		return;
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed.
       [not found] ` <20130828080311.GA608@hacker.(null)>
@ 2013-08-28  9:34   ` Tang Chen
  0 siblings, 0 replies; 32+ messages in thread
From: Tang Chen @ 2013-08-28  9:34 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hi Wanpeng

On 08/28/2013 04:03 PM, Wanpeng Li wrote:
> Hi Tang,
......
>> [About this patch-set]
>>
>> So this patch-set does the following:
>>
>> 1. Make memblock be able to allocate memory from low address to high address.
>
> I want to know if there is fragmentation degree difference here?
>

Before this patch-set, we mapped memory like this:

1. [0, ISA_END_ADDRESS),
2. [ISA_END_ADDRESS, round_down(max_addr, PMD_SIZE)), from top downwards,
3. [round_down(max_addr, PMD_SIZE), max_addr)


After this patch-set, when movablenode is enabled, it is like:

1. [round_up(_end, PMD_SIZE), max_addr), from _end upwards,
2. [ISA_END_ADDRESS, round_up(_end, PMD_SIZE)),
3. [0, ISA_END_ADDRESS)


All the boundaries are aligned with PMD_SIZE. I think it is the same.

Thanks.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed.
  2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
                   ` (11 preceding siblings ...)
       [not found] ` <20130828080311.GA608@hacker.(null)>
@ 2013-08-28 15:19 ` Tejun Heo
  2013-08-29  1:30   ` Tang Chen
  2013-09-02  1:03 ` Tang Chen
  2013-09-04 19:22 ` Tejun Heo
  14 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2013-08-28 15:19 UTC (permalink / raw)
  To: Tang Chen
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Tue, Aug 27, 2013 at 05:37:37PM +0800, Tang Chen wrote:
> Tang Chen (11):
>   memblock: Rename current_limit to current_limit_high in memblock.
>   memblock: Rename memblock_set_current_limit() to
>     memblock_set_current_limit_high().
>   memblock: Introduce lowest limit in memblock.
>   memblock: Introduce memblock_set_current_limit_low() to set lower
>     limit of memblock.
>   memblock: Introduce allocation order to memblock.
>   memblock: Improve memblock to support allocation from lower address.
>   x86, memblock: Set lowest limit for memblock_alloc_base_nid().
>   x86, acpi, memblock: Use __memblock_alloc_base() in
>     acpi_initrd_override()
>   mem-hotplug: Introduce movablenode boot option to {en|dis}able using
>     SRAT.
>   x86, mem-hotplug: Support initialize page tables from low to high.
>   x86, mem_hotplug: Allocate memory near kernel image before SRAT is
>     parsed.

Doesn't apply to -master, -next or tip.  Again, can you please include
which tree and git commit the patches are against in the patch
description?  How is one supposed to know on top of which tree you're
working?  It is in your benefit to make things easier for the prosepct
reviewers.  Trying to guess and apply the patches to different devel
branches and failing isn't productive and frustates your prospect
reviewers who would of course have negative pre-perception going into
the review and this isn't the first time this issue was raised either.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed.
  2013-08-28 15:19 ` Tejun Heo
@ 2013-08-29  1:30   ` Tang Chen
       [not found]     ` <20130829013657.GA22599@hacker.(null)>
  0 siblings, 1 reply; 32+ messages in thread
From: Tang Chen @ 2013-08-29  1:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On 08/28/2013 11:19 PM, Tejun Heo wrote:
......
> Doesn't apply to -master, -next or tip.  Again, can you please include
> which tree and git commit the patches are against in the patch
> description?  How is one supposed to know on top of which tree you're
> working?  It is in your benefit to make things easier for the prosepct
> reviewers.  Trying to guess and apply the patches to different devel
> branches and failing isn't productive and frustates your prospect
> reviewers who would of course have negative pre-perception going into
> the review and this isn't the first time this issue was raised either.
>

Hi tj,

Sorry for the trouble. Please refer to the following branch:

https://github.com/imtangchen/linux.git  movablenode-boot-option

Thanks.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed.
       [not found]     ` <20130829013657.GA22599@hacker.(null)>
@ 2013-08-29  1:53       ` Tang Chen
  0 siblings, 0 replies; 32+ messages in thread
From: Tang Chen @ 2013-08-29  1:53 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Tejun Heo, rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai,
	jiang.liu, wency, laijs, isimatu.yasuaki, izumi.taku, mgorman,
	minchan, mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel,
	jweiner, prarit, zhangyanfei, x86, linux-doc, linux-kernel,
	linux-mm, linux-acpi

Hi Wanpeng,

On 08/29/2013 09:36 AM, Wanpeng Li wrote:
......
>> Hi tj,
>>
>> Sorry for the trouble. Please refer to the following branch:
>>
>> https://github.com/imtangchen/linux.git  movablenode-boot-option
>>
>
> Could you post your testcase? So I can test it on x86 and powerpc machines.
>

Sure. Some simple testcases:

1. Boot the kernel without movablenode boot option, check if the memory 
mapping
    is initialized as before, high to low.
2. Boot the kernel with movablenode boot option, check if the memory 
mapping
    is initialized as before, low to high.
3. With movablenode, check if the memory allocation is from high to low 
after
    SRAT is parsed.
4. Check if we can do acpi_initrd_override normally with and without 
movablenode.
    And the memory allocation is from low to high, near the end of 
kernel image.
5. with movablenode, check if crashkernel boot option works normally.
    (This may consume a lot of memory, but should work normally.)
6. With movablenode, check if relocate_initrd() works normally.
    (This may consume a lot of memory, but should work normally.)
7. With movablenode, check if kexec could locate the kernel to higher 
memory.
    (This may consume hotplug memory if higher memory is hotpluggable, 
but should work normally.)


Please do the above tests with and without the following config options:

1. CONFIG_MOVABLE_NODE
2. CONFIG_ACPI_INITRD_OVERRIDE


Thanks for the testing.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed.
  2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
                   ` (12 preceding siblings ...)
  2013-08-28 15:19 ` Tejun Heo
@ 2013-09-02  1:03 ` Tang Chen
  2013-09-04 19:22 ` Tejun Heo
  14 siblings, 0 replies; 32+ messages in thread
From: Tang Chen @ 2013-09-02  1:03 UTC (permalink / raw)
  To: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei
  Cc: x86, linux-doc, linux-kernel, linux-mm, linux-acpi

Hi guys,

Any comment to this patch-set ?  And shall agree on using this solution
suggested by Tejun ?

Thanks.

On 08/27/2013 05:37 PM, Tang Chen wrote:
> This patch-set is based on tj's suggestion, and not fully tested.
> Just for review and discussion.
>
>
> [Problem]
>
> The current Linux cannot migrate pages used by the kerenl because
> of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
> When the pa is changed, we cannot simply update the pagetable and
> keep the va unmodified. So the kernel pages are not migratable.
>
> There are also some other issues will cause the kernel pages not migratable.
> For example, the physical address may be cached somewhere and will be used.
> It is not to update all the caches.
>
> When doing memory hotplug in Linux, we first migrate all the pages in one
> memory device somewhere else, and then remove the device. But if pages are
> used by the kernel, they are not migratable. As a result, memory used by
> the kernel cannot be hot-removed.
>
> Modifying the kernel direct mapping mechanism is too difficult to do. And
> it may cause the kernel performance down and unstable. So we use the following
> way to do memory hotplug.
>
>
> [What we are doing]
>
> In Linux, memory in one numa node is divided into several zones. One of the
> zones is ZONE_MOVABLE, which the kernel won't use.
>
> In order to implement memory hotplug in Linux, we are going to arrange all
> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
> To do this, we need ACPI's help.
>
> In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
> affinities in SRAT record every memory range in the system, and also, flags
> specifying if the memory range is hotpluggable.
> (Please refer to ACPI spec 5.0 5.2.16)
>
> With the help of SRAT, we have to do the following two things to achieve our
> goal:
>
> 1. When doing memory hot-add, allow the users arranging hotpluggable as
>     ZONE_MOVABLE.
>     (This has been done by the MOVABLE_NODE functionality in Linux.)
>
> 2. when the system is booting, prevent bootmem allocator from allocating
>     hotpluggable memory for the kernel before the memory initialization
>     finishes.
>
> The problem 2 is the key problem we are going to solve. But before solving it,
> we need some preparation. Please see below.
>
>
> [Preparation]
>
> Bootloader has to load the kernel image into memory. And this memory must be
> unhotpluggable. We cannot prevent this anyway. So in a memory hotplug system,
> we can assume any node the kernel resides in is not hotpluggable.
>
> Before SRAT is parsed, we don't know which memory ranges are hotpluggable. But
> memblock has already started to work. In the current kernel, memblock allocates
> the following memory before SRAT is parsed:
>
> setup_arch()
>   |->memblock_x86_fill()            /* memblock is ready */
>   |......
>   |->early_reserve_e820_mpc_new()   /* allocate memory under 1MB */
>   |->reserve_real_mode()            /* allocate memory under 1MB */
>   |->init_mem_mapping()             /* allocate page tables, about 2MB to map 1GB memory */
>   |->dma_contiguous_reserve()       /* specified by user, should be low */
>   |->setup_log_buf()                /* specified by user, several mega bytes */
>   |->relocate_initrd()              /* could be large, but will be freed after boot, should reorder */
>   |->acpi_initrd_override()         /* several mega bytes */
>   |->reserve_crashkernel()          /* could be large, should reorder */
>   |......
>   |->initmem_init()                 /* Parse SRAT */
>
> According to Tejun's advice, before SRAT is parsed, we should try our best to
> allocate memory near the kernel image. Since the whole node the kernel resides
> in won't be hotpluggable, and for a modern server, a node may have at least 16GB
> memory, allocating several mega bytes memory around the kernel image won't cross
> to hotpluggable memory.
>
>
> [About this patch-set]
>
> So this patch-set does the following:
>
> 1. Make memblock be able to allocate memory from low address to high address.
>     Also introduce low limit to prevent memblock allocating memory too low.
>
> 2. Improve init_mem_mapping() to support allocate page tables from low address
>     to high address.
>
> 3. Introduce "movablenode" boot option to enable and disable this functionality.
>
> PS: Reordering of relocate_initrd() and reserve_crashkernel() has not been done
>      yet. acpi_initrd_override() needs to access initrd with virtual address. So
>      relocate_initrd() must be done before acpi_initrd_override().
>
>
> Tang Chen (11):
>    memblock: Rename current_limit to current_limit_high in memblock.
>    memblock: Rename memblock_set_current_limit() to
>      memblock_set_current_limit_high().
>    memblock: Introduce lowest limit in memblock.
>    memblock: Introduce memblock_set_current_limit_low() to set lower
>      limit of memblock.
>    memblock: Introduce allocation order to memblock.
>    memblock: Improve memblock to support allocation from lower address.
>    x86, memblock: Set lowest limit for memblock_alloc_base_nid().
>    x86, acpi, memblock: Use __memblock_alloc_base() in
>      acpi_initrd_override()
>    mem-hotplug: Introduce movablenode boot option to {en|dis}able using
>      SRAT.
>    x86, mem-hotplug: Support initialize page tables from low to high.
>    x86, mem_hotplug: Allocate memory near kernel image before SRAT is
>      parsed.
>
>   Documentation/kernel-parameters.txt |   15 ++++
>   arch/arm/mm/mmu.c                   |    2 +-
>   arch/arm64/mm/mmu.c                 |    4 +-
>   arch/microblaze/mm/init.c           |    2 +-
>   arch/powerpc/mm/40x_mmu.c           |    4 +-
>   arch/powerpc/mm/44x_mmu.c           |    2 +-
>   arch/powerpc/mm/fsl_booke_mmu.c     |    4 +-
>   arch/powerpc/mm/hash_utils_64.c     |    4 +-
>   arch/powerpc/mm/init_32.c           |    4 +-
>   arch/powerpc/mm/ppc_mmu_32.c        |    4 +-
>   arch/powerpc/mm/tlb_nohash.c        |    4 +-
>   arch/unicore32/mm/mmu.c             |    2 +-
>   arch/x86/kernel/setup.c             |   41 ++++++++++-
>   arch/x86/mm/init.c                  |  119 ++++++++++++++++++++++++--------
>   drivers/acpi/osl.c                  |    4 +-
>   include/linux/memblock.h            |   33 ++++++++--
>   include/linux/memory_hotplug.h      |    5 ++
>   mm/memblock.c                       |  131 +++++++++++++++++++++++++++++-----
>   mm/memory_hotplug.c                 |    9 +++
>   mm/nobootmem.c                      |    4 +-
>   20 files changed, 320 insertions(+), 77 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 06/11] memblock: Improve memblock to support allocation from lower address.
  2013-08-27  9:37 ` [PATCH 06/11] memblock: Improve memblock to support allocation from lower address Tang Chen
@ 2013-09-04  0:24   ` Toshi Kani
  2013-09-04  1:00     ` Tang Chen
  0 siblings, 1 reply; 32+ messages in thread
From: Toshi Kani @ 2013-09-04  0:24 UTC (permalink / raw)
  To: Tang Chen
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Tue, 2013-08-27 at 17:37 +0800, Tang Chen wrote:
> This patch modifies the memblock_find_in_range_node() to support two
> different allocation orders. After this patch, memblock will check
> memblock.current_order, and decide in which order to allocate memory.
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> ---
>  mm/memblock.c |   90 +++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 files changed, 75 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 8f1e2d4..961d4a5 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -85,6 +85,77 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
>  }
>  
>  /**
> + * __memblock_find_range - find free area utility
> + * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE
> + * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
> + * @size: size of free area to find
> + * @align: alignment of free area to find
> + * @nid: nid of the free area to find, %MAX_NUMNODES for any node
> + *
> + * Utility called from memblock_find_in_range_node(), find free area from
> + * lower address to higher address.
> + *
> + * RETURNS:
> + * Found address on success, %0 on failure.
> + */
> +phys_addr_t __init_memblock
> +__memblock_find_range(phys_addr_t start, phys_addr_t end,
> +		      phys_addr_t size, phys_addr_t align, int nid)

This func should be static as it must be an internal func.

> +{
> +	phys_addr_t this_start, this_end, cand;
> +	u64 i;
> +
> +	for_each_free_mem_range(i, nid, &this_start, &this_end, NULL) {
> +		this_start = clamp(this_start, start, end);
> +		this_end = clamp(this_end, start, end);
> +
> +		cand = round_up(this_start, align);
> +		if (cand < this_end && this_end - cand >= size)
> +			return cand;
> +	}
> +	return 0;
> +}
> +
> +/**
> + * __memblock_find_range_rev - find free area utility, in reverse order
> + * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE
> + * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
> + * @size: size of free area to find
> + * @align: alignment of free area to find
> + * @nid: nid of the free area to find, %MAX_NUMNODES for any node
> + *
> + * Utility called from memblock_find_in_range_node(), find free area from
> + * higher address to lower address.
> + *
> + * RETURNS:
> + * Found address on success, %0 on failure.
> + */
> +phys_addr_t __init_memblock
> +__memblock_find_range_rev(phys_addr_t start, phys_addr_t end,
> +			  phys_addr_t size, phys_addr_t align, int nid)

Ditto.

> +{
> +	phys_addr_t this_start, this_end, cand;
> +	u64 i;
> +
> +	for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
> +		this_start = clamp(this_start, start, end);
> +		this_end = clamp(this_end, start, end);
> +
> +		/*
> +		 * Just in case that (this_end - size) underflows and cause
> +		 * (cand >= this_start) to be true incorrectly.
> +		 */
> +		if (this_end < size)
> +			break;
> +
> +		cand = round_down(this_end - size, align);
> +		if (cand >= this_start)
> +			return cand;
> +	}
> +	return 0;
> +}
> +
> +/**
>   * memblock_find_in_range_node - find free area in given range and node
>   * @start: start of candidate range, can be %MEMBLOCK_ALLOC_ACCESSIBLE
>   * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
> @@ -110,9 +181,6 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>  					phys_addr_t end, phys_addr_t size,
>  					phys_addr_t align, int nid)
>  {
> -	phys_addr_t this_start, this_end, cand;
> -	u64 i;
> -
>  	/* pump up @start and @end */
>  	if (start == MEMBLOCK_ALLOC_ACCESSIBLE)
>  		start = memblock.current_limit_low;
> @@ -123,18 +191,10 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
>  	start = max_t(phys_addr_t, start, PAGE_SIZE);
>  	end = max(start, end);
>  
> -	for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
> -		this_start = clamp(this_start, start, end);
> -		this_end = clamp(this_end, start, end);
> -
> -		if (this_end < size)
> -			continue;
> -
> -		cand = round_down(this_end - size, align);
> -		if (cand >= this_start)
> -			return cand;
> -	}
> -	return 0;
> +	if (memblock.current_order == MEMBLOCK_ORDER_DEFAULT)

This needs to use MEMBLOCK_ORDER_HIGH_TO_LOW since the code should be
independent from the value of MEMBLOCK_ORDER_DEFAULT.

Thanks,
-Toshi



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 07/11] x86, memblock: Set lowest limit for memblock_alloc_base_nid().
  2013-08-27  9:37 ` [PATCH 07/11] x86, memblock: Set lowest limit for memblock_alloc_base_nid() Tang Chen
@ 2013-09-04  0:37   ` Toshi Kani
  2013-09-04  2:05     ` Tang Chen
  0 siblings, 1 reply; 32+ messages in thread
From: Toshi Kani @ 2013-09-04  0:37 UTC (permalink / raw)
  To: Tang Chen
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Tue, 2013-08-27 at 17:37 +0800, Tang Chen wrote:
> memblock_alloc_base_nid() is a common API of memblock. And it calls
> memblock_find_in_range_node() with %start = 0, which means it has no
> limit for the lowest address by default.
> 
> 	memblock_find_in_range_node(0, max_addr, size, align, nid);
> 
> Since we introduced current_limit_low to memblock, if we have no limit
> for the lowest address or we are not sure, we should pass
> MEMBLOCK_ALLOC_ACCESSIBLE to %start so that it will be limited by the
> default low limit.
> 
> dma_contiguous_reserve() and setup_log_buf() will eventually call
> memblock_alloc_base_nid() to allocate memory. So if the allocation order
> is from low to high, they will allocate memory from the lowest limit
> to higher memory.

This requires the callers to use MEMBLOCK_ALLOC_ACCESSIBLE instead of 0.
Is there a good way to make sure that all callers will follow this rule
going forward?  Perhaps, memblock_find_in_range_node() should emit some
message if 0 is passed when current_order is low to high and the boot
option is specified?

Similarly, I wonder if we should have a check to the allocation size to
make sure that all allocations will stay small in this case.

Thanks,
-Toshi


> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> ---
>  mm/memblock.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 961d4a5..be8c4d1 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -851,7 +851,8 @@ static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
>  	/* align @size to avoid excessive fragmentation on reserved array */
>  	size = round_up(size, align);
>  
> -	found = memblock_find_in_range_node(0, max_addr, size, align, nid);
> +	found = memblock_find_in_range_node(MEMBLOCK_ALLOC_ACCESSIBLE,
> +					    max_addr, size, align, nid);
>  	if (found && !memblock_reserve(found, size))
>  		return found;
>  



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 06/11] memblock: Improve memblock to support allocation from lower address.
  2013-09-04  0:24   ` Toshi Kani
@ 2013-09-04  1:00     ` Tang Chen
  0 siblings, 0 replies; 32+ messages in thread
From: Tang Chen @ 2013-09-04  1:00 UTC (permalink / raw)
  To: Toshi Kani
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On 09/04/2013 08:24 AM, Toshi Kani wrote:
......
>> +phys_addr_t __init_memblock
>> +__memblock_find_range(phys_addr_t start, phys_addr_t end,
>> +		      phys_addr_t size, phys_addr_t align, int nid)
>
> This func should be static as it must be an internal func.
>
......
>> +phys_addr_t __init_memblock
>> +__memblock_find_range_rev(phys_addr_t start, phys_addr_t end,
>> +			  phys_addr_t size, phys_addr_t align, int nid)
>
> Ditto.
......
>> +	if (memblock.current_order == MEMBLOCK_ORDER_DEFAULT)
>
> This needs to use MEMBLOCK_ORDER_HIGH_TO_LOW since the code should be
> independent from the value of MEMBLOCK_ORDER_DEFAULT.
>

OK, followed.

Thanks.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 07/11] x86, memblock: Set lowest limit for memblock_alloc_base_nid().
  2013-09-04  0:37   ` Toshi Kani
@ 2013-09-04  2:05     ` Tang Chen
  2013-09-04 15:22       ` Toshi Kani
  0 siblings, 1 reply; 32+ messages in thread
From: Tang Chen @ 2013-09-04  2:05 UTC (permalink / raw)
  To: Toshi Kani
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On 09/04/2013 08:37 AM, Toshi Kani wrote:
> On Tue, 2013-08-27 at 17:37 +0800, Tang Chen wrote:
>> memblock_alloc_base_nid() is a common API of memblock. And it calls
>> memblock_find_in_range_node() with %start = 0, which means it has no
>> limit for the lowest address by default.
>>
>> 	memblock_find_in_range_node(0, max_addr, size, align, nid);
>>
>> Since we introduced current_limit_low to memblock, if we have no limit
>> for the lowest address or we are not sure, we should pass
>> MEMBLOCK_ALLOC_ACCESSIBLE to %start so that it will be limited by the
>> default low limit.
>>
>> dma_contiguous_reserve() and setup_log_buf() will eventually call
>> memblock_alloc_base_nid() to allocate memory. So if the allocation order
>> is from low to high, they will allocate memory from the lowest limit
>> to higher memory.
>
> This requires the callers to use MEMBLOCK_ALLOC_ACCESSIBLE instead of 0.
> Is there a good way to make sure that all callers will follow this rule
> going forward?  Perhaps, memblock_find_in_range_node() should emit some
> message if 0 is passed when current_order is low to high and the boot
> option is specified?

How about set this as the default rule:

	When using from low to high order, always allocate memory from
	current_limit_low.

So far, I think only movablenode boot option will use this order.

>
> Similarly, I wonder if we should have a check to the allocation size to
> make sure that all allocations will stay small in this case.
>

We can check the size. But what is the stragety after we found that the 
size
is too large ?  Do we refuse to allocate memory ?  I don't think so.

I think only relocate_initrd() and reserve_crachkernel() could allocate 
large
memory. reserve_crachkernel() is easy to reorder, but reordering 
relocate_initrd()
is difficult because acpi_initrd_override() need to access to it with va.

I think on most servers, we don't need to do relocate_initrd(). initrd 
will be
loaded to mapped memory in normal situation. Can we just leave it there ?

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 07/11] x86, memblock: Set lowest limit for memblock_alloc_base_nid().
  2013-09-04  2:05     ` Tang Chen
@ 2013-09-04 15:22       ` Toshi Kani
  0 siblings, 0 replies; 32+ messages in thread
From: Toshi Kani @ 2013-09-04 15:22 UTC (permalink / raw)
  To: Tang Chen
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Wed, 2013-09-04 at 10:05 +0800, Tang Chen wrote:
> On 09/04/2013 08:37 AM, Toshi Kani wrote:
> > On Tue, 2013-08-27 at 17:37 +0800, Tang Chen wrote:
> >> memblock_alloc_base_nid() is a common API of memblock. And it calls
> >> memblock_find_in_range_node() with %start = 0, which means it has no
> >> limit for the lowest address by default.
> >>
> >> 	memblock_find_in_range_node(0, max_addr, size, align, nid);
> >>
> >> Since we introduced current_limit_low to memblock, if we have no limit
> >> for the lowest address or we are not sure, we should pass
> >> MEMBLOCK_ALLOC_ACCESSIBLE to %start so that it will be limited by the
> >> default low limit.
> >>
> >> dma_contiguous_reserve() and setup_log_buf() will eventually call
> >> memblock_alloc_base_nid() to allocate memory. So if the allocation order
> >> is from low to high, they will allocate memory from the lowest limit
> >> to higher memory.
> >
> > This requires the callers to use MEMBLOCK_ALLOC_ACCESSIBLE instead of 0.
> > Is there a good way to make sure that all callers will follow this rule
> > going forward?  Perhaps, memblock_find_in_range_node() should emit some
> > message if 0 is passed when current_order is low to high and the boot
> > option is specified?
> 
> How about set this as the default rule:
> 
> 	When using from low to high order, always allocate memory from
> 	current_limit_low.
> 
> So far, I think only movablenode boot option will use this order.

Sounds good to me.

> > Similarly, I wonder if we should have a check to the allocation size to
> > make sure that all allocations will stay small in this case.
> >
> 
> We can check the size. But what is the stragety after we found that the 
> size
> is too large ?  Do we refuse to allocate memory ?  I don't think so.

We can just add a log message.  No need to fail.

> I think only relocate_initrd() and reserve_crachkernel() could allocate 
> large
> memory. reserve_crachkernel() is easy to reorder, but reordering 
> relocate_initrd()
> is difficult because acpi_initrd_override() need to access to it with va.
> 
> I think on most servers, we don't need to do relocate_initrd(). initrd 
> will be
> loaded to mapped memory in normal situation. Can we just leave it there ?

Since this approach relies on the assumption that all allocations are
small enough, it would be nice to have a way to verify if it remains
true.  How about we measure a total amount of allocations while the
order is low to high, and log it when switched to high to low?  This
way, we can easily monitor the usage.

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed.
  2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
                   ` (13 preceding siblings ...)
  2013-09-02  1:03 ` Tang Chen
@ 2013-09-04 19:22 ` Tejun Heo
  2013-09-05  9:01   ` Tang Chen
       [not found]   ` <52299935.0302450a.26c9.ffffb240SMTPIN_ADDED_BROKEN@mx.google.com>
  14 siblings, 2 replies; 32+ messages in thread
From: Tejun Heo @ 2013-09-04 19:22 UTC (permalink / raw)
  To: Tang Chen
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello,

On Tue, Aug 27, 2013 at 05:37:37PM +0800, Tang Chen wrote:
> 1. Make memblock be able to allocate memory from low address to high address.
>    Also introduce low limit to prevent memblock allocating memory too low.
> 
> 2. Improve init_mem_mapping() to support allocate page tables from low address 
>    to high address.
> 
> 3. Introduce "movablenode" boot option to enable and disable this functionality.
> 
> PS: Reordering of relocate_initrd() and reserve_crashkernel() has not been done 
>     yet. acpi_initrd_override() needs to access initrd with virtual address. So 
>     relocate_initrd() must be done before acpi_initrd_override().

I'm expectedly happier with this approach but some overall review
points.

* I think patch splitting went a bit too far.  e.g. it doesn't make
  much sense or helps anything to split "introduction of a param" from
  "the param doing something".

* I think it's a lot more complex than necessary.  Just implement a
  single function - memblock_alloc_bottom_up(@start) where specifying
  MEMBLOCK_ALLOC_ANYWHERE restores top down behavior and do
  memblock_alloc_bottom_up(end_of_kernel) early during boot.  If the
  bottom up mode is set, just try allocating bottom up from the
  specified address and if that fails do normal top down allocation.
  No need to meddle with the callers.  The only change necessary
  (well, aside from the reordering) outside memblock is adding two
  calls to the above function.

* I don't think "order" is the right word here.  "direction" probably
  fits a lot better.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 11/11] x86, mem_hotplug: Allocate memory near kernel image before SRAT is parsed.
  2013-08-27  9:37 ` [PATCH 11/11] x86, mem_hotplug: Allocate memory near kernel image before SRAT is parsed Tang Chen
@ 2013-09-04 19:40   ` Toshi Kani
  0 siblings, 0 replies; 32+ messages in thread
From: Toshi Kani @ 2013-09-04 19:40 UTC (permalink / raw)
  To: Tang Chen
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

On Tue, 2013-08-27 at 17:37 +0800, Tang Chen wrote:
> After memblock is ready, before SRAT is parsed, we should allocate memory
> near the kernel image. So this patch does the following:
> 
> 1. After memblock is ready, make memblock allocate memory from low address
>    to high, and set the lowest limit to the end of kernel image.
> 2. After SRAT is parsed, make memblock behave as default, allocate memory
>    from high address to low, and reset the lowest limit to 0.
> 
> This behavior is controlled by movablenode boot option.
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> ---
>  arch/x86/kernel/setup.c |   37 +++++++++++++++++++++++++++++++++++++
>  1 files changed, 37 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index fa7b5f0..0b35bbd 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1087,6 +1087,31 @@ void __init setup_arch(char **cmdline_p)
>  	trim_platform_memory_ranges();
>  	trim_low_memory_range();
>  
> +#ifdef CONFIG_MOVABLE_NODE
> +	if (movablenode_enable_srat) {
> +		/*
> +		 * Memory used by the kernel cannot be hot-removed because Linux cannot
> +		 * migrate the kernel pages. When memory hotplug is enabled, we should
> +		 * prevent memblock from allocating memory for the kernel.
> +		 *
> +		 * ACPI SRAT records all hotpluggable memory ranges. But before SRAT is
> +		 * parsed, we don't know about it.
> +		 *
> +		 * The kernel image is loaded into memory at very early time. We cannot
> +		 * prevent this anyway. So on NUMA system, we set any node the kernel
> +		 * resides in as un-hotpluggable.
> +		 *
> +		 * Since on modern servers, one node could have double-digit gigabytes
> +		 * memory, we can assume the memory around the kernel image is also

Memory hotplug can be supported on virtualized environments, and we
should allow using SRAT on them as a next step.  In such environments,
memory hotplug will be performed on per memory device object basis for
workload balancing, and double-digit gigabytes is unlikely the case for
now.  So, I'd suggest it should instead state that all allocations are
kept small until SRAT is pursed.

> +		 * un-hotpluggable. So before SRAT is parsed, just allocate memory near
> +		 * the kernel image to try the best to keep the kernel away from
> +		 * hotpluggable memory.
> +		 */
> +		memblock_set_current_order(MEMBLOCK_ORDER_LOW_TO_HIGH);
> +		memblock_set_current_limit_low(__pa_symbol(_end));
> +	}
> +#endif /* CONFIG_MOVABLE_NODE */

Should the above block be put into init_mem_mapping() since it is
memblock initialization?  It is good to have some concise comments here,
though.

> +
>  	init_mem_mapping();
>  
>  	early_trap_pf_init();
> @@ -1127,6 +1152,18 @@ void __init setup_arch(char **cmdline_p)
>  	early_acpi_boot_init();
>  
>  	initmem_init();
> +
> +#ifdef CONFIG_MOVABLE_NODE
> +	if (movablenode_enable_srat) {
> +		/*
> +		 * When ACPI SRAT is parsed, which is done in initmem_init(), set
> +		 * memblock back to the default behavior.
> +		 */
> +		memblock_set_current_order(MEMBLOCK_ORDER_DEFAULT);
> +		memblock_set_current_limit_low(0);
> +	}
> +#endif /* CONFIG_MOVABLE_NODE */

Similarly, should this block be put into initmem_init() with some
comment here?

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed.
  2013-09-04 19:22 ` Tejun Heo
@ 2013-09-05  9:01   ` Tang Chen
       [not found]   ` <52299935.0302450a.26c9.ffffb240SMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 32+ messages in thread
From: Tang Chen @ 2013-09-05  9:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hi tj,

On 09/05/2013 03:22 AM, Tejun Heo wrote:
......
> I'm expectedly happier with this approach but some overall review
> points.
>
> * I think patch splitting went a bit too far.  e.g. it doesn't make
>    much sense or helps anything to split "introduction of a param" from
>    "the param doing something".
>
> * I think it's a lot more complex than necessary.  Just implement a
>    single function - memblock_alloc_bottom_up(@start) where specifying
>    MEMBLOCK_ALLOC_ANYWHERE restores top down behavior and do
>    memblock_alloc_bottom_up(end_of_kernel) early during boot.  If the
>    bottom up mode is set, just try allocating bottom up from the
>    specified address and if that fails do normal top down allocation.
>    No need to meddle with the callers.  The only change necessary
>    (well, aside from the reordering) outside memblock is adding two
>    calls to the above function.
>
> * I don't think "order" is the right word here.  "direction" probably
>    fits a lot better.
>

Thanks for the advices. I'll try to simply the code and send a new 
patch-set soon.

Thanks.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/11] memblock: Introduce allocation order to memblock.
       [not found]   ` <20130905091615.GB15294@hacker.(null)>
@ 2013-09-05  9:21     ` Tang Chen
  0 siblings, 0 replies; 32+ messages in thread
From: Tang Chen @ 2013-09-05  9:21 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hi Wanpeng,

On 09/05/2013 05:16 PM, Wanpeng Li wrote:
......
>>
>> +/* Allocation order. */
>
> How about replace "Allocation order" by "Allocation sequence".
>
> The "Allocation order" is ambiguity.
>

Yes, order is ambiguity. But as tj suggested, I think maybe "direction"
is better.

Thanks. :)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/11] x86, mem-hotplug: Support initialize page tables from low to high.
       [not found]   ` <20130905133027.GA23038@hacker.(null)>
@ 2013-09-06  1:34     ` Tang Chen
       [not found]       ` <20130906021653.GA1062@hacker.(null)>
  0 siblings, 1 reply; 32+ messages in thread
From: Tang Chen @ 2013-09-06  1:34 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hi Wanpeng,

Thank you for reviewing. See below, please.

On 09/05/2013 09:30 PM, Wanpeng Li wrote:
......
>> +#ifdef CONFIG_MOVABLE_NODE
>> +	unsigned long kernel_end;
>> +
>> +	if (movablenode_enable_srat&&
>> +	    memblock.current_order == MEMBLOCK_ORDER_LOW_TO_HIGH) {
>
> I think memblock.current_order == MEMBLOCK_ORDER_LOW_TO_HIGH is always
> true if config MOVABLE_NODE and movablenode_enable_srat == true if PATCH
> 11/11 is applied.

memblock.current_order == MEMBLOCK_ORDER_LOW_TO_HIGH is true here if 
MOVABLE_NODE
is configured, and it will be reset after SRAT is parsed. But 
movablenode_enable_srat
could only be true when users specify movablenode boot option in the 
kernel commandline.

Please refer to patch 9/11.

>
>> +		kernel_end = round_up(__pa_symbol(_end), PMD_SIZE);
>> +
>> +		memory_map_from_low(kernel_end, end);
>> +		memory_map_from_low(ISA_END_ADDRESS, kernel_end);
>
> Why split ISA_END_ADDRESS ~ end?

The first 5 pages for the page tables are from brk, please refer to 
alloc_low_pages().
They are able to map about 2MB memory. And this 2MB memory will be used 
to store
page tables for the next mapped pages.

Here, we split [ISA_END_ADDRESS, end) into [ISA_END_ADDRESS, _end) and 
[_end, end),
and map [_end, end) first. This is because memory in [ISA_END_ADDRESS, 
_end) may be
used, then we have not enough memory for the next coming page tables. We 
should map
[_end, end) first because this memory is highly likely unused.

>
......
>
> I think the variables sorted by address is:
> ISA_END_ADDRESS ->  _end ->  real_end ->  end

Yes.

>
>> +	memory_map_from_high(ISA_END_ADDRESS, real_end);
>
> If this is overlap with work done between #ifdef CONFIG_MOVABLE_NODE and
> #endif?
>

I don't think so. Seeing from my code, if work between #ifdef 
CONFIG_MOVABLE_NODE and
#endif is done, it will goto out, right ?

Thanks.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/11] x86, mem-hotplug: Support initialize page tables from low to high.
       [not found]       ` <20130906021653.GA1062@hacker.(null)>
@ 2013-09-06  3:09         ` Tang Chen
  0 siblings, 0 replies; 32+ messages in thread
From: Tang Chen @ 2013-09-06  3:09 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hi Wanpeng,

On 09/06/2013 10:16 AM, Wanpeng Li wrote:
......
>>>> +#ifdef CONFIG_MOVABLE_NODE
>>>> +	unsigned long kernel_end;
>>>> +
>>>> +	if (movablenode_enable_srat&&
>>>> +	    memblock.current_order == MEMBLOCK_ORDER_LOW_TO_HIGH) {
>>>
>>> I think memblock.current_order == MEMBLOCK_ORDER_LOW_TO_HIGH is always
>>> true if config MOVABLE_NODE and movablenode_enable_srat == true if PATCH
>>> 11/11 is applied.
>>
>> memblock.current_order == MEMBLOCK_ORDER_LOW_TO_HIGH is true here if
>> MOVABLE_NODE
>> is configured, and it will be reset after SRAT is parsed. But
>> movablenode_enable_srat
>> could only be true when users specify movablenode boot option in the
>> kernel commandline.
>
> You are right.
>
> I mean the change should be:
>
> +#ifdef CONFIG_MOVABLE_NODE
> +       unsigned long kernel_end;
> +
> +       if (movablenode_enable_srat) {
>
> The is unnecessary to check memblock.current_order since it is always true
> if movable_node is configured and movablenode_enable_srat is true.
>

But I think, memblock.current_order is set outside init_mem_mapping(). And
the path in the if statement could only be run when current order is from
low to high. So I think it is safe to check it here.

I prefer to keep it at least in the next version patch-set. If others also
think it is unnecessary, I'm OK with removing the checking. :)

Thanks. :)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed.
       [not found]   ` <52299935.0302450a.26c9.ffffb240SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2013-09-06 15:15     ` Tejun Heo
  2013-09-06 15:47       ` H. Peter Anvin
       [not found]       ` <522db781.22ab440a.41b1.ffffd825SMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 2 replies; 32+ messages in thread
From: Tejun Heo @ 2013-09-06 15:15 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello, Wanpeng.

On Fri, Sep 06, 2013 at 04:58:11PM +0800, Wanpeng Li wrote:
> What's the root reason memblock alloc from high to low? To reduce 
> fragmentation or ...

Because low memory tends to be more precious, it's just easier to pack
everything towards the top so that we don't have to worry about which
zone to use for allocation and fallback logic.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed.
  2013-09-06 15:15     ` Tejun Heo
@ 2013-09-06 15:47       ` H. Peter Anvin
       [not found]       ` <522db781.22ab440a.41b1.ffffd825SMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 32+ messages in thread
From: H. Peter Anvin @ 2013-09-06 15:47 UTC (permalink / raw)
  To: Tejun Heo, Wanpeng Li
  Cc: rjw, lenb, tglx, mingo, akpm, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	zhangyanfei, x86, linux-doc, linux-kernel, linux-mm, linux-acpi

Specifically there are a bunch of things which need to be below a certain address (which one varies.)

Tejun Heo <tj@kernel.org> wrote:
>Hello, Wanpeng.
>
>On Fri, Sep 06, 2013 at 04:58:11PM +0800, Wanpeng Li wrote:
>> What's the root reason memblock alloc from high to low? To reduce 
>> fragmentation or ...
>
>Because low memory tends to be more precious, it's just easier to pack
>everything towards the top so that we don't have to worry about which
>zone to use for allocation and fallback logic.
>
>Thanks.

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed.
       [not found]       ` <522db781.22ab440a.41b1.ffffd825SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2013-09-09 13:58         ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2013-09-09 13:58 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: rjw, lenb, tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, izumi.taku, mgorman, minchan,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner,
	prarit, zhangyanfei, x86, linux-doc, linux-kernel, linux-mm,
	linux-acpi

Hello,

On Mon, Sep 09, 2013 at 07:56:34PM +0800, Wanpeng Li wrote:
> If allocate from low to high as what this patchset done will occupy the
> precious memory you mentioned?

Yeah, and that'd be the reason why this behavior is dependent on a
kernel option.  That said, allocating some megs on top of kernel isn't
a big deal.  The wretched ISA DMA is mostly gone now and some megs
isn't gonna hurt 32bit DMAs in any noticeable way.  I wouldn't be too
surprised if nobody notices after switching the default behavior to
allocate early mem close to kernel.  Maybe the only case which might
be impacted is 32bit highmem configs, but they're messed up no matter
what anyway and even they shouldn't be affected noticeably if large
mapping is in use.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2013-09-09 13:58 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-27  9:37 [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
2013-08-27  9:37 ` [PATCH 01/11] memblock: Rename current_limit to current_limit_high in memblock Tang Chen
2013-08-27  9:37 ` [PATCH 02/11] memblock: Rename memblock_set_current_limit() to memblock_set_current_limit_high() Tang Chen
2013-08-27  9:37 ` [PATCH 03/11] memblock: Introduce lowest limit in memblock Tang Chen
2013-08-27  9:37 ` [PATCH 04/11] memblock: Introduce memblock_set_current_limit_low() to set lower limit of memblock Tang Chen
2013-08-27  9:37 ` [PATCH 05/11] memblock: Introduce allocation order to memblock Tang Chen
     [not found]   ` <20130905091615.GB15294@hacker.(null)>
2013-09-05  9:21     ` Tang Chen
2013-08-27  9:37 ` [PATCH 06/11] memblock: Improve memblock to support allocation from lower address Tang Chen
2013-09-04  0:24   ` Toshi Kani
2013-09-04  1:00     ` Tang Chen
2013-08-27  9:37 ` [PATCH 07/11] x86, memblock: Set lowest limit for memblock_alloc_base_nid() Tang Chen
2013-09-04  0:37   ` Toshi Kani
2013-09-04  2:05     ` Tang Chen
2013-09-04 15:22       ` Toshi Kani
2013-08-27  9:37 ` [PATCH 08/11] x86, acpi, memblock: Use __memblock_alloc_base() in acpi_initrd_override() Tang Chen
2013-08-28  0:04   ` Rafael J. Wysocki
2013-08-27  9:37 ` [PATCH 09/11] mem-hotplug: Introduce movablenode boot option to {en|dis}able using SRAT Tang Chen
2013-08-27  9:37 ` [PATCH 10/11] x86, mem-hotplug: Support initialize page tables from low to high Tang Chen
     [not found]   ` <20130905133027.GA23038@hacker.(null)>
2013-09-06  1:34     ` Tang Chen
     [not found]       ` <20130906021653.GA1062@hacker.(null)>
2013-09-06  3:09         ` Tang Chen
2013-08-27  9:37 ` [PATCH 11/11] x86, mem_hotplug: Allocate memory near kernel image before SRAT is parsed Tang Chen
2013-09-04 19:40   ` Toshi Kani
     [not found] ` <20130828080311.GA608@hacker.(null)>
2013-08-28  9:34   ` [PATCH 00/11] x86, memblock: Allocate memory near kernel image before SRAT parsed Tang Chen
2013-08-28 15:19 ` Tejun Heo
2013-08-29  1:30   ` Tang Chen
     [not found]     ` <20130829013657.GA22599@hacker.(null)>
2013-08-29  1:53       ` Tang Chen
2013-09-02  1:03 ` Tang Chen
2013-09-04 19:22 ` Tejun Heo
2013-09-05  9:01   ` Tang Chen
     [not found]   ` <52299935.0302450a.26c9.ffffb240SMTPIN_ADDED_BROKEN@mx.google.com>
2013-09-06 15:15     ` Tejun Heo
2013-09-06 15:47       ` H. Peter Anvin
     [not found]       ` <522db781.22ab440a.41b1.ffffd825SMTPIN_ADDED_BROKEN@mx.google.com>
2013-09-09 13:58         ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).