linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
@ 2013-06-13 13:02 Tang Chen
  2013-06-13 13:02 ` [Part1 PATCH v5 01/22] x86: Change get_ramdisk_{image|size}() to global Tang Chen
                   ` (24 more replies)
  0 siblings, 25 replies; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:02 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm

From: Yinghai Lu <yinghai@kernel.org>

No offence, just rebase and resend the patches from Yinghai to help
to push this functionality faster.
Also improve the comments in the patches' log.


One commit that tried to parse SRAT early get reverted before v3.9-rc1.

| commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
| Author: Tang Chen <tangchen@cn.fujitsu.com>
| Date:   Fri Feb 22 16:33:44 2013 -0800
|
|    acpi, memory-hotplug: parse SRAT before memblock is ready

It broke several things, like acpi override and fall back path etc.

This patchset is clean implementation that will parse numa info early.
1. keep the acpi table initrd override working by split finding with copying.
   finding is done at head_32.S and head64.c stage,
        in head_32.S, initrd is accessed in 32bit flat mode with phys addr.
        in head64.c, initrd is accessed via kernel low mapping address
        with help of #PF set page table.
   copying is done with early_ioremap just after memblock is setup.
2. keep fallback path working. numaq and ACPI and amd_nmua and dummy.
   seperate initmem_init to two stages.
   early_initmem_init will only extract numa info early into numa_meminfo.
   initmem_init will keep slit and emulation handling.
3. keep other old code flow untouched like relocate_initrd and initmem_init.
   early_initmem_init will take old init_mem_mapping position.
   it call early_x86_numa_init and init_mem_mapping for every nodes.
   For 64bit, we avoid having size limit on initrd, as relocate_initrd
   is still after init_mem_mapping for all memory.
4. last patch will try to put page table on local node, so that memory
   hotplug will be happy.

In short, early_initmem_init will parse numa info early and call
init_mem_mapping to set page table for every nodes's mem.

could be found at:
        git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm

and it is based on today's Linus tree.

-v2: Address tj's review and split patches to small ones.
-v3: Add some Acked-by from tj, also stop abusing cpio_data for acpi_files info
-v4: fix one typo found by Tang Chen.
     Also added tested-by from Thomas Renninger and Tony.
-v5: Rebase to Linux-3.10.0-rc5 (patch 5 and 21 has been rebased)
     Improve comments in patches' log.
     Improve comment in init_mem_mapping() in patch21.

Yinghai Lu (22):
  x86: Change get_ramdisk_{image|size}() to global
  x86, microcode: Use common get_ramdisk_{image|size}()
  x86, ACPI, mm: Kill max_low_pfn_mapped
  x86, ACPI: Search buffer above 4GB in a second try for acpi initrd
    table override
  x86, ACPI: Increase acpi initrd override tables number limit
  x86, ACPI: Split acpi_initrd_override() into find/copy two steps
  x86, ACPI: Store override acpi tables phys addr in cpio files info
    array
  x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode
  x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c
  x86, mm, numa: Move two functions calling on successful path later
  x86, mm, numa: Call numa_meminfo_cover_memory() checking early
  x86, mm, numa: Move node_map_pfn_alignment() to x86
  x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment
  x86, mm, numa: Set memblock nid later
  x86, mm, numa: Move node_possible_map setting later
  x86, mm, numa: Move numa emulation handling down.
  x86, ACPI, numa, ia64: split SLIT handling out
  x86, mm, numa: Add early_initmem_init() stub
  x86, mm: Parse numa info earlier
  x86, mm: Add comments for step_size shift
  x86, mm: Make init_mem_mapping be able to be called several times
  x86, mm, numa: Put pagetable on local node ram for 64bit

 arch/ia64/kernel/setup.c                |    4 +-
 arch/x86/include/asm/acpi.h             |    3 +-
 arch/x86/include/asm/page_types.h       |    2 +-
 arch/x86/include/asm/pgtable.h          |    2 +-
 arch/x86/include/asm/setup.h            |    9 ++
 arch/x86/kernel/head64.c                |    2 +
 arch/x86/kernel/head_32.S               |    4 +
 arch/x86/kernel/microcode_intel_early.c |    8 +-
 arch/x86/kernel/setup.c                 |   86 +++++++-----
 arch/x86/mm/init.c                      |  121 +++++++++++-----
 arch/x86/mm/numa.c                      |  240 ++++++++++++++++++++++++-------
 arch/x86/mm/numa_emulation.c            |    2 +-
 arch/x86/mm/numa_internal.h             |    2 +
 arch/x86/mm/srat.c                      |   11 +-
 drivers/acpi/numa.c                     |   13 +-
 drivers/acpi/osl.c                      |  138 +++++++++++++------
 include/linux/acpi.h                    |   20 ++--
 include/linux/mm.h                      |    3 -
 mm/page_alloc.c                         |   52 +-------
 19 files changed, 476 insertions(+), 246 deletions(-)


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 01/22] x86: Change get_ramdisk_{image|size}() to global
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
@ 2013-06-13 13:02 ` Tang Chen
  2013-06-14 21:30   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-13 13:02 ` [Part1 PATCH v5 02/22] x86, microcode: Use common get_ramdisk_{image|size}() Tang Chen
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:02 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm

From: Yinghai Lu <yinghai@kernel.org>

This patch does two things:
1. Change get_ramdisk_image() and get_ramdisk_size() to global.
2. Make get_ramdisk_image() and get_ramdisk_size() take a
   boot_params pointer parameter.

The whole patch-set tries to split ACPI initrd table override
procedure into two steps: finding and copying.
The finding step is done at head_32.S and head64.c stage. So we
need to call get_ramdisk_image() and get_ramdisk_size() in these
two files.

And also, in head_32.S, it can only access boot_params via physical
address during 32bit flat mode, so make get_ramdisk_image() and
get_ramdisk_size() take a boot_params pointer, so that we can pass
a physical address to code in head_32.S.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/include/asm/setup.h |    3 +++
 arch/x86/kernel/setup.c      |   28 ++++++++++++++--------------
 2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index b7bf350..4f71d48 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -106,6 +106,9 @@ void *extend_brk(size_t size, size_t align);
 	RESERVE_BRK(name, sizeof(type) * entries)
 
 extern void probe_roms(void);
+u64 get_ramdisk_image(struct boot_params *bp);
+u64 get_ramdisk_size(struct boot_params *bp);
+
 #ifdef __i386__
 
 void __init i386_start_kernel(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 56f7fcf..66ab495 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -297,19 +297,19 @@ static void __init reserve_brk(void)
 
 #ifdef CONFIG_BLK_DEV_INITRD
 
-static u64 __init get_ramdisk_image(void)
+u64 __init get_ramdisk_image(struct boot_params *bp)
 {
-	u64 ramdisk_image = boot_params.hdr.ramdisk_image;
+	u64 ramdisk_image = bp->hdr.ramdisk_image;
 
-	ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;
+	ramdisk_image |= (u64)bp->ext_ramdisk_image << 32;
 
 	return ramdisk_image;
 }
-static u64 __init get_ramdisk_size(void)
+u64 __init get_ramdisk_size(struct boot_params *bp)
 {
-	u64 ramdisk_size = boot_params.hdr.ramdisk_size;
+	u64 ramdisk_size = bp->hdr.ramdisk_size;
 
-	ramdisk_size |= (u64)boot_params.ext_ramdisk_size << 32;
+	ramdisk_size |= (u64)bp->ext_ramdisk_size << 32;
 
 	return ramdisk_size;
 }
@@ -318,8 +318,8 @@ static u64 __init get_ramdisk_size(void)
 static void __init relocate_initrd(void)
 {
 	/* Assume only end is not page aligned */
-	u64 ramdisk_image = get_ramdisk_image();
-	u64 ramdisk_size  = get_ramdisk_size();
+	u64 ramdisk_image = get_ramdisk_image(&boot_params);
+	u64 ramdisk_size  = get_ramdisk_size(&boot_params);
 	u64 area_size     = PAGE_ALIGN(ramdisk_size);
 	u64 ramdisk_here;
 	unsigned long slop, clen, mapaddr;
@@ -358,8 +358,8 @@ static void __init relocate_initrd(void)
 		ramdisk_size  -= clen;
 	}
 
-	ramdisk_image = get_ramdisk_image();
-	ramdisk_size  = get_ramdisk_size();
+	ramdisk_image = get_ramdisk_image(&boot_params);
+	ramdisk_size  = get_ramdisk_size(&boot_params);
 	printk(KERN_INFO "Move RAMDISK from [mem %#010llx-%#010llx] to"
 		" [mem %#010llx-%#010llx]\n",
 		ramdisk_image, ramdisk_image + ramdisk_size - 1,
@@ -369,8 +369,8 @@ static void __init relocate_initrd(void)
 static void __init early_reserve_initrd(void)
 {
 	/* Assume only end is not page aligned */
-	u64 ramdisk_image = get_ramdisk_image();
-	u64 ramdisk_size  = get_ramdisk_size();
+	u64 ramdisk_image = get_ramdisk_image(&boot_params);
+	u64 ramdisk_size  = get_ramdisk_size(&boot_params);
 	u64 ramdisk_end   = PAGE_ALIGN(ramdisk_image + ramdisk_size);
 
 	if (!boot_params.hdr.type_of_loader ||
@@ -382,8 +382,8 @@ static void __init early_reserve_initrd(void)
 static void __init reserve_initrd(void)
 {
 	/* Assume only end is not page aligned */
-	u64 ramdisk_image = get_ramdisk_image();
-	u64 ramdisk_size  = get_ramdisk_size();
+	u64 ramdisk_image = get_ramdisk_image(&boot_params);
+	u64 ramdisk_size  = get_ramdisk_size(&boot_params);
 	u64 ramdisk_end   = PAGE_ALIGN(ramdisk_image + ramdisk_size);
 	u64 mapped_size;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 02/22] x86, microcode: Use common get_ramdisk_{image|size}()
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
  2013-06-13 13:02 ` [Part1 PATCH v5 01/22] x86: Change get_ramdisk_{image|size}() to global Tang Chen
@ 2013-06-13 13:02 ` Tang Chen
  2013-06-14 21:31   ` [tip:x86/mm] x86, microcode: Use common get_ramdisk_{image|size}( ) tip-bot for Yinghai Lu
  2013-06-13 13:02 ` [Part1 PATCH v5 03/22] x86, ACPI, mm: Kill max_low_pfn_mapped Tang Chen
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:02 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, Fenghua Yu

From: Yinghai Lu <yinghai@kernel.org>

In patch1, we change get_ramdisk_image() and get_ramdisk_size()
to global, so we can use them instead of using global variable
boot_params.

We need this to get correct ramdisk adress for 64bits bzImage
that initrd can be loaded above 4G by kexec-tools.

-v2: fix one typo that is found by Tang Chen

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/kernel/microcode_intel_early.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/microcode_intel_early.c b/arch/x86/kernel/microcode_intel_early.c
index 2e9e128..54575a9 100644
--- a/arch/x86/kernel/microcode_intel_early.c
+++ b/arch/x86/kernel/microcode_intel_early.c
@@ -743,8 +743,8 @@ load_ucode_intel_bsp(void)
 	struct boot_params *boot_params_p;
 
 	boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
-	ramdisk_image = boot_params_p->hdr.ramdisk_image;
-	ramdisk_size  = boot_params_p->hdr.ramdisk_size;
+	ramdisk_image = get_ramdisk_image(boot_params_p);
+	ramdisk_size  = get_ramdisk_size(boot_params_p);
 	initrd_start_early = ramdisk_image;
 	initrd_end_early = initrd_start_early + ramdisk_size;
 
@@ -753,8 +753,8 @@ load_ucode_intel_bsp(void)
 		(unsigned long *)__pa_nodebug(&mc_saved_in_initrd),
 		initrd_start_early, initrd_end_early, &uci);
 #else
-	ramdisk_image = boot_params.hdr.ramdisk_image;
-	ramdisk_size  = boot_params.hdr.ramdisk_size;
+	ramdisk_image = get_ramdisk_image(&boot_params);
+	ramdisk_size  = get_ramdisk_size(&boot_params);
 	initrd_start_early = ramdisk_image + PAGE_OFFSET;
 	initrd_end_early = initrd_start_early + ramdisk_size;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 03/22] x86, ACPI, mm: Kill max_low_pfn_mapped
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
  2013-06-13 13:02 ` [Part1 PATCH v5 01/22] x86: Change get_ramdisk_{image|size}() to global Tang Chen
  2013-06-13 13:02 ` [Part1 PATCH v5 02/22] x86, microcode: Use common get_ramdisk_{image|size}() Tang Chen
@ 2013-06-13 13:02 ` Tang Chen
  2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-17 21:04   ` [Part1 PATCH v5 03/22] " Tejun Heo
  2013-06-13 13:02 ` [Part1 PATCH v5 04/22] x86, ACPI: Search buffer above 4GB in a second try for acpi initrd table override Tang Chen
                   ` (21 subsequent siblings)
  24 siblings, 2 replies; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:02 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, Rafael J. Wysocki,
	Jacob Shin, Pekka Enberg, linux-acpi

From: Yinghai Lu <yinghai@kernel.org>

Now we have pfn_mapped[] array, and max_low_pfn_mapped should not
be used anymore. Users should use pfn_mapped[] or just
1UL<<(32-PAGE_SHIFT) instead.

The only user of max_low_pfn_mapped is ACPI_INITRD_TABLE_OVERRIDE.
We could change to use 1U<<(32_PAGE_SHIFT) with it, aka under 4G.

Known problem:
There is another user of max_low_pfn_mapped: i915 device driver.
But the code is commented out by a pair of "#if 0 ... #endif".
Not sure why the driver developers want to do that.

-v2: Leave alone max_low_pfn_mapped in i915 code according to tj.

Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Jacob Shin <jacob.shin@amd.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-acpi@vger.kernel.org
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/include/asm/page_types.h |    1 -
 arch/x86/kernel/setup.c           |    4 +---
 arch/x86/mm/init.c                |    4 ----
 drivers/acpi/osl.c                |    6 +++---
 4 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 54c9787..b012b82 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -43,7 +43,6 @@
 
 extern int devmem_is_allowed(unsigned long pagenr);
 
-extern unsigned long max_low_pfn_mapped;
 extern unsigned long max_pfn_mapped;
 
 static inline phys_addr_t get_max_mapped(void)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 66ab495..6ca5f2c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -112,13 +112,11 @@
 #include <asm/prom.h>
 
 /*
- * max_low_pfn_mapped: highest direct mapped pfn under 4GB
- * max_pfn_mapped:     highest direct mapped pfn over 4GB
+ * max_pfn_mapped:     highest direct mapped pfn
  *
  * The direct mapping only covers E820_RAM regions, so the ranges and gaps are
  * represented by pfn_mapped
  */
-unsigned long max_low_pfn_mapped;
 unsigned long max_pfn_mapped;
 
 #ifdef CONFIG_DMI
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index eaac174..8554656 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -313,10 +313,6 @@ static void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
 	nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);
 
 	max_pfn_mapped = max(max_pfn_mapped, end_pfn);
-
-	if (start_pfn < (1UL<<(32-PAGE_SHIFT)))
-		max_low_pfn_mapped = max(max_low_pfn_mapped,
-					 min(end_pfn, 1UL<<(32-PAGE_SHIFT)));
 }
 
 bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index e721863..93e3194 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -624,9 +624,9 @@ void __init acpi_initrd_override(void *data, size_t size)
 	if (table_nr == 0)
 		return;
 
-	acpi_tables_addr =
-		memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
-				       all_tables_size, PAGE_SIZE);
+	/* under 4G at first, then above 4G */
+	acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
+					all_tables_size, PAGE_SIZE);
 	if (!acpi_tables_addr) {
 		WARN_ON(1);
 		return;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 04/22] x86, ACPI: Search buffer above 4GB in a second try for acpi initrd table override
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (2 preceding siblings ...)
  2013-06-13 13:02 ` [Part1 PATCH v5 03/22] x86, ACPI, mm: Kill max_low_pfn_mapped Tang Chen
@ 2013-06-13 13:02 ` Tang Chen
  2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-17 21:06   ` [Part1 PATCH v5 04/22] " Tejun Heo
  2013-06-13 13:02 ` [Part1 PATCH v5 05/22] x86, ACPI: Increase acpi initrd override tables number limit Tang Chen
                   ` (20 subsequent siblings)
  24 siblings, 2 replies; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:02 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, Rafael J. Wysocki, linux-acpi

From: Yinghai Lu <yinghai@kernel.org>

Now we only search buffer for new acpi tables in initrd under
4GB. In some case, like user use memmap to exclude all low ram,
we may not find range for it under 4GB. So do second try to
search for buffer above 4GB.

Since later accessing to the tables is using early_ioremap(),
using memory above 4GB is OK.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: linux-acpi@vger.kernel.org
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 drivers/acpi/osl.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 93e3194..42c48fc 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -627,6 +627,10 @@ void __init acpi_initrd_override(void *data, size_t size)
 	/* under 4G at first, then above 4G */
 	acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
 					all_tables_size, PAGE_SIZE);
+	if (!acpi_tables_addr)
+		acpi_tables_addr = memblock_find_in_range(0,
+					~(phys_addr_t)0,
+					all_tables_size, PAGE_SIZE);
 	if (!acpi_tables_addr) {
 		WARN_ON(1);
 		return;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 05/22] x86, ACPI: Increase acpi initrd override tables number limit
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (3 preceding siblings ...)
  2013-06-13 13:02 ` [Part1 PATCH v5 04/22] x86, ACPI: Search buffer above 4GB in a second try for acpi initrd table override Tang Chen
@ 2013-06-13 13:02 ` Tang Chen
  2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-13 13:02 ` [Part1 PATCH v5 06/22] x86, ACPI: Split acpi_initrd_override() into find/copy two steps Tang Chen
                   ` (19 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:02 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, Rafael J. Wysocki, linux-acpi

From: Yinghai Lu <yinghai@kernel.org>

Current number of acpi tables in initrd is limited to 10, which is
too small. 64 would be good enough as we have 35 sigs and could
have several SSDTs.

Two problems in current code prevent us from increasing the 10 tables limit:
1. cpio file info array is put in stack, as every element is 32 bytes, we
   could run out of stack if we increase the array size to 64.
   So we can move it out from stack, and make it global and put it in
   __initdata section.
2. early_ioremap only can remap 256kb one time. Current code is mapping
   10 tables one time. If we increase that limit, the whole size could be
   more than 256kb, and early_ioremap will fail.
   So we can map the tables one by one during copying, instead of mapping
   all of them at one time.

-v2: According to tj, split it out to separated patch, also
     rename array name to acpi_initrd_files.
-v3: Add some comments about mapping table one by one during copying
     per tj.

Signed-off-by: Yinghai <yinghai@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: linux-acpi@vger.kernel.org
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 drivers/acpi/osl.c |   26 +++++++++++++++-----------
 1 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 42c48fc..53dd490 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -569,8 +569,8 @@ static const char * const table_sigs[] = {
 
 #define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
 
-/* Must not increase 10 or needs code modification below */
-#define ACPI_OVERRIDE_TABLES 10
+#define ACPI_OVERRIDE_TABLES 64
+static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
 
 void __init acpi_initrd_override(void *data, size_t size)
 {
@@ -579,7 +579,6 @@ void __init acpi_initrd_override(void *data, size_t size)
 	struct acpi_table_header *table;
 	char cpio_path[32] = "kernel/firmware/acpi/";
 	struct cpio_data file;
-	struct cpio_data early_initrd_files[ACPI_OVERRIDE_TABLES];
 	char *p;
 
 	if (data == NULL || size == 0)
@@ -617,8 +616,8 @@ void __init acpi_initrd_override(void *data, size_t size)
 			table->signature, cpio_path, file.name, table->length);
 
 		all_tables_size += table->length;
-		early_initrd_files[table_nr].data = file.data;
-		early_initrd_files[table_nr].size = file.size;
+		acpi_initrd_files[table_nr].data = file.data;
+		acpi_initrd_files[table_nr].size = file.size;
 		table_nr++;
 	}
 	if (table_nr == 0)
@@ -648,14 +647,19 @@ void __init acpi_initrd_override(void *data, size_t size)
 	memblock_reserve(acpi_tables_addr, all_tables_size);
 	arch_reserve_mem_area(acpi_tables_addr, all_tables_size);
 
-	p = early_ioremap(acpi_tables_addr, all_tables_size);
-
+	/*
+	 * early_ioremap can only remap 256KB at one time. If we map all the
+	 * tables at one time, we will hit the limit. So we need to map tables
+	 * one by one during copying.
+	 */
 	for (no = 0; no < table_nr; no++) {
-		memcpy(p + total_offset, early_initrd_files[no].data,
-		       early_initrd_files[no].size);
-		total_offset += early_initrd_files[no].size;
+		phys_addr_t size = acpi_initrd_files[no].size;
+
+		p = early_ioremap(acpi_tables_addr + total_offset, size);
+		memcpy(p, acpi_initrd_files[no].data, size);
+		early_iounmap(p, size);
+		total_offset += size;
 	}
-	early_iounmap(p, all_tables_size);
 }
 #endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 06/22] x86, ACPI: Split acpi_initrd_override() into find/copy two steps
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (4 preceding siblings ...)
  2013-06-13 13:02 ` [Part1 PATCH v5 05/22] x86, ACPI: Increase acpi initrd override tables number limit Tang Chen
@ 2013-06-13 13:02 ` Tang Chen
  2013-06-14 21:31   ` [tip:x86/mm] x86, ACPI: Split acpi_initrd_override() into find/ copy " tip-bot for Yinghai Lu
  2013-06-13 13:02 ` [Part1 PATCH v5 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array Tang Chen
                   ` (18 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:02 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, Pekka Enberg, Jacob Shin,
	Rafael J. Wysocki, linux-acpi

From: Yinghai Lu <yinghai@kernel.org>

To parse SRAT before memblock starts to work, we need to move acpi table
probing procedure earlier. But acpi_initrd_table_override procedure must
be executed before acpi table probing. So we need to move it earlier too,
which means to move acpi_initrd_table_override procedure before memblock
starts to work.

But acpi_initrd_table_override procedure needs memblock to allocate buffer
for ACPI tables. To solve this problem, we need to split acpi_initrd_override()
procedure into two steps: finding and copying.
Find should be as early as possible. Copy should be after memblock is ready.

Currently, acpi_initrd_table_override procedure is executed after
init_mem_mapping() and relocate_initrd(), so it can scan initrd and copy
acpi tables with kernel virtual addresses of initrd.

Once we split it into finding and copying steps, it could be done like the
following:

Finding could be done in head_32.S and head64.c, just like microcode early
scanning. In head_32.S, it is 32bit flat mode, we don't need to setup page
table to access it. In head64.c, #PF set page table could help us to access
initrd with kernel low mapping addresses.

Copying need to be done just after memblock is ready, because it needs to
allocate buffer for new acpi tables with memblock.
Also it should be done before probing acpi tables, and we need early_ioremap
to access source and target ranges, as init_mem_mapping is not called yet.

While a dummy version of acpi_initrd_override() was defined when
!CONFIG_ACPI_INITRD_TABLE_OVERRIDE, the prototype and dummy version were
conditionalized inside CONFIG_ACPI. This forced setup_arch() to have its own
#ifdefs around acpi_initrd_override() as otherwise build would fail when
!CONFIG_ACPI. Move the prototypes and dummy implementations of the newly
split functions out of CONFIG_ACPI block in acpi.h so that we can throw away
the #ifdefs from its users.

-v2: Split one patch out according to tj.
     also don't pass table_nr around.
-v3: Add Tj's changelog about moving down to #idef in acpi.h to
     avoid #idef in setup.c

Signed-off-by: Yinghai <yinghai@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: linux-acpi@vger.kernel.org
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/kernel/setup.c |    6 +++---
 drivers/acpi/osl.c      |   18 +++++++++++++-----
 include/linux/acpi.h    |   16 ++++++++--------
 3 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6ca5f2c..42f584c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1119,9 +1119,9 @@ void __init setup_arch(char **cmdline_p)
 
 	reserve_initrd();
 
-#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
-	acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
-#endif
+	acpi_initrd_override_find((void *)initrd_start,
+					initrd_end - initrd_start);
+	acpi_initrd_override_copy();
 
 	reserve_crashkernel();
 
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 53dd490..6ab6c54 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -572,14 +572,13 @@ static const char * const table_sigs[] = {
 #define ACPI_OVERRIDE_TABLES 64
 static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
 
-void __init acpi_initrd_override(void *data, size_t size)
+void __init acpi_initrd_override_find(void *data, size_t size)
 {
-	int sig, no, table_nr = 0, total_offset = 0;
+	int sig, no, table_nr = 0;
 	long offset = 0;
 	struct acpi_table_header *table;
 	char cpio_path[32] = "kernel/firmware/acpi/";
 	struct cpio_data file;
-	char *p;
 
 	if (data == NULL || size == 0)
 		return;
@@ -620,7 +619,14 @@ void __init acpi_initrd_override(void *data, size_t size)
 		acpi_initrd_files[table_nr].size = file.size;
 		table_nr++;
 	}
-	if (table_nr == 0)
+}
+
+void __init acpi_initrd_override_copy(void)
+{
+	int no, total_offset = 0;
+	char *p;
+
+	if (!all_tables_size)
 		return;
 
 	/* under 4G at first, then above 4G */
@@ -652,9 +658,11 @@ void __init acpi_initrd_override(void *data, size_t size)
 	 * tables at one time, we will hit the limit. So we need to map tables
 	 * one by one during copying.
 	 */
-	for (no = 0; no < table_nr; no++) {
+	for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
 		phys_addr_t size = acpi_initrd_files[no].size;
 
+		if (!size)
+			break;
 		p = early_ioremap(acpi_tables_addr + total_offset, size);
 		memcpy(p, acpi_initrd_files[no].data, size);
 		early_iounmap(p, size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 17b5b59..8dd917b 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -79,14 +79,6 @@ typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
 typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,
 				      const unsigned long end);
 
-#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override(void *data, size_t size);
-#else
-static inline void acpi_initrd_override(void *data, size_t size)
-{
-}
-#endif
-
 char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
 void __acpi_unmap_table(char *map, unsigned long size);
 int early_acpi_boot_init(void);
@@ -476,6 +468,14 @@ static inline bool acpi_driver_match_device(struct device *dev,
 
 #endif	/* !CONFIG_ACPI */
 
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void acpi_initrd_override_find(void *data, size_t size);
+void acpi_initrd_override_copy(void);
+#else
+static inline void acpi_initrd_override_find(void *data, size_t size) { }
+static inline void acpi_initrd_override_copy(void) { }
+#endif
+
 #ifdef CONFIG_ACPI
 void acpi_os_set_prepare_sleep(int (*func)(u8 sleep_state,
 			       u32 pm1a_ctrl,  u32 pm1b_ctrl));
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (5 preceding siblings ...)
  2013-06-13 13:02 ` [Part1 PATCH v5 06/22] x86, ACPI: Split acpi_initrd_override() into find/copy two steps Tang Chen
@ 2013-06-13 13:02 ` Tang Chen
  2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
                     ` (2 more replies)
  2013-06-13 13:02 ` [Part1 PATCH v5 08/22] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode Tang Chen
                   ` (17 subsequent siblings)
  24 siblings, 3 replies; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:02 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, Rafael J. Wysocki, linux-acpi

From: Yinghai Lu <yinghai@kernel.org>

This patch introduces a file_pos struct to store physaddr. And then changes
acpi_initrd_files[] to file_pos type. Then store physaddr of ACPI tables
in acpi_initrd_files[].

For finding, we will find ACPI tables with physaddr during 32bit flat mode
in head_32.S, because at that time we don't need to setup page table to
access initrd.

For copying, we could use early_ioremap() with physaddr directly before
memory mapping is set.

To keep 32bit and 64bit platforms consistent, use phys_addr for all.

-v2: introduce file_pos to save physaddr instead of abusing cpio_data
     which tj is not happy with.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: linux-acpi@vger.kernel.org
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 drivers/acpi/osl.c |   15 +++++++++++----
 1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 6ab6c54..42f79e3 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -570,7 +570,11 @@ static const char * const table_sigs[] = {
 #define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
 
 #define ACPI_OVERRIDE_TABLES 64
-static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
+struct file_pos {
+	phys_addr_t data;
+	phys_addr_t size;
+};
+static struct file_pos __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
 
 void __init acpi_initrd_override_find(void *data, size_t size)
 {
@@ -615,7 +619,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
 			table->signature, cpio_path, file.name, table->length);
 
 		all_tables_size += table->length;
-		acpi_initrd_files[table_nr].data = file.data;
+		acpi_initrd_files[table_nr].data = __pa_nodebug(file.data);
 		acpi_initrd_files[table_nr].size = file.size;
 		table_nr++;
 	}
@@ -624,7 +628,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
 void __init acpi_initrd_override_copy(void)
 {
 	int no, total_offset = 0;
-	char *p;
+	char *p, *q;
 
 	if (!all_tables_size)
 		return;
@@ -659,12 +663,15 @@ void __init acpi_initrd_override_copy(void)
 	 * one by one during copying.
 	 */
 	for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
+		phys_addr_t addr = acpi_initrd_files[no].data;
 		phys_addr_t size = acpi_initrd_files[no].size;
 
 		if (!size)
 			break;
+		q = early_ioremap(addr, size);
 		p = early_ioremap(acpi_tables_addr + total_offset, size);
-		memcpy(p, acpi_initrd_files[no].data, size);
+		memcpy(p, q, size);
+		early_iounmap(q, size);
 		early_iounmap(p, size);
 		total_offset += size;
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 08/22] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (6 preceding siblings ...)
  2013-06-13 13:02 ` [Part1 PATCH v5 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array Tang Chen
@ 2013-06-13 13:02 ` Tang Chen
  2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-18  0:07   ` [Part1 PATCH v5 08/22] " Tejun Heo
  2013-06-13 13:02 ` [Part1 PATCH v5 09/22] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c Tang Chen
                   ` (16 subsequent siblings)
  24 siblings, 2 replies; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:02 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, Pekka Enberg, Jacob Shin,
	Rafael J. Wysocki, linux-acpi

From: Yinghai Lu <yinghai@kernel.org>

For finding procedure, it would be easy to access initrd in 32bit flat
mode, as we don't need to setup page table. That is from head_32.S, and
microcode updating already use this trick.

This patch does the following:

1. Change acpi_initrd_override_find to use phys to access global variables.

2. Pass a bool parameter "is_phys" to acpi_initrd_override_find() because
   we cannot tell if it is a pa or a va through the address itself with
   32bit. Boot loader could load initrd above max_low_pfn.

3. Put table_sigs[] on stack, otherwise it is too messy to change string
   array to physaddr and still keep offset calculating correct. The size is
   about 36x4 bytes, and it is small to settle in stack.

4. Also rewrite the MACRO INVALID_TABLE to be in a do {...} while(0) loop
   so that it is more readable.

NOTE: Don't call printk as it uses global variables, so delay print
      during copying.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: linux-acpi@vger.kernel.org
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/kernel/setup.c |    2 +-
 drivers/acpi/osl.c      |   85 ++++++++++++++++++++++++++++++++--------------
 include/linux/acpi.h    |    5 ++-
 3 files changed, 63 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 42f584c..142e042 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1120,7 +1120,7 @@ void __init setup_arch(char **cmdline_p)
 	reserve_initrd();
 
 	acpi_initrd_override_find((void *)initrd_start,
-					initrd_end - initrd_start);
+					initrd_end - initrd_start, false);
 	acpi_initrd_override_copy();
 
 	reserve_crashkernel();
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 42f79e3..23578e8 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -551,21 +551,9 @@ u8 __init acpi_table_checksum(u8 *buffer, u32 length)
 	return sum;
 }
 
-/* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
-static const char * const table_sigs[] = {
-	ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
-	ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
-	ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
-	ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
-	ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
-	ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
-	ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
-	ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
-	ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
-
 /* Non-fatal errors: Affected tables/files are ignored */
 #define INVALID_TABLE(x, path, name)					\
-	{ pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); continue; }
+	do { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); } while (0)
 
 #define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
 
@@ -576,17 +564,45 @@ struct file_pos {
 };
 static struct file_pos __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
 
-void __init acpi_initrd_override_find(void *data, size_t size)
+/*
+ * acpi_initrd_override_find() is called from head_32.S and head64.c.
+ * head_32.S calling path is with 32bit flat mode, so we can access
+ * initrd early without setting pagetable or relocating initrd. For
+ * global variables accessing, we need to use phys address instead of
+ * kernel virtual address, try to put table_sigs string array in stack,
+ * so avoid switching for it.
+ * Also don't call printk as it uses global variables.
+ */
+void __init acpi_initrd_override_find(void *data, size_t size, bool is_phys)
 {
 	int sig, no, table_nr = 0;
 	long offset = 0;
 	struct acpi_table_header *table;
 	char cpio_path[32] = "kernel/firmware/acpi/";
 	struct cpio_data file;
+	struct file_pos *files = acpi_initrd_files;
+	int *all_tables_size_p = &all_tables_size;
+
+	/* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
+	char *table_sigs[] = {
+		ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
+		ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
+		ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
+		ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
+		ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
+		ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
+		ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
+		ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
+		ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
 
 	if (data == NULL || size == 0)
 		return;
 
+	if (is_phys) {
+		files = (struct file_pos *)__pa_symbol(acpi_initrd_files);
+		all_tables_size_p = (int *)__pa_symbol(&all_tables_size);
+	}
+
 	for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
 		file = find_cpio_data(cpio_path, data, size, &offset);
 		if (!file.data)
@@ -595,9 +611,12 @@ void __init acpi_initrd_override_find(void *data, size_t size)
 		data += offset;
 		size -= offset;
 
-		if (file.size < sizeof(struct acpi_table_header))
-			INVALID_TABLE("Table smaller than ACPI header",
+		if (file.size < sizeof(struct acpi_table_header)) {
+			if (!is_phys)
+				INVALID_TABLE("Table smaller than ACPI header",
 				      cpio_path, file.name);
+			continue;
+		}
 
 		table = file.data;
 
@@ -605,22 +624,33 @@ void __init acpi_initrd_override_find(void *data, size_t size)
 			if (!memcmp(table->signature, table_sigs[sig], 4))
 				break;
 
-		if (!table_sigs[sig])
-			INVALID_TABLE("Unknown signature",
+		if (!table_sigs[sig]) {
+			if (!is_phys)
+				 INVALID_TABLE("Unknown signature",
 				      cpio_path, file.name);
-		if (file.size != table->length)
-			INVALID_TABLE("File length does not match table length",
+			continue;
+		}
+		if (file.size != table->length) {
+			if (!is_phys)
+				INVALID_TABLE("File length does not match table length",
 				      cpio_path, file.name);
-		if (acpi_table_checksum(file.data, table->length))
-			INVALID_TABLE("Bad table checksum",
+			continue;
+		}
+		if (acpi_table_checksum(file.data, table->length)) {
+			if (!is_phys)
+				INVALID_TABLE("Bad table checksum",
 				      cpio_path, file.name);
+			continue;
+		}
 
-		pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
+		if (!is_phys)
+			pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
 			table->signature, cpio_path, file.name, table->length);
 
-		all_tables_size += table->length;
-		acpi_initrd_files[table_nr].data = __pa_nodebug(file.data);
-		acpi_initrd_files[table_nr].size = file.size;
+		(*all_tables_size_p) += table->length;
+		files[table_nr].data = is_phys ? (phys_addr_t)file.data :
+						  __pa_nodebug(file.data);
+		files[table_nr].size = file.size;
 		table_nr++;
 	}
 }
@@ -670,6 +700,9 @@ void __init acpi_initrd_override_copy(void)
 			break;
 		q = early_ioremap(addr, size);
 		p = early_ioremap(acpi_tables_addr + total_offset, size);
+		pr_info("%4.4s ACPI table found in initrd [%#010llx-%#010llx]\n",
+				((struct acpi_table_header *)q)->signature,
+				(u64)addr, (u64)(addr + size - 1));
 		memcpy(p, q, size);
 		early_iounmap(q, size);
 		early_iounmap(p, size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 8dd917b..4e3731b 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -469,10 +469,11 @@ static inline bool acpi_driver_match_device(struct device *dev,
 #endif	/* !CONFIG_ACPI */
 
 #ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override_find(void *data, size_t size);
+void acpi_initrd_override_find(void *data, size_t size, bool is_phys);
 void acpi_initrd_override_copy(void);
 #else
-static inline void acpi_initrd_override_find(void *data, size_t size) { }
+static inline void acpi_initrd_override_find(void *data, size_t size,
+						 bool is_phys) { }
 static inline void acpi_initrd_override_copy(void) { }
 #endif
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 09/22] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (7 preceding siblings ...)
  2013-06-13 13:02 ` [Part1 PATCH v5 08/22] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode Tang Chen
@ 2013-06-13 13:02 ` Tang Chen
  2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-18  0:33   ` [Part1 PATCH v5 09/22] " Tejun Heo
  2013-06-13 13:02 ` [Part1 PATCH v5 10/22] x86, mm, numa: Move two functions calling on successful path later Tang Chen
                   ` (15 subsequent siblings)
  24 siblings, 2 replies; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:02 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, Pekka Enberg, Jacob Shin,
	Rafael J. Wysocki, linux-acpi

From: Yinghai Lu <yinghai@kernel.org>

head64.c could use #PF handler setup page table to access initrd before
init mem mapping and initrd relocating.

head_32.S could use 32bit flat mode to access initrd before init mem
mapping initrd relocating.

This patch introduces x86_acpi_override_find(), which is called from
head_32.S/head64.c, to replace acpi_initrd_override_find(). So that we
can makes 32bit and 64 bit more consistent.

-v2: use inline function in header file instead according to tj.
     also still need to keep #idef head_32.S to avoid compiling error.
-v3: need to move down reserve_initrd() after acpi_initrd_override_copy(),
     to make sure we are using right address.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: linux-acpi@vger.kernel.org
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/include/asm/setup.h |    6 ++++++
 arch/x86/kernel/head64.c     |    2 ++
 arch/x86/kernel/head_32.S    |    4 ++++
 arch/x86/kernel/setup.c      |   34 ++++++++++++++++++++++++++++++----
 4 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 4f71d48..6f885b7 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -42,6 +42,12 @@ extern void visws_early_detect(void);
 static inline void visws_early_detect(void) { }
 #endif
 
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void x86_acpi_override_find(void);
+#else
+static inline void x86_acpi_override_find(void) { }
+#endif
+
 extern unsigned long saved_video_mode;
 
 extern void reserve_standard_io_resources(void);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 55b6761..229b281 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -175,6 +175,8 @@ void __init x86_64_start_kernel(char * real_mode_data)
 	if (console_loglevel == 10)
 		early_printk("Kernel alive\n");
 
+	x86_acpi_override_find();
+
 	clear_page(init_level4_pgt);
 	/* set init_level4_pgt kernel high mapping*/
 	init_level4_pgt[511] = early_level4_pgt[511];
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 73afd11..ca08f0e 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -149,6 +149,10 @@ ENTRY(startup_32)
 	call load_ucode_bsp
 #endif
 
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+	call x86_acpi_override_find
+#endif
+
 /*
  * Initialize page tables.  This creates a PDE and a set of page
  * tables, which are located immediately beyond __brk_base.  The variable
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 142e042..d11b1b7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -421,6 +421,34 @@ static void __init reserve_initrd(void)
 }
 #endif /* CONFIG_BLK_DEV_INITRD */
 
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void __init x86_acpi_override_find(void)
+{
+	unsigned long ramdisk_image, ramdisk_size;
+	unsigned char *p = NULL;
+
+#ifdef CONFIG_X86_32
+	struct boot_params *boot_params_p;
+
+	/*
+	 * 32bit is from head_32.S, and it is 32bit flat mode.
+	 * So need to use phys address to access global variables.
+	 */
+	boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
+	ramdisk_image = get_ramdisk_image(boot_params_p);
+	ramdisk_size  = get_ramdisk_size(boot_params_p);
+	p = (unsigned char *)ramdisk_image;
+	acpi_initrd_override_find(p, ramdisk_size, true);
+#else
+	ramdisk_image = get_ramdisk_image(&boot_params);
+	ramdisk_size  = get_ramdisk_size(&boot_params);
+	if (ramdisk_image)
+		p = __va(ramdisk_image);
+	acpi_initrd_override_find(p, ramdisk_size, false);
+#endif
+}
+#endif
+
 static void __init parse_setup_data(void)
 {
 	struct setup_data *data;
@@ -1117,12 +1145,10 @@ void __init setup_arch(char **cmdline_p)
 	/* Allocate bigger log buffer */
 	setup_log_buf(1);
 
-	reserve_initrd();
-
-	acpi_initrd_override_find((void *)initrd_start,
-					initrd_end - initrd_start, false);
 	acpi_initrd_override_copy();
 
+	reserve_initrd();
+
 	reserve_crashkernel();
 
 	vsmp_init();
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 10/22] x86, mm, numa: Move two functions calling on successful path later
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (8 preceding siblings ...)
  2013-06-13 13:02 ` [Part1 PATCH v5 09/22] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c Tang Chen
@ 2013-06-13 13:02 ` Tang Chen
  2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-18  0:53   ` [Part1 PATCH v5 10/22] " Tejun Heo
  2013-06-13 13:02 ` [Part1 PATCH v5 11/22] x86, mm, numa: Call numa_meminfo_cover_memory() checking early Tang Chen
                   ` (14 subsequent siblings)
  24 siblings, 2 replies; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:02 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm

From: Yinghai Lu <yinghai@kernel.org>

We need to have numa info ready before init_mem_mappingi(), so that we
can call init_mem_mapping per node, and alse trim node memory ranges to
big alignment.

Currently, parsing numa info needs to allocate some buffer and need to be
called after init_mem_mapping. So try to split parsing numa info procedure
into two steps:
	- The first step will be called before init_mem_mapping, and it
	  should not need allocate buffers.
	- The second step will cantain all the buffer related code and be
	  executed later.

At last we will have early_initmem_init() and initmem_init().

This patch implements only the first step.

setup_node_data() and numa_init_array() are only called for successful
path, so we can move these two callings to x86_numa_init(). That will also
make numa_init() smaller and more readable.

-v2: remove online_node_map clear in numa_init(), as it is only
     set in setup_node_data() at last in successful path.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |   69 +++++++++++++++++++++++++++++----------------------
 1 files changed, 39 insertions(+), 30 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a71c4e2..07ae800 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -477,7 +477,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
 	unsigned long uninitialized_var(pfn_align);
-	int i, nid;
+	int i;
 
 	/* Account for nodes with cpus and no memory */
 	node_possible_map = numa_nodes_parsed;
@@ -506,24 +506,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 	if (!numa_meminfo_cover_memory(mi))
 		return -EINVAL;
 
-	/* Finally register nodes. */
-	for_each_node_mask(nid, node_possible_map) {
-		u64 start = PFN_PHYS(max_pfn);
-		u64 end = 0;
-
-		for (i = 0; i < mi->nr_blks; i++) {
-			if (nid != mi->blk[i].nid)
-				continue;
-			start = min(mi->blk[i].start, start);
-			end = max(mi->blk[i].end, end);
-		}
-
-		if (start < end)
-			setup_node_data(nid, start, end);
-	}
-
-	/* Dump memblock with node info and return. */
-	memblock_dump_all();
 	return 0;
 }
 
@@ -559,7 +541,6 @@ static int __init numa_init(int (*init_func)(void))
 
 	nodes_clear(numa_nodes_parsed);
 	nodes_clear(node_possible_map);
-	nodes_clear(node_online_map);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
 	WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
 	numa_reset_distance();
@@ -577,15 +558,6 @@ static int __init numa_init(int (*init_func)(void))
 	if (ret < 0)
 		return ret;
 
-	for (i = 0; i < nr_cpu_ids; i++) {
-		int nid = early_cpu_to_node(i);
-
-		if (nid == NUMA_NO_NODE)
-			continue;
-		if (!node_online(nid))
-			numa_clear_node(i);
-	}
-	numa_init_array();
 	return 0;
 }
 
@@ -618,7 +590,7 @@ static int __init dummy_numa_init(void)
  * last fallback is dummy single node config encomapssing whole memory and
  * never fails.
  */
-void __init x86_numa_init(void)
+static void __init early_x86_numa_init(void)
 {
 	if (!numa_off) {
 #ifdef CONFIG_X86_NUMAQ
@@ -638,6 +610,43 @@ void __init x86_numa_init(void)
 	numa_init(dummy_numa_init);
 }
 
+void __init x86_numa_init(void)
+{
+	int i, nid;
+	struct numa_meminfo *mi = &numa_meminfo;
+
+	early_x86_numa_init();
+
+	/* Finally register nodes. */
+	for_each_node_mask(nid, node_possible_map) {
+		u64 start = PFN_PHYS(max_pfn);
+		u64 end = 0;
+
+		for (i = 0; i < mi->nr_blks; i++) {
+			if (nid != mi->blk[i].nid)
+				continue;
+			start = min(mi->blk[i].start, start);
+			end = max(mi->blk[i].end, end);
+		}
+
+		if (start < end)
+			setup_node_data(nid, start, end); /* online is set */
+	}
+
+	/* Dump memblock with node info */
+	memblock_dump_all();
+
+	for (i = 0; i < nr_cpu_ids; i++) {
+		int nid = early_cpu_to_node(i);
+
+		if (nid == NUMA_NO_NODE)
+			continue;
+		if (!node_online(nid))
+			numa_clear_node(i);
+	}
+	numa_init_array();
+}
+
 static __init int find_near_online_node(int node)
 {
 	int n, val;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 11/22] x86, mm, numa: Call numa_meminfo_cover_memory() checking early
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (9 preceding siblings ...)
  2013-06-13 13:02 ` [Part1 PATCH v5 10/22] x86, mm, numa: Move two functions calling on successful path later Tang Chen
@ 2013-06-13 13:02 ` Tang Chen
  2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-18  1:05   ` [Part1 PATCH v5 11/22] " Tejun Heo
  2013-06-13 13:02 ` [Part1 PATCH v5 12/22] x86, mm, numa: Move node_map_pfn_alignment() to x86 Tang Chen
                   ` (13 subsequent siblings)
  24 siblings, 2 replies; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:02 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm

From: Yinghai Lu <yinghai@kernel.org>

In order to seperate parsing numa info procedure into two steps,
we need to set memblock nid later, as it could change memblock
array, and possible doube memblock.memory array which will need
to allocate buffer.

We do not need to use nid in memblock to find out absent pages.
So we can move that numa_meminfo_cover_memory() early.

Also we could change __absent_pages_in_range() to static and use
absent_pages_in_range() directly.

Later we will set memblock nid only once on successful path.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |    7 ++++---
 include/linux/mm.h |    2 --
 mm/page_alloc.c    |    2 +-
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 07ae800..1bb565d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -457,7 +457,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
 		u64 s = mi->blk[i].start >> PAGE_SHIFT;
 		u64 e = mi->blk[i].end >> PAGE_SHIFT;
 		numaram += e - s;
-		numaram -= __absent_pages_in_range(mi->blk[i].nid, s, e);
+		numaram -= absent_pages_in_range(s, e);
 		if ((s64)numaram < 0)
 			numaram = 0;
 	}
@@ -485,6 +485,9 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 	if (WARN_ON(nodes_empty(node_possible_map)))
 		return -EINVAL;
 
+	if (!numa_meminfo_cover_memory(mi))
+		return -EINVAL;
+
 	for (i = 0; i < mi->nr_blks; i++) {
 		struct numa_memblk *mb = &mi->blk[i];
 		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
@@ -503,8 +506,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 		return -EINVAL;
 	}
 #endif
-	if (!numa_meminfo_cover_memory(mi))
-		return -EINVAL;
 
 	return 0;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0c8528..28e9470 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1385,8 +1385,6 @@ static inline unsigned long free_initmem_default(int poison)
  */
 extern void free_area_init_nodes(unsigned long *max_zone_pfn);
 unsigned long node_map_pfn_alignment(void);
-unsigned long __absent_pages_in_range(int nid, unsigned long start_pfn,
-						unsigned long end_pfn);
 extern unsigned long absent_pages_in_range(unsigned long start_pfn,
 						unsigned long end_pfn);
 extern void get_pfn_range_for_nid(unsigned int nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3edb62..74e3428 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4397,7 +4397,7 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid,
  * Return the number of holes in a range on a node. If nid is MAX_NUMNODES,
  * then all holes in the requested range will be accounted for.
  */
-unsigned long __meminit __absent_pages_in_range(int nid,
+static unsigned long __meminit __absent_pages_in_range(int nid,
 				unsigned long range_start_pfn,
 				unsigned long range_end_pfn)
 {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 12/22] x86, mm, numa: Move node_map_pfn_alignment() to x86
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (10 preceding siblings ...)
  2013-06-13 13:02 ` [Part1 PATCH v5 11/22] x86, mm, numa: Call numa_meminfo_cover_memory() checking early Tang Chen
@ 2013-06-13 13:02 ` Tang Chen
  2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-18  1:08   ` [Part1 PATCH v5 12/22] " Tejun Heo
  2013-06-13 13:03 ` [Part1 PATCH v5 13/22] x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment Tang Chen
                   ` (12 subsequent siblings)
  24 siblings, 2 replies; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:02 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm

From: Yinghai Lu <yinghai@kernel.org>

Move node_map_pfn_alignment() to arch/x86/mm as there is no
other user for it.

Will update it to use numa_meminfo instead of memblock.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |    1 -
 mm/page_alloc.c    |   50 --------------------------------------------------
 3 files changed, 50 insertions(+), 51 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1bb565d..10c6240 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -474,6 +474,56 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
 	return true;
 }
 
+/**
+ * node_map_pfn_alignment - determine the maximum internode alignment
+ *
+ * This function should be called after node map is populated and sorted.
+ * It calculates the maximum power of two alignment which can distinguish
+ * all the nodes.
+ *
+ * For example, if all nodes are 1GiB and aligned to 1GiB, the return value
+ * would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)).  If the
+ * nodes are shifted by 256MiB, 256MiB.  Note that if only the last node is
+ * shifted, 1GiB is enough and this function will indicate so.
+ *
+ * This is used to test whether pfn -> nid mapping of the chosen memory
+ * model has fine enough granularity to avoid incorrect mapping for the
+ * populated node map.
+ *
+ * Returns the determined alignment in pfn's.  0 if there is no alignment
+ * requirement (single node).
+ */
+unsigned long __init node_map_pfn_alignment(void)
+{
+	unsigned long accl_mask = 0, last_end = 0;
+	unsigned long start, end, mask;
+	int last_nid = -1;
+	int i, nid;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+		if (!start || last_nid < 0 || last_nid == nid) {
+			last_nid = nid;
+			last_end = end;
+			continue;
+		}
+
+		/*
+		 * Start with a mask granular enough to pin-point to the
+		 * start pfn and tick off bits one-by-one until it becomes
+		 * too coarse to separate the current node from the last.
+		 */
+		mask = ~((1 << __ffs(start)) - 1);
+		while (mask && last_end <= (start & (mask << 1)))
+			mask <<= 1;
+
+		/* accumulate all internode masks */
+		accl_mask |= mask;
+	}
+
+	/* convert mask to number of pages */
+	return ~accl_mask + 1;
+}
+
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
 	unsigned long uninitialized_var(pfn_align);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 28e9470..b827743 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1384,7 +1384,6 @@ static inline unsigned long free_initmem_default(int poison)
  * CONFIG_HAVE_MEMBLOCK_NODE_MAP.
  */
 extern void free_area_init_nodes(unsigned long *max_zone_pfn);
-unsigned long node_map_pfn_alignment(void);
 extern unsigned long absent_pages_in_range(unsigned long start_pfn,
 						unsigned long end_pfn);
 extern void get_pfn_range_for_nid(unsigned int nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 74e3428..7ba7703 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,56 +4762,6 @@ void __init setup_nr_node_ids(void)
 }
 #endif
 
-/**
- * node_map_pfn_alignment - determine the maximum internode alignment
- *
- * This function should be called after node map is populated and sorted.
- * It calculates the maximum power of two alignment which can distinguish
- * all the nodes.
- *
- * For example, if all nodes are 1GiB and aligned to 1GiB, the return value
- * would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)).  If the
- * nodes are shifted by 256MiB, 256MiB.  Note that if only the last node is
- * shifted, 1GiB is enough and this function will indicate so.
- *
- * This is used to test whether pfn -> nid mapping of the chosen memory
- * model has fine enough granularity to avoid incorrect mapping for the
- * populated node map.
- *
- * Returns the determined alignment in pfn's.  0 if there is no alignment
- * requirement (single node).
- */
-unsigned long __init node_map_pfn_alignment(void)
-{
-	unsigned long accl_mask = 0, last_end = 0;
-	unsigned long start, end, mask;
-	int last_nid = -1;
-	int i, nid;
-
-	for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
-		if (!start || last_nid < 0 || last_nid == nid) {
-			last_nid = nid;
-			last_end = end;
-			continue;
-		}
-
-		/*
-		 * Start with a mask granular enough to pin-point to the
-		 * start pfn and tick off bits one-by-one until it becomes
-		 * too coarse to separate the current node from the last.
-		 */
-		mask = ~((1 << __ffs(start)) - 1);
-		while (mask && last_end <= (start & (mask << 1)))
-			mask <<= 1;
-
-		/* accumulate all internode masks */
-		accl_mask |= mask;
-	}
-
-	/* convert mask to number of pages */
-	return ~accl_mask + 1;
-}
-
 /* Find the lowest pfn for a node */
 static unsigned long __init find_min_pfn_for_node(int nid)
 {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 13/22] x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (11 preceding siblings ...)
  2013-06-13 13:02 ` [Part1 PATCH v5 12/22] x86, mm, numa: Move node_map_pfn_alignment() to x86 Tang Chen
@ 2013-06-13 13:03 ` Tang Chen
  2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-18  1:40   ` [Part1 PATCH v5 13/22] " Tejun Heo
  2013-06-13 13:03 ` [Part1 PATCH v5 14/22] x86, mm, numa: Set memblock nid later Tang Chen
                   ` (11 subsequent siblings)
  24 siblings, 2 replies; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:03 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm

From: Yinghai Lu <yinghai@kernel.org>

We could use numa_meminfo directly instead of memblock nid in
node_map_pfn_alignment().

So we could do setting memblock nid later and only do it once
for successful path.

-v2: according to tj, separate moving to another patch.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |   30 +++++++++++++++++++-----------
 1 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 10c6240..cff565a 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -493,14 +493,18 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
  * Returns the determined alignment in pfn's.  0 if there is no alignment
  * requirement (single node).
  */
-unsigned long __init node_map_pfn_alignment(void)
+#ifdef NODE_NOT_IN_PAGE_FLAGS
+static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
 {
 	unsigned long accl_mask = 0, last_end = 0;
 	unsigned long start, end, mask;
 	int last_nid = -1;
 	int i, nid;
 
-	for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+	for (i = 0; i < mi->nr_blks; i++) {
+		start = mi->blk[i].start >> PAGE_SHIFT;
+		end = mi->blk[i].end >> PAGE_SHIFT;
+		nid = mi->blk[i].nid;
 		if (!start || last_nid < 0 || last_nid == nid) {
 			last_nid = nid;
 			last_end = end;
@@ -523,10 +527,16 @@ unsigned long __init node_map_pfn_alignment(void)
 	/* convert mask to number of pages */
 	return ~accl_mask + 1;
 }
+#else
+static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
+{
+	return 0;
+}
+#endif
 
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
-	unsigned long uninitialized_var(pfn_align);
+	unsigned long pfn_align;
 	int i;
 
 	/* Account for nodes with cpus and no memory */
@@ -538,24 +548,22 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 	if (!numa_meminfo_cover_memory(mi))
 		return -EINVAL;
 
-	for (i = 0; i < mi->nr_blks; i++) {
-		struct numa_memblk *mb = &mi->blk[i];
-		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
-	}
-
 	/*
 	 * If sections array is gonna be used for pfn -> nid mapping, check
 	 * whether its granularity is fine enough.
 	 */
-#ifdef NODE_NOT_IN_PAGE_FLAGS
-	pfn_align = node_map_pfn_alignment();
+	pfn_align = node_map_pfn_alignment(mi);
 	if (pfn_align && pfn_align < PAGES_PER_SECTION) {
 		printk(KERN_WARNING "Node alignment %LuMB < min %LuMB, rejecting NUMA config\n",
 		       PFN_PHYS(pfn_align) >> 20,
 		       PFN_PHYS(PAGES_PER_SECTION) >> 20);
 		return -EINVAL;
 	}
-#endif
+
+	for (i = 0; i < mi->nr_blks; i++) {
+		struct numa_memblk *mb = &mi->blk[i];
+		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+	}
 
 	return 0;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 14/22] x86, mm, numa: Set memblock nid later
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (12 preceding siblings ...)
  2013-06-13 13:03 ` [Part1 PATCH v5 13/22] x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment Tang Chen
@ 2013-06-13 13:03 ` Tang Chen
  2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-18  1:45   ` [Part1 PATCH v5 14/22] " Tejun Heo
  2013-06-13 13:03 ` [Part1 PATCH v5 15/22] x86, mm, numa: Move node_possible_map setting later Tang Chen
                   ` (10 subsequent siblings)
  24 siblings, 2 replies; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:03 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm

From: Yinghai Lu <yinghai@kernel.org>

In order to seperate parsing numa info procedure into two steps,
we need to set memblock nid later because it could change memblock
array, and possible doube memblock.memory array which will allocate
buffer.

Only set memblock nid once for successful path.

Also rename numa_register_memblks to numa_check_memblks() after
moving out code of setting memblock nid.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |   16 +++++++---------
 1 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index cff565a..e448b6f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -534,10 +534,9 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
 }
 #endif
 
-static int __init numa_register_memblks(struct numa_meminfo *mi)
+static int __init numa_check_memblks(struct numa_meminfo *mi)
 {
 	unsigned long pfn_align;
-	int i;
 
 	/* Account for nodes with cpus and no memory */
 	node_possible_map = numa_nodes_parsed;
@@ -560,11 +559,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 		return -EINVAL;
 	}
 
-	for (i = 0; i < mi->nr_blks; i++) {
-		struct numa_memblk *mb = &mi->blk[i];
-		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
-	}
-
 	return 0;
 }
 
@@ -601,7 +595,6 @@ static int __init numa_init(int (*init_func)(void))
 	nodes_clear(numa_nodes_parsed);
 	nodes_clear(node_possible_map);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
-	WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
 	numa_reset_distance();
 
 	ret = init_func();
@@ -613,7 +606,7 @@ static int __init numa_init(int (*init_func)(void))
 
 	numa_emulation(&numa_meminfo, numa_distance_cnt);
 
-	ret = numa_register_memblks(&numa_meminfo);
+	ret = numa_check_memblks(&numa_meminfo);
 	if (ret < 0)
 		return ret;
 
@@ -676,6 +669,11 @@ void __init x86_numa_init(void)
 
 	early_x86_numa_init();
 
+	for (i = 0; i < mi->nr_blks; i++) {
+		struct numa_memblk *mb = &mi->blk[i];
+		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+	}
+
 	/* Finally register nodes. */
 	for_each_node_mask(nid, node_possible_map) {
 		u64 start = PFN_PHYS(max_pfn);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 15/22] x86, mm, numa: Move node_possible_map setting later
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (13 preceding siblings ...)
  2013-06-13 13:03 ` [Part1 PATCH v5 14/22] x86, mm, numa: Set memblock nid later Tang Chen
@ 2013-06-13 13:03 ` Tang Chen
  2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-13 13:03 ` [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down Tang Chen
                   ` (9 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:03 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm

From: Yinghai Lu <yinghai@kernel.org>

Move node_possible_map handling out of numa_check_memblks()
to avoid side effect when changing numa_check_memblks().

Only set node_possible_map once for successful path instead
of resetting in numa_init() every time.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e448b6f..da2ebab 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -536,12 +536,13 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
 
 static int __init numa_check_memblks(struct numa_meminfo *mi)
 {
+	nodemask_t nodes_parsed;
 	unsigned long pfn_align;
 
 	/* Account for nodes with cpus and no memory */
-	node_possible_map = numa_nodes_parsed;
-	numa_nodemask_from_meminfo(&node_possible_map, mi);
-	if (WARN_ON(nodes_empty(node_possible_map)))
+	nodes_parsed = numa_nodes_parsed;
+	numa_nodemask_from_meminfo(&nodes_parsed, mi);
+	if (WARN_ON(nodes_empty(nodes_parsed)))
 		return -EINVAL;
 
 	if (!numa_meminfo_cover_memory(mi))
@@ -593,7 +594,6 @@ static int __init numa_init(int (*init_func)(void))
 		set_apicid_to_node(i, NUMA_NO_NODE);
 
 	nodes_clear(numa_nodes_parsed);
-	nodes_clear(node_possible_map);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
 	numa_reset_distance();
 
@@ -669,6 +669,9 @@ void __init x86_numa_init(void)
 
 	early_x86_numa_init();
 
+	node_possible_map = numa_nodes_parsed;
+	numa_nodemask_from_meminfo(&node_possible_map, mi);
+
 	for (i = 0; i < mi->nr_blks; i++) {
 		struct numa_memblk *mb = &mi->blk[i];
 		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down.
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (14 preceding siblings ...)
  2013-06-13 13:03 ` [Part1 PATCH v5 15/22] x86, mm, numa: Move node_possible_map setting later Tang Chen
@ 2013-06-13 13:03 ` Tang Chen
  2013-06-14 21:33   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-18  1:58   ` [Part1 PATCH v5 16/22] " Tejun Heo
  2013-06-13 13:03 ` [Part1 PATCH v5 17/22] x86, ACPI, numa, ia64: split SLIT handling out Tang Chen
                   ` (8 subsequent siblings)
  24 siblings, 2 replies; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:03 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, David Rientjes

From: Yinghai Lu <yinghai@kernel.org>

numa_emulation() needs to allocate buffer for new numa_meminfo
and distance matrix, so execute it later in x86_numa_init().

Also we change the behavoir:
	- before this patch, if user input wrong data in command
	  line, it will fall back to next numa probing or disabling
	  numa.
	- after this patch, if user input wrong data in command line,
	  it will stay with numa info probed from previous probing,
	  like ACPI SRAT or amd_numa.

We need to call numa_check_memblks to reject wrong user inputs early
so that we can keep the original numa_meminfo not changed.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c           |    6 +++---
 arch/x86/mm/numa_emulation.c |    2 +-
 arch/x86/mm/numa_internal.h  |    2 ++
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index da2ebab..3254f22 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -534,7 +534,7 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
 }
 #endif
 
-static int __init numa_check_memblks(struct numa_meminfo *mi)
+int __init numa_check_memblks(struct numa_meminfo *mi)
 {
 	nodemask_t nodes_parsed;
 	unsigned long pfn_align;
@@ -604,8 +604,6 @@ static int __init numa_init(int (*init_func)(void))
 	if (ret < 0)
 		return ret;
 
-	numa_emulation(&numa_meminfo, numa_distance_cnt);
-
 	ret = numa_check_memblks(&numa_meminfo);
 	if (ret < 0)
 		return ret;
@@ -669,6 +667,8 @@ void __init x86_numa_init(void)
 
 	early_x86_numa_init();
 
+	numa_emulation(&numa_meminfo, numa_distance_cnt);
+
 	node_possible_map = numa_nodes_parsed;
 	numa_nodemask_from_meminfo(&node_possible_map, mi);
 
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index dbbbb47..5a0433d 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -348,7 +348,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 	if (ret < 0)
 		goto no_emu;
 
-	if (numa_cleanup_meminfo(&ei) < 0) {
+	if (numa_cleanup_meminfo(&ei) < 0 || numa_check_memblks(&ei) < 0) {
 		pr_warning("NUMA: Warning: constructed meminfo invalid, disabling emulation\n");
 		goto no_emu;
 	}
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index ad86ec9..bb2fbcc 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -21,6 +21,8 @@ void __init numa_reset_distance(void);
 
 void __init x86_numa_init(void);
 
+int __init numa_check_memblks(struct numa_meminfo *mi);
+
 #ifdef CONFIG_NUMA_EMU
 void __init numa_emulation(struct numa_meminfo *numa_meminfo,
 			   int numa_dist_cnt);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 17/22] x86, ACPI, numa, ia64: split SLIT handling out
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (15 preceding siblings ...)
  2013-06-13 13:03 ` [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down Tang Chen
@ 2013-06-13 13:03 ` Tang Chen
  2013-06-14 21:33   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-13 13:03 ` [Part1 PATCH v5 18/22] x86, mm, numa: Add early_initmem_init() stub Tang Chen
                   ` (7 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:03 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, Rafael J. Wysocki,
	linux-acpi, Tony Luck, Fenghua Yu, linux-ia64

From: Yinghai Lu <yinghai@kernel.org>

We need to handle slit later, as it need to allocate buffer for distance
matrix. Also we do not need SLIT info before init_mem_mapping. So move
SLIT parsing procedure later.

x86_acpi_numa_init() will be splited into x86_acpi_numa_init_srat() and
x86_acpi_numa_init_slit().

It should not break ia64 by replacing acpi_numa_init with
acpi_numa_init_srat/acpi_numa_init_slit/acpi_num_arch_fixup.

-v2: Change name to acpi_numa_init_srat/acpi_numa_init_slit according tj.
     remove the reset_numa_distance() in numa_init(), as get we only set
     distance in slit handling.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: linux-acpi@vger.kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: linux-ia64@vger.kernel.org
Tested-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/ia64/kernel/setup.c    |    4 +++-
 arch/x86/include/asm/acpi.h |    3 ++-
 arch/x86/mm/numa.c          |   14 ++++++++++++--
 arch/x86/mm/srat.c          |   11 +++++++----
 drivers/acpi/numa.c         |   13 +++++++------
 include/linux/acpi.h        |    3 ++-
 6 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index 13bfdd2..5f7db4a 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -558,7 +558,9 @@ setup_arch (char **cmdline_p)
 	acpi_table_init();
 	early_acpi_boot_init();
 # ifdef CONFIG_ACPI_NUMA
-	acpi_numa_init();
+	acpi_numa_init_srat();
+	acpi_numa_init_slit();
+	acpi_numa_arch_fixup();
 #  ifdef CONFIG_ACPI_HOTPLUG_CPU
 	prefill_possible_map();
 #  endif
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index b31bf97..651db0b 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -178,7 +178,8 @@ static inline void disable_acpi(void) { }
 
 #ifdef CONFIG_ACPI_NUMA
 extern int acpi_numa;
-extern int x86_acpi_numa_init(void);
+int x86_acpi_numa_init_srat(void);
+void x86_acpi_numa_init_slit(void);
 #endif /* CONFIG_ACPI_NUMA */
 
 #define acpi_unlazy_tlb(x)	leave_mm(x)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 3254f22..630e09f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -595,7 +595,6 @@ static int __init numa_init(int (*init_func)(void))
 
 	nodes_clear(numa_nodes_parsed);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
-	numa_reset_distance();
 
 	ret = init_func();
 	if (ret < 0)
@@ -633,6 +632,10 @@ static int __init dummy_numa_init(void)
 	return 0;
 }
 
+#ifdef CONFIG_ACPI_NUMA
+static bool srat_used __initdata;
+#endif
+
 /**
  * x86_numa_init - Initialize NUMA
  *
@@ -648,8 +651,10 @@ static void __init early_x86_numa_init(void)
 			return;
 #endif
 #ifdef CONFIG_ACPI_NUMA
-		if (!numa_init(x86_acpi_numa_init))
+		if (!numa_init(x86_acpi_numa_init_srat)) {
+			srat_used = true;
 			return;
+		}
 #endif
 #ifdef CONFIG_AMD_NUMA
 		if (!numa_init(amd_numa_init))
@@ -667,6 +672,11 @@ void __init x86_numa_init(void)
 
 	early_x86_numa_init();
 
+#ifdef CONFIG_ACPI_NUMA
+	if (srat_used)
+		x86_acpi_numa_init_slit();
+#endif
+
 	numa_emulation(&numa_meminfo, numa_distance_cnt);
 
 	node_possible_map = numa_nodes_parsed;
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index cdd0da9..443f9ef 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -185,14 +185,17 @@ out_err:
 	return -1;
 }
 
-void __init acpi_numa_arch_fixup(void) {}
-
-int __init x86_acpi_numa_init(void)
+int __init x86_acpi_numa_init_srat(void)
 {
 	int ret;
 
-	ret = acpi_numa_init();
+	ret = acpi_numa_init_srat();
 	if (ret < 0)
 		return ret;
 	return srat_disabled() ? -EINVAL : 0;
 }
+
+void __init x86_acpi_numa_init_slit(void)
+{
+	acpi_numa_init_slit();
+}
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 33e609f..6460db4 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -282,7 +282,7 @@ acpi_table_parse_srat(enum acpi_srat_type id,
 					    handler, max_entries);
 }
 
-int __init acpi_numa_init(void)
+int __init acpi_numa_init_srat(void)
 {
 	int cnt = 0;
 
@@ -303,11 +303,6 @@ int __init acpi_numa_init(void)
 					    NR_NODE_MEMBLKS);
 	}
 
-	/* SLIT: System Locality Information Table */
-	acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
-
-	acpi_numa_arch_fixup();
-
 	if (cnt < 0)
 		return cnt;
 	else if (!parsed_numa_memblks)
@@ -315,6 +310,12 @@ int __init acpi_numa_init(void)
 	return 0;
 }
 
+void __init acpi_numa_init_slit(void)
+{
+	/* SLIT: System Locality Information Table */
+	acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
+}
+
 int acpi_get_pxm(acpi_handle h)
 {
 	unsigned long long pxm;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 4e3731b..92463b5 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -85,7 +85,8 @@ int early_acpi_boot_init(void);
 int acpi_boot_init (void);
 void acpi_boot_table_init (void);
 int acpi_mps_check (void);
-int acpi_numa_init (void);
+int acpi_numa_init_srat(void);
+void acpi_numa_init_slit(void);
 
 int acpi_table_init (void);
 int acpi_table_parse(char *id, acpi_tbl_table_handler handler);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 18/22] x86, mm, numa: Add early_initmem_init() stub
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (16 preceding siblings ...)
  2013-06-13 13:03 ` [Part1 PATCH v5 17/22] x86, ACPI, numa, ia64: split SLIT handling out Tang Chen
@ 2013-06-13 13:03 ` Tang Chen
  2013-06-14 21:33   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-13 13:03 ` [Part1 PATCH v5 19/22] x86, mm: Parse numa info earlier Tang Chen
                   ` (6 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:03 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, Pekka Enberg, Jacob Shin

From: Yinghai Lu <yinghai@kernel.org>

Introduce early_initmem_init() to call early_x86_numa_init(),
which will be used to parse numa info earlier.

Later will call init_mem_mapping for all the nodes.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/include/asm/page_types.h |    1 +
 arch/x86/kernel/setup.c           |    1 +
 arch/x86/mm/init.c                |    6 ++++++
 arch/x86/mm/numa.c                |    7 +++++--
 4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index b012b82..d04dd8c 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -55,6 +55,7 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
 extern unsigned long init_memory_mapping(unsigned long start,
 					 unsigned long end);
 
+void early_initmem_init(void);
 extern void initmem_init(void);
 
 #endif	/* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d11b1b7..301165e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1162,6 +1162,7 @@ void __init setup_arch(char **cmdline_p)
 
 	early_acpi_boot_init();
 
+	early_initmem_init();
 	initmem_init();
 	memblock_find_dma_reserve();
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 8554656..3c21f16 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -467,6 +467,12 @@ void __init init_mem_mapping(void)
 	early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
 }
 
+#ifndef CONFIG_NUMA
+void __init early_initmem_init(void)
+{
+}
+#endif
+
 /*
  * devmem_is_allowed() checks to see if /dev/mem access to a certain address
  * is valid. The argument is a physical page number.
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 630e09f..7d76936 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -665,13 +665,16 @@ static void __init early_x86_numa_init(void)
 	numa_init(dummy_numa_init);
 }
 
+void __init early_initmem_init(void)
+{
+	early_x86_numa_init();
+}
+
 void __init x86_numa_init(void)
 {
 	int i, nid;
 	struct numa_meminfo *mi = &numa_meminfo;
 
-	early_x86_numa_init();
-
 #ifdef CONFIG_ACPI_NUMA
 	if (srat_used)
 		x86_acpi_numa_init_slit();
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 19/22] x86, mm: Parse numa info earlier
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (17 preceding siblings ...)
  2013-06-13 13:03 ` [Part1 PATCH v5 18/22] x86, mm, numa: Add early_initmem_init() stub Tang Chen
@ 2013-06-13 13:03 ` Tang Chen
  2013-06-14 21:33   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-13 13:03 ` [Part1 PATCH v5 20/22] x86, mm: Add comments for step_size shift Tang Chen
                   ` (5 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:03 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, Pekka Enberg, Jacob Shin

From: Yinghai Lu <yinghai@kernel.org>

Parsing numa info has been separated into two steps now.

early_initmem_info() only parses info in numa_meminfo and
nodes_parsed. still keep numaq, acpi_numa, amd_numa, dummy
fall back sequence working.

SLIT and numa emulation handling are still left in initmem_init().

Call early_initmem_init before init_mem_mapping() to prepare
to use numa_info with it.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/kernel/setup.c |   24 ++++++++++--------------
 1 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 301165e..fd0d5be 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1125,13 +1125,21 @@ void __init setup_arch(char **cmdline_p)
 	trim_platform_memory_ranges();
 	trim_low_memory_range();
 
+	/*
+	 * Parse the ACPI tables for possible boot-time SMP configuration.
+	 */
+	acpi_initrd_override_copy();
+	acpi_boot_table_init();
+	early_acpi_boot_init();
+	early_initmem_init();
 	init_mem_mapping();
-
+	memblock.current_limit = get_max_mapped();
 	early_trap_pf_init();
 
+	reserve_initrd();
+
 	setup_real_mode();
 
-	memblock.current_limit = get_max_mapped();
 	dma_contiguous_reserve(0);
 
 	/*
@@ -1145,24 +1153,12 @@ void __init setup_arch(char **cmdline_p)
 	/* Allocate bigger log buffer */
 	setup_log_buf(1);
 
-	acpi_initrd_override_copy();
-
-	reserve_initrd();
-
 	reserve_crashkernel();
 
 	vsmp_init();
 
 	io_delay_init();
 
-	/*
-	 * Parse the ACPI tables for possible boot-time SMP configuration.
-	 */
-	acpi_boot_table_init();
-
-	early_acpi_boot_init();
-
-	early_initmem_init();
 	initmem_init();
 	memblock_find_dma_reserve();
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 20/22] x86, mm: Add comments for step_size shift
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (18 preceding siblings ...)
  2013-06-13 13:03 ` [Part1 PATCH v5 19/22] x86, mm: Parse numa info earlier Tang Chen
@ 2013-06-13 13:03 ` Tang Chen
  2013-06-14 21:33   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-13 13:03 ` [Part1 PATCH v5 21/22] x86, mm: Make init_mem_mapping be able to be called several times Tang Chen
                   ` (4 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:03 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm

From: Yinghai Lu <yinghai@kernel.org>

As requested by hpa, add comments for why we choose 5 to be
the step size shift.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/init.c |   21 ++++++++++++++++++---
 1 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 3c21f16..5f38e72 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -395,8 +395,23 @@ static unsigned long __init init_range_memory_mapping(
 	return mapped_ram_size;
 }
 
-/* (PUD_SHIFT-PMD_SHIFT)/2 */
-#define STEP_SIZE_SHIFT 5
+static unsigned long __init get_new_step_size(unsigned long step_size)
+{
+	/*
+	 * initial mapped size is PMD_SIZE, aka 2M.
+	 * We can not set step_size to be PUD_SIZE aka 1G yet.
+	 * In worse case, when 1G is cross the 1G boundary, and
+	 * PG_LEVEL_2M is not set, we will need 1+1+512 pages (aka 2M + 8k)
+	 * to map 1G range with PTE. Use 5 as shift for now.
+	 */
+	unsigned long new_step_size = step_size << 5;
+
+	if (new_step_size > step_size)
+		step_size = new_step_size;
+
+	return  step_size;
+}
+
 void __init init_mem_mapping(void)
 {
 	unsigned long end, real_end, start, last_start;
@@ -445,7 +460,7 @@ void __init init_mem_mapping(void)
 		min_pfn_mapped = last_start >> PAGE_SHIFT;
 		/* only increase step_size after big range get mapped */
 		if (new_mapped_ram_size > mapped_ram_size)
-			step_size <<= STEP_SIZE_SHIFT;
+			step_size = get_new_step_size(step_size);
 		mapped_ram_size += new_mapped_ram_size;
 	}
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 21/22] x86, mm: Make init_mem_mapping be able to be called several times
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (19 preceding siblings ...)
  2013-06-13 13:03 ` [Part1 PATCH v5 20/22] x86, mm: Add comments for step_size shift Tang Chen
@ 2013-06-13 13:03 ` Tang Chen
  2013-06-13 18:35   ` Konrad Rzeszutek Wilk
  2013-06-14 21:33   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-13 13:03 ` [Part1 PATCH v5 22/22] x86, mm, numa: Put pagetable on local node ram for 64bit Tang Chen
                   ` (3 subsequent siblings)
  24 siblings, 2 replies; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:03 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, Pekka Enberg, Jacob Shin,
	Konrad Rzeszutek Wilk

From: Yinghai Lu <yinghai@kernel.org>

Prepare to put page table on local nodes.

Move calling of init_mem_mapping() to early_initmem_init().

Rework alloc_low_pages to allocate page table in following order:
	BRK, local node, low range

Still only load_cr3 one time, otherwise we would break xen 64bit again.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/include/asm/pgtable.h |    2 +-
 arch/x86/kernel/setup.c        |    1 -
 arch/x86/mm/init.c             |  100 +++++++++++++++++++++++++---------------
 arch/x86/mm/numa.c             |   24 ++++++++++
 4 files changed, 88 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1e67223..868687c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -621,7 +621,7 @@ static inline int pgd_none(pgd_t pgd)
 #ifndef __ASSEMBLY__
 
 extern int direct_gbpages;
-void init_mem_mapping(void);
+void init_mem_mapping(unsigned long begin, unsigned long end);
 void early_alloc_pgt_buf(void);
 
 /* local pte updates need not use xchg for locking */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fd0d5be..9ccbd60 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1132,7 +1132,6 @@ void __init setup_arch(char **cmdline_p)
 	acpi_boot_table_init();
 	early_acpi_boot_init();
 	early_initmem_init();
-	init_mem_mapping();
 	memblock.current_limit = get_max_mapped();
 	early_trap_pf_init();
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 5f38e72..9ff71ff 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -24,7 +24,10 @@ static unsigned long __initdata pgt_buf_start;
 static unsigned long __initdata pgt_buf_end;
 static unsigned long __initdata pgt_buf_top;
 
-static unsigned long min_pfn_mapped;
+static unsigned long low_min_pfn_mapped;
+static unsigned long low_max_pfn_mapped;
+static unsigned long local_min_pfn_mapped;
+static unsigned long local_max_pfn_mapped;
 
 static bool __initdata can_use_brk_pgt = true;
 
@@ -52,10 +55,17 @@ __ref void *alloc_low_pages(unsigned int num)
 
 	if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
 		unsigned long ret;
-		if (min_pfn_mapped >= max_pfn_mapped)
-			panic("alloc_low_page: ran out of memory");
-		ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
-					max_pfn_mapped << PAGE_SHIFT,
+		if (local_min_pfn_mapped >= local_max_pfn_mapped) {
+			if (low_min_pfn_mapped >= low_max_pfn_mapped)
+				panic("alloc_low_page: ran out of memory");
+			ret = memblock_find_in_range(
+					low_min_pfn_mapped << PAGE_SHIFT,
+					low_max_pfn_mapped << PAGE_SHIFT,
+					PAGE_SIZE * num , PAGE_SIZE);
+		} else
+			ret = memblock_find_in_range(
+					local_min_pfn_mapped << PAGE_SHIFT,
+					local_max_pfn_mapped << PAGE_SHIFT,
 					PAGE_SIZE * num , PAGE_SIZE);
 		if (!ret)
 			panic("alloc_low_page: can not alloc memory");
@@ -412,67 +422,88 @@ static unsigned long __init get_new_step_size(unsigned long step_size)
 	return  step_size;
 }
 
-void __init init_mem_mapping(void)
+void __init init_mem_mapping(unsigned long begin, unsigned long end)
 {
-	unsigned long end, real_end, start, last_start;
+	unsigned long real_end, start, last_start;
 	unsigned long step_size;
 	unsigned long addr;
 	unsigned long mapped_ram_size = 0;
 	unsigned long new_mapped_ram_size;
+	bool is_low = false;
+
+	if (!begin) {
+		probe_page_size_mask();
+		/* the ISA range is always mapped regardless of memory holes */
+		init_memory_mapping(0, ISA_END_ADDRESS);
+		begin = ISA_END_ADDRESS;
+		is_low = true;
+	}
 
-	probe_page_size_mask();
-
-#ifdef CONFIG_X86_64
-	end = max_pfn << PAGE_SHIFT;
-#else
-	end = max_low_pfn << PAGE_SHIFT;
-#endif
-
-	/* the ISA range is always mapped regardless of memory holes */
-	init_memory_mapping(0, ISA_END_ADDRESS);
+	if (begin >= end)
+		return;
 
 	/* xen has big range in reserved near end of ram, skip it at first.*/
-	addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
+	addr = memblock_find_in_range(begin, end, PMD_SIZE, PMD_SIZE);
 	real_end = addr + PMD_SIZE;
 
 	/* step_size need to be small so pgt_buf from BRK could cover it */
 	step_size = PMD_SIZE;
-	max_pfn_mapped = 0; /* will get exact value next */
-	min_pfn_mapped = real_end >> PAGE_SHIFT;
+	local_max_pfn_mapped = begin >> PAGE_SHIFT;
+	local_min_pfn_mapped = real_end >> PAGE_SHIFT;
 	last_start = start = real_end;
 
 	/*
-	 * We start from the top (end of memory) and go to the bottom.
-	 * The memblock_find_in_range() gets us a block of RAM from the
-	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
-	 * for page table.
+	 * alloc_low_pages() will allocate pagetable pages in the following
+	 * order:
+	 *	BRK, local node, low range
+	 *
+	 * That means it will first use up all the BRK memory, then try to get
+	 * us a block of RAM from [local_min_pfn_mapped, local_max_pfn_mapped)
+	 * used as new pagetable pages. If no memory on the local node has
+	 * been mapped, it will allocate memory from
+	 * [low_min_pfn_mapped, low_max_pfn_mapped).
 	 */
-	while (last_start > ISA_END_ADDRESS) {
+	while (last_start > begin) {
 		if (last_start > step_size) {
 			start = round_down(last_start - 1, step_size);
-			if (start < ISA_END_ADDRESS)
-				start = ISA_END_ADDRESS;
+			if (start < begin)
+				start = begin;
 		} else
-			start = ISA_END_ADDRESS;
+			start = begin;
 		new_mapped_ram_size = init_range_memory_mapping(start,
 							last_start);
+		if ((last_start >> PAGE_SHIFT) > local_max_pfn_mapped)
+			local_max_pfn_mapped = last_start >> PAGE_SHIFT;
+		local_min_pfn_mapped = start >> PAGE_SHIFT;
 		last_start = start;
-		min_pfn_mapped = last_start >> PAGE_SHIFT;
 		/* only increase step_size after big range get mapped */
 		if (new_mapped_ram_size > mapped_ram_size)
 			step_size = get_new_step_size(step_size);
 		mapped_ram_size += new_mapped_ram_size;
 	}
 
-	if (real_end < end)
+	if (real_end < end) {
 		init_range_memory_mapping(real_end, end);
+		if ((end >> PAGE_SHIFT) > local_max_pfn_mapped)
+			local_max_pfn_mapped = end >> PAGE_SHIFT;
+	}
 
+	if (is_low) {
+		low_min_pfn_mapped = local_min_pfn_mapped;
+		low_max_pfn_mapped = local_max_pfn_mapped;
+	}
+}
+
+#ifndef CONFIG_NUMA
+void __init early_initmem_init(void)
+{
 #ifdef CONFIG_X86_64
-	if (max_pfn > max_low_pfn) {
-		/* can we preseve max_low_pfn ?*/
+	init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+	if (max_pfn > max_low_pfn)
 		max_low_pfn = max_pfn;
 	}
 #else
+	init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
 	early_ioremap_page_table_range_init();
 #endif
 
@@ -481,11 +512,6 @@ void __init init_mem_mapping(void)
 
 	early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
 }
-
-#ifndef CONFIG_NUMA
-void __init early_initmem_init(void)
-{
-}
 #endif
 
 /*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 7d76936..9b18ee8 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -17,8 +17,10 @@
 #include <asm/dma.h>
 #include <asm/acpi.h>
 #include <asm/amd_nb.h>
+#include <asm/tlbflush.h>
 
 #include "numa_internal.h"
+#include "mm_internal.h"
 
 int __initdata numa_off;
 nodemask_t numa_nodes_parsed __initdata;
@@ -665,9 +667,31 @@ static void __init early_x86_numa_init(void)
 	numa_init(dummy_numa_init);
 }
 
+#ifdef CONFIG_X86_64
+static void __init early_x86_numa_init_mapping(void)
+{
+	init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+	if (max_pfn > max_low_pfn)
+		max_low_pfn = max_pfn;
+}
+#else
+static void __init early_x86_numa_init_mapping(void)
+{
+	init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
+	early_ioremap_page_table_range_init();
+}
+#endif
+
 void __init early_initmem_init(void)
 {
 	early_x86_numa_init();
+
+	early_x86_numa_init_mapping();
+
+	load_cr3(swapper_pg_dir);
+	__flush_tlb_all();
+
+	early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
 }
 
 void __init x86_numa_init(void)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [Part1 PATCH v5 22/22] x86, mm, numa: Put pagetable on local node ram for 64bit
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (20 preceding siblings ...)
  2013-06-13 13:03 ` [Part1 PATCH v5 21/22] x86, mm: Make init_mem_mapping be able to be called several times Tang Chen
@ 2013-06-13 13:03 ` Tang Chen
  2013-06-14 21:34   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-18  2:03 ` [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tejun Heo
                   ` (2 subsequent siblings)
  24 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-13 13:03 UTC (permalink / raw)
  To: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, Pekka Enberg, Jacob Shin,
	Konrad Rzeszutek Wilk

From: Yinghai Lu <yinghai@kernel.org>

If node with ram is hotplugable, memory for local node page
table and vmemmap should be on the local node ram.

This patch is some kind of refreshment of
| commit 1411e0ec3123ae4c4ead6bfc9fe3ee5a3ae5c327
| Date:   Mon Dec 27 16:48:17 2010 -0800
|
|    x86-64, numa: Put pgtable to local node memory
That was reverted before.

We have reason to reintroduce it to improve performance when
using memory hotplug.

Calling init_mem_mapping() in early_initmem_init() for each
node. alloc_low_pages() will allocate page table in following
order:
	BRK, local node, low range

So page table will be on low range or local nodes.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c |   34 +++++++++++++++++++++++++++++++++-
 1 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 9b18ee8..5adf803 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -670,7 +670,39 @@ static void __init early_x86_numa_init(void)
 #ifdef CONFIG_X86_64
 static void __init early_x86_numa_init_mapping(void)
 {
-	init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+	unsigned long last_start = 0, last_end = 0;
+	struct numa_meminfo *mi = &numa_meminfo;
+	unsigned long start, end;
+	int last_nid = -1;
+	int i, nid;
+
+	for (i = 0; i < mi->nr_blks; i++) {
+		nid   = mi->blk[i].nid;
+		start = mi->blk[i].start;
+		end   = mi->blk[i].end;
+
+		if (last_nid == nid) {
+			last_end = end;
+			continue;
+		}
+
+		/* other nid now */
+		if (last_nid >= 0) {
+			printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
+					last_nid, last_start, last_end - 1);
+			init_mem_mapping(last_start, last_end);
+		}
+
+		/* for next nid */
+		last_nid   = nid;
+		last_start = start;
+		last_end   = end;
+	}
+	/* last one */
+	printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
+			last_nid, last_start, last_end - 1);
+	init_mem_mapping(last_start, last_end);
+
 	if (max_pfn > max_low_pfn)
 		max_low_pfn = max_pfn;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 21/22] x86, mm: Make init_mem_mapping be able to be called several times
  2013-06-13 13:03 ` [Part1 PATCH v5 21/22] x86, mm: Make init_mem_mapping be able to be called several times Tang Chen
@ 2013-06-13 18:35   ` Konrad Rzeszutek Wilk
  2013-06-13 22:47     ` Yinghai Lu
  2013-06-14 21:33   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  1 sibling, 1 reply; 87+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-06-13 18:35 UTC (permalink / raw)
  To: Tang Chen, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, mgorman, minchan, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner, prarit
  Cc: x86, linux-doc, linux-kernel, linux-mm, Pekka Enberg, Jacob Shin

Tang Chen <tangchen@cn.fujitsu.com> wrote:

>From: Yinghai Lu <yinghai@kernel.org>
>
>Prepare to put page table on local nodes.
>
>Move calling of init_mem_mapping() to early_initmem_init().
>
>Rework alloc_low_pages to allocate page table in following order:
>	BRK, local node, low range
>
>Still only load_cr3 one time, otherwise we would break xen 64bit again.
>



Sigh..  Can that comment on Xen be removed please.  The issue was fixed last release  and I believe I already asked to remove that comment as it is not true anymore. 
-- 
Sent from my Android phone. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 21/22] x86, mm: Make init_mem_mapping be able to be called several times
  2013-06-13 18:35   ` Konrad Rzeszutek Wilk
@ 2013-06-13 22:47     ` Yinghai Lu
  2013-06-14  5:08       ` Tang Chen
  0 siblings, 1 reply; 87+ messages in thread
From: Yinghai Lu @ 2013-06-13 22:47 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Tang Chen, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Morton, Tejun Heo, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Mel Gorman,
	Minchan Kim, mina86, gong.chen, vasilis.liaskovitis, lwoodman,
	Rik van Riel, jweiner, Prarit Bhargava, the arch/x86 maintainers,
	linux-doc, Linux Kernel Mailing List, Linux MM, Pekka Enberg,
	Jacob Shin

On Thu, Jun 13, 2013 at 11:35 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> Tang Chen <tangchen@cn.fujitsu.com> wrote:
>
>>From: Yinghai Lu <yinghai@kernel.org>
>>
>>Prepare to put page table on local nodes.
>>
>>Move calling of init_mem_mapping() to early_initmem_init().
>>
>>Rework alloc_low_pages to allocate page table in following order:
>>       BRK, local node, low range
>>
>>Still only load_cr3 one time, otherwise we would break xen 64bit again.
>>
>
>
>
> Sigh..  Can that comment on Xen be removed please.  The issue was fixed last release  and I believe I already asked to remove that comment as it is not true anymore.

Sorry about that again, I thought I removed that already.

Yinghai

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 21/22] x86, mm: Make init_mem_mapping be able to be called several times
  2013-06-13 22:47     ` Yinghai Lu
@ 2013-06-14  5:08       ` Tang Chen
  0 siblings, 0 replies; 87+ messages in thread
From: Tang Chen @ 2013-06-14  5:08 UTC (permalink / raw)
  To: Yinghai Lu, Konrad Rzeszutek Wilk
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Tejun Heo, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Mel Gorman, Minchan Kim,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel,
	jweiner, Prarit Bhargava, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM, Pekka Enberg, Jacob Shin

On 06/14/2013 06:47 AM, Yinghai Lu wrote:
> On Thu, Jun 13, 2013 at 11:35 AM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com>  wrote:
>> Tang Chen<tangchen@cn.fujitsu.com>  wrote:
>>
>>> From: Yinghai Lu<yinghai@kernel.org>
>>>
>>> Prepare to put page table on local nodes.
>>>
>>> Move calling of init_mem_mapping() to early_initmem_init().
>>>
>>> Rework alloc_low_pages to allocate page table in following order:
>>>        BRK, local node, low range
>>>
>>> Still only load_cr3 one time, otherwise we would break xen 64bit again.
>>>
>>
>>
>>
>> Sigh..  Can that comment on Xen be removed please.  The issue was fixed last release  and I believe I already asked to remove that comment as it is not true anymore.
>
> Sorry about that again, I thought I removed that already.

Sorry I didn't notice that. Will remove it if Yinghai or I resend this 
patch-set.

Thanks.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86: Change get_ramdisk_{image|size}() to global
  2013-06-13 13:02 ` [Part1 PATCH v5 01/22] x86: Change get_ramdisk_{image|size}() to global Tang Chen
@ 2013-06-14 21:30   ` tip-bot for Yinghai Lu
  0 siblings, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, tj, tangchen, tglx, trenn, hpa

Commit-ID:  d9518cb78d6d5ee6b24eb7ee2f4b108ec30e174e
Gitweb:     http://git.kernel.org/tip/d9518cb78d6d5ee6b24eb7ee2f4b108ec30e174e
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:02:48 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:03:26 -0700

x86: Change get_ramdisk_{image|size}() to global

This patch does two things:
1. Change get_ramdisk_image() and get_ramdisk_size() to global.
2. Make get_ramdisk_image() and get_ramdisk_size() take a
   boot_params pointer parameter.

The whole patch-set tries to split ACPI initrd table override
procedure into two steps: finding and copying.
The finding step is done at head_32.S and head64.c stage. So we
need to call get_ramdisk_image() and get_ramdisk_size() in these
two files.

And also, in head_32.S, it can only access boot_params via physical
address during 32bit flat mode, so make get_ramdisk_image() and
get_ramdisk_size() take a boot_params pointer, so that we can pass
a physical address to code in head_32.S.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-2-git-send-email-tangchen@cn.fujitsu.com
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/include/asm/setup.h |  3 +++
 arch/x86/kernel/setup.c      | 28 ++++++++++++++--------------
 2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index b7bf350..4f71d48 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -106,6 +106,9 @@ void *extend_brk(size_t size, size_t align);
 	RESERVE_BRK(name, sizeof(type) * entries)
 
 extern void probe_roms(void);
+u64 get_ramdisk_image(struct boot_params *bp);
+u64 get_ramdisk_size(struct boot_params *bp);
+
 #ifdef __i386__
 
 void __init i386_start_kernel(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 56f7fcf..66ab495 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -297,19 +297,19 @@ static void __init reserve_brk(void)
 
 #ifdef CONFIG_BLK_DEV_INITRD
 
-static u64 __init get_ramdisk_image(void)
+u64 __init get_ramdisk_image(struct boot_params *bp)
 {
-	u64 ramdisk_image = boot_params.hdr.ramdisk_image;
+	u64 ramdisk_image = bp->hdr.ramdisk_image;
 
-	ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;
+	ramdisk_image |= (u64)bp->ext_ramdisk_image << 32;
 
 	return ramdisk_image;
 }
-static u64 __init get_ramdisk_size(void)
+u64 __init get_ramdisk_size(struct boot_params *bp)
 {
-	u64 ramdisk_size = boot_params.hdr.ramdisk_size;
+	u64 ramdisk_size = bp->hdr.ramdisk_size;
 
-	ramdisk_size |= (u64)boot_params.ext_ramdisk_size << 32;
+	ramdisk_size |= (u64)bp->ext_ramdisk_size << 32;
 
 	return ramdisk_size;
 }
@@ -318,8 +318,8 @@ static u64 __init get_ramdisk_size(void)
 static void __init relocate_initrd(void)
 {
 	/* Assume only end is not page aligned */
-	u64 ramdisk_image = get_ramdisk_image();
-	u64 ramdisk_size  = get_ramdisk_size();
+	u64 ramdisk_image = get_ramdisk_image(&boot_params);
+	u64 ramdisk_size  = get_ramdisk_size(&boot_params);
 	u64 area_size     = PAGE_ALIGN(ramdisk_size);
 	u64 ramdisk_here;
 	unsigned long slop, clen, mapaddr;
@@ -358,8 +358,8 @@ static void __init relocate_initrd(void)
 		ramdisk_size  -= clen;
 	}
 
-	ramdisk_image = get_ramdisk_image();
-	ramdisk_size  = get_ramdisk_size();
+	ramdisk_image = get_ramdisk_image(&boot_params);
+	ramdisk_size  = get_ramdisk_size(&boot_params);
 	printk(KERN_INFO "Move RAMDISK from [mem %#010llx-%#010llx] to"
 		" [mem %#010llx-%#010llx]\n",
 		ramdisk_image, ramdisk_image + ramdisk_size - 1,
@@ -369,8 +369,8 @@ static void __init relocate_initrd(void)
 static void __init early_reserve_initrd(void)
 {
 	/* Assume only end is not page aligned */
-	u64 ramdisk_image = get_ramdisk_image();
-	u64 ramdisk_size  = get_ramdisk_size();
+	u64 ramdisk_image = get_ramdisk_image(&boot_params);
+	u64 ramdisk_size  = get_ramdisk_size(&boot_params);
 	u64 ramdisk_end   = PAGE_ALIGN(ramdisk_image + ramdisk_size);
 
 	if (!boot_params.hdr.type_of_loader ||
@@ -382,8 +382,8 @@ static void __init early_reserve_initrd(void)
 static void __init reserve_initrd(void)
 {
 	/* Assume only end is not page aligned */
-	u64 ramdisk_image = get_ramdisk_image();
-	u64 ramdisk_size  = get_ramdisk_size();
+	u64 ramdisk_image = get_ramdisk_image(&boot_params);
+	u64 ramdisk_size  = get_ramdisk_size(&boot_params);
 	u64 ramdisk_end   = PAGE_ALIGN(ramdisk_image + ramdisk_size);
 	u64 mapped_size;
 

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, microcode: Use common get_ramdisk_{image|size}( )
  2013-06-13 13:02 ` [Part1 PATCH v5 02/22] x86, microcode: Use common get_ramdisk_{image|size}() Tang Chen
@ 2013-06-14 21:31   ` tip-bot for Yinghai Lu
  0 siblings, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, fenghua.yu, tangchen, tj,
	tglx, trenn, hpa

Commit-ID:  a795ab2d9c2113c63d2c9a0677012db13e746121
Gitweb:     http://git.kernel.org/tip/a795ab2d9c2113c63d2c9a0677012db13e746121
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:02:49 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:03:30 -0700

x86, microcode: Use common get_ramdisk_{image|size}()

In patch1, we change get_ramdisk_image() and get_ramdisk_size()
to global, so we can use them instead of using global variable
boot_params.

We need this to get correct ramdisk adress for 64bits bzImage
that initrd can be loaded above 4G by kexec-tools.

-v2: fix one typo that is found by Tang Chen

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-3-git-send-email-tangchen@cn.fujitsu.com
Cc: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/kernel/microcode_intel_early.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/microcode_intel_early.c b/arch/x86/kernel/microcode_intel_early.c
index 2e9e128..54575a9 100644
--- a/arch/x86/kernel/microcode_intel_early.c
+++ b/arch/x86/kernel/microcode_intel_early.c
@@ -743,8 +743,8 @@ load_ucode_intel_bsp(void)
 	struct boot_params *boot_params_p;
 
 	boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
-	ramdisk_image = boot_params_p->hdr.ramdisk_image;
-	ramdisk_size  = boot_params_p->hdr.ramdisk_size;
+	ramdisk_image = get_ramdisk_image(boot_params_p);
+	ramdisk_size  = get_ramdisk_size(boot_params_p);
 	initrd_start_early = ramdisk_image;
 	initrd_end_early = initrd_start_early + ramdisk_size;
 
@@ -753,8 +753,8 @@ load_ucode_intel_bsp(void)
 		(unsigned long *)__pa_nodebug(&mc_saved_in_initrd),
 		initrd_start_early, initrd_end_early, &uci);
 #else
-	ramdisk_image = boot_params.hdr.ramdisk_image;
-	ramdisk_size  = boot_params.hdr.ramdisk_size;
+	ramdisk_image = get_ramdisk_image(&boot_params);
+	ramdisk_size  = get_ramdisk_size(&boot_params);
 	initrd_start_early = ramdisk_image + PAGE_OFFSET;
 	initrd_end_early = initrd_start_early + ramdisk_size;
 

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, ACPI, mm: Kill max_low_pfn_mapped
  2013-06-13 13:02 ` [Part1 PATCH v5 03/22] x86, ACPI, mm: Kill max_low_pfn_mapped Tang Chen
@ 2013-06-14 21:31   ` tip-bot for Yinghai Lu
  2013-06-17 21:04   ` [Part1 PATCH v5 03/22] " Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, penberg, tangchen, jacob.shin,
	trenn, tglx, hpa, rjw

Commit-ID:  b19feb388cdee35bf991e4977d1936f6d23c75a8
Gitweb:     http://git.kernel.org/tip/b19feb388cdee35bf991e4977d1936f6d23c75a8
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:02:50 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:03:37 -0700

x86, ACPI, mm: Kill max_low_pfn_mapped

Now we have pfn_mapped[] array, and max_low_pfn_mapped should not
be used anymore. Users should use pfn_mapped[] or just
1UL<<(32-PAGE_SHIFT) instead.

The only user of max_low_pfn_mapped is ACPI_INITRD_TABLE_OVERRIDE.
We could change to use 1U<<(32_PAGE_SHIFT) with it, aka under 4G.

Known problem:
There is another user of max_low_pfn_mapped: i915 device driver.
But the code is commented out by a pair of "#if 0 ... #endif".
Not sure why the driver developers want to do that.

-v2: Leave alone max_low_pfn_mapped in i915 code according to tj.

Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-4-git-send-email-tangchen@cn.fujitsu.com
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Jacob Shin <jacob.shin@amd.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-acpi@vger.kernel.org
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/include/asm/page_types.h | 1 -
 arch/x86/kernel/setup.c           | 4 +---
 arch/x86/mm/init.c                | 4 ----
 drivers/acpi/osl.c                | 6 +++---
 4 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 54c9787..b012b82 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -43,7 +43,6 @@
 
 extern int devmem_is_allowed(unsigned long pagenr);
 
-extern unsigned long max_low_pfn_mapped;
 extern unsigned long max_pfn_mapped;
 
 static inline phys_addr_t get_max_mapped(void)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 66ab495..6ca5f2c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -112,13 +112,11 @@
 #include <asm/prom.h>
 
 /*
- * max_low_pfn_mapped: highest direct mapped pfn under 4GB
- * max_pfn_mapped:     highest direct mapped pfn over 4GB
+ * max_pfn_mapped:     highest direct mapped pfn
  *
  * The direct mapping only covers E820_RAM regions, so the ranges and gaps are
  * represented by pfn_mapped
  */
-unsigned long max_low_pfn_mapped;
 unsigned long max_pfn_mapped;
 
 #ifdef CONFIG_DMI
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index eaac174..8554656 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -313,10 +313,6 @@ static void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
 	nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);
 
 	max_pfn_mapped = max(max_pfn_mapped, end_pfn);
-
-	if (start_pfn < (1UL<<(32-PAGE_SHIFT)))
-		max_low_pfn_mapped = max(max_low_pfn_mapped,
-					 min(end_pfn, 1UL<<(32-PAGE_SHIFT)));
 }
 
 bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index e721863..93e3194 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -624,9 +624,9 @@ void __init acpi_initrd_override(void *data, size_t size)
 	if (table_nr == 0)
 		return;
 
-	acpi_tables_addr =
-		memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
-				       all_tables_size, PAGE_SIZE);
+	/* under 4G at first, then above 4G */
+	acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
+					all_tables_size, PAGE_SIZE);
 	if (!acpi_tables_addr) {
 		WARN_ON(1);
 		return;

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, ACPI: Search buffer above 4GB in a second try for acpi initrd table override
  2013-06-13 13:02 ` [Part1 PATCH v5 04/22] x86, ACPI: Search buffer above 4GB in a second try for acpi initrd table override Tang Chen
@ 2013-06-14 21:31   ` tip-bot for Yinghai Lu
  2013-06-17 21:06   ` [Part1 PATCH v5 04/22] " Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, tangchen, tglx, trenn, hpa, rjw

Commit-ID:  5a2c7ccc51a2bc42d96e05dd3d920ef0c09eb730
Gitweb:     http://git.kernel.org/tip/5a2c7ccc51a2bc42d96e05dd3d920ef0c09eb730
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:02:51 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:03:49 -0700

x86, ACPI: Search buffer above 4GB in a second try for acpi initrd table override

Now we only search buffer for new acpi tables in initrd under
4GB. In some case, like user use memmap to exclude all low ram,
we may not find range for it under 4GB. So do second try to
search for buffer above 4GB.

Since later accessing to the tables is using early_ioremap(),
using memory above 4GB is OK.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-5-git-send-email-tangchen@cn.fujitsu.com
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: linux-acpi@vger.kernel.org
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 drivers/acpi/osl.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 93e3194..42c48fc 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -627,6 +627,10 @@ void __init acpi_initrd_override(void *data, size_t size)
 	/* under 4G at first, then above 4G */
 	acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
 					all_tables_size, PAGE_SIZE);
+	if (!acpi_tables_addr)
+		acpi_tables_addr = memblock_find_in_range(0,
+					~(phys_addr_t)0,
+					all_tables_size, PAGE_SIZE);
 	if (!acpi_tables_addr) {
 		WARN_ON(1);
 		return;

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, ACPI: Increase acpi initrd override tables number limit
  2013-06-13 13:02 ` [Part1 PATCH v5 05/22] x86, ACPI: Increase acpi initrd override tables number limit Tang Chen
@ 2013-06-14 21:31   ` tip-bot for Yinghai Lu
  0 siblings, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, tangchen, tj, tglx, trenn, hpa, rjw

Commit-ID:  7a309b8608958c40bb7f82ac83532a44b09deae2
Gitweb:     http://git.kernel.org/tip/7a309b8608958c40bb7f82ac83532a44b09deae2
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:02:52 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:03:57 -0700

x86, ACPI: Increase acpi initrd override tables number limit

Current number of acpi tables in initrd is limited to 10, which is
too small. 64 would be good enough as we have 35 sigs and could
have several SSDTs.

Two problems in current code prevent us from increasing the 10 tables limit:
1. cpio file info array is put in stack, as every element is 32 bytes, we
   could run out of stack if we increase the array size to 64.
   So we can move it out from stack, and make it global and put it in
   __initdata section.
2. early_ioremap only can remap 256kb one time. Current code is mapping
   10 tables one time. If we increase that limit, the whole size could be
   more than 256kb, and early_ioremap will fail.
   So we can map the tables one by one during copying, instead of mapping
   all of them at one time.

-v2: According to tj, split it out to separated patch, also
     rename array name to acpi_initrd_files.
-v3: Add some comments about mapping table one by one during copying
     per tj.

Signed-off-by: Yinghai <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-6-git-send-email-tangchen@cn.fujitsu.com
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: linux-acpi@vger.kernel.org
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 drivers/acpi/osl.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 42c48fc..53dd490 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -569,8 +569,8 @@ static const char * const table_sigs[] = {
 
 #define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
 
-/* Must not increase 10 or needs code modification below */
-#define ACPI_OVERRIDE_TABLES 10
+#define ACPI_OVERRIDE_TABLES 64
+static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
 
 void __init acpi_initrd_override(void *data, size_t size)
 {
@@ -579,7 +579,6 @@ void __init acpi_initrd_override(void *data, size_t size)
 	struct acpi_table_header *table;
 	char cpio_path[32] = "kernel/firmware/acpi/";
 	struct cpio_data file;
-	struct cpio_data early_initrd_files[ACPI_OVERRIDE_TABLES];
 	char *p;
 
 	if (data == NULL || size == 0)
@@ -617,8 +616,8 @@ void __init acpi_initrd_override(void *data, size_t size)
 			table->signature, cpio_path, file.name, table->length);
 
 		all_tables_size += table->length;
-		early_initrd_files[table_nr].data = file.data;
-		early_initrd_files[table_nr].size = file.size;
+		acpi_initrd_files[table_nr].data = file.data;
+		acpi_initrd_files[table_nr].size = file.size;
 		table_nr++;
 	}
 	if (table_nr == 0)
@@ -648,14 +647,19 @@ void __init acpi_initrd_override(void *data, size_t size)
 	memblock_reserve(acpi_tables_addr, all_tables_size);
 	arch_reserve_mem_area(acpi_tables_addr, all_tables_size);
 
-	p = early_ioremap(acpi_tables_addr, all_tables_size);
-
+	/*
+	 * early_ioremap can only remap 256KB at one time. If we map all the
+	 * tables at one time, we will hit the limit. So we need to map tables
+	 * one by one during copying.
+	 */
 	for (no = 0; no < table_nr; no++) {
-		memcpy(p + total_offset, early_initrd_files[no].data,
-		       early_initrd_files[no].size);
-		total_offset += early_initrd_files[no].size;
+		phys_addr_t size = acpi_initrd_files[no].size;
+
+		p = early_ioremap(acpi_tables_addr + total_offset, size);
+		memcpy(p, acpi_initrd_files[no].data, size);
+		early_iounmap(p, size);
+		total_offset += size;
 	}
-	early_iounmap(p, all_tables_size);
 }
 #endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
 

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, ACPI: Split acpi_initrd_override() into find/ copy two steps
  2013-06-13 13:02 ` [Part1 PATCH v5 06/22] x86, ACPI: Split acpi_initrd_override() into find/copy two steps Tang Chen
@ 2013-06-14 21:31   ` tip-bot for Yinghai Lu
  0 siblings, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, penberg, tangchen, tj,
	jacob.shin, trenn, tglx, hpa, rjw

Commit-ID:  29206daa5831dc5b435a06387fd702875401c6bd
Gitweb:     http://git.kernel.org/tip/29206daa5831dc5b435a06387fd702875401c6bd
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:02:53 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:04:01 -0700

x86, ACPI: Split acpi_initrd_override() into find/copy two steps

To parse SRAT before memblock starts to work, we need to move acpi table
probing procedure earlier. But acpi_initrd_table_override procedure must
be executed before acpi table probing. So we need to move it earlier too,
which means to move acpi_initrd_table_override procedure before memblock
starts to work.

But acpi_initrd_table_override procedure needs memblock to allocate buffer
for ACPI tables. To solve this problem, we need to split acpi_initrd_override()
procedure into two steps: finding and copying.
Find should be as early as possible. Copy should be after memblock is ready.

Currently, acpi_initrd_table_override procedure is executed after
init_mem_mapping() and relocate_initrd(), so it can scan initrd and copy
acpi tables with kernel virtual addresses of initrd.

Once we split it into finding and copying steps, it could be done like the
following:

Finding could be done in head_32.S and head64.c, just like microcode early
scanning. In head_32.S, it is 32bit flat mode, we don't need to setup page
table to access it. In head64.c, #PF set page table could help us to access
initrd with kernel low mapping addresses.

Copying need to be done just after memblock is ready, because it needs to
allocate buffer for new acpi tables with memblock.
Also it should be done before probing acpi tables, and we need early_ioremap
to access source and target ranges, as init_mem_mapping is not called yet.

While a dummy version of acpi_initrd_override() was defined when
!CONFIG_ACPI_INITRD_TABLE_OVERRIDE, the prototype and dummy version were
conditionalized inside CONFIG_ACPI. This forced setup_arch() to have its own
#ifdefs around acpi_initrd_override() as otherwise build would fail when
!CONFIG_ACPI. Move the prototypes and dummy implementations of the newly
split functions out of CONFIG_ACPI block in acpi.h so that we can throw away
the #ifdefs from its users.

-v2: Split one patch out according to tj.
     also don't pass table_nr around.
-v3: Add Tj's changelog about moving down to #idef in acpi.h to
     avoid #idef in setup.c

Signed-off-by: Yinghai <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-7-git-send-email-tangchen@cn.fujitsu.com
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: linux-acpi@vger.kernel.org
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/kernel/setup.c |  6 +++---
 drivers/acpi/osl.c      | 18 +++++++++++++-----
 include/linux/acpi.h    | 16 ++++++++--------
 3 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6ca5f2c..42f584c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1119,9 +1119,9 @@ void __init setup_arch(char **cmdline_p)
 
 	reserve_initrd();
 
-#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
-	acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
-#endif
+	acpi_initrd_override_find((void *)initrd_start,
+					initrd_end - initrd_start);
+	acpi_initrd_override_copy();
 
 	reserve_crashkernel();
 
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 53dd490..6ab6c54 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -572,14 +572,13 @@ static const char * const table_sigs[] = {
 #define ACPI_OVERRIDE_TABLES 64
 static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
 
-void __init acpi_initrd_override(void *data, size_t size)
+void __init acpi_initrd_override_find(void *data, size_t size)
 {
-	int sig, no, table_nr = 0, total_offset = 0;
+	int sig, no, table_nr = 0;
 	long offset = 0;
 	struct acpi_table_header *table;
 	char cpio_path[32] = "kernel/firmware/acpi/";
 	struct cpio_data file;
-	char *p;
 
 	if (data == NULL || size == 0)
 		return;
@@ -620,7 +619,14 @@ void __init acpi_initrd_override(void *data, size_t size)
 		acpi_initrd_files[table_nr].size = file.size;
 		table_nr++;
 	}
-	if (table_nr == 0)
+}
+
+void __init acpi_initrd_override_copy(void)
+{
+	int no, total_offset = 0;
+	char *p;
+
+	if (!all_tables_size)
 		return;
 
 	/* under 4G at first, then above 4G */
@@ -652,9 +658,11 @@ void __init acpi_initrd_override(void *data, size_t size)
 	 * tables at one time, we will hit the limit. So we need to map tables
 	 * one by one during copying.
 	 */
-	for (no = 0; no < table_nr; no++) {
+	for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
 		phys_addr_t size = acpi_initrd_files[no].size;
 
+		if (!size)
+			break;
 		p = early_ioremap(acpi_tables_addr + total_offset, size);
 		memcpy(p, acpi_initrd_files[no].data, size);
 		early_iounmap(p, size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 17b5b59..8dd917b 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -79,14 +79,6 @@ typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
 typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,
 				      const unsigned long end);
 
-#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override(void *data, size_t size);
-#else
-static inline void acpi_initrd_override(void *data, size_t size)
-{
-}
-#endif
-
 char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
 void __acpi_unmap_table(char *map, unsigned long size);
 int early_acpi_boot_init(void);
@@ -476,6 +468,14 @@ static inline bool acpi_driver_match_device(struct device *dev,
 
 #endif	/* !CONFIG_ACPI */
 
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void acpi_initrd_override_find(void *data, size_t size);
+void acpi_initrd_override_copy(void);
+#else
+static inline void acpi_initrd_override_find(void *data, size_t size) { }
+static inline void acpi_initrd_override_copy(void) { }
+#endif
+
 #ifdef CONFIG_ACPI
 void acpi_os_set_prepare_sleep(int (*func)(u8 sleep_state,
 			       u32 pm1a_ctrl,  u32 pm1b_ctrl));

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, ACPI: Store override acpi tables phys addr in cpio files info array
  2013-06-13 13:02 ` [Part1 PATCH v5 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array Tang Chen
@ 2013-06-14 21:31   ` tip-bot for Yinghai Lu
  2013-06-17 23:38   ` [Part1 PATCH v5 07/22] " Tejun Heo
  2013-06-17 23:52   ` Tejun Heo
  2 siblings, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, tangchen, tglx, trenn, hpa, rjw

Commit-ID:  8ec3ffdf3921675aeae8e9c2b42be3c0b700f153
Gitweb:     http://git.kernel.org/tip/8ec3ffdf3921675aeae8e9c2b42be3c0b700f153
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:02:54 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:04:04 -0700

x86, ACPI: Store override acpi tables phys addr in cpio files info array

This patch introduces a file_pos struct to store physaddr. And then changes
acpi_initrd_files[] to file_pos type. Then store physaddr of ACPI tables
in acpi_initrd_files[].

For finding, we will find ACPI tables with physaddr during 32bit flat mode
in head_32.S, because at that time we don't need to setup page table to
access initrd.

For copying, we could use early_ioremap() with physaddr directly before
memory mapping is set.

To keep 32bit and 64bit platforms consistent, use phys_addr for all.

-v2: introduce file_pos to save physaddr instead of abusing cpio_data
     which tj is not happy with.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-8-git-send-email-tangchen@cn.fujitsu.com
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: linux-acpi@vger.kernel.org
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 drivers/acpi/osl.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 6ab6c54..42f79e3 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -570,7 +570,11 @@ static const char * const table_sigs[] = {
 #define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
 
 #define ACPI_OVERRIDE_TABLES 64
-static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
+struct file_pos {
+	phys_addr_t data;
+	phys_addr_t size;
+};
+static struct file_pos __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
 
 void __init acpi_initrd_override_find(void *data, size_t size)
 {
@@ -615,7 +619,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
 			table->signature, cpio_path, file.name, table->length);
 
 		all_tables_size += table->length;
-		acpi_initrd_files[table_nr].data = file.data;
+		acpi_initrd_files[table_nr].data = __pa_nodebug(file.data);
 		acpi_initrd_files[table_nr].size = file.size;
 		table_nr++;
 	}
@@ -624,7 +628,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
 void __init acpi_initrd_override_copy(void)
 {
 	int no, total_offset = 0;
-	char *p;
+	char *p, *q;
 
 	if (!all_tables_size)
 		return;
@@ -659,12 +663,15 @@ void __init acpi_initrd_override_copy(void)
 	 * one by one during copying.
 	 */
 	for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
+		phys_addr_t addr = acpi_initrd_files[no].data;
 		phys_addr_t size = acpi_initrd_files[no].size;
 
 		if (!size)
 			break;
+		q = early_ioremap(addr, size);
 		p = early_ioremap(acpi_tables_addr + total_offset, size);
-		memcpy(p, acpi_initrd_files[no].data, size);
+		memcpy(p, q, size);
+		early_iounmap(q, size);
 		early_iounmap(p, size);
 		total_offset += size;
 	}

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode
  2013-06-13 13:02 ` [Part1 PATCH v5 08/22] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode Tang Chen
@ 2013-06-14 21:31   ` tip-bot for Yinghai Lu
  2013-06-18  0:07   ` [Part1 PATCH v5 08/22] " Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, penberg, tangchen, jacob.shin,
	trenn, tglx, hpa, rjw

Commit-ID:  56cb257fee5a6e381452bc11fe47357b04cd085e
Gitweb:     http://git.kernel.org/tip/56cb257fee5a6e381452bc11fe47357b04cd085e
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:02:55 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:04:06 -0700

x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode

For finding procedure, it would be easy to access initrd in 32bit flat
mode, as we don't need to setup page table. That is from head_32.S, and
microcode updating already use this trick.

This patch does the following:

1. Change acpi_initrd_override_find to use phys to access global variables.

2. Pass a bool parameter "is_phys" to acpi_initrd_override_find() because
   we cannot tell if it is a pa or a va through the address itself with
   32bit. Boot loader could load initrd above max_low_pfn.

3. Put table_sigs[] on stack, otherwise it is too messy to change string
   array to physaddr and still keep offset calculating correct. The size is
   about 36x4 bytes, and it is small to settle in stack.

4. Also rewrite the MACRO INVALID_TABLE to be in a do {...} while(0) loop
   so that it is more readable.

NOTE: Don't call printk as it uses global variables, so delay print
      during copying.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-9-git-send-email-tangchen@cn.fujitsu.com
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: linux-acpi@vger.kernel.org
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/kernel/setup.c |  2 +-
 drivers/acpi/osl.c      | 85 ++++++++++++++++++++++++++++++++++---------------
 include/linux/acpi.h    |  5 +--
 3 files changed, 63 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 42f584c..142e042 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1120,7 +1120,7 @@ void __init setup_arch(char **cmdline_p)
 	reserve_initrd();
 
 	acpi_initrd_override_find((void *)initrd_start,
-					initrd_end - initrd_start);
+					initrd_end - initrd_start, false);
 	acpi_initrd_override_copy();
 
 	reserve_crashkernel();
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 42f79e3..23578e8 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -551,21 +551,9 @@ u8 __init acpi_table_checksum(u8 *buffer, u32 length)
 	return sum;
 }
 
-/* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
-static const char * const table_sigs[] = {
-	ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
-	ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
-	ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
-	ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
-	ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
-	ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
-	ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
-	ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
-	ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
-
 /* Non-fatal errors: Affected tables/files are ignored */
 #define INVALID_TABLE(x, path, name)					\
-	{ pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); continue; }
+	do { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); } while (0)
 
 #define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
 
@@ -576,17 +564,45 @@ struct file_pos {
 };
 static struct file_pos __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
 
-void __init acpi_initrd_override_find(void *data, size_t size)
+/*
+ * acpi_initrd_override_find() is called from head_32.S and head64.c.
+ * head_32.S calling path is with 32bit flat mode, so we can access
+ * initrd early without setting pagetable or relocating initrd. For
+ * global variables accessing, we need to use phys address instead of
+ * kernel virtual address, try to put table_sigs string array in stack,
+ * so avoid switching for it.
+ * Also don't call printk as it uses global variables.
+ */
+void __init acpi_initrd_override_find(void *data, size_t size, bool is_phys)
 {
 	int sig, no, table_nr = 0;
 	long offset = 0;
 	struct acpi_table_header *table;
 	char cpio_path[32] = "kernel/firmware/acpi/";
 	struct cpio_data file;
+	struct file_pos *files = acpi_initrd_files;
+	int *all_tables_size_p = &all_tables_size;
+
+	/* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
+	char *table_sigs[] = {
+		ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
+		ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
+		ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
+		ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
+		ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
+		ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
+		ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
+		ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
+		ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
 
 	if (data == NULL || size == 0)
 		return;
 
+	if (is_phys) {
+		files = (struct file_pos *)__pa_symbol(acpi_initrd_files);
+		all_tables_size_p = (int *)__pa_symbol(&all_tables_size);
+	}
+
 	for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
 		file = find_cpio_data(cpio_path, data, size, &offset);
 		if (!file.data)
@@ -595,9 +611,12 @@ void __init acpi_initrd_override_find(void *data, size_t size)
 		data += offset;
 		size -= offset;
 
-		if (file.size < sizeof(struct acpi_table_header))
-			INVALID_TABLE("Table smaller than ACPI header",
+		if (file.size < sizeof(struct acpi_table_header)) {
+			if (!is_phys)
+				INVALID_TABLE("Table smaller than ACPI header",
 				      cpio_path, file.name);
+			continue;
+		}
 
 		table = file.data;
 
@@ -605,22 +624,33 @@ void __init acpi_initrd_override_find(void *data, size_t size)
 			if (!memcmp(table->signature, table_sigs[sig], 4))
 				break;
 
-		if (!table_sigs[sig])
-			INVALID_TABLE("Unknown signature",
+		if (!table_sigs[sig]) {
+			if (!is_phys)
+				 INVALID_TABLE("Unknown signature",
 				      cpio_path, file.name);
-		if (file.size != table->length)
-			INVALID_TABLE("File length does not match table length",
+			continue;
+		}
+		if (file.size != table->length) {
+			if (!is_phys)
+				INVALID_TABLE("File length does not match table length",
 				      cpio_path, file.name);
-		if (acpi_table_checksum(file.data, table->length))
-			INVALID_TABLE("Bad table checksum",
+			continue;
+		}
+		if (acpi_table_checksum(file.data, table->length)) {
+			if (!is_phys)
+				INVALID_TABLE("Bad table checksum",
 				      cpio_path, file.name);
+			continue;
+		}
 
-		pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
+		if (!is_phys)
+			pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
 			table->signature, cpio_path, file.name, table->length);
 
-		all_tables_size += table->length;
-		acpi_initrd_files[table_nr].data = __pa_nodebug(file.data);
-		acpi_initrd_files[table_nr].size = file.size;
+		(*all_tables_size_p) += table->length;
+		files[table_nr].data = is_phys ? (phys_addr_t)file.data :
+						  __pa_nodebug(file.data);
+		files[table_nr].size = file.size;
 		table_nr++;
 	}
 }
@@ -670,6 +700,9 @@ void __init acpi_initrd_override_copy(void)
 			break;
 		q = early_ioremap(addr, size);
 		p = early_ioremap(acpi_tables_addr + total_offset, size);
+		pr_info("%4.4s ACPI table found in initrd [%#010llx-%#010llx]\n",
+				((struct acpi_table_header *)q)->signature,
+				(u64)addr, (u64)(addr + size - 1));
 		memcpy(p, q, size);
 		early_iounmap(q, size);
 		early_iounmap(p, size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 8dd917b..4e3731b 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -469,10 +469,11 @@ static inline bool acpi_driver_match_device(struct device *dev,
 #endif	/* !CONFIG_ACPI */
 
 #ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override_find(void *data, size_t size);
+void acpi_initrd_override_find(void *data, size_t size, bool is_phys);
 void acpi_initrd_override_copy(void);
 #else
-static inline void acpi_initrd_override_find(void *data, size_t size) { }
+static inline void acpi_initrd_override_find(void *data, size_t size,
+						 bool is_phys) { }
 static inline void acpi_initrd_override_copy(void) { }
 #endif
 

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c
  2013-06-13 13:02 ` [Part1 PATCH v5 09/22] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c Tang Chen
@ 2013-06-14 21:32   ` tip-bot for Yinghai Lu
  2013-06-18  0:33   ` [Part1 PATCH v5 09/22] " Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, penberg, tangchen, jacob.shin,
	trenn, tglx, hpa, rjw

Commit-ID:  88168dcb255f44892bcf9f6fac6aeb424471ffaa
Gitweb:     http://git.kernel.org/tip/88168dcb255f44892bcf9f6fac6aeb424471ffaa
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:02:56 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:04:50 -0700

x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c

head64.c can use the #PF handler automatic page tables to access initrd before
init mem mapping and initrd relocating.

head_32.S can use 32-bit flat mode to access initrd before init mem
mapping initrd relocating.

This patch introduces x86_acpi_override_find(), which is called from
head_32.S/head64.c, to replace acpi_initrd_override_find(). So that we
can makes 32bit and 64 bit more consistent.

-v2: use inline function in header file instead according to tj.
     also still need to keep #idef head_32.S to avoid compiling error.
-v3: need to move down reserve_initrd() after acpi_initrd_override_copy(),
     to make sure we are using right address.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-10-git-send-email-tangchen@cn.fujitsu.com
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: linux-acpi@vger.kernel.org
Tested-by: Thomas Renninger <trenn@suse.de>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/include/asm/setup.h |  6 ++++++
 arch/x86/kernel/head64.c     |  2 ++
 arch/x86/kernel/head_32.S    |  4 ++++
 arch/x86/kernel/setup.c      | 34 ++++++++++++++++++++++++++++++----
 4 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 4f71d48..6f885b7 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -42,6 +42,12 @@ extern void visws_early_detect(void);
 static inline void visws_early_detect(void) { }
 #endif
 
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void x86_acpi_override_find(void);
+#else
+static inline void x86_acpi_override_find(void) { }
+#endif
+
 extern unsigned long saved_video_mode;
 
 extern void reserve_standard_io_resources(void);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 55b6761..229b281 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -175,6 +175,8 @@ void __init x86_64_start_kernel(char * real_mode_data)
 	if (console_loglevel == 10)
 		early_printk("Kernel alive\n");
 
+	x86_acpi_override_find();
+
 	clear_page(init_level4_pgt);
 	/* set init_level4_pgt kernel high mapping*/
 	init_level4_pgt[511] = early_level4_pgt[511];
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 73afd11..ca08f0e 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -149,6 +149,10 @@ ENTRY(startup_32)
 	call load_ucode_bsp
 #endif
 
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+	call x86_acpi_override_find
+#endif
+
 /*
  * Initialize page tables.  This creates a PDE and a set of page
  * tables, which are located immediately beyond __brk_base.  The variable
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 142e042..d11b1b7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -421,6 +421,34 @@ static void __init reserve_initrd(void)
 }
 #endif /* CONFIG_BLK_DEV_INITRD */
 
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void __init x86_acpi_override_find(void)
+{
+	unsigned long ramdisk_image, ramdisk_size;
+	unsigned char *p = NULL;
+
+#ifdef CONFIG_X86_32
+	struct boot_params *boot_params_p;
+
+	/*
+	 * 32bit is from head_32.S, and it is 32bit flat mode.
+	 * So need to use phys address to access global variables.
+	 */
+	boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
+	ramdisk_image = get_ramdisk_image(boot_params_p);
+	ramdisk_size  = get_ramdisk_size(boot_params_p);
+	p = (unsigned char *)ramdisk_image;
+	acpi_initrd_override_find(p, ramdisk_size, true);
+#else
+	ramdisk_image = get_ramdisk_image(&boot_params);
+	ramdisk_size  = get_ramdisk_size(&boot_params);
+	if (ramdisk_image)
+		p = __va(ramdisk_image);
+	acpi_initrd_override_find(p, ramdisk_size, false);
+#endif
+}
+#endif
+
 static void __init parse_setup_data(void)
 {
 	struct setup_data *data;
@@ -1117,12 +1145,10 @@ void __init setup_arch(char **cmdline_p)
 	/* Allocate bigger log buffer */
 	setup_log_buf(1);
 
-	reserve_initrd();
-
-	acpi_initrd_override_find((void *)initrd_start,
-					initrd_end - initrd_start, false);
 	acpi_initrd_override_copy();
 
+	reserve_initrd();
+
 	reserve_crashkernel();
 
 	vsmp_init();

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, mm, numa: Move two functions calling on successful path later
  2013-06-13 13:02 ` [Part1 PATCH v5 10/22] x86, mm, numa: Move two functions calling on successful path later Tang Chen
@ 2013-06-14 21:32   ` tip-bot for Yinghai Lu
  2013-06-18  0:53   ` [Part1 PATCH v5 10/22] " Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:32 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, yinghai, tangchen, tglx, hpa

Commit-ID:  f5127d18677d45bdd17bb3d34e21c2a3f6b0eef6
Gitweb:     http://git.kernel.org/tip/f5127d18677d45bdd17bb3d34e21c2a3f6b0eef6
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:02:57 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:04:53 -0700

x86, mm, numa: Move two functions calling on successful path later

We need to have numa info ready before init_mem_mappingi(), so that we
can call init_mem_mapping per node, and alse trim node memory ranges to
big alignment.

Currently, parsing numa info needs to allocate some buffer and need to be
called after init_mem_mapping. So try to split parsing numa info procedure
into two steps:
	- The first step will be called before init_mem_mapping, and it
	  should not need allocate buffers.
	- The second step will cantain all the buffer related code and be
	  executed later.

At last we will have early_initmem_init() and initmem_init().

This patch implements only the first step.

setup_node_data() and numa_init_array() are only called for successful
path, so we can move these two callings to x86_numa_init(). That will also
make numa_init() smaller and more readable.

-v2: remove online_node_map clear in numa_init(), as it is only
     set in setup_node_data() at last in successful path.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-11-git-send-email-tangchen@cn.fujitsu.com
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/mm/numa.c | 69 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a71c4e2..07ae800 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -477,7 +477,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
 	unsigned long uninitialized_var(pfn_align);
-	int i, nid;
+	int i;
 
 	/* Account for nodes with cpus and no memory */
 	node_possible_map = numa_nodes_parsed;
@@ -506,24 +506,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 	if (!numa_meminfo_cover_memory(mi))
 		return -EINVAL;
 
-	/* Finally register nodes. */
-	for_each_node_mask(nid, node_possible_map) {
-		u64 start = PFN_PHYS(max_pfn);
-		u64 end = 0;
-
-		for (i = 0; i < mi->nr_blks; i++) {
-			if (nid != mi->blk[i].nid)
-				continue;
-			start = min(mi->blk[i].start, start);
-			end = max(mi->blk[i].end, end);
-		}
-
-		if (start < end)
-			setup_node_data(nid, start, end);
-	}
-
-	/* Dump memblock with node info and return. */
-	memblock_dump_all();
 	return 0;
 }
 
@@ -559,7 +541,6 @@ static int __init numa_init(int (*init_func)(void))
 
 	nodes_clear(numa_nodes_parsed);
 	nodes_clear(node_possible_map);
-	nodes_clear(node_online_map);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
 	WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
 	numa_reset_distance();
@@ -577,15 +558,6 @@ static int __init numa_init(int (*init_func)(void))
 	if (ret < 0)
 		return ret;
 
-	for (i = 0; i < nr_cpu_ids; i++) {
-		int nid = early_cpu_to_node(i);
-
-		if (nid == NUMA_NO_NODE)
-			continue;
-		if (!node_online(nid))
-			numa_clear_node(i);
-	}
-	numa_init_array();
 	return 0;
 }
 
@@ -618,7 +590,7 @@ static int __init dummy_numa_init(void)
  * last fallback is dummy single node config encomapssing whole memory and
  * never fails.
  */
-void __init x86_numa_init(void)
+static void __init early_x86_numa_init(void)
 {
 	if (!numa_off) {
 #ifdef CONFIG_X86_NUMAQ
@@ -638,6 +610,43 @@ void __init x86_numa_init(void)
 	numa_init(dummy_numa_init);
 }
 
+void __init x86_numa_init(void)
+{
+	int i, nid;
+	struct numa_meminfo *mi = &numa_meminfo;
+
+	early_x86_numa_init();
+
+	/* Finally register nodes. */
+	for_each_node_mask(nid, node_possible_map) {
+		u64 start = PFN_PHYS(max_pfn);
+		u64 end = 0;
+
+		for (i = 0; i < mi->nr_blks; i++) {
+			if (nid != mi->blk[i].nid)
+				continue;
+			start = min(mi->blk[i].start, start);
+			end = max(mi->blk[i].end, end);
+		}
+
+		if (start < end)
+			setup_node_data(nid, start, end); /* online is set */
+	}
+
+	/* Dump memblock with node info */
+	memblock_dump_all();
+
+	for (i = 0; i < nr_cpu_ids; i++) {
+		int nid = early_cpu_to_node(i);
+
+		if (nid == NUMA_NO_NODE)
+			continue;
+		if (!node_online(nid))
+			numa_clear_node(i);
+	}
+	numa_init_array();
+}
+
 static __init int find_near_online_node(int node)
 {
 	int n, val;

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, mm, numa: Call numa_meminfo_cover_memory() checking early
  2013-06-13 13:02 ` [Part1 PATCH v5 11/22] x86, mm, numa: Call numa_meminfo_cover_memory() checking early Tang Chen
@ 2013-06-14 21:32   ` tip-bot for Yinghai Lu
  2013-06-18  1:05   ` [Part1 PATCH v5 11/22] " Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:32 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, yinghai, tangchen, tglx, hpa

Commit-ID:  3c5d8f9640b0c7c512434d7047c34bab976e1f9a
Gitweb:     http://git.kernel.org/tip/3c5d8f9640b0c7c512434d7047c34bab976e1f9a
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:02:58 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:04:56 -0700

x86, mm, numa: Call numa_meminfo_cover_memory() checking early

In order to seperate parsing numa info procedure into two steps,
we need to set memblock nid later, as it could change memblock
array, and possible doube memblock.memory array which will need
to allocate buffer.

We do not need to use nid in memblock to find out absent pages.
So we can move that numa_meminfo_cover_memory() early.

Also we could change __absent_pages_in_range() to static and use
absent_pages_in_range() directly.

Later we will set memblock nid only once on successful path.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-12-git-send-email-tangchen@cn.fujitsu.com
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/mm/numa.c | 7 ++++---
 include/linux/mm.h | 2 --
 mm/page_alloc.c    | 2 +-
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 07ae800..1bb565d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -457,7 +457,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
 		u64 s = mi->blk[i].start >> PAGE_SHIFT;
 		u64 e = mi->blk[i].end >> PAGE_SHIFT;
 		numaram += e - s;
-		numaram -= __absent_pages_in_range(mi->blk[i].nid, s, e);
+		numaram -= absent_pages_in_range(s, e);
 		if ((s64)numaram < 0)
 			numaram = 0;
 	}
@@ -485,6 +485,9 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 	if (WARN_ON(nodes_empty(node_possible_map)))
 		return -EINVAL;
 
+	if (!numa_meminfo_cover_memory(mi))
+		return -EINVAL;
+
 	for (i = 0; i < mi->nr_blks; i++) {
 		struct numa_memblk *mb = &mi->blk[i];
 		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
@@ -503,8 +506,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 		return -EINVAL;
 	}
 #endif
-	if (!numa_meminfo_cover_memory(mi))
-		return -EINVAL;
 
 	return 0;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0c8528..28e9470 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1385,8 +1385,6 @@ static inline unsigned long free_initmem_default(int poison)
  */
 extern void free_area_init_nodes(unsigned long *max_zone_pfn);
 unsigned long node_map_pfn_alignment(void);
-unsigned long __absent_pages_in_range(int nid, unsigned long start_pfn,
-						unsigned long end_pfn);
 extern unsigned long absent_pages_in_range(unsigned long start_pfn,
 						unsigned long end_pfn);
 extern void get_pfn_range_for_nid(unsigned int nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 378a15b..c427f46 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4395,7 +4395,7 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid,
  * Return the number of holes in a range on a node. If nid is MAX_NUMNODES,
  * then all holes in the requested range will be accounted for.
  */
-unsigned long __meminit __absent_pages_in_range(int nid,
+static unsigned long __meminit __absent_pages_in_range(int nid,
 				unsigned long range_start_pfn,
 				unsigned long range_end_pfn)
 {

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, mm, numa: Move node_map_pfn_alignment() to x86
  2013-06-13 13:02 ` [Part1 PATCH v5 12/22] x86, mm, numa: Move node_map_pfn_alignment() to x86 Tang Chen
@ 2013-06-14 21:32   ` tip-bot for Yinghai Lu
  2013-06-18  1:08   ` [Part1 PATCH v5 12/22] " Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:32 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, yinghai, tangchen, tglx, hpa

Commit-ID:  076d2bd696f8fc47c881a92dd1e5b203ef556f51
Gitweb:     http://git.kernel.org/tip/076d2bd696f8fc47c881a92dd1e5b203ef556f51
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:02:59 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:04:58 -0700

x86, mm, numa: Move node_map_pfn_alignment() to x86

Move node_map_pfn_alignment() to arch/x86/mm as there is no
other user for it.

Will update it to use numa_meminfo instead of memblock.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-13-git-send-email-tangchen@cn.fujitsu.com
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/mm/numa.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |  1 -
 mm/page_alloc.c    | 50 --------------------------------------------------
 3 files changed, 50 insertions(+), 51 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1bb565d..10c6240 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -474,6 +474,56 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
 	return true;
 }
 
+/**
+ * node_map_pfn_alignment - determine the maximum internode alignment
+ *
+ * This function should be called after node map is populated and sorted.
+ * It calculates the maximum power of two alignment which can distinguish
+ * all the nodes.
+ *
+ * For example, if all nodes are 1GiB and aligned to 1GiB, the return value
+ * would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)).  If the
+ * nodes are shifted by 256MiB, 256MiB.  Note that if only the last node is
+ * shifted, 1GiB is enough and this function will indicate so.
+ *
+ * This is used to test whether pfn -> nid mapping of the chosen memory
+ * model has fine enough granularity to avoid incorrect mapping for the
+ * populated node map.
+ *
+ * Returns the determined alignment in pfn's.  0 if there is no alignment
+ * requirement (single node).
+ */
+unsigned long __init node_map_pfn_alignment(void)
+{
+	unsigned long accl_mask = 0, last_end = 0;
+	unsigned long start, end, mask;
+	int last_nid = -1;
+	int i, nid;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+		if (!start || last_nid < 0 || last_nid == nid) {
+			last_nid = nid;
+			last_end = end;
+			continue;
+		}
+
+		/*
+		 * Start with a mask granular enough to pin-point to the
+		 * start pfn and tick off bits one-by-one until it becomes
+		 * too coarse to separate the current node from the last.
+		 */
+		mask = ~((1 << __ffs(start)) - 1);
+		while (mask && last_end <= (start & (mask << 1)))
+			mask <<= 1;
+
+		/* accumulate all internode masks */
+		accl_mask |= mask;
+	}
+
+	/* convert mask to number of pages */
+	return ~accl_mask + 1;
+}
+
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
 	unsigned long uninitialized_var(pfn_align);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 28e9470..b827743 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1384,7 +1384,6 @@ static inline unsigned long free_initmem_default(int poison)
  * CONFIG_HAVE_MEMBLOCK_NODE_MAP.
  */
 extern void free_area_init_nodes(unsigned long *max_zone_pfn);
-unsigned long node_map_pfn_alignment(void);
 extern unsigned long absent_pages_in_range(unsigned long start_pfn,
 						unsigned long end_pfn);
 extern void get_pfn_range_for_nid(unsigned int nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c427f46..28c4a97 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4760,56 +4760,6 @@ void __init setup_nr_node_ids(void)
 }
 #endif
 
-/**
- * node_map_pfn_alignment - determine the maximum internode alignment
- *
- * This function should be called after node map is populated and sorted.
- * It calculates the maximum power of two alignment which can distinguish
- * all the nodes.
- *
- * For example, if all nodes are 1GiB and aligned to 1GiB, the return value
- * would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)).  If the
- * nodes are shifted by 256MiB, 256MiB.  Note that if only the last node is
- * shifted, 1GiB is enough and this function will indicate so.
- *
- * This is used to test whether pfn -> nid mapping of the chosen memory
- * model has fine enough granularity to avoid incorrect mapping for the
- * populated node map.
- *
- * Returns the determined alignment in pfn's.  0 if there is no alignment
- * requirement (single node).
- */
-unsigned long __init node_map_pfn_alignment(void)
-{
-	unsigned long accl_mask = 0, last_end = 0;
-	unsigned long start, end, mask;
-	int last_nid = -1;
-	int i, nid;
-
-	for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
-		if (!start || last_nid < 0 || last_nid == nid) {
-			last_nid = nid;
-			last_end = end;
-			continue;
-		}
-
-		/*
-		 * Start with a mask granular enough to pin-point to the
-		 * start pfn and tick off bits one-by-one until it becomes
-		 * too coarse to separate the current node from the last.
-		 */
-		mask = ~((1 << __ffs(start)) - 1);
-		while (mask && last_end <= (start & (mask << 1)))
-			mask <<= 1;
-
-		/* accumulate all internode masks */
-		accl_mask |= mask;
-	}
-
-	/* convert mask to number of pages */
-	return ~accl_mask + 1;
-}
-
 /* Find the lowest pfn for a node */
 static unsigned long __init find_min_pfn_for_node(int nid)
 {

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment
  2013-06-13 13:03 ` [Part1 PATCH v5 13/22] x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment Tang Chen
@ 2013-06-14 21:32   ` tip-bot for Yinghai Lu
  2013-06-18  1:40   ` [Part1 PATCH v5 13/22] " Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:32 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, yinghai, tangchen, tglx, hpa

Commit-ID:  052b6965a153de6c46203c574c5ad3161e829898
Gitweb:     http://git.kernel.org/tip/052b6965a153de6c46203c574c5ad3161e829898
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:03:00 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:05:00 -0700

x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment

We could use numa_meminfo directly instead of memblock nid in
node_map_pfn_alignment().

So we could do setting memblock nid later and only do it once
for successful path.

-v2: according to tj, separate moving to another patch.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-14-git-send-email-tangchen@cn.fujitsu.com
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/mm/numa.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 10c6240..cff565a 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -493,14 +493,18 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
  * Returns the determined alignment in pfn's.  0 if there is no alignment
  * requirement (single node).
  */
-unsigned long __init node_map_pfn_alignment(void)
+#ifdef NODE_NOT_IN_PAGE_FLAGS
+static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
 {
 	unsigned long accl_mask = 0, last_end = 0;
 	unsigned long start, end, mask;
 	int last_nid = -1;
 	int i, nid;
 
-	for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+	for (i = 0; i < mi->nr_blks; i++) {
+		start = mi->blk[i].start >> PAGE_SHIFT;
+		end = mi->blk[i].end >> PAGE_SHIFT;
+		nid = mi->blk[i].nid;
 		if (!start || last_nid < 0 || last_nid == nid) {
 			last_nid = nid;
 			last_end = end;
@@ -523,10 +527,16 @@ unsigned long __init node_map_pfn_alignment(void)
 	/* convert mask to number of pages */
 	return ~accl_mask + 1;
 }
+#else
+static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
+{
+	return 0;
+}
+#endif
 
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
-	unsigned long uninitialized_var(pfn_align);
+	unsigned long pfn_align;
 	int i;
 
 	/* Account for nodes with cpus and no memory */
@@ -538,24 +548,22 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 	if (!numa_meminfo_cover_memory(mi))
 		return -EINVAL;
 
-	for (i = 0; i < mi->nr_blks; i++) {
-		struct numa_memblk *mb = &mi->blk[i];
-		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
-	}
-
 	/*
 	 * If sections array is gonna be used for pfn -> nid mapping, check
 	 * whether its granularity is fine enough.
 	 */
-#ifdef NODE_NOT_IN_PAGE_FLAGS
-	pfn_align = node_map_pfn_alignment();
+	pfn_align = node_map_pfn_alignment(mi);
 	if (pfn_align && pfn_align < PAGES_PER_SECTION) {
 		printk(KERN_WARNING "Node alignment %LuMB < min %LuMB, rejecting NUMA config\n",
 		       PFN_PHYS(pfn_align) >> 20,
 		       PFN_PHYS(PAGES_PER_SECTION) >> 20);
 		return -EINVAL;
 	}
-#endif
+
+	for (i = 0; i < mi->nr_blks; i++) {
+		struct numa_memblk *mb = &mi->blk[i];
+		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+	}
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, mm, numa: Set memblock nid later
  2013-06-13 13:03 ` [Part1 PATCH v5 14/22] x86, mm, numa: Set memblock nid later Tang Chen
@ 2013-06-14 21:32   ` tip-bot for Yinghai Lu
  2013-06-18  1:45   ` [Part1 PATCH v5 14/22] " Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:32 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, yinghai, tangchen, tglx, hpa

Commit-ID:  1b74b2fd7fa0b4b1493a4921eefd560f9ff67963
Gitweb:     http://git.kernel.org/tip/1b74b2fd7fa0b4b1493a4921eefd560f9ff67963
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:03:01 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:05:02 -0700

x86, mm, numa: Set memblock nid later

In order to seperate parsing numa info procedure into two steps,
we need to set memblock nid later because it could change memblock
array, and possible doube memblock.memory array which will allocate
buffer.

Only set memblock nid once for successful path.

Also rename numa_register_memblks to numa_check_memblks() after
moving out code of setting memblock nid.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-15-git-send-email-tangchen@cn.fujitsu.com
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/mm/numa.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index cff565a..e448b6f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -534,10 +534,9 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
 }
 #endif
 
-static int __init numa_register_memblks(struct numa_meminfo *mi)
+static int __init numa_check_memblks(struct numa_meminfo *mi)
 {
 	unsigned long pfn_align;
-	int i;
 
 	/* Account for nodes with cpus and no memory */
 	node_possible_map = numa_nodes_parsed;
@@ -560,11 +559,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 		return -EINVAL;
 	}
 
-	for (i = 0; i < mi->nr_blks; i++) {
-		struct numa_memblk *mb = &mi->blk[i];
-		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
-	}
-
 	return 0;
 }
 
@@ -601,7 +595,6 @@ static int __init numa_init(int (*init_func)(void))
 	nodes_clear(numa_nodes_parsed);
 	nodes_clear(node_possible_map);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
-	WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
 	numa_reset_distance();
 
 	ret = init_func();
@@ -613,7 +606,7 @@ static int __init numa_init(int (*init_func)(void))
 
 	numa_emulation(&numa_meminfo, numa_distance_cnt);
 
-	ret = numa_register_memblks(&numa_meminfo);
+	ret = numa_check_memblks(&numa_meminfo);
 	if (ret < 0)
 		return ret;
 
@@ -676,6 +669,11 @@ void __init x86_numa_init(void)
 
 	early_x86_numa_init();
 
+	for (i = 0; i < mi->nr_blks; i++) {
+		struct numa_memblk *mb = &mi->blk[i];
+		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+	}
+
 	/* Finally register nodes. */
 	for_each_node_mask(nid, node_possible_map) {
 		u64 start = PFN_PHYS(max_pfn);

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, mm, numa: Move node_possible_map setting later
  2013-06-13 13:03 ` [Part1 PATCH v5 15/22] x86, mm, numa: Move node_possible_map setting later Tang Chen
@ 2013-06-14 21:32   ` tip-bot for Yinghai Lu
  0 siblings, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, tj, tangchen, tglx, hpa

Commit-ID:  052f56f9b1ffa4b1d1fffb7beb43511e0c630305
Gitweb:     http://git.kernel.org/tip/052f56f9b1ffa4b1d1fffb7beb43511e0c630305
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:03:02 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:05:03 -0700

x86, mm, numa: Move node_possible_map setting later

Move node_possible_map handling out of numa_check_memblks()
to avoid side effect when changing numa_check_memblks().

Only set node_possible_map once for successful path instead
of resetting in numa_init() every time.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-16-git-send-email-tangchen@cn.fujitsu.com
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/mm/numa.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e448b6f..da2ebab 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -536,12 +536,13 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
 
 static int __init numa_check_memblks(struct numa_meminfo *mi)
 {
+	nodemask_t nodes_parsed;
 	unsigned long pfn_align;
 
 	/* Account for nodes with cpus and no memory */
-	node_possible_map = numa_nodes_parsed;
-	numa_nodemask_from_meminfo(&node_possible_map, mi);
-	if (WARN_ON(nodes_empty(node_possible_map)))
+	nodes_parsed = numa_nodes_parsed;
+	numa_nodemask_from_meminfo(&nodes_parsed, mi);
+	if (WARN_ON(nodes_empty(nodes_parsed)))
 		return -EINVAL;
 
 	if (!numa_meminfo_cover_memory(mi))
@@ -593,7 +594,6 @@ static int __init numa_init(int (*init_func)(void))
 		set_apicid_to_node(i, NUMA_NO_NODE);
 
 	nodes_clear(numa_nodes_parsed);
-	nodes_clear(node_possible_map);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
 	numa_reset_distance();
 
@@ -669,6 +669,9 @@ void __init x86_numa_init(void)
 
 	early_x86_numa_init();
 
+	node_possible_map = numa_nodes_parsed;
+	numa_nodemask_from_meminfo(&node_possible_map, mi);
+
 	for (i = 0; i < mi->nr_blks; i++) {
 		struct numa_memblk *mb = &mi->blk[i];
 		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, mm, numa: Move numa emulation handling down.
  2013-06-13 13:03 ` [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down Tang Chen
@ 2013-06-14 21:33   ` tip-bot for Yinghai Lu
  2013-06-18  1:58   ` [Part1 PATCH v5 16/22] " Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, tangchen, tglx, rientjes, hpa

Commit-ID:  1169f9b1e7bfb609264544bf3581f038722eb10a
Gitweb:     http://git.kernel.org/tip/1169f9b1e7bfb609264544bf3581f038722eb10a
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:03:03 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:05:05 -0700

x86, mm, numa: Move numa emulation handling down.

numa_emulation() needs to allocate buffer for new numa_meminfo
and distance matrix, so execute it later in x86_numa_init().

Also we change the behavoir:
	- before this patch, if user input wrong data in command
	  line, it will fall back to next numa probing or disabling
	  numa.
	- after this patch, if user input wrong data in command line,
	  it will stay with numa info probed from previous probing,
	  like ACPI SRAT or amd_numa.

We need to call numa_check_memblks to reject wrong user inputs early
so that we can keep the original numa_meminfo not changed.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-17-git-send-email-tangchen@cn.fujitsu.com
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/mm/numa.c           | 6 +++---
 arch/x86/mm/numa_emulation.c | 2 +-
 arch/x86/mm/numa_internal.h  | 2 ++
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index da2ebab..3254f22 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -534,7 +534,7 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
 }
 #endif
 
-static int __init numa_check_memblks(struct numa_meminfo *mi)
+int __init numa_check_memblks(struct numa_meminfo *mi)
 {
 	nodemask_t nodes_parsed;
 	unsigned long pfn_align;
@@ -604,8 +604,6 @@ static int __init numa_init(int (*init_func)(void))
 	if (ret < 0)
 		return ret;
 
-	numa_emulation(&numa_meminfo, numa_distance_cnt);
-
 	ret = numa_check_memblks(&numa_meminfo);
 	if (ret < 0)
 		return ret;
@@ -669,6 +667,8 @@ void __init x86_numa_init(void)
 
 	early_x86_numa_init();
 
+	numa_emulation(&numa_meminfo, numa_distance_cnt);
+
 	node_possible_map = numa_nodes_parsed;
 	numa_nodemask_from_meminfo(&node_possible_map, mi);
 
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index dbbbb47..5a0433d 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -348,7 +348,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 	if (ret < 0)
 		goto no_emu;
 
-	if (numa_cleanup_meminfo(&ei) < 0) {
+	if (numa_cleanup_meminfo(&ei) < 0 || numa_check_memblks(&ei) < 0) {
 		pr_warning("NUMA: Warning: constructed meminfo invalid, disabling emulation\n");
 		goto no_emu;
 	}
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index ad86ec9..bb2fbcc 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -21,6 +21,8 @@ void __init numa_reset_distance(void);
 
 void __init x86_numa_init(void);
 
+int __init numa_check_memblks(struct numa_meminfo *mi);
+
 #ifdef CONFIG_NUMA_EMU
 void __init numa_emulation(struct numa_meminfo *numa_meminfo,
 			   int numa_dist_cnt);

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, ACPI, numa, ia64: split SLIT handling out
  2013-06-13 13:03 ` [Part1 PATCH v5 17/22] x86, ACPI, numa, ia64: split SLIT handling out Tang Chen
@ 2013-06-14 21:33   ` tip-bot for Yinghai Lu
  0 siblings, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, tony.luck, fenghua.yu,
	tangchen, tglx, hpa, rjw

Commit-ID:  f8e2d4e7235c816cf0a23aa2d32c57c0d4f8a3f2
Gitweb:     http://git.kernel.org/tip/f8e2d4e7235c816cf0a23aa2d32c57c0d4f8a3f2
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:03:04 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:05:08 -0700

x86, ACPI, numa, ia64: split SLIT handling out

We need to handle slit later, as it need to allocate buffer for distance
matrix. Also we do not need SLIT info before init_mem_mapping. So move
SLIT parsing procedure later.

x86_acpi_numa_init() will be splited into x86_acpi_numa_init_srat() and
x86_acpi_numa_init_slit().

It should not break ia64 by replacing acpi_numa_init with
acpi_numa_init_srat/acpi_numa_init_slit/acpi_num_arch_fixup.

-v2: Change name to acpi_numa_init_srat/acpi_numa_init_slit according tj.
     remove the reset_numa_distance() in numa_init(), as get we only set
     distance in slit handling.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-18-git-send-email-tangchen@cn.fujitsu.com
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: linux-acpi@vger.kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: linux-ia64@vger.kernel.org
Tested-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/ia64/kernel/setup.c    |  4 +++-
 arch/x86/include/asm/acpi.h |  3 ++-
 arch/x86/mm/numa.c          | 14 ++++++++++++--
 arch/x86/mm/srat.c          | 11 +++++++----
 drivers/acpi/numa.c         | 13 +++++++------
 include/linux/acpi.h        |  3 ++-
 6 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index 13bfdd2..5f7db4a 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -558,7 +558,9 @@ setup_arch (char **cmdline_p)
 	acpi_table_init();
 	early_acpi_boot_init();
 # ifdef CONFIG_ACPI_NUMA
-	acpi_numa_init();
+	acpi_numa_init_srat();
+	acpi_numa_init_slit();
+	acpi_numa_arch_fixup();
 #  ifdef CONFIG_ACPI_HOTPLUG_CPU
 	prefill_possible_map();
 #  endif
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index b31bf97..651db0b 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -178,7 +178,8 @@ static inline void disable_acpi(void) { }
 
 #ifdef CONFIG_ACPI_NUMA
 extern int acpi_numa;
-extern int x86_acpi_numa_init(void);
+int x86_acpi_numa_init_srat(void);
+void x86_acpi_numa_init_slit(void);
 #endif /* CONFIG_ACPI_NUMA */
 
 #define acpi_unlazy_tlb(x)	leave_mm(x)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 3254f22..630e09f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -595,7 +595,6 @@ static int __init numa_init(int (*init_func)(void))
 
 	nodes_clear(numa_nodes_parsed);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
-	numa_reset_distance();
 
 	ret = init_func();
 	if (ret < 0)
@@ -633,6 +632,10 @@ static int __init dummy_numa_init(void)
 	return 0;
 }
 
+#ifdef CONFIG_ACPI_NUMA
+static bool srat_used __initdata;
+#endif
+
 /**
  * x86_numa_init - Initialize NUMA
  *
@@ -648,8 +651,10 @@ static void __init early_x86_numa_init(void)
 			return;
 #endif
 #ifdef CONFIG_ACPI_NUMA
-		if (!numa_init(x86_acpi_numa_init))
+		if (!numa_init(x86_acpi_numa_init_srat)) {
+			srat_used = true;
 			return;
+		}
 #endif
 #ifdef CONFIG_AMD_NUMA
 		if (!numa_init(amd_numa_init))
@@ -667,6 +672,11 @@ void __init x86_numa_init(void)
 
 	early_x86_numa_init();
 
+#ifdef CONFIG_ACPI_NUMA
+	if (srat_used)
+		x86_acpi_numa_init_slit();
+#endif
+
 	numa_emulation(&numa_meminfo, numa_distance_cnt);
 
 	node_possible_map = numa_nodes_parsed;
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index cdd0da9..443f9ef 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -185,14 +185,17 @@ out_err:
 	return -1;
 }
 
-void __init acpi_numa_arch_fixup(void) {}
-
-int __init x86_acpi_numa_init(void)
+int __init x86_acpi_numa_init_srat(void)
 {
 	int ret;
 
-	ret = acpi_numa_init();
+	ret = acpi_numa_init_srat();
 	if (ret < 0)
 		return ret;
 	return srat_disabled() ? -EINVAL : 0;
 }
+
+void __init x86_acpi_numa_init_slit(void)
+{
+	acpi_numa_init_slit();
+}
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 33e609f..6460db4 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -282,7 +282,7 @@ acpi_table_parse_srat(enum acpi_srat_type id,
 					    handler, max_entries);
 }
 
-int __init acpi_numa_init(void)
+int __init acpi_numa_init_srat(void)
 {
 	int cnt = 0;
 
@@ -303,11 +303,6 @@ int __init acpi_numa_init(void)
 					    NR_NODE_MEMBLKS);
 	}
 
-	/* SLIT: System Locality Information Table */
-	acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
-
-	acpi_numa_arch_fixup();
-
 	if (cnt < 0)
 		return cnt;
 	else if (!parsed_numa_memblks)
@@ -315,6 +310,12 @@ int __init acpi_numa_init(void)
 	return 0;
 }
 
+void __init acpi_numa_init_slit(void)
+{
+	/* SLIT: System Locality Information Table */
+	acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
+}
+
 int acpi_get_pxm(acpi_handle h)
 {
 	unsigned long long pxm;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 4e3731b..92463b5 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -85,7 +85,8 @@ int early_acpi_boot_init(void);
 int acpi_boot_init (void);
 void acpi_boot_table_init (void);
 int acpi_mps_check (void);
-int acpi_numa_init (void);
+int acpi_numa_init_srat(void);
+void acpi_numa_init_slit(void);
 
 int acpi_table_init (void);
 int acpi_table_parse(char *id, acpi_tbl_table_handler handler);

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, mm, numa: Add early_initmem_init() stub
  2013-06-13 13:03 ` [Part1 PATCH v5 18/22] x86, mm, numa: Add early_initmem_init() stub Tang Chen
@ 2013-06-14 21:33   ` tip-bot for Yinghai Lu
  0 siblings, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, penberg, jacob.shin, tangchen,
	tglx, hpa

Commit-ID:  9c80560654a3fb62ec3b3529ddcf85317537ff85
Gitweb:     http://git.kernel.org/tip/9c80560654a3fb62ec3b3529ddcf85317537ff85
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:03:05 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:05:11 -0700

x86, mm, numa: Add early_initmem_init() stub

Introduce early_initmem_init() to call early_x86_numa_init(),
which will be used to parse numa info earlier.

Later will call init_mem_mapping for all the nodes.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-19-git-send-email-tangchen@cn.fujitsu.com
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/include/asm/page_types.h | 1 +
 arch/x86/kernel/setup.c           | 1 +
 arch/x86/mm/init.c                | 6 ++++++
 arch/x86/mm/numa.c                | 7 +++++--
 4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index b012b82..d04dd8c 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -55,6 +55,7 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
 extern unsigned long init_memory_mapping(unsigned long start,
 					 unsigned long end);
 
+void early_initmem_init(void);
 extern void initmem_init(void);
 
 #endif	/* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d11b1b7..301165e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1162,6 +1162,7 @@ void __init setup_arch(char **cmdline_p)
 
 	early_acpi_boot_init();
 
+	early_initmem_init();
 	initmem_init();
 	memblock_find_dma_reserve();
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 8554656..3c21f16 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -467,6 +467,12 @@ void __init init_mem_mapping(void)
 	early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
 }
 
+#ifndef CONFIG_NUMA
+void __init early_initmem_init(void)
+{
+}
+#endif
+
 /*
  * devmem_is_allowed() checks to see if /dev/mem access to a certain address
  * is valid. The argument is a physical page number.
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 630e09f..7d76936 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -665,13 +665,16 @@ static void __init early_x86_numa_init(void)
 	numa_init(dummy_numa_init);
 }
 
+void __init early_initmem_init(void)
+{
+	early_x86_numa_init();
+}
+
 void __init x86_numa_init(void)
 {
 	int i, nid;
 	struct numa_meminfo *mi = &numa_meminfo;
 
-	early_x86_numa_init();
-
 #ifdef CONFIG_ACPI_NUMA
 	if (srat_used)
 		x86_acpi_numa_init_slit();

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, mm: Parse numa info earlier
  2013-06-13 13:03 ` [Part1 PATCH v5 19/22] x86, mm: Parse numa info earlier Tang Chen
@ 2013-06-14 21:33   ` tip-bot for Yinghai Lu
  0 siblings, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, penberg, jacob.shin, tangchen,
	tglx, hpa

Commit-ID:  ca099f2813b5dccf2383784dbcfb9589110bd846
Gitweb:     http://git.kernel.org/tip/ca099f2813b5dccf2383784dbcfb9589110bd846
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:03:06 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:05:13 -0700

x86, mm: Parse numa info earlier

Parsing numa info has been separated into two steps now.

early_initmem_info() only parses info in numa_meminfo and
nodes_parsed. still keep numaq, acpi_numa, amd_numa, dummy
fall back sequence working.

SLIT and numa emulation handling are still left in initmem_init().

Call early_initmem_init before init_mem_mapping() to prepare
to use numa_info with it.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-20-git-send-email-tangchen@cn.fujitsu.com
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/kernel/setup.c | 24 ++++++++++--------------
 1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 301165e..fd0d5be 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1125,13 +1125,21 @@ void __init setup_arch(char **cmdline_p)
 	trim_platform_memory_ranges();
 	trim_low_memory_range();
 
+	/*
+	 * Parse the ACPI tables for possible boot-time SMP configuration.
+	 */
+	acpi_initrd_override_copy();
+	acpi_boot_table_init();
+	early_acpi_boot_init();
+	early_initmem_init();
 	init_mem_mapping();
-
+	memblock.current_limit = get_max_mapped();
 	early_trap_pf_init();
 
+	reserve_initrd();
+
 	setup_real_mode();
 
-	memblock.current_limit = get_max_mapped();
 	dma_contiguous_reserve(0);
 
 	/*
@@ -1145,24 +1153,12 @@ void __init setup_arch(char **cmdline_p)
 	/* Allocate bigger log buffer */
 	setup_log_buf(1);
 
-	acpi_initrd_override_copy();
-
-	reserve_initrd();
-
 	reserve_crashkernel();
 
 	vsmp_init();
 
 	io_delay_init();
 
-	/*
-	 * Parse the ACPI tables for possible boot-time SMP configuration.
-	 */
-	acpi_boot_table_init();
-
-	early_acpi_boot_init();
-
-	early_initmem_init();
 	initmem_init();
 	memblock_find_dma_reserve();
 

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, mm: Add comments for step_size shift
  2013-06-13 13:03 ` [Part1 PATCH v5 20/22] x86, mm: Add comments for step_size shift Tang Chen
@ 2013-06-14 21:33   ` tip-bot for Yinghai Lu
  0 siblings, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:33 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, yinghai, tangchen, tglx, hpa

Commit-ID:  7d5a256fc953dd80a4eb9a1870607ec991d23ec2
Gitweb:     http://git.kernel.org/tip/7d5a256fc953dd80a4eb9a1870607ec991d23ec2
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:03:07 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:05:32 -0700

x86, mm: Add comments for step_size shift

As requested by hpa, add comments for why we choose 5 to be
the step size shift.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-21-git-send-email-tangchen@cn.fujitsu.com
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/mm/init.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 3c21f16..5f38e72 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -395,8 +395,23 @@ static unsigned long __init init_range_memory_mapping(
 	return mapped_ram_size;
 }
 
-/* (PUD_SHIFT-PMD_SHIFT)/2 */
-#define STEP_SIZE_SHIFT 5
+static unsigned long __init get_new_step_size(unsigned long step_size)
+{
+	/*
+	 * initial mapped size is PMD_SIZE, aka 2M.
+	 * We can not set step_size to be PUD_SIZE aka 1G yet.
+	 * In worse case, when 1G is cross the 1G boundary, and
+	 * PG_LEVEL_2M is not set, we will need 1+1+512 pages (aka 2M + 8k)
+	 * to map 1G range with PTE. Use 5 as shift for now.
+	 */
+	unsigned long new_step_size = step_size << 5;
+
+	if (new_step_size > step_size)
+		step_size = new_step_size;
+
+	return  step_size;
+}
+
 void __init init_mem_mapping(void)
 {
 	unsigned long end, real_end, start, last_start;
@@ -445,7 +460,7 @@ void __init init_mem_mapping(void)
 		min_pfn_mapped = last_start >> PAGE_SHIFT;
 		/* only increase step_size after big range get mapped */
 		if (new_mapped_ram_size > mapped_ram_size)
-			step_size <<= STEP_SIZE_SHIFT;
+			step_size = get_new_step_size(step_size);
 		mapped_ram_size += new_mapped_ram_size;
 	}
 

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, mm: Make init_mem_mapping be able to be called several times
  2013-06-13 13:03 ` [Part1 PATCH v5 21/22] x86, mm: Make init_mem_mapping be able to be called several times Tang Chen
  2013-06-13 18:35   ` Konrad Rzeszutek Wilk
@ 2013-06-14 21:33   ` tip-bot for Yinghai Lu
  1 sibling, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, konrad.wilk, yinghai, penberg,
	tangchen, jacob.shin, tglx, hpa

Commit-ID:  ae4ffbb606770c7918e627e36c84b627250b1dbb
Gitweb:     http://git.kernel.org/tip/ae4ffbb606770c7918e627e36c84b627250b1dbb
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:03:08 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:05:43 -0700

x86, mm: Make init_mem_mapping be able to be called several times

Prepare to put page table on local nodes.

Move calling of init_mem_mapping() to early_initmem_init().

Rework alloc_low_pages to allocate page table in following order:
	BRK, local node, low range

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-22-git-send-email-tangchen@cn.fujitsu.com
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h |   2 +-
 arch/x86/kernel/setup.c        |   1 -
 arch/x86/mm/init.c             | 100 ++++++++++++++++++++++++++---------------
 arch/x86/mm/numa.c             |  24 ++++++++++
 4 files changed, 88 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1e67223..868687c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -621,7 +621,7 @@ static inline int pgd_none(pgd_t pgd)
 #ifndef __ASSEMBLY__
 
 extern int direct_gbpages;
-void init_mem_mapping(void);
+void init_mem_mapping(unsigned long begin, unsigned long end);
 void early_alloc_pgt_buf(void);
 
 /* local pte updates need not use xchg for locking */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fd0d5be..9ccbd60 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1132,7 +1132,6 @@ void __init setup_arch(char **cmdline_p)
 	acpi_boot_table_init();
 	early_acpi_boot_init();
 	early_initmem_init();
-	init_mem_mapping();
 	memblock.current_limit = get_max_mapped();
 	early_trap_pf_init();
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 5f38e72..9ff71ff 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -24,7 +24,10 @@ static unsigned long __initdata pgt_buf_start;
 static unsigned long __initdata pgt_buf_end;
 static unsigned long __initdata pgt_buf_top;
 
-static unsigned long min_pfn_mapped;
+static unsigned long low_min_pfn_mapped;
+static unsigned long low_max_pfn_mapped;
+static unsigned long local_min_pfn_mapped;
+static unsigned long local_max_pfn_mapped;
 
 static bool __initdata can_use_brk_pgt = true;
 
@@ -52,10 +55,17 @@ __ref void *alloc_low_pages(unsigned int num)
 
 	if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
 		unsigned long ret;
-		if (min_pfn_mapped >= max_pfn_mapped)
-			panic("alloc_low_page: ran out of memory");
-		ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
-					max_pfn_mapped << PAGE_SHIFT,
+		if (local_min_pfn_mapped >= local_max_pfn_mapped) {
+			if (low_min_pfn_mapped >= low_max_pfn_mapped)
+				panic("alloc_low_page: ran out of memory");
+			ret = memblock_find_in_range(
+					low_min_pfn_mapped << PAGE_SHIFT,
+					low_max_pfn_mapped << PAGE_SHIFT,
+					PAGE_SIZE * num , PAGE_SIZE);
+		} else
+			ret = memblock_find_in_range(
+					local_min_pfn_mapped << PAGE_SHIFT,
+					local_max_pfn_mapped << PAGE_SHIFT,
 					PAGE_SIZE * num , PAGE_SIZE);
 		if (!ret)
 			panic("alloc_low_page: can not alloc memory");
@@ -412,67 +422,88 @@ static unsigned long __init get_new_step_size(unsigned long step_size)
 	return  step_size;
 }
 
-void __init init_mem_mapping(void)
+void __init init_mem_mapping(unsigned long begin, unsigned long end)
 {
-	unsigned long end, real_end, start, last_start;
+	unsigned long real_end, start, last_start;
 	unsigned long step_size;
 	unsigned long addr;
 	unsigned long mapped_ram_size = 0;
 	unsigned long new_mapped_ram_size;
+	bool is_low = false;
+
+	if (!begin) {
+		probe_page_size_mask();
+		/* the ISA range is always mapped regardless of memory holes */
+		init_memory_mapping(0, ISA_END_ADDRESS);
+		begin = ISA_END_ADDRESS;
+		is_low = true;
+	}
 
-	probe_page_size_mask();
-
-#ifdef CONFIG_X86_64
-	end = max_pfn << PAGE_SHIFT;
-#else
-	end = max_low_pfn << PAGE_SHIFT;
-#endif
-
-	/* the ISA range is always mapped regardless of memory holes */
-	init_memory_mapping(0, ISA_END_ADDRESS);
+	if (begin >= end)
+		return;
 
 	/* xen has big range in reserved near end of ram, skip it at first.*/
-	addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
+	addr = memblock_find_in_range(begin, end, PMD_SIZE, PMD_SIZE);
 	real_end = addr + PMD_SIZE;
 
 	/* step_size need to be small so pgt_buf from BRK could cover it */
 	step_size = PMD_SIZE;
-	max_pfn_mapped = 0; /* will get exact value next */
-	min_pfn_mapped = real_end >> PAGE_SHIFT;
+	local_max_pfn_mapped = begin >> PAGE_SHIFT;
+	local_min_pfn_mapped = real_end >> PAGE_SHIFT;
 	last_start = start = real_end;
 
 	/*
-	 * We start from the top (end of memory) and go to the bottom.
-	 * The memblock_find_in_range() gets us a block of RAM from the
-	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
-	 * for page table.
+	 * alloc_low_pages() will allocate pagetable pages in the following
+	 * order:
+	 *	BRK, local node, low range
+	 *
+	 * That means it will first use up all the BRK memory, then try to get
+	 * us a block of RAM from [local_min_pfn_mapped, local_max_pfn_mapped)
+	 * used as new pagetable pages. If no memory on the local node has
+	 * been mapped, it will allocate memory from
+	 * [low_min_pfn_mapped, low_max_pfn_mapped).
 	 */
-	while (last_start > ISA_END_ADDRESS) {
+	while (last_start > begin) {
 		if (last_start > step_size) {
 			start = round_down(last_start - 1, step_size);
-			if (start < ISA_END_ADDRESS)
-				start = ISA_END_ADDRESS;
+			if (start < begin)
+				start = begin;
 		} else
-			start = ISA_END_ADDRESS;
+			start = begin;
 		new_mapped_ram_size = init_range_memory_mapping(start,
 							last_start);
+		if ((last_start >> PAGE_SHIFT) > local_max_pfn_mapped)
+			local_max_pfn_mapped = last_start >> PAGE_SHIFT;
+		local_min_pfn_mapped = start >> PAGE_SHIFT;
 		last_start = start;
-		min_pfn_mapped = last_start >> PAGE_SHIFT;
 		/* only increase step_size after big range get mapped */
 		if (new_mapped_ram_size > mapped_ram_size)
 			step_size = get_new_step_size(step_size);
 		mapped_ram_size += new_mapped_ram_size;
 	}
 
-	if (real_end < end)
+	if (real_end < end) {
 		init_range_memory_mapping(real_end, end);
+		if ((end >> PAGE_SHIFT) > local_max_pfn_mapped)
+			local_max_pfn_mapped = end >> PAGE_SHIFT;
+	}
 
+	if (is_low) {
+		low_min_pfn_mapped = local_min_pfn_mapped;
+		low_max_pfn_mapped = local_max_pfn_mapped;
+	}
+}
+
+#ifndef CONFIG_NUMA
+void __init early_initmem_init(void)
+{
 #ifdef CONFIG_X86_64
-	if (max_pfn > max_low_pfn) {
-		/* can we preseve max_low_pfn ?*/
+	init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+	if (max_pfn > max_low_pfn)
 		max_low_pfn = max_pfn;
 	}
 #else
+	init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
 	early_ioremap_page_table_range_init();
 #endif
 
@@ -481,11 +512,6 @@ void __init init_mem_mapping(void)
 
 	early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
 }
-
-#ifndef CONFIG_NUMA
-void __init early_initmem_init(void)
-{
-}
 #endif
 
 /*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 7d76936..9b18ee8 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -17,8 +17,10 @@
 #include <asm/dma.h>
 #include <asm/acpi.h>
 #include <asm/amd_nb.h>
+#include <asm/tlbflush.h>
 
 #include "numa_internal.h"
+#include "mm_internal.h"
 
 int __initdata numa_off;
 nodemask_t numa_nodes_parsed __initdata;
@@ -665,9 +667,31 @@ static void __init early_x86_numa_init(void)
 	numa_init(dummy_numa_init);
 }
 
+#ifdef CONFIG_X86_64
+static void __init early_x86_numa_init_mapping(void)
+{
+	init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+	if (max_pfn > max_low_pfn)
+		max_low_pfn = max_pfn;
+}
+#else
+static void __init early_x86_numa_init_mapping(void)
+{
+	init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
+	early_ioremap_page_table_range_init();
+}
+#endif
+
 void __init early_initmem_init(void)
 {
 	early_x86_numa_init();
+
+	early_x86_numa_init_mapping();
+
+	load_cr3(swapper_pg_dir);
+	__flush_tlb_all();
+
+	early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
 }
 
 void __init x86_numa_init(void)

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [tip:x86/mm] x86, mm, numa: Put pagetable on local node ram for 64bit
  2013-06-13 13:03 ` [Part1 PATCH v5 22/22] x86, mm, numa: Put pagetable on local node ram for 64bit Tang Chen
@ 2013-06-14 21:34   ` tip-bot for Yinghai Lu
  0 siblings, 0 replies; 87+ messages in thread
From: tip-bot for Yinghai Lu @ 2013-06-14 21:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, konrad.wilk, yinghai, penberg,
	tangchen, jacob.shin, tglx, hpa

Commit-ID:  5f02a5e6ca366be44064463f25b6f4cc4468a197
Gitweb:     http://git.kernel.org/tip/5f02a5e6ca366be44064463f25b6f4cc4468a197
Author:     Yinghai Lu <yinghai@kernel.org>
AuthorDate: Thu, 13 Jun 2013 21:03:09 +0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 14 Jun 2013 14:05:49 -0700

x86, mm, numa: Put pagetable on local node ram for 64bit

If node with ram is hotplugable, memory for local node page
table and vmemmap should be on the local node ram.

This patch is some kind of refreshment of
| commit 1411e0ec3123ae4c4ead6bfc9fe3ee5a3ae5c327
| Date:   Mon Dec 27 16:48:17 2010 -0800
|
|    x86-64, numa: Put pgtable to local node memory
That was reverted before.

We have reason to reintroduce it to improve performance when
using memory hotplug.

Calling init_mem_mapping() in early_initmem_init() for each
node. alloc_low_pages() will allocate page table in following
order:
	BRK, local node, low range

So page table will be on low range or local nodes.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371128589-8953-23-git-send-email-tangchen@cn.fujitsu.com
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/mm/numa.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 9b18ee8..5adf803 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -670,7 +670,39 @@ static void __init early_x86_numa_init(void)
 #ifdef CONFIG_X86_64
 static void __init early_x86_numa_init_mapping(void)
 {
-	init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+	unsigned long last_start = 0, last_end = 0;
+	struct numa_meminfo *mi = &numa_meminfo;
+	unsigned long start, end;
+	int last_nid = -1;
+	int i, nid;
+
+	for (i = 0; i < mi->nr_blks; i++) {
+		nid   = mi->blk[i].nid;
+		start = mi->blk[i].start;
+		end   = mi->blk[i].end;
+
+		if (last_nid == nid) {
+			last_end = end;
+			continue;
+		}
+
+		/* other nid now */
+		if (last_nid >= 0) {
+			printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
+					last_nid, last_start, last_end - 1);
+			init_mem_mapping(last_start, last_end);
+		}
+
+		/* for next nid */
+		last_nid   = nid;
+		last_start = start;
+		last_end   = end;
+	}
+	/* last one */
+	printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
+			last_nid, last_start, last_end - 1);
+	init_mem_mapping(last_start, last_end);
+
 	if (max_pfn > max_low_pfn)
 		max_low_pfn = max_pfn;
 }

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 03/22] x86, ACPI, mm: Kill max_low_pfn_mapped
  2013-06-13 13:02 ` [Part1 PATCH v5 03/22] x86, ACPI, mm: Kill max_low_pfn_mapped Tang Chen
  2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
@ 2013-06-17 21:04   ` Tejun Heo
  2013-06-17 21:13     ` Yinghai Lu
  1 sibling, 1 reply; 87+ messages in thread
From: Tejun Heo @ 2013-06-17 21:04 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm, Rafael J. Wysocki, Jacob Shin,
	Pekka Enberg, linux-acpi

Hello,

On Thu, Jun 13, 2013 at 09:02:50PM +0800, Tang Chen wrote:
> From: Yinghai Lu <yinghai@kernel.org>
> 
> Now we have pfn_mapped[] array, and max_low_pfn_mapped should not
> be used anymore. Users should use pfn_mapped[] or just
> 1UL<<(32-PAGE_SHIFT) instead.
> 
> The only user of max_low_pfn_mapped is ACPI_INITRD_TABLE_OVERRIDE.
> We could change to use 1U<<(32_PAGE_SHIFT) with it, aka under 4G.

                                ^ typo

...
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index e721863..93e3194 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -624,9 +624,9 @@ void __init acpi_initrd_override(void *data, size_t size)
>  	if (table_nr == 0)
>  		return;
>  
> -	acpi_tables_addr =
> -		memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
> -				       all_tables_size, PAGE_SIZE);
> +	/* under 4G at first, then above 4G */
> +	acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
> +					all_tables_size, PAGE_SIZE);

No bigge, but why (1ULL << 32) - 1?  Shouldn't it be just 1ULL << 32?
memblock deals with [@start, @end) areas, right?

Other than that,

 Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 04/22] x86, ACPI: Search buffer above 4GB in a second try for acpi initrd table override
  2013-06-13 13:02 ` [Part1 PATCH v5 04/22] x86, ACPI: Search buffer above 4GB in a second try for acpi initrd table override Tang Chen
  2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
@ 2013-06-17 21:06   ` Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2013-06-17 21:06 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm, Rafael J. Wysocki, linux-acpi

On Thu, Jun 13, 2013 at 09:02:51PM +0800, Tang Chen wrote:
> From: Yinghai Lu <yinghai@kernel.org>
> 
> Now we only search buffer for new acpi tables in initrd under
> 4GB. In some case, like user use memmap to exclude all low ram,
> we may not find range for it under 4GB. So do second try to
> search for buffer above 4GB.
> 
> Since later accessing to the tables is using early_ioremap(),

Maybe "later accesses to the tables" would read better?

> using memory above 4GB is OK.
> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
> Cc: linux-acpi@vger.kernel.org
> Tested-by: Thomas Renninger <trenn@suse.de>
> Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
> Tested-by: Tang Chen <tangchen@cn.fujitsu.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 03/22] x86, ACPI, mm: Kill max_low_pfn_mapped
  2013-06-17 21:04   ` [Part1 PATCH v5 03/22] " Tejun Heo
@ 2013-06-17 21:13     ` Yinghai Lu
  2013-06-17 23:08       ` Tejun Heo
  0 siblings, 1 reply; 87+ messages in thread
From: Yinghai Lu @ 2013-06-17 21:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Mel Gorman, Minchan Kim,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel,
	jweiner, Prarit Bhargava, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM, Rafael J. Wysocki,
	Jacob Shin, Pekka Enberg, ACPI Devel Maling List

On Mon, Jun 17, 2013 at 2:04 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Thu, Jun 13, 2013 at 09:02:50PM +0800, Tang Chen wrote:
>> From: Yinghai Lu <yinghai@kernel.org>
>>
>> Now we have pfn_mapped[] array, and max_low_pfn_mapped should not
>> be used anymore. Users should use pfn_mapped[] or just
>> 1UL<<(32-PAGE_SHIFT) instead.
>>
>> The only user of max_low_pfn_mapped is ACPI_INITRD_TABLE_OVERRIDE.
>> We could change to use 1U<<(32_PAGE_SHIFT) with it, aka under 4G.
>
>                                 ^ typo

ok.

>
> ...
>> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
>> index e721863..93e3194 100644
>> --- a/drivers/acpi/osl.c
>> +++ b/drivers/acpi/osl.c
>> @@ -624,9 +624,9 @@ void __init acpi_initrd_override(void *data, size_t size)
>>       if (table_nr == 0)
>>               return;
>>
>> -     acpi_tables_addr =
>> -             memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
>> -                                    all_tables_size, PAGE_SIZE);
>> +     /* under 4G at first, then above 4G */
>> +     acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
>> +                                     all_tables_size, PAGE_SIZE);
>
> No bigge, but why (1ULL << 32) - 1?  Shouldn't it be just 1ULL << 32?
> memblock deals with [@start, @end) areas, right?

that is for 32bit, when phys_addr_t is 32bit, in that case
(1ULL<<32) cast to 32bit would be 0.

>
> Other than that,
>
>  Acked-by: Tejun Heo <tj@kernel.org>

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 03/22] x86, ACPI, mm: Kill max_low_pfn_mapped
  2013-06-17 21:13     ` Yinghai Lu
@ 2013-06-17 23:08       ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2013-06-17 23:08 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tang Chen, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Mel Gorman, Minchan Kim,
	mina86, Chen Gong, vasilis.liaskovitis, lwoodman, Rik van Riel,
	jweiner, Prarit Bhargava, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM, Rafael J. Wysocki,
	Jacob Shin, Pekka Enberg, ACPI Devel Maling List

On Mon, Jun 17, 2013 at 2:13 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>> No bigge, but why (1ULL << 32) - 1?  Shouldn't it be just 1ULL << 32?
>> memblock deals with [@start, @end) areas, right?
>
> that is for 32bit, when phys_addr_t is 32bit, in that case
> (1ULL<<32) cast to 32bit would be 0.

Right, it'd work the same even after overflowing but yeah, it can be confusing.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array
  2013-06-13 13:02 ` [Part1 PATCH v5 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array Tang Chen
  2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
@ 2013-06-17 23:38   ` Tejun Heo
  2013-06-17 23:40     ` Yinghai Lu
  2013-06-17 23:52   ` Tejun Heo
  2 siblings, 1 reply; 87+ messages in thread
From: Tejun Heo @ 2013-06-17 23:38 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm, Rafael J. Wysocki, linux-acpi

On Thu, Jun 13, 2013 at 09:02:54PM +0800, Tang Chen wrote:
> -static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
> +struct file_pos {
> +	phys_addr_t data;
> +	phys_addr_t size;
> +};

Isn't file_pos too generic as name?  Would acpi_initrd_file_pos too
long?  Maybe just struct acpi_initrd_file?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array
  2013-06-17 23:38   ` [Part1 PATCH v5 07/22] " Tejun Heo
@ 2013-06-17 23:40     ` Yinghai Lu
  0 siblings, 0 replies; 87+ messages in thread
From: Yinghai Lu @ 2013-06-17 23:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Mel Gorman, Minchan Kim,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel,
	jweiner, Prarit Bhargava, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM, Rafael J. Wysocki,
	ACPI Devel Maling List

On Mon, Jun 17, 2013 at 4:38 PM, Tejun Heo <tj@kernel.org> wrote:
> On Thu, Jun 13, 2013 at 09:02:54PM +0800, Tang Chen wrote:
>> -static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
>> +struct file_pos {
>> +     phys_addr_t data;
>> +     phys_addr_t size;
>> +};
>
> Isn't file_pos too generic as name?  Would acpi_initrd_file_pos too
> long?  Maybe just struct acpi_initrd_file?

ok, will change to acpi_initrd_file.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array
  2013-06-13 13:02 ` [Part1 PATCH v5 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array Tang Chen
  2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
  2013-06-17 23:38   ` [Part1 PATCH v5 07/22] " Tejun Heo
@ 2013-06-17 23:52   ` Tejun Heo
  2 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2013-06-17 23:52 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm, Rafael J. Wysocki, linux-acpi

On Thu, Jun 13, 2013 at 09:02:54PM +0800, Tang Chen wrote:
> From: Yinghai Lu <yinghai@kernel.org>
> 
> This patch introduces a file_pos struct to store physaddr. And then changes
> acpi_initrd_files[] to file_pos type. Then store physaddr of ACPI tables
> in acpi_initrd_files[].
> 
> For finding, we will find ACPI tables with physaddr during 32bit flat mode
> in head_32.S, because at that time we don't need to setup page table to
> access initrd.
> 
> For copying, we could use early_ioremap() with physaddr directly before
> memory mapping is set.
> 
> To keep 32bit and 64bit platforms consistent, use phys_addr for all.

Also, how about something like the following?

Subject: x86, ACPI: introduce a new struct to store phys_addr of acpi override tables

ACPI initrd override table handling has been recently broken into two
functions - acpi_initrd_override_find() and
acpi_initrd_override_copy().  The former function currently stores the
virtual addresses and sizes of the found override tables in an array
of struct cpio_data for the latter function.

To make NUMA information available earlier during boot,
acpi_initrd_override_find() will be used much earlier - on 32bit, from
head_32.S before linear address translation is set up, which will make
it impossible to use the virtual addresses of the tables.

This patch introduces a new struct - file_pos - which records
phys_addr and size of a memory area, and replaces the cpio_data array
with it so that acpi_initrd_override_find() can record the phys_addrs
of the override tables instead of virtual addresses.  This will allow
using the function before the linear address is set up.

acpi_initrd_override_copy() now accesses the override tables using
early_ioremap() on the stored phys_addrs.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 08/22] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode
  2013-06-13 13:02 ` [Part1 PATCH v5 08/22] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode Tang Chen
  2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
@ 2013-06-18  0:07   ` Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2013-06-18  0:07 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm, Pekka Enberg, Jacob Shin,
	Rafael J. Wysocki, linux-acpi

On Thu, Jun 13, 2013 at 09:02:55PM +0800, Tang Chen wrote:
> From: Yinghai Lu <yinghai@kernel.org>
> 
> For finding procedure, it would be easy to access initrd in 32bit flat
> mode, as we don't need to setup page table. That is from head_32.S, and
> microcode updating already use this trick.

It'd be really great if you can give a brief explanation of why this
is happening at the beginning of the commit description so that when
someone lands on this commit later on, [s]he can orient oneself.  It
doesn't have to be long.  Open with something like,

 To make NUMA info available early during boot for memory hotplug
 support, acpi_initrd_override_find() needs to be used very early
 during boot.

and then continue to describe what's happening.  It'll make the commit
a lot more approachable to people who just encountered it.

> This patch does the following:
> 
> 1. Change acpi_initrd_override_find to use phys to access global variables.
> 
> 2. Pass a bool parameter "is_phys" to acpi_initrd_override_find() because
>    we cannot tell if it is a pa or a va through the address itself with
>    32bit. Boot loader could load initrd above max_low_pfn.

Do you mean "from 32bit address boundary"?  Maybe "from 4G boundary"
is clearer?

> 
> 3. Put table_sigs[] on stack, otherwise it is too messy to change string
>    array to physaddr and still keep offset calculating correct. The size is
>    about 36x4 bytes, and it is small to settle in stack.
> 
> 4. Also rewrite the MACRO INVALID_TABLE to be in a do {...} while(0) loop
>    so that it is more readable.

The important part is taking "continue" out of it, right?

> +/*
> + * acpi_initrd_override_find() is called from head_32.S and head64.c.
> + * head_32.S calling path is with 32bit flat mode, so we can access

When called from head_32.S, the CPU is in 32bit flat mode and the
kernel virtual address space isn't available yet.

> + * initrd early without setting pagetable or relocating initrd. For
> + * global variables accessing, we need to use phys address instead of

As initrd is in phys_addr, it can be accessed directly; however,
global variables must be accessed by explicitly obtaining their
physical addresses.

> + * kernel virtual address, try to put table_sigs string array in stack,
> + * so avoid switching for it.

Note that table_sigs array is built on stack to avoid such address
translations while accessing its members.

> + * Also don't call printk as it uses global variables.
> + */
> +void __init acpi_initrd_override_find(void *data, size_t size, bool is_phys)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 09/22] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c
  2013-06-13 13:02 ` [Part1 PATCH v5 09/22] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c Tang Chen
  2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
@ 2013-06-18  0:33   ` Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2013-06-18  0:33 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm, Pekka Enberg, Jacob Shin,
	Rafael J. Wysocki, linux-acpi

On Thu, Jun 13, 2013 at 09:02:56PM +0800, Tang Chen wrote:
> From: Yinghai Lu <yinghai@kernel.org>

Ditto for the opening.  Probably not a must, I suppose, but would be
very nice.

> head64.c could use #PF handler setup page table to access initrd before
> init mem mapping and initrd relocating.
> 
> head_32.S could use 32bit flat mode to access initrd before init mem
> mapping initrd relocating.
> 
> This patch introduces x86_acpi_override_find(), which is called from
> head_32.S/head64.c, to replace acpi_initrd_override_find(). So that we
> can makes 32bit and 64 bit more consistent.
> 
> -v2: use inline function in header file instead according to tj.
>      also still need to keep #idef head_32.S to avoid compiling error.
> -v3: need to move down reserve_initrd() after acpi_initrd_override_copy(),
>      to make sure we are using right address.
> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: Jacob Shin <jacob.shin@amd.com>
> Cc: Rafael J. Wysocki <rjw@sisk.pl>
> Cc: linux-acpi@vger.kernel.org
> Tested-by: Thomas Renninger <trenn@suse.de>
> Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
> Tested-by: Tang Chen <tangchen@cn.fujitsu.com>

Other than that,

 Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 10/22] x86, mm, numa: Move two functions calling on successful path later
  2013-06-13 13:02 ` [Part1 PATCH v5 10/22] x86, mm, numa: Move two functions calling on successful path later Tang Chen
  2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
@ 2013-06-18  0:53   ` Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2013-06-18  0:53 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

Hello,

Does the subject match the patch content?  What two functions?  The
patch is separating out the actual registration part so that the
discovery part can happen earlier, right?

> Currently, parsing numa info needs to allocate some buffer and need to be
> called after init_mem_mapping. So try to split parsing numa info procedure
> into two steps:
> 	- The first step will be called before init_mem_mapping, and it
> 	  should not need allocate buffers.

Document the requirement somewhere in the source code?

> 	- The second step will cantain all the buffer related code and be
> 	  executed later.
> 
> At last we will have early_initmem_init() and initmem_init().

Do you mean "eventually" or "in the end" by "at last"?

> This patch implements only the first step.
> 
> setup_node_data() and numa_init_array() are only called for successful
> path, so we can move these two callings to x86_numa_init(). That will also
> make numa_init() smaller and more readable.

I find the description somewhat difficult to follow.  :(

> -v2: remove online_node_map clear in numa_init(), as it is only
>      set in setup_node_data() at last in successful path.

I don't get this.  What prevents specific numa init functions (numaq,
x86_acpi, amd...) from updating node_online_map?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 11/22] x86, mm, numa: Call numa_meminfo_cover_memory() checking early
  2013-06-13 13:02 ` [Part1 PATCH v5 11/22] x86, mm, numa: Call numa_meminfo_cover_memory() checking early Tang Chen
  2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
@ 2013-06-18  1:05   ` Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2013-06-18  1:05 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

On Thu, Jun 13, 2013 at 09:02:58PM +0800, Tang Chen wrote:
> From: Yinghai Lu <yinghai@kernel.org>
> 
> In order to seperate parsing numa info procedure into two steps,
> we need to set memblock nid later, as it could change memblock
> array, and possible doube memblock.memory array which will need
> to allocate buffer.
> 
> We do not need to use nid in memblock to find out absent pages.

 because...

And please also explain it in the source code with comment including
why the check has to be done early.

> So we can move that numa_meminfo_cover_memory() early.

Maybe "So, we can use the NUMA-unaware absent_pages_in_range() in
numa_meminfo_cover_memory() and call the function before setting nid's
to memblock."

> Also we could change __absent_pages_in_range() to static and use
> absent_pages_in_range() directly.

"As this removes the last user of __absent_pages_in_range(), this
patch also makes the function static."

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 12/22] x86, mm, numa: Move node_map_pfn_alignment() to x86
  2013-06-13 13:02 ` [Part1 PATCH v5 12/22] x86, mm, numa: Move node_map_pfn_alignment() to x86 Tang Chen
  2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
@ 2013-06-18  1:08   ` Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2013-06-18  1:08 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

On Thu, Jun 13, 2013 at 09:02:59PM +0800, Tang Chen wrote:
> From: Yinghai Lu <yinghai@kernel.org>
> 
> Move node_map_pfn_alignment() to arch/x86/mm as there is no
> other user for it.
> 
> Will update it to use numa_meminfo instead of memblock.
> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
> Tested-by: Tang Chen <tangchen@cn.fujitsu.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 13/22] x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment
  2013-06-13 13:03 ` [Part1 PATCH v5 13/22] x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment Tang Chen
  2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
@ 2013-06-18  1:40   ` Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2013-06-18  1:40 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

On Thu, Jun 13, 2013 at 09:03:00PM +0800, Tang Chen wrote:
> From: Yinghai Lu <yinghai@kernel.org>
> 
> We could use numa_meminfo directly instead of memblock nid in
> node_map_pfn_alignment().
> 
> So we could do setting memblock nid later and only do it once
> for successful path.
> 
> -v2: according to tj, separate moving to another patch.

How about something like,

  Subject: x86, mm, NUMA: Use numa_meminfo instead of memblock in node_map_pfn_alignment()

  When sparsemem is used and page->flags doesn't have enough space to
  carry both the sparsemem section and node ID, NODE_NOT_IN_PAGE_FLAGS
  is set and the node is determined from section.  This requires that
  the NUMA nodes aren't more granular than sparsemem sections.
  node_map_pfn_alignment() is used to determine the maximum NUMA
  inter-node alignment which can distinguish all nodes to verify the
  above condition.

  The function currently assumes the NUMA node maps are populated and
  sorted and uses for_each_mem_pfn_range() to iterate memory regions.
  We want this to happen way earlier to support memory hotplug (maybe
  elaborate a bit more here).

  This patch updates node_map_pfn_alignment() so that it iterates over
  numa_meminfo instead and moves its invocation before memory regions
  are registered to memblock and node maps in numa_register_memblks().
  This will help memory hotplug (how...) and as a bonus we register
  memory regions only if the alignment check succeeds rather than
  registering and then failing.

Also, the comment on top of node_map_pfn_alignment() needs to be
updated, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 14/22] x86, mm, numa: Set memblock nid later
  2013-06-13 13:03 ` [Part1 PATCH v5 14/22] x86, mm, numa: Set memblock nid later Tang Chen
  2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
@ 2013-06-18  1:45   ` Tejun Heo
  1 sibling, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2013-06-18  1:45 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

On Thu, Jun 13, 2013 at 09:03:01PM +0800, Tang Chen wrote:
> From: Yinghai Lu <yinghai@kernel.org>
> 
> In order to seperate parsing numa info procedure into two steps,

Short "why" would be nice.

> we need to set memblock nid later because it could change memblock
                                   ^
				   in where?

> array, and possible doube memblock.memory array which will allocate
             ^
	     possibly double

> buffer.

 which is bad why?

> Only set memblock nid once for successful path.
> 
> Also rename numa_register_memblks to numa_check_memblks() after
> moving out code of setting memblock nid.
> @@ -676,6 +669,11 @@ void __init x86_numa_init(void)
>  
>  	early_x86_numa_init();
>  
> +	for (i = 0; i < mi->nr_blks; i++) {
> +		struct numa_memblk *mb = &mi->blk[i];
> +		memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
> +	}
> +

Can we please have some comments explaining the new ordering
requirements?  When reading code, how is one supposed to know that the
ordering of operations is all deliberate and fragile?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down.
  2013-06-13 13:03 ` [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down Tang Chen
  2013-06-14 21:33   ` [tip:x86/mm] " tip-bot for Yinghai Lu
@ 2013-06-18  1:58   ` Tejun Heo
  2013-06-18  6:22     ` Yinghai Lu
  1 sibling, 1 reply; 87+ messages in thread
From: Tejun Heo @ 2013-06-18  1:58 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm, David Rientjes

On Thu, Jun 13, 2013 at 09:03:03PM +0800, Tang Chen wrote:
> From: Yinghai Lu <yinghai@kernel.org>
> 
> numa_emulation() needs to allocate buffer for new numa_meminfo
> and distance matrix, so execute it later in x86_numa_init().
> 
> Also we change the behavoir:
> 	- before this patch, if user input wrong data in command
> 	  line, it will fall back to next numa probing or disabling
> 	  numa.
> 	- after this patch, if user input wrong data in command line,
> 	  it will stay with numa info probed from previous probing,
> 	  like ACPI SRAT or amd_numa.
> 
> We need to call numa_check_memblks to reject wrong user inputs early
> so that we can keep the original numa_meminfo not changed.

So, this is another very subtle ordering you're adding without any
comment and I'm not sure it even makes sense because the function can
fail after that point.

I'm getting really doubtful about this whole approach of carefully
splitting discovery and registration.  It's inherently fragile like
hell and the poor documentation makes it a lot worse.  I'm gonna reply
to the head message.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (21 preceding siblings ...)
  2013-06-13 13:03 ` [Part1 PATCH v5 22/22] x86, mm, numa: Put pagetable on local node ram for 64bit Tang Chen
@ 2013-06-18  2:03 ` Tejun Heo
  2013-06-18  5:47   ` Tang Chen
  2013-06-18 17:10 ` Vasilis Liaskovitis
  2013-06-21  5:19 ` H. Peter Anvin
  24 siblings, 1 reply; 87+ messages in thread
From: Tejun Heo @ 2013-06-18  2:03 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

Hello,

On Thu, Jun 13, 2013 at 09:02:47PM +0800, Tang Chen wrote:
> One commit that tried to parse SRAT early get reverted before v3.9-rc1.
> 
> | commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
> | Author: Tang Chen <tangchen@cn.fujitsu.com>
> | Date:   Fri Feb 22 16:33:44 2013 -0800
> |
> |    acpi, memory-hotplug: parse SRAT before memblock is ready
> 
> It broke several things, like acpi override and fall back path etc.
> 
> This patchset is clean implementation that will parse numa info early.
> 1. keep the acpi table initrd override working by split finding with copying.
>    finding is done at head_32.S and head64.c stage,
>         in head_32.S, initrd is accessed in 32bit flat mode with phys addr.
>         in head64.c, initrd is accessed via kernel low mapping address
>         with help of #PF set page table.
>    copying is done with early_ioremap just after memblock is setup.
> 2. keep fallback path working. numaq and ACPI and amd_nmua and dummy.
>    seperate initmem_init to two stages.
>    early_initmem_init will only extract numa info early into numa_meminfo.
>    initmem_init will keep slit and emulation handling.
> 3. keep other old code flow untouched like relocate_initrd and initmem_init.
>    early_initmem_init will take old init_mem_mapping position.
>    it call early_x86_numa_init and init_mem_mapping for every nodes.
>    For 64bit, we avoid having size limit on initrd, as relocate_initrd
>    is still after init_mem_mapping for all memory.
> 4. last patch will try to put page table on local node, so that memory
>    hotplug will be happy.
> 
> In short, early_initmem_init will parse numa info early and call
> init_mem_mapping to set page table for every nodes's mem.

So, can you please explain why you're doing the above?  What are you
trying to achieve in the end and why is this the best approach?  This
is all for memory hotplug, right?

I can understand the part where you're move NUMA discovery before
initializations which will get allocated permanent addresses in the
wrong nodes, but trying to do the same with memblock itself is making
the code extremely fragile.  It's nasty because there's nothing
apparent which seems to necessitate such ordering.  The ordering looks
rather arbitrary but changing the orders will subtly break memory
hotplug support, which is a really bad way to structure the code.

Can't you just move memblock arrays after NUMA init is complete?
That'd be a lot simpler and way more robust than the proposed changes,
no?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-18  2:03 ` [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tejun Heo
@ 2013-06-18  5:47   ` Tang Chen
  2013-06-18 17:21     ` Tejun Heo
  0 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-18  5:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

Hi tj,

On 06/18/2013 10:03 AM, Tejun Heo wrote:
......
>
> So, can you please explain why you're doing the above?  What are you
> trying to achieve in the end and why is this the best approach?  This
> is all for memory hotplug, right?

Yes, this is all for memory hotplug.

[why]
At early boot time (before parsing SRAT), memblock will allocate memory
for kernel to use. But the memory could be hotpluggable memory because
at such an early time, we don't know which memory is hotpluggable. This
will cause hotpluggable memory un-hotpluggable. What we are trying to
do is to prevent memblock from allocating hotpluggable memory.

[approach]
Parse SRAT earlier before memblock starts to work, because there is a
bit in SRAT specifying which memory is hotpluggable.

I'm not saying this is the best approach. I can also see that this
patch-set touches a lot of boot code. But i think parsing SRAT earlier
is reasonable because this is the only way for now to know which memory
is hotpluggable from firmware.

>
> I can understand the part where you're move NUMA discovery before
> initializations which will get allocated permanent addresses in the
> wrong nodes, but trying to do the same with memblock itself is making
> the code extremely fragile.  It's nasty because there's nothing
> apparent which seems to necessitate such ordering.  The ordering looks
> rather arbitrary but changing the orders will subtly break memory
> hotplug support, which is a really bad way to structure the code.
>
> Can't you just move memblock arrays after NUMA init is complete?
> That'd be a lot simpler and way more robust than the proposed changes,
> no?

Sorry, I don't quite understand the approach you are suggesting. If we
move memblock arrays, we need to update all the pointers pointing to
the moved memory. How can we do this ?

Thanks. :)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down.
  2013-06-18  1:58   ` [Part1 PATCH v5 16/22] " Tejun Heo
@ 2013-06-18  6:22     ` Yinghai Lu
  2013-06-18  7:13       ` Yinghai Lu
  2013-06-19 21:25       ` Yinghai Lu
  0 siblings, 2 replies; 87+ messages in thread
From: Yinghai Lu @ 2013-06-18  6:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Mel Gorman, Minchan Kim,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel,
	jweiner, Prarit Bhargava, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM, David Rientjes

On Mon, Jun 17, 2013 at 6:58 PM, Tejun Heo <tj@kernel.org> wrote:
> On Thu, Jun 13, 2013 at 09:03:03PM +0800, Tang Chen wrote:
>> From: Yinghai Lu <yinghai@kernel.org>
>>
>> numa_emulation() needs to allocate buffer for new numa_meminfo
>> and distance matrix, so execute it later in x86_numa_init().
>>
>> Also we change the behavoir:
>>       - before this patch, if user input wrong data in command
>>         line, it will fall back to next numa probing or disabling
>>         numa.
>>       - after this patch, if user input wrong data in command line,
>>         it will stay with numa info probed from previous probing,
>>         like ACPI SRAT or amd_numa.
>>
>> We need to call numa_check_memblks to reject wrong user inputs early
>> so that we can keep the original numa_meminfo not changed.
>
> So, this is another very subtle ordering you're adding without any
> comment and I'm not sure it even makes sense because the function can
> fail after that point.

Yes, if it fail, we will stay with current numa info from firmware.
That looks like right behavior.

Before this patch, it will fail to next numa way like if acpi srat + user
input fail, it will try to go with amd_numa then try apply user info.

>
> I'm getting really doubtful about this whole approach of carefully
> splitting discovery and registration.  It's inherently fragile like
> hell and the poor documentation makes it a lot worse.  I'm gonna reply
> to the head message.

Maybe look at the patch is not clear enough, but if looks at the final changed
code it would be more clear.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down.
  2013-06-18  6:22     ` Yinghai Lu
@ 2013-06-18  7:13       ` Yinghai Lu
  2013-06-19 21:25       ` Yinghai Lu
  1 sibling, 0 replies; 87+ messages in thread
From: Yinghai Lu @ 2013-06-18  7:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Mel Gorman, Minchan Kim,
	mina86, gong.chen, vasilis.liaskovitis, lwoodman, Rik van Riel,
	jweiner, Prarit Bhargava, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM, David Rientjes

On Mon, Jun 17, 2013 at 11:22 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Mon, Jun 17, 2013 at 6:58 PM, Tejun Heo <tj@kernel.org> wrote:
>> On Thu, Jun 13, 2013 at 09:03:03PM +0800, Tang Chen wrote:
>>> From: Yinghai Lu <yinghai@kernel.org>
>>>
>>> numa_emulation() needs to allocate buffer for new numa_meminfo
>>> and distance matrix, so execute it later in x86_numa_init().
>>>
>>> Also we change the behavoir:
>>>       - before this patch, if user input wrong data in command
>>>         line, it will fall back to next numa probing or disabling
>>>         numa.
>>>       - after this patch, if user input wrong data in command line,
>>>         it will stay with numa info probed from previous probing,
>>>         like ACPI SRAT or amd_numa.
>>>
>>> We need to call numa_check_memblks to reject wrong user inputs early
>>> so that we can keep the original numa_meminfo not changed.
>>
>> So, this is another very subtle ordering you're adding without any
>> comment and I'm not sure it even makes sense because the function can
>> fail after that point.
>
> Yes, if it fail, we will stay with current numa info from firmware.
> That looks like right behavior.
>
> Before this patch, it will fail to next numa way like if acpi srat + user
> input fail, it will try to go with amd_numa then try apply user info.
>
>>
>> I'm getting really doubtful about this whole approach of carefully
>> splitting discovery and registration.  It's inherently fragile like
>> hell and the poor documentation makes it a lot worse.  I'm gonna reply
>> to the head message.
>
> Maybe look at the patch is not clear enough, but if looks at the final changed
> code it would be more clear.

update the patches from 1-15 with your review.

git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-x86-mm

https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/log/?h=for-x86-mm

Yinghai

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (22 preceding siblings ...)
  2013-06-18  2:03 ` [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tejun Heo
@ 2013-06-18 17:10 ` Vasilis Liaskovitis
  2013-06-18 20:19   ` Yinghai Lu
  2013-06-24  9:40   ` Gu Zheng
  2013-06-21  5:19 ` H. Peter Anvin
  24 siblings, 2 replies; 87+ messages in thread
From: Vasilis Liaskovitis @ 2013-06-18 17:10 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu, wency,
	laijs, isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	lwoodman, riel, jweiner, prarit, x86, linux-doc, linux-kernel,
	linux-mm

Hi,

On Thu, Jun 13, 2013 at 09:02:47PM +0800, Tang Chen wrote:
> From: Yinghai Lu <yinghai@kernel.org>
> 
> No offence, just rebase and resend the patches from Yinghai to help
> to push this functionality faster.
> Also improve the comments in the patches' log.
> 
> 
> One commit that tried to parse SRAT early get reverted before v3.9-rc1.
> 
> | commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
> | Author: Tang Chen <tangchen@cn.fujitsu.com>
> | Date:   Fri Feb 22 16:33:44 2013 -0800
> |
> |    acpi, memory-hotplug: parse SRAT before memblock is ready
> 
> It broke several things, like acpi override and fall back path etc.
> 
> This patchset is clean implementation that will parse numa info early.
> 1. keep the acpi table initrd override working by split finding with copying.
>    finding is done at head_32.S and head64.c stage,
>         in head_32.S, initrd is accessed in 32bit flat mode with phys addr.
>         in head64.c, initrd is accessed via kernel low mapping address
>         with help of #PF set page table.
>    copying is done with early_ioremap just after memblock is setup.
> 2. keep fallback path working. numaq and ACPI and amd_nmua and dummy.
>    seperate initmem_init to two stages.
>    early_initmem_init will only extract numa info early into numa_meminfo.
>    initmem_init will keep slit and emulation handling.
> 3. keep other old code flow untouched like relocate_initrd and initmem_init.
>    early_initmem_init will take old init_mem_mapping position.
>    it call early_x86_numa_init and init_mem_mapping for every nodes.
>    For 64bit, we avoid having size limit on initrd, as relocate_initrd
>    is still after init_mem_mapping for all memory.
> 4. last patch will try to put page table on local node, so that memory
>    hotplug will be happy.
> 
> In short, early_initmem_init will parse numa info early and call
> init_mem_mapping to set page table for every nodes's mem.
> 
> could be found at:
>         git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
> 
> and it is based on today's Linus tree.
>

Has this patchset been tested on various numa configs?
I am using linux-next next-20130607 + part1 with qemu/kvm/seabios VMs. The kernel
boots successfully in many numa configs but while trying different memory sizes
for a 2 numa node VM, I noticed that booting does not complete in all cases
(bootup screen appears to hang but there is no output indicating an early panic)

node0   node1	 boots
1G 	1G	 yes
1G 	2G	 yes
1G 	0.5G	 yes
3G 	2.5G	 yes
3G 	3G 	 yes
4G 	0G	 yes
4G 	4G	 yes
1.5G	1G	 no
2G 	1G	 no
2G 	2G	 no
2.5G 	2G	 no
2.5G 	2.5G	 no

linux-next next-20130607 boots al of these configs fine.

Looks odd, perhaps I have something wrong in my setup or maybe there is a
seabios/qemu interaction with this patchset. I will update if I find something.

thanks,

- Vasilis



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-18  5:47   ` Tang Chen
@ 2013-06-18 17:21     ` Tejun Heo
  2013-06-20  5:52       ` Tang Chen
  0 siblings, 1 reply; 87+ messages in thread
From: Tejun Heo @ 2013-06-18 17:21 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, hpa, akpm, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

Hey, Tang.

On Tue, Jun 18, 2013 at 01:47:16PM +0800, Tang Chen wrote:
> [approach]
> Parse SRAT earlier before memblock starts to work, because there is a
> bit in SRAT specifying which memory is hotpluggable.
> 
> I'm not saying this is the best approach. I can also see that this
> patch-set touches a lot of boot code. But i think parsing SRAT earlier
> is reasonable because this is the only way for now to know which memory
> is hotpluggable from firmware.

Touching a lot of code is not a problem but it feels like it's trying
to boot strap itself while walking and achieves that by carefully
sequencing all operations which may allocate from memblock before NUMA
info is available without any way to enforce or verify that.

> >Can't you just move memblock arrays after NUMA init is complete?
> >That'd be a lot simpler and way more robust than the proposed changes,
> >no?
> 
> Sorry, I don't quite understand the approach you are suggesting. If we
> move memblock arrays, we need to update all the pointers pointing to
> the moved memory. How can we do this ?

So, there are two things involved here - memblock itself and consumers
of memblock, right?  I get that the latter shouldn't allocate memory
from memblock before NUMA info is entered into memblock, so please
reorder as necessary *and* make sure memblock complains if something
violates that.  Temporary memory areas which are return are fine.
Just complain if there are memory regions remaining which are
allocated before NUMA info is available after boot is complete.  No
need to make booting more painful than it currently is.

As for memblock itself, there's no need to walk carefully around it.
Just let it do its thing and implement
memblock_relocate_to_numa_node_0() or whatever after NUMA information
is available.  memblock already does relocate itself whenever it's
expanding the arrays anyway, so implementation should be trivial.

Maybe I'm missing something but having a working memory allocator as
soon as possible is *way* less painful than trying to bootstrap around
it.  Allow boot path to allocate memory areas from memblock as soon as
possible but just ensure that none of the ones which may violate the
hotplug requirements is remaining once boot is complete.  Temporaray
regions won't matter then and the few which need persistent areas can
either be reordered to happen after NUMA init or they can allocate a
new area and move to there after NUMA info is available.  Let's please
minimize this walking-and-trying-to-tie-shoestrings-at-the-same-time
thing.  It's painful and extremely fragile.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-18 17:10 ` Vasilis Liaskovitis
@ 2013-06-18 20:19   ` Yinghai Lu
  2013-06-19 10:05     ` Vasilis Liaskovitis
  2013-06-24  9:40   ` Gu Zheng
  1 sibling, 1 reply; 87+ messages in thread
From: Yinghai Lu @ 2013-06-18 20:19 UTC (permalink / raw)
  To: Vasilis Liaskovitis
  Cc: Tang Chen, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Morton, Tejun Heo, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Mel Gorman,
	Minchan Kim, mina86, gong.chen, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM

On Tue, Jun 18, 2013 at 10:10 AM, Vasilis Liaskovitis
<vasilis.liaskovitis@profitbricks.com> wrote:
>> could be found at:
>>         git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
>>
>> and it is based on today's Linus tree.
>>
>
> Has this patchset been tested on various numa configs?
> I am using linux-next next-20130607 + part1 with qemu/kvm/seabios VMs. The kernel
> boots successfully in many numa configs but while trying different memory sizes
> for a 2 numa node VM, I noticed that booting does not complete in all cases
> (bootup screen appears to hang but there is no output indicating an early panic)
>
> node0   node1    boots
> 1G      1G       yes
> 1G      2G       yes
> 1G      0.5G     yes
> 3G      2.5G     yes
> 3G      3G       yes
> 4G      0G       yes
> 4G      4G       yes
> 1.5G    1G       no
> 2G      1G       no
> 2G      2G       no
> 2.5G    2G       no
> 2.5G    2.5G     no
>
> linux-next next-20130607 boots al of these configs fine.
>
> Looks odd, perhaps I have something wrong in my setup or maybe there is a
> seabios/qemu interaction with this patchset. I will update if I find something.

just tried 2g/2g, and it works on qemu-kvm:

early console in setup code
Probing EDD (edd=off to disable)... ok
early console in decompress_kernel
decompress_kernel:
  input: [0x2a8e2c2-0x3393991], output: 0x1000000, heap: [0x339b200-0x33a31ff]

Decompressing Linux... xz... Parsing ELF... done.
Booting the kernel.
[    0.000000] bootconsole [uart0] enabled
[    0.000000]    real_mode_data :      phys 0000000000014490
[    0.000000]    real_mode_data :      virt ffff880000014490
[    0.000000]       boot_params : init virt ffffffff82f869a0
[    0.000000]       boot_params :      phys 0000000002f869a0
[    0.000000]       boot_params :      virt ffff880002f869a0
[    0.000000] boot_command_line : init virt ffffffff82e53020
[    0.000000] boot_command_line :      phys 0000000002e53020
[    0.000000] boot_command_line :      virt ffff880002e53020
[    0.000000] Kernel Layout:
[    0.000000]   .text: [0x01000000-0x020b8840]
[    0.000000] .rodata: [0x02200000-0x029d3fff]
[    0.000000]   .data: [0x02a00000-0x02bd4d7f]
[    0.000000]   .init: [0x02bd6000-0x02f71fff]
[    0.000000]    .bss: [0x02f80000-0x03c20fff]
[    0.000000]    .brk: [0x03c21000-0x03c45fff]
[    0.000000] memblock_reserve: [0x0009f000-0x000fffff] * BIOS reserved
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.10.0-rc6-yh-01398-ga6660aa-dirty
(yhlu@linux-siqj.site) (gcc version 4.7.2 20130108 [gcc-4_7-branch
revision 195012] (SUSE Linux) ) #1754 SMP Tue Jun 18 13:10:47 PDT 2013
[    0.000000] memblock_reserve: [0x01000000-0x03c20fff] TEXT DATA BSS
[    0.000000] memblock_reserve: [0x7dcef000-0x7fffefff] RAMDISK
[    0.000000] Command line: BOOT_IMAGE=linux debug ignore_loglevel
initcall_debug pci=routeirq ramdisk_size=262144 root=/dev/ram0 rw
ip=dhcp console=uart8250,io,0x3f8,115200 initrd=initrd.img
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000]   Centaur CentaurHauls
[    0.000000] Physical RAM map:
[    0.000000] raw: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] raw: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] raw: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] raw: [mem 0x0000000000100000-0x00000000dfffdfff] usable
[    0.000000] raw: [mem 0x00000000dfffe000-0x00000000dfffffff] reserved
[    0.000000] raw: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[    0.000000] raw: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] raw: [mem 0x0000000100000000-0x000000011fffffff] usable
[    0.000000] e820: BIOS-provided physical RAM map (sanitized by setup):
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000dfffdfff] usable
[    0.000000] BIOS-e820: [mem 0x00000000dfffe000-0x00000000dfffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000011fffffff] usable
[    0.000000] debug: ignoring loglevel setting.
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] SMBIOS 2.4 present.
[    0.000000] DMI: Bochs Bochs, BIOS Bochs 01/01/2011
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] No AGP bridge found
[    0.000000] e820: last_pfn = 0x120000 max_arch_pfn = 0x400000000
[    0.000000] MTRR default type: write-back
[    0.000000] MTRR fixed ranges enabled:
[    0.000000]   00000-9FFFF write-back
[    0.000000]   A0000-BFFFF uncachable
[    0.000000]   C0000-FFFFF write-protect
[    0.000000] MTRR variable ranges enabled:
[    0.000000]   0 [00E0000000-00FFFFFFFF] mask FFE0000000 uncachable
[    0.000000]   1 disabled
[    0.000000]   2 disabled
[    0.000000]   3 disabled
[    0.000000]   4 disabled
[    0.000000]   5 disabled
[    0.000000]   6 disabled
[    0.000000]   7 disabled
[    0.000000] PAT not supported by CPU.
[    0.000000] e820: last_pfn = 0xdfffe max_arch_pfn = 0x400000000
[    0.000000] found SMP MP-table at [mem 0x000fdae0-0x000fdaef]
mapped at [ffff8800000fdae0]
[    0.000000] memblock_reserve: [0x000fdae0-0x000fdaef] * MP-table mpf
[    0.000000] memblock_reserve: [0x000fdaf0-0x000fdbe3] * MP-table mpc
[    0.000000] memblock_reserve: [0x03c21000-0x03c26fff] BRK
[    0.000000] MEMBLOCK configuration:
[    0.000000]  memory size = 0xfff9cc00 reserved size = 0x4f98000
[    0.000000]  memory.cnt  = 0x3
[    0.000000]  memory[0x0]    [0x00001000-0x0009efff], 0x9e000 bytes
[    0.000000]  memory[0x1]    [0x00100000-0xdfffdfff], 0xdfefe000 bytes
[    0.000000]  memory[0x2]    [0x100000000-0x11fffffff], 0x20000000 bytes
[    0.000000]  reserved.cnt  = 0x3
[    0.000000]  reserved[0x0]    [0x0009f000-0x000fffff], 0x61000 bytes
[    0.000000]  reserved[0x1]    [0x01000000-0x03c26fff], 0x2c27000 bytes
[    0.000000]  reserved[0x2]    [0x7dcef000-0x7fffefff], 0x2310000 bytes
[    0.000000] memblock_reserve: [0x00099000-0x0009efff] TRAMPOLINE
[    0.000000] Base memory trampoline at [ffff880000099000] 99000 size 24576
[    0.000000] memblock_reserve: [0x00000000-0x0000ffff] RESERVELOW
[    0.000000] ACPI: RSDP 00000000000fd8d0 00014 (v00 BOCHS )
[    0.000000] ACPI: RSDT 00000000dfffe270 00038 (v01 BOCHS  BXPCRSDT
00000001 BXPC 00000001)
[    0.000000] ACPI: FACP 00000000dfffff80 00074 (v01 BOCHS  BXPCFACP
00000001 BXPC 00000001)
[    0.000000] ACPI: DSDT 00000000dfffe2b0 011A9 (v01   BXPC   BXDSDT
00000001 INTL 20100528)
[    0.000000] ACPI: FACS 00000000dfffff40 00040
[    0.000000] ACPI: SSDT 00000000dffff6e0 00858 (v01 BOCHS  BXPCSSDT
00000001 BXPC 00000001)
[    0.000000] ACPI: APIC 00000000dffff5b0 00090 (v01 BOCHS  BXPCAPIC
00000001 BXPC 00000001)
[    0.000000] ACPI: HPET 00000000dffff570 00038 (v01 BOCHS  BXPCHPET
00000001 BXPC 00000001)
[    0.000000] ACPI: SRAT 00000000dffff460 00110 (v01 BOCHS  BXPCSRAT
00000001 BXPC 00000001)
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[    0.000000] SRAT: PXM 0 -> APIC 0x01 -> Node 0
[    0.000000] SRAT: PXM 1 -> APIC 0x02 -> Node 1
[    0.000000] SRAT: PXM 1 -> APIC 0x03 -> Node 1
[    0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[    0.000000] SRAT: Node 0 PXM 0 [mem 0x00100000-0x7fffffff]
[    0.000000] SRAT: Node 1 PXM 1 [mem 0x80000000-0xdfffffff]
[    0.000000] SRAT: Node 1 PXM 1 [mem 0x100000000-0x11fffffff]
[    0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem
0x00100000-0x7fffffff] -> [mem 0x00000000-0x7fffffff]
[    0.000000] NUMA: Node 1 [mem 0x80000000-0xdfffffff] + [mem
0x100000000-0x11fffffff] -> [mem 0x80000000-0x11fffffff]
[    0.000000] Node 0: [mem 0x00000000000000-0x0000007fffffff]
[    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[    0.000000]  [mem 0x00000000-0x000fffff] page 4k
[    0.000000] BRK [0x03c22000, 0x03c22fff] PGTABLE
[    0.000000] BRK [0x03c23000, 0x03c23fff] PGTABLE
[    0.000000] BRK [0x03c24000, 0x03c24fff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x7da00000-0x7dbfffff]
[    0.000000]  [mem 0x7da00000-0x7dbfffff] page 2M
[    0.000000] BRK [0x03c25000, 0x03c25fff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x7c000000-0x7d9fffff]
[    0.000000]  [mem 0x7c000000-0x7d9fffff] page 2M
[    0.000000] init_memory_mapping: [mem 0x00100000-0x7bffffff]
[    0.000000]  [mem 0x00100000-0x001fffff] page 4k
[    0.000000]  [mem 0x00200000-0x7bffffff] page 2M
[    0.000000] init_memory_mapping: [mem 0x7dc00000-0x7fffffff]
[    0.000000]  [mem 0x7dc00000-0x7fffffff] page 2M
[    0.000000] Node 1: [mem 0x00000080000000-0x0000011fffffff]
[    0.000000] init_memory_mapping: [mem 0x11fe00000-0x11fffffff]
[    0.000000]  [mem 0x11fe00000-0x11fffffff] page 2M
[    0.000000] BRK [0x03c26000, 0x03c26fff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x11c000000-0x11fdfffff]
[    0.000000]  [mem 0x11c000000-0x11fdfffff] page 2M
[    0.000000] init_memory_mapping: [mem 0x100000000-0x11bffffff]
[    0.000000]  [mem 0x100000000-0x11bffffff] page 2M
[    0.000000] init_memory_mapping: [mem 0x80000000-0xdfffdfff]
[    0.000000]  [mem 0x80000000-0xdfdfffff] page 2M
[    0.000000]  [mem 0xdfe00000-0xdfffdfff] page 4k
[    0.000000] memblock_reserve: [0x11ffff000-0x11fffffff] PGTABLE
[    0.000000] memblock_reserve: [0x11fffe000-0x11fffefff] PGTABLE
[    0.000000] memblock_reserve: [0x11fffd000-0x11fffdfff] PGTABLE
[    0.000000] RAMDISK: [mem 0x7dcef000-0x7fffefff]
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x7fffffff]
[    0.000000] memblock_reserve: [0x7dcc8000-0x7dceefff]
[    0.000000]   NODE_DATA [mem 0x7dcc8000-0x7dceefff]
[    0.000000] Initmem setup node 1 [mem 0x80000000-0x11fffffff]
[    0.000000] memblock_reserve: [0x11ffd6000-0x11fffcfff]
[    0.000000]   NODE_DATA [mem 0x11ffd6000-0x11fffcfff]
[    0.000000] MEMBLOCK configuration:
[    0.000000]  memory size = 0xfff9cc00 reserved size = 0x4fff000
[    0.000000]  memory.cnt  = 0x4
[    0.000000]  memory[0x0]    [0x00001000-0x0009efff], 0x9e000 bytes on node 0
[    0.000000]  memory[0x1]    [0x00100000-0x7fffffff], 0x7ff00000
bytes on node 0
[    0.000000]  memory[0x2]    [0x80000000-0xdfffdfff], 0x5fffe000
bytes on node 1
[    0.000000]  memory[0x3]    [0x100000000-0x11fffffff], 0x20000000
bytes on node 1
[    0.000000]  reserved.cnt  = 0x5
[    0.000000]  reserved[0x0]    [0x00000000-0x0000ffff], 0x10000 bytes
[    0.000000]  reserved[0x1]    [0x00099000-0x000fffff], 0x67000 bytes
[    0.000000]  reserved[0x2]    [0x01000000-0x03c26fff], 0x2c27000 bytes
[    0.000000]  reserved[0x3]    [0x7dcc8000-0x7fffefff], 0x2337000 bytes
[    0.000000]  reserved[0x4]    [0x11ffd6000-0x11fffffff], 0x2a000 bytes
[    0.000000] memblock_reserve: [0x7ffff000-0x7fffffff] sparse section
[    0.000000] memblock_reserve: [0x11fbd6000-0x11ffd5fff] usemap_map
[    0.000000] memblock_reserve: [0x7dcc7e00-0x7dcc7fff] usemap section
[    0.000000] memblock_reserve: [0x11fbd5e00-0x11fbd5fff] usemap section
[    0.000000] memblock_reserve: [0x11f7d5e00-0x11fbd5dff] map_map
[    0.000000] memblock_reserve: [0x7bc00000-0x7dbfffff] vmemmap buf
[    0.000000] memblock_reserve: [0x7dcc6000-0x7dcc6fff] vmemmap block
[    0.000000]  [ffffea0000000000-ffffea7fffffffff] PGD @
ffff88007dcc6000 on node 0
[    0.000000] memblock_reserve: [0x7dcc5000-0x7dcc5fff] vmemmap block
[    0.000000]  [ffffea0000000000-ffffea003fffffff] PUD @
ffff88007dcc5000 on node 0
[    0.000000]    memblock_free: [0x7dc00000-0x7dbfffff]
[    0.000000] memblock_reserve: [0x11d600000-0x11f5fffff] vmemmap buf
[    0.000000]  [ffffea0000000000-ffffea0001ffffff] PMD ->
[ffff88007bc00000-ffff88007dbfffff] on node 0
[    0.000000]    memblock_free: [0x11f600000-0x11f5fffff]
[    0.000000]  [ffffea0002000000-ffffea00047fffff] PMD ->
[ffff88011d600000-ffff88011f5fffff] on node 1
[    0.000000]    memblock_free: [0x11f7d5e00-0x11fbd5dff]
[    0.000000]    memblock_free: [0x11fbd6000-0x11ffd5fff]Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   [mem 0x100000000-0x11fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009efff]
[    0.000000]   node   0: [mem 0x00100000-0x7fffffff]
[    0.000000]   node   1: [mem 0x80000000-0xdfffdfff]
[    0.000000]   node   1: [mem 0x100000000-0x11fffffff]
[    0.000000] start - node_states[2]:
[    0.000000] On node 0 totalpages: 524190
[    0.000000]   DMA zone: 64 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3998 pages, LIFO batch:0
[    0.000000] memblock_reserve: [0x7dc6d000-0x7dcc4fff] pgdat
[    0.000000]   DMA32 zone: 8128 pages used for memmap
[    0.000000]   DMA32 zone: 520192 pages, LIFO batch:31
[    0.000000] memblock_reserve: [0x7dc15000-0x7dc6cfff] pgdat
[    0.000000] On node 1 totalpages: 524286
[    0.000000]   DMA32 zone: 6144 pages used for memmap
[    0.000000]   DMA32 zone: 393214 pages, LIFO batch:31
[    0.000000] memblock_reserve: [0x11ff7e000-0x11ffd5fff] pgdat
[    0.000000]   Normal zone: 2048 pages used for memmap
[    0.000000]   Normal zone: 131072 pages, LIFO batch:31
[    0.000000] memblock_reserve: [0x11ff26000-0x11ff7dfff] pgdat
[    0.000000] after - node_states[2]: 0-1
[    0.000000] memblock_reserve: [0x11ff25000-0x11ff25fff] pgtable

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-18 20:19   ` Yinghai Lu
@ 2013-06-19 10:05     ` Vasilis Liaskovitis
  2013-06-20 18:42       ` Yinghai Lu
  0 siblings, 1 reply; 87+ messages in thread
From: Vasilis Liaskovitis @ 2013-06-19 10:05 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tang Chen, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Morton, Tejun Heo, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Mel Gorman,
	Minchan Kim, mina86, gong.chen, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM

On Tue, Jun 18, 2013 at 01:19:12PM -0700, Yinghai Lu wrote:
> On Tue, Jun 18, 2013 at 10:10 AM, Vasilis Liaskovitis
> <vasilis.liaskovitis@profitbricks.com> wrote:
> >> could be found at:
> >>         git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
> >>
> >> and it is based on today's Linus tree.
> >>
> >
> > Has this patchset been tested on various numa configs?
> > I am using linux-next next-20130607 + part1 with qemu/kvm/seabios VMs. The kernel
> > boots successfully in many numa configs but while trying different memory sizes
> > for a 2 numa node VM, I noticed that booting does not complete in all cases
> > (bootup screen appears to hang but there is no output indicating an early panic)
> >
> > node0   node1    boots
> > 1G      1G       yes
> > 1G      2G       yes
> > 1G      0.5G     yes
> > 3G      2.5G     yes
> > 3G      3G       yes
> > 4G      0G       yes
> > 4G      4G       yes
> > 1.5G    1G       no
> > 2G      1G       no
> > 2G      2G       no
> > 2.5G    2G       no
> > 2.5G    2.5G     no
> >
> > linux-next next-20130607 boots al of these configs fine.
> >
> > Looks odd, perhaps I have something wrong in my setup or maybe there is a
> > seabios/qemu interaction with this patchset. I will update if I find something.
> 
> just tried 2g/2g, and it works on qemu-kvm:

thanks for testing. If you can also share qemu/seabios versions you use (release
or git commits), that would be helpful.

this is most likely some error on my setup, I 'll let you know if I conclude
otherwise.

thanks,

- Vasilis

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down.
  2013-06-18  6:22     ` Yinghai Lu
  2013-06-18  7:13       ` Yinghai Lu
@ 2013-06-19 21:25       ` Yinghai Lu
  1 sibling, 0 replies; 87+ messages in thread
From: Yinghai Lu @ 2013-06-19 21:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tang Chen, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Morton, Thomas Renninger, Jiang Liu, Wen Congyang,
	Lai Jiangshan, Yasuaki Ishimatsu, Mel Gorman, Minchan Kim,
	mina86, gong.chen, Vasilis Liaskovitis, lwoodman, Rik van Riel,
	jweiner, Prarit Bhargava, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM, David Rientjes

On Mon, Jun 17, 2013 at 11:22 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Mon, Jun 17, 2013 at 6:58 PM, Tejun Heo <tj@kernel.org> wrote:
>> On Thu, Jun 13, 2013 at 09:03:03PM +0800, Tang Chen wrote:
>>> From: Yinghai Lu <yinghai@kernel.org>
>>>
>>> numa_emulation() needs to allocate buffer for new numa_meminfo
>>> and distance matrix, so execute it later in x86_numa_init().
>>>
>>> Also we change the behavoir:
>>>       - before this patch, if user input wrong data in command
>>>         line, it will fall back to next numa probing or disabling
>>>         numa.
>>>       - after this patch, if user input wrong data in command line,
>>>         it will stay with numa info probed from previous probing,
>>>         like ACPI SRAT or amd_numa.
>>>
>>> We need to call numa_check_memblks to reject wrong user inputs early
>>> so that we can keep the original numa_meminfo not changed.
>>
>> So, this is another very subtle ordering you're adding without any
>> comment and I'm not sure it even makes sense because the function can
>> fail after that point.

the new numa_emulation will call numa_check_memblks at first before
touch numa_meminfo.
if it fails, numa_meminfo is not touched, so that should not a problem.

>
> Yes, if it fail, we will stay with current numa info from firmware.
> That looks like right behavior.
>
> Before this patch, it will fail to next numa way like if acpi srat + user
> input fail, it will try to go with amd_numa then try apply user info.

For numa emulation fail sequence, want to double check what should be
right seuence:

on and before 2.6.38:
   emulation ==> acpi ==> amd ==> dummy
so if emulation with wrong input, will fall back to acpi numa.

from 2.6.39
   acpi (emulation) ==> amd (emulation) ==> dummy (emulation)
if emulation with wrong input, it will fall back to next numa discovery.

after my patchset
will be acpi ==> amd ==> dummy
          emulation.
the new emulation will call numa_check_memblks at first before touch
numa_meminfo.
anyway if emulation fails, numa_meminfo is not touched.

so this change looks like right change.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-18 17:21     ` Tejun Heo
@ 2013-06-20  5:52       ` Tang Chen
  2013-06-20  6:17         ` Tejun Heo
  0 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-20  5:52 UTC (permalink / raw)
  To: Tejun Heo, yinghai
  Cc: tglx, mingo, hpa, akpm, trenn, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

Hi tj, Yinghai,

On 06/19/2013 01:21 AM, Tejun Heo wrote:
> Hey, Tang.
>
> On Tue, Jun 18, 2013 at 01:47:16PM +0800, Tang Chen wrote:
>> [approach]
>> Parse SRAT earlier before memblock starts to work, because there is a
>> bit in SRAT specifying which memory is hotpluggable.
>>
>> I'm not saying this is the best approach. I can also see that this
>> patch-set touches a lot of boot code. But i think parsing SRAT earlier
>> is reasonable because this is the only way for now to know which memory
>> is hotpluggable from firmware.
>
> Touching a lot of code is not a problem but it feels like it's trying
> to boot strap itself while walking and achieves that by carefully
> sequencing all operations which may allocate from memblock before NUMA
> info is available without any way to enforce or verify that.

Yes, the current implementation has no way to verify if there is
anything violating the hotplug requirement. This is weak and should
be improved.

>
>>> Can't you just move memblock arrays after NUMA init is complete?
>>> That'd be a lot simpler and way more robust than the proposed changes,
>>> no?
>>
>> Sorry, I don't quite understand the approach you are suggesting. If we
>> move memblock arrays, we need to update all the pointers pointing to
>> the moved memory. How can we do this ?
>
> So, there are two things involved here - memblock itself and consumers
> of memblock, right?

Yes.

>I get that the latter shouldn't allocate memory
> from memblock before NUMA info is entered into memblock, so please
> reorder as necessary *and* make sure memblock complains if something
> violates that.  Temporary memory areas which are return are fine.
> Just complain if there are memory regions remaining which are
> allocated before NUMA info is available after boot is complete.  No
> need to make booting more painful than it currently is.

I think there are two difficulties to do this job in your way.

1. It is difficult to tell which memory allocation is temporary and
    which one is permanent when memblock is allocating memory. So, we
    can only wait till boot is complete, and see which remains.
    But, we have the second difficulty.

2. In memblock.reserve[], we cannot tell why we allocated this memory
    just from the array item, right?  So it is difficult to do the
    relocation. If in the future, we have to allocate permanent memory
    for other new purposes, we have to do the relocation again and again.
    (Not sure if I understand the point correctly. I think there isn't
     a generic way to relocate memory used for different purposes.)

If you also had a look at the Part2 patches, you will see that I
introduced a flags member into memblock to specify different types
of memory, which will help to recognize hotpluggable memory. My
thinking is that ensure memblock will not allocate hotpluggable
memory. I think this is the most safe and easy way to satisfy hotplug
requirement.

(not finished, please see below)

>
> As for memblock itself, there's no need to walk carefully around it.
> Just let it do its thing and implement
> memblock_relocate_to_numa_node_0() or whatever after NUMA information
> is available.  memblock already does relocate itself whenever it's
> expanding the arrays anyway, so implementation should be trivial.

Yes, this is easy.

>
> Maybe I'm missing something but having a working memory allocator as
> soon as possible is *way* less painful than trying to bootstrap around
> it.  Allow boot path to allocate memory areas from memblock as soon as
> possible but just ensure that none of the ones which may violate the
> hotplug requirements is remaining once boot is complete.  Temporaray
> regions won't matter then and the few which need persistent areas can
> either be reordered to happen after NUMA init or they can allocate a
> new area and move to there after NUMA info is available.  Let's please
> minimize this walking-and-trying-to-tie-shoestrings-at-the-same-time
> thing.  It's painful and extremely fragile.

IIUC, I know what you are worrying about:

1. No way to ensure parsing numa info is early enough in the future.
    Someone could have a chance to use memblock before parsing SRAT
    in the future.

2. memblock won't complain if anything violates the hotplug requirement.
    This is not safe.

So you don't agree to serialize the operations at boot time.

But I think ensuring memblock won't allocate hotpluggable memory to
the kernel (which is the current way in Part2 patches) is the safest
way to satisfy memory hotplug requirement. And this is right a working
memory allocator at boot time. Not checking or relocating after system
boots.

About this patch-set from Yinghai, actually he is doing a job that I
failed to do. And he also included a lot of other things in the
patch-set, such as extend max number of overridable acpi tables, local
node pagetable, and so on.

Maybe all these things are done at the same time looks a little messy.
So, how about we do it this way:

1. Improvements for ACPI_TABLE_OVERRIDE, such as increase the number
    of overridable tables.

2. Move forward parsing SRAT.

3. local device pagetable (not local node), I mentioned in Part3
    patch-set discussion. I'm now also working on it.

I'm not trying to do thing half way. I just think maybe smaller patch-set
will be easy to understand and review.


PS:
More info about local device pagetable:

There could be more than on memory device in a numa node. If we allocate
local node pagetable, the pagetable pages of one memory device could be
in another memory device. So the memory device containing pagetable have
to be hot-removed in the last place. This is hard to handle in hot-remove
path. So maybe local device pagetable is more reasonable.

Thanks. :)








^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-20  5:52       ` Tang Chen
@ 2013-06-20  6:17         ` Tejun Heo
  2013-06-21  9:19           ` Tang Chen
  0 siblings, 1 reply; 87+ messages in thread
From: Tejun Heo @ 2013-06-20  6:17 UTC (permalink / raw)
  To: Tang Chen
  Cc: yinghai, tglx, mingo, hpa, akpm, trenn, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

Hello, Tang.

On Thu, Jun 20, 2013 at 01:52:50PM +0800, Tang Chen wrote:
> 1. It is difficult to tell which memory allocation is temporary and
>    which one is permanent when memblock is allocating memory. So, we
>    can only wait till boot is complete, and see which remains.
>    But, we have the second difficulty.
> 
> 2. In memblock.reserve[], we cannot tell why we allocated this memory
>    just from the array item, right?  So it is difficult to do the
>    relocation. If in the future, we have to allocate permanent memory
>    for other new purposes, we have to do the relocation again and again.
>    (Not sure if I understand the point correctly. I think there isn't
>     a generic way to relocate memory used for different purposes.)

I was suggesting two separate things.

* As memblock allocator can relocate itself.  There's no point in
  avoiding setting NUMA node while parsing and registering NUMA
  topology.  Just parse and register NUMA info and later tell it to
  relocate itself out of hot-pluggable node.  A number of patches in
  the series is doing this dancing - carefully reordering NUMA
  probing.  No need to do that.  It's really fragile thing to do.

* Once you get the above out of the way, I don't think there are a lot
  of permanent allocations in the way before NUMA is initialized.
  Re-order the remaining ones if that's cleaner to do.  If that gets
  overly messy / fragile, copying them around or freeing and reloading
  afterwards could be an option too.  There isn't much point in being
  super-efficient about ACPI override table.  Being cleaner and more
  robust is far more important.

As for distinguishing temporary / permanent, it shouldn't be difficult
to make memblock track all allocations before NUMA info becomes online
and then verify that those areas are free by the time boot is
complete.  Just mark the reserved areas allocated before NUMA info is
fully available.

> If you also had a look at the Part2 patches, you will see that I
> introduced a flags member into memblock to specify different types
> of memory, which will help to recognize hotpluggable memory. My
> thinking is that ensure memblock will not allocate hotpluggable
> memory. I think this is the most safe and easy way to satisfy hotplug
> requirement.

And you can use exactly the same mechanism to track memory areas which
were allocated before NUMA info was fully available, right?

> So you don't agree to serialize the operations at boot time.

No, I'm not disagreeing that some ordering is necessary.  My point is
that things seem to be going that way too far.  Sure, some reordering
is necessary but it doesn't have to be this fragile.  Careful
reordering isn't the only way to achieve it.

> About this patch-set from Yinghai, actually he is doing a job that I
> failed to do. And he also included a lot of other things in the
> patch-set, such as extend max number of overridable acpi tables, local
> node pagetable, and so on.

Doing multiple things to achieve a goal in a patchset might not be
optimal but is usually okay if properly explained.  What's not okay is
not explaining the overall goal, approach and design in the head
message, poor quality of patch description and code documentation.

This part of code is almost inherently fragile and difficult to debug
and patchset like this would degrade the maintainability and I really
don't want to spend hours trying to decipher what the overall approach
is by trying to navigate maze of poorly documented patches only to
find out that some of the basic approaches are not very agreeable.  We
could have had this exact discussion way earlier if the head message
properly described what was going on and the review process would have
been much more pleasant for all involved parties.

I don't think it matters whose patches go in how as long as they are
attributed correctly.  The end result - what goes in the git tree as
log and code changes - matters, and it needs to be whole lot better.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-19 10:05     ` Vasilis Liaskovitis
@ 2013-06-20 18:42       ` Yinghai Lu
  0 siblings, 0 replies; 87+ messages in thread
From: Yinghai Lu @ 2013-06-20 18:42 UTC (permalink / raw)
  To: Vasilis Liaskovitis
  Cc: Tang Chen, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andrew Morton, Tejun Heo, Thomas Renninger, Jiang Liu,
	Wen Congyang, Lai Jiangshan, Yasuaki Ishimatsu, Mel Gorman,
	Minchan Kim, mina86, gong.chen, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM

On Wed, Jun 19, 2013 at 3:05 AM, Vasilis Liaskovitis
<vasilis.liaskovitis@profitbricks.com> wrote:
> On Tue, Jun 18, 2013 at 01:19:12PM -0700, Yinghai Lu wrote:
>> On Tue, Jun 18, 2013 at 10:10 AM, Vasilis Liaskovitis
>> <vasilis.liaskovitis@profitbricks.com> wrote:
>> >> could be found at:
>> >>         git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
>> >>
>> >> and it is based on today's Linus tree.
>> >>
>> >
>> > Has this patchset been tested on various numa configs?
>> > I am using linux-next next-20130607 + part1 with qemu/kvm/seabios VMs. The kernel
>> > boots successfully in many numa configs but while trying different memory sizes
>> > for a 2 numa node VM, I noticed that booting does not complete in all cases
>> > (bootup screen appears to hang but there is no output indicating an early panic)
>> >
>> > node0   node1    boots
>> > 1G      1G       yes
>> > 1G      2G       yes
>> > 1G      0.5G     yes
>> > 3G      2.5G     yes
>> > 3G      3G       yes
>> > 4G      0G       yes
>> > 4G      4G       yes
>> > 1.5G    1G       no
>> > 2G      1G       no
>> > 2G      2G       no
>> > 2.5G    2G       no
>> > 2.5G    2.5G     no
>> >
>> > linux-next next-20130607 boots al of these configs fine.
>> >
>> > Looks odd, perhaps I have something wrong in my setup or maybe there is a
>> > seabios/qemu interaction with this patchset. I will update if I find something.
>>
>> just tried 2g/2g, and it works on qemu-kvm:
>
> thanks for testing. If you can also share qemu/seabios versions you use (release
> or git commits), that would be helpful.

QEMU emulator version 1.5.50, Copyright (c) 2003-2008 Fabrice Bellard

it is at:
commit 7387de16d0e4d2988df350926537cd12a8e34206
Merge: b8a75b6 e73fe2b
Author: Anthony Liguori <aliguori@us.ibm.com>
Date:   Fri Jun 7 08:40:52 2013 -0500

    Merge remote-tracking branch 'stefanha/block' into staging

start command:

#for 64bit numa
/usr/local/kvm/bin/qemu-system-x86_64 -L /usr/local/kvm/share/qemu
-enable-kvm -numa node,nodeid=0,cpus=0-1,mem=2048 -numa
node,nodeid=1,cpus=2-3,mem=2048 -smp sockets=2,cores=2,threads=1 -m
4096 -net nic,model=e1000,macaddr=00:1c:25:1c:13:e9 -net user -hda
/home/yhlu/data.dsk -cdrom
/home/yhlu/xx/xx/kernel/tip/linux-2.6/arch/x86/boot/image.iso -boot d
-serial telnet:127.0.0.1:4444,server -monitor stdio


>
> this is most likely some error on my setup, I 'll let you know if I conclude
> otherwise.
>
> thanks,
>
> - Vasilis

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
                   ` (23 preceding siblings ...)
  2013-06-18 17:10 ` Vasilis Liaskovitis
@ 2013-06-21  5:19 ` H. Peter Anvin
  2013-06-21  6:06   ` Tang Chen
  2013-06-21 20:18   ` Yinghai Lu
  24 siblings, 2 replies; 87+ messages in thread
From: H. Peter Anvin @ 2013-06-21  5:19 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, akpm, tj, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

On 06/13/2013 06:02 AM, Tang Chen wrote:
> From: Yinghai Lu <yinghai@kernel.org>
> 
> No offence, just rebase and resend the patches from Yinghai to help
> to push this functionality faster.
> Also improve the comments in the patches' log.
> 

So we need a new version of this which addresses the build problems and
the feedback from Tejun... and it would be good to get that soon, or
we'll be looking at 3.12.

Since the merge window is approaching quickly, is there a meaningful
subset that is ready now?

	-hpa



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-21  5:19 ` H. Peter Anvin
@ 2013-06-21  6:06   ` Tang Chen
  2013-06-21  6:10     ` H. Peter Anvin
  2013-06-21 20:18   ` Yinghai Lu
  1 sibling, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-21  6:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: tglx, mingo, akpm, tj, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm, Toshi Kani

On 06/21/2013 01:19 PM, H. Peter Anvin wrote:
> On 06/13/2013 06:02 AM, Tang Chen wrote:
>> From: Yinghai Lu<yinghai@kernel.org>
>>
>> No offence, just rebase and resend the patches from Yinghai to help
>> to push this functionality faster.
>> Also improve the comments in the patches' log.
>>
>
> So we need a new version of this which addresses the build problems and
> the feedback from Tejun... and it would be good to get that soon, or
> we'll be looking at 3.12.

Hi hpa,

The build problem has been fixed by Yinghai.

>
> Since the merge window is approaching quickly, is there a meaningful
> subset that is ready now?

I think memory hotplug needs at least part1 and part2 patches. But
local node pagetable (patch 21 and 22 in part1) will break memory
hot-remove path. My part3 intends to fix it, but it seems we need
local device pagetable to enable single device hotplug, but not just
local node pagetable.

So, my plan is

1. Implement arranging hotpluggable memory with SRAT first, within tj's
    comments, without local node pagetable.
    (The main work in part2. And of course, need some patches in part1.)
2. Do the local device pagetable work, not local node.
3. Improve memory hotplug to support local device pagetable.

I'll send a new version patch-set of step1, wishing we can catch up with
the merge window. And I think step2 and 3 should be done later.


Thanks. :)



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-21  6:06   ` Tang Chen
@ 2013-06-21  6:10     ` H. Peter Anvin
  2013-06-21  6:20       ` Tang Chen
  0 siblings, 1 reply; 87+ messages in thread
From: H. Peter Anvin @ 2013-06-21  6:10 UTC (permalink / raw)
  To: Tang Chen
  Cc: tglx, mingo, akpm, tj, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm, Toshi Kani

On 06/20/2013 11:06 PM, Tang Chen wrote:
> 
> Hi hpa,
> 
> The build problem has been fixed by Yinghai.
> 

Where?  I don't see anything that is obviously a fix in my inbox.

What about Tejun's feedback?

	-hpa


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-21  6:10     ` H. Peter Anvin
@ 2013-06-21  6:20       ` Tang Chen
  2013-06-21  6:26         ` Tejun Heo
  0 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-21  6:20 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: tglx, mingo, akpm, tj, trenn, yinghai, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm, Toshi Kani

On 06/21/2013 02:10 PM, H. Peter Anvin wrote:
> On 06/20/2013 11:06 PM, Tang Chen wrote:
>>
>> Hi hpa,
>>
>> The build problem has been fixed by Yinghai.
>>
>
> Where?  I don't see anything that is obviously a fix in my inbox.
>

Yinghai resent a new version to fix the problem.
https://lkml.org/lkml/2013/6/14/561

> What about Tejun's feedback?

tj's comments were after the latest version. So we need to
restructure the patch-set.

>
> 	-hpa
>
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-21  6:20       ` Tang Chen
@ 2013-06-21  6:26         ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2013-06-21  6:26 UTC (permalink / raw)
  To: Tang Chen
  Cc: H. Peter Anvin, tglx, mingo, akpm, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, mgorman, minchan, mina86,
	gong.chen, vasilis.liaskovitis, lwoodman, riel, jweiner, prarit,
	x86, linux-doc, linux-kernel, linux-mm, Toshi Kani

Hello, guys.

On Fri, Jun 21, 2013 at 02:20:28PM +0800, Tang Chen wrote:
> >What about Tejun's feedback?
> 
> tj's comments were after the latest version. So we need to
> restructure the patch-set.

Given that it's unlikely to reach actual functionality in this cycle,
it's probably a better idea to aim the next cycle.  I don't think we
wanna rush it.  As for my suggestions, I'm not sure how much of it'd
work out and how much better it's gonna make things but it definitely
seems worth investigating to me.  Let's please see how it goes.

Thanks a lot!

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-20  6:17         ` Tejun Heo
@ 2013-06-21  9:19           ` Tang Chen
  2013-06-21 18:25             ` Tejun Heo
  0 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-21  9:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: yinghai, tglx, mingo, hpa, akpm, trenn, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

Hi tj,

On 06/20/2013 02:17 PM, Tejun Heo wrote:
......
>
> I was suggesting two separate things.
>
> * As memblock allocator can relocate itself.  There's no point in
>    avoiding setting NUMA node while parsing and registering NUMA
>    topology.  Just parse and register NUMA info and later tell it to
>    relocate itself out of hot-pluggable node.  A number of patches in
>    the series is doing this dancing - carefully reordering NUMA
>    probing.  No need to do that.  It's really fragile thing to do.
>
> * Once you get the above out of the way, I don't think there are a lot
>    of permanent allocations in the way before NUMA is initialized.
>    Re-order the remaining ones if that's cleaner to do.  If that gets
>    overly messy / fragile, copying them around or freeing and reloading
>    afterwards could be an option too.

memblock allocator can relocate itself, but it cannot relocate the memory
it allocated for users. There could be some pointers pointing to these
memory ranges. If we do the relocation, how to update these pointers ?

Or, do you mean modify the pagetable ?  I don't think so.

So would you please tell me more about how to do the relocation ?

Thanks. :)



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-21  9:19           ` Tang Chen
@ 2013-06-21 18:25             ` Tejun Heo
  2013-06-24  3:51               ` Tang Chen
  0 siblings, 1 reply; 87+ messages in thread
From: Tejun Heo @ 2013-06-21 18:25 UTC (permalink / raw)
  To: Tang Chen
  Cc: yinghai, tglx, mingo, hpa, akpm, trenn, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

Hey,

On Fri, Jun 21, 2013 at 05:19:48PM +0800, Tang Chen wrote:
> >* As memblock allocator can relocate itself.  There's no point in
> >   avoiding setting NUMA node while parsing and registering NUMA
> >   topology.  Just parse and register NUMA info and later tell it to
> >   relocate itself out of hot-pluggable node.  A number of patches in
> >   the series is doing this dancing - carefully reordering NUMA
> >   probing.  No need to do that.  It's really fragile thing to do.
> >
> >* Once you get the above out of the way, I don't think there are a lot
> >   of permanent allocations in the way before NUMA is initialized.
> >   Re-order the remaining ones if that's cleaner to do.  If that gets
> >   overly messy / fragile, copying them around or freeing and reloading
> >   afterwards could be an option too.
> 
> memblock allocator can relocate itself, but it cannot relocate the memory

Hmmm... maybe I wasn't clear but that's the first bullet point above.

> it allocated for users. There could be some pointers pointing to these
> memory ranges. If we do the relocation, how to update these pointers ?

And the second.  Can you please list what persistent areas are
allocated before numa info is configured into memblock?  There
shouldn't be whole lot.  And, again, this type of information should
have been available in the head message so that high-level discussion
could take place right away.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-21  5:19 ` H. Peter Anvin
  2013-06-21  6:06   ` Tang Chen
@ 2013-06-21 20:18   ` Yinghai Lu
  1 sibling, 0 replies; 87+ messages in thread
From: Yinghai Lu @ 2013-06-21 20:18 UTC (permalink / raw)
  To: H. Peter Anvin, Tejun Heo
  Cc: Tang Chen, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Thomas Renninger, Jiang Liu, Wen Congyang, Lai Jiangshan,
	Yasuaki Ishimatsu, Mel Gorman, Minchan Kim, mina86, gong.chen,
	Vasilis Liaskovitis, lwoodman, Rik van Riel, jweiner,
	Prarit Bhargava, the arch/x86 maintainers, linux-doc,
	Linux Kernel Mailing List, Linux MM

On Thu, Jun 20, 2013 at 10:19 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 06/13/2013 06:02 AM, Tang Chen wrote:
>> From: Yinghai Lu <yinghai@kernel.org>
>>
>> No offence, just rebase and resend the patches from Yinghai to help
>> to push this functionality faster.
>> Also improve the comments in the patches' log.
>>
>
> So we need a new version of this which addresses the build problems and
> the feedback from Tejun... and it would be good to get that soon, or
> we'll be looking at 3.12.
>
> Since the merge window is approaching quickly, is there a meaningful
> subset that is ready now?

patch 1-9, and 20 in updated patchset, could goto 3.11.
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-x86-mm
https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/log/?h=for-x86-mm

they are about acpi_override move early and some enhancement.
they got enough tested-by and Acked-by include ones from tj.

If you are ok with that, I could resend those 10 patches today.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-21 18:25             ` Tejun Heo
@ 2013-06-24  3:51               ` Tang Chen
  2013-06-24  7:26                 ` Tang Chen
  0 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-24  3:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: yinghai, tglx, mingo, hpa, akpm, trenn, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

On 06/22/2013 02:25 AM, Tejun Heo wrote:
> Hey,
>
> On Fri, Jun 21, 2013 at 05:19:48PM +0800, Tang Chen wrote:
>>> * As memblock allocator can relocate itself.  There's no point in
>>>    avoiding setting NUMA node while parsing and registering NUMA
>>>    topology.  Just parse and register NUMA info and later tell it to
>>>    relocate itself out of hot-pluggable node.  A number of patches in
>>>    the series is doing this dancing - carefully reordering NUMA
>>>    probing.  No need to do that.  It's really fragile thing to do.
>>>
>>> * Once you get the above out of the way, I don't think there are a lot
>>>    of permanent allocations in the way before NUMA is initialized.
>>>    Re-order the remaining ones if that's cleaner to do.  If that gets
>>>    overly messy / fragile, copying them around or freeing and reloading
>>>    afterwards could be an option too.
>>
>> memblock allocator can relocate itself, but it cannot relocate the memory
>
> Hmmm... maybe I wasn't clear but that's the first bullet point above.
>
>> it allocated for users. There could be some pointers pointing to these
>> memory ranges. If we do the relocation, how to update these pointers ?
>
> And the second.  Can you please list what persistent areas are
> allocated before numa info is configured into memblock?  There

Hi tj,

My box is x86_64, and the memory layout is:
[    0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[    0.000000] SRAT: Node 0 PXM 0 [mem 0x100000000-0x307ffffff]
[    0.000000] SRAT: Node 1 PXM 2 [mem 0x308000000-0x587ffffff] Hot 
Pluggable
[    0.000000] SRAT: Node 2 PXM 3 [mem 0x588000000-0x7ffffffff] Hot 
Pluggable


I marked ranges reserved by memblock before we parse SRAT with flag 0x4.
There are about 14 ranges which is persistent after boot.

[    0.000000]  reserved[0x0]   [0x00000000000000-0x0000000000ffff], 
0x10000 bytes flags: 0x4
[    0.000000]  reserved[0x1]   [0x00000000093000-0x000000000fffff], 
0x6d000 bytes flags: 0x4
[    0.000000]  reserved[0x2]   [0x00000001000000-0x00000002a9afff], 
0x1a9b000 bytes flags: 0x4
[    0.000000]  reserved[0x3]   [0x00000030000000-0x00000037ffffff], 
0x8000000 bytes flags: 0x4
...
[    0.000000]  reserved[0x5]   [0x0000006da81000-0x0000006e46afff], 
0x9ea000 bytes flags: 0x4
[    0.000000]  reserved[0x6]   [0x0000006ed6a000-0x0000006f246fff], 
0x4dd000 bytes flags: 0x4
[    0.000000]  reserved[0x7]   [0x0000006f28a000-0x0000006f299fff], 
0x10000 bytes flags: 0x4
[    0.000000]  reserved[0x8]   [0x0000006f29c000-0x0000006fe91fff], 
0xbf6000 bytes flags: 0x4
[    0.000000]  reserved[0x9]   [0x00000070e92000-0x00000071d54fff], 
0xec3000 bytes flags: 0x4
[    0.000000]  reserved[0xa]   [0x00000071d5e000-0x00000072204fff], 
0x4a7000 bytes flags: 0x4
[    0.000000]  reserved[0xb]   [0x00000072220000-0x0000007222074f], 
0x750 bytes flags: 0x4
...
[    0.000000]  reserved[0xd]   [0x000000722bc000-0x000000722bc1cf], 
0x1d0 bytes flags: 0x4
[    0.000000]  reserved[0xe]   [0x00000072bd3000-0x00000076c8ffff], 
0x40bd000 bytes flags: 0x4
......
[    0.000000]  reserved[0x134] [0x000007fffdf000-0x000007ffffffff], 
0x21000 bytes flags: 0x4


Just for the readability:
                                 [0x00000308000000-0x00000587ffffff] 
    Hot Pluggable
                                 [0x00000588000000-0x000007ffffffff] 
    Hot Pluggable

Seeing from the dmesg, only the last one is in hotpluggable area. I need 
to go
through the code to find out what it is, and find a way to relocate it.

But I'm not sure if a box with a different SRAT will have different result.

I will send more info later.

Thanks. :)


> shouldn't be whole lot.  And, again, this type of information should
> have been available in the head message so that high-level discussion
> could take place right away.
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-24  3:51               ` Tang Chen
@ 2013-06-24  7:26                 ` Tang Chen
  2013-06-24 19:59                   ` Tejun Heo
  0 siblings, 1 reply; 87+ messages in thread
From: Tang Chen @ 2013-06-24  7:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: yinghai, tglx, mingo, hpa, akpm, trenn, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

On 06/24/2013 11:51 AM, Tang Chen wrote:
> On 06/22/2013 02:25 AM, Tejun Heo wrote:
>> Hey,
>>
>> On Fri, Jun 21, 2013 at 05:19:48PM +0800, Tang Chen wrote:
>>>> * As memblock allocator can relocate itself. There's no point in
>>>> avoiding setting NUMA node while parsing and registering NUMA
>>>> topology. Just parse and register NUMA info and later tell it to
>>>> relocate itself out of hot-pluggable node. A number of patches in
>>>> the series is doing this dancing - carefully reordering NUMA
>>>> probing. No need to do that. It's really fragile thing to do.
>>>>
>>>> * Once you get the above out of the way, I don't think there are a lot
>>>> of permanent allocations in the way before NUMA is initialized.
>>>> Re-order the remaining ones if that's cleaner to do. If that gets
>>>> overly messy / fragile, copying them around or freeing and reloading
>>>> afterwards could be an option too.
>>>
>>> memblock allocator can relocate itself, but it cannot relocate the
>>> memory
>>
>> Hmmm... maybe I wasn't clear but that's the first bullet point above.
>>
>>> it allocated for users. There could be some pointers pointing to these
>>> memory ranges. If we do the relocation, how to update these pointers ?
>>
>> And the second. Can you please list what persistent areas are
>> allocated before numa info is configured into memblock? There
>
> Hi tj,
>
> My box is x86_64, and the memory layout is:
> [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
> [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x100000000-0x307ffffff]
> [ 0.000000] SRAT: Node 1 PXM 2 [mem 0x308000000-0x587ffffff] Hot Pluggable
> [ 0.000000] SRAT: Node 2 PXM 3 [mem 0x588000000-0x7ffffffff] Hot Pluggable
>
>
> I marked ranges reserved by memblock before we parse SRAT with flag 0x4.
> There are about 14 ranges which is persistent after boot.
>
> [ 0.000000] reserved[0x0] [0x00000000000000-0x0000000000ffff], 0x10000
> bytes flags: 0x4
> [ 0.000000] reserved[0x1] [0x00000000093000-0x000000000fffff], 0x6d000
> bytes flags: 0x4
> [ 0.000000] reserved[0x2] [0x00000001000000-0x00000002a9afff], 0x1a9b000
> bytes flags: 0x4
> [ 0.000000] reserved[0x3] [0x00000030000000-0x00000037ffffff], 0x8000000
> bytes flags: 0x4
> ...
> [ 0.000000] reserved[0x5] [0x0000006da81000-0x0000006e46afff], 0x9ea000
> bytes flags: 0x4
> [ 0.000000] reserved[0x6] [0x0000006ed6a000-0x0000006f246fff], 0x4dd000
> bytes flags: 0x4
> [ 0.000000] reserved[0x7] [0x0000006f28a000-0x0000006f299fff], 0x10000
> bytes flags: 0x4
> [ 0.000000] reserved[0x8] [0x0000006f29c000-0x0000006fe91fff], 0xbf6000
> bytes flags: 0x4
> [ 0.000000] reserved[0x9] [0x00000070e92000-0x00000071d54fff], 0xec3000
> bytes flags: 0x4
> [ 0.000000] reserved[0xa] [0x00000071d5e000-0x00000072204fff], 0x4a7000
> bytes flags: 0x4
> [ 0.000000] reserved[0xb] [0x00000072220000-0x0000007222074f], 0x750
> bytes flags: 0x4
> ...
> [ 0.000000] reserved[0xd] [0x000000722bc000-0x000000722bc1cf], 0x1d0
> bytes flags: 0x4
> [ 0.000000] reserved[0xe] [0x00000072bd3000-0x00000076c8ffff], 0x40bd000
> bytes flags: 0x4
> ......
> [ 0.000000] reserved[0x134] [0x000007fffdf000-0x000007ffffffff], 0x21000
> bytes flags: 0x4

This range is allocated by init_mem_mapping() in setup_arch(), it calls
alloc_low_pages() to allocate pagetable pages.

I think if we do the local device pagetable, we can solve this problem
without any relocation.

I will make a patch trying to do this. But I'm not sure if there are any
other relocation problems on other architectures.

But even if not, I still think this could be dangerous if someone modifies
the boot path and allocates some persistent memory before SRAT parsed in
the future. He has to be aware of memory hotplug things and do the 
necessary
relocation himself.

I'll try to make the patch to acheve this with comment as full as possible.

Thanks. :)

>
>
> Just for the readability:
> [0x00000308000000-0x00000587ffffff] Hot Pluggable
> [0x00000588000000-0x000007ffffffff] Hot Pluggable
>
> Seeing from the dmesg, only the last one is in hotpluggable area. I need
> to go
> through the code to find out what it is, and find a way to relocate it.
>
> But I'm not sure if a box with a different SRAT will have different result.
>
> I will send more info later.
>
> Thanks. :)
>
>
>> shouldn't be whole lot. And, again, this type of information should
>> have been available in the head message so that high-level discussion
>> could take place right away.
>>
>> Thanks.
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-18 17:10 ` Vasilis Liaskovitis
  2013-06-18 20:19   ` Yinghai Lu
@ 2013-06-24  9:40   ` Gu Zheng
  1 sibling, 0 replies; 87+ messages in thread
From: Gu Zheng @ 2013-06-24  9:40 UTC (permalink / raw)
  To: Vasilis Liaskovitis
  Cc: Tang Chen, tglx, mingo, hpa, akpm, tj, trenn, yinghai, jiang.liu,
	wency, laijs, isimatu.yasuaki, mgorman, minchan, mina86,
	gong.chen, lwoodman, riel, jweiner, prarit, x86, linux-doc,
	linux-kernel, linux-mm

On 06/19/2013 01:10 AM, Vasilis Liaskovitis wrote:

> Hi,
> 
> On Thu, Jun 13, 2013 at 09:02:47PM +0800, Tang Chen wrote:
>> From: Yinghai Lu <yinghai@kernel.org>
>>
>> No offence, just rebase and resend the patches from Yinghai to help
>> to push this functionality faster.
>> Also improve the comments in the patches' log.
>>
>>
>> One commit that tried to parse SRAT early get reverted before v3.9-rc1.
>>
>> | commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
>> | Author: Tang Chen <tangchen@cn.fujitsu.com>
>> | Date:   Fri Feb 22 16:33:44 2013 -0800
>> |
>> |    acpi, memory-hotplug: parse SRAT before memblock is ready
>>
>> It broke several things, like acpi override and fall back path etc.
>>
>> This patchset is clean implementation that will parse numa info early.
>> 1. keep the acpi table initrd override working by split finding with copying.
>>    finding is done at head_32.S and head64.c stage,
>>         in head_32.S, initrd is accessed in 32bit flat mode with phys addr.
>>         in head64.c, initrd is accessed via kernel low mapping address
>>         with help of #PF set page table.
>>    copying is done with early_ioremap just after memblock is setup.
>> 2. keep fallback path working. numaq and ACPI and amd_nmua and dummy.
>>    seperate initmem_init to two stages.
>>    early_initmem_init will only extract numa info early into numa_meminfo.
>>    initmem_init will keep slit and emulation handling.
>> 3. keep other old code flow untouched like relocate_initrd and initmem_init.
>>    early_initmem_init will take old init_mem_mapping position.
>>    it call early_x86_numa_init and init_mem_mapping for every nodes.
>>    For 64bit, we avoid having size limit on initrd, as relocate_initrd
>>    is still after init_mem_mapping for all memory.
>> 4. last patch will try to put page table on local node, so that memory
>>    hotplug will be happy.
>>
>> In short, early_initmem_init will parse numa info early and call
>> init_mem_mapping to set page table for every nodes's mem.
>>
>> could be found at:
>>         git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
>>
>> and it is based on today's Linus tree.
>>
> 
> Has this patchset been tested on various numa configs?
> I am using linux-next next-20130607 + part1 with qemu/kvm/seabios VMs. The kernel
> boots successfully in many numa configs but while trying different memory sizes
> for a 2 numa node VM, I noticed that booting does not complete in all cases
> (bootup screen appears to hang but there is no output indicating an early panic)
> 
> node0   node1	 boots
> 1G 	1G	 yes
> 1G 	2G	 yes
> 1G 	0.5G	 yes
> 3G 	2.5G	 yes
> 3G 	3G 	 yes
> 4G 	0G	 yes
> 4G 	4G	 yes
> 1.5G	1G	 no
> 2G 	1G	 no
> 2G 	2G	 no
> 2.5G 	2G	 no
> 2.5G 	2.5G	 no
> 
> linux-next next-20130607 boots al of these configs fine.
> 
> Looks odd, perhaps I have something wrong in my setup or maybe there is a
> seabios/qemu interaction with this patchset. I will update if I find something.

Hi Vasilis,
   This patchset can work well with all the numa config cases you mentioned in latest kernel tree (3.10-rc7) in our box.

Host OS: RHEL 6.4 Beta
qemu-kvm: 0.12.1.2 (Released with RHEL 6.4 Beta)
Guest OS: RHEL 6.3 
Guest kernel:3.10-rc7 + [Part1 PATCH v5 ] x86, ACPI, numa: Parse numa info earlier
Cmd:

/usr/libexec/qemu-kvm -name rhel_6.3 -S -M rhel6.4.0 -enable-kvm 
-m 5120 -smp 4,sockets=4,cores=1,threads=1 
-numa node,nodeid=0,cpus=0-1,mem=2560 
-numa node,nodeid=1,cpus=2-3,mem=2560 
-uuid fa11164c-1a09-280b-eae4-e2c40c631767 -nodefconfig -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/rhel_6.3.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/home/hut-rhel6.3.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:28:6e:29,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5


Result:
node0   node1	 boots
1G 	1G	 yes
1G 	2G	 yes
1G 	0.5G	 yes
3G 	2.5G	 yes
3G 	3G 	 yes
4G 	0G	 yes
4G 	4G	 yes
1.5G	1G	 yes
2G 	1G	 yes
2G 	2G	 yes
2.5G 	2G	 yes
2.5G 	2.5G	 yes

Thanks,

Gu


> 
> thanks,
> 
> - Vasilis
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier
  2013-06-24  7:26                 ` Tang Chen
@ 2013-06-24 19:59                   ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2013-06-24 19:59 UTC (permalink / raw)
  To: Tang Chen
  Cc: yinghai, tglx, mingo, hpa, akpm, trenn, jiang.liu, wency, laijs,
	isimatu.yasuaki, mgorman, minchan, mina86, gong.chen,
	vasilis.liaskovitis, lwoodman, riel, jweiner, prarit, x86,
	linux-doc, linux-kernel, linux-mm

Hello, Tang.

On Mon, Jun 24, 2013 at 03:26:27PM +0800, Tang Chen wrote:
> >My box is x86_64, and the memory layout is:
> >[ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
> >[ 0.000000] SRAT: Node 0 PXM 0 [mem 0x100000000-0x307ffffff]
> >[ 0.000000] SRAT: Node 1 PXM 2 [mem 0x308000000-0x587ffffff] Hot Pluggable
> >[ 0.000000] SRAT: Node 2 PXM 3 [mem 0x588000000-0x7ffffffff] Hot Pluggable
> >
> >
> >I marked ranges reserved by memblock before we parse SRAT with flag 0x4.
> >There are about 14 ranges which is persistent after boot.

You can also record the caller address or short backtrace with each
allocation (maybe controlled by some debug parameter).  It'd be a nice
capability to keep around anyway.

> This range is allocated by init_mem_mapping() in setup_arch(), it calls
> alloc_low_pages() to allocate pagetable pages.
> 
> I think if we do the local device pagetable, we can solve this problem
> without any relocation.

Yeah, I really can't think of many places which would allocate
permanent piece of memory before memblock is fully initialized.  Just
in case I wasn't clear, I don't have anything fundamentally against
reordering operations if that's cleaner, but we really should at least
find out what needs to be reordered and have a mechanism to verify and
track them down, and of course if relocating / reloading / whatever is
cleaner and/or more robust, that's what we should do.

> I will make a patch trying to do this. But I'm not sure if there are any
> other relocation problems on other architectures.
> 
> But even if not, I still think this could be dangerous if someone modifies
> the boot path and allocates some persistent memory before SRAT parsed in
> the future. He has to be aware of memory hotplug things and do the
> necessary relocation himself.

As I wrote above, I think it'd be nice to have a way to track memblock
allocations.  It can be a debug thing but we can just do it by
default, e.g., for allocations before memblock is fully initialized.
It's not like there are a ton of them.  Those extra allocations can be
freed on boot completion anyway, so they won't affect NUMA hotplug
either and we'll be able to continuously watch, and thus properly
maintain, the early boot hotplug issue on most configurations whether
they actually support and perform hotplug or not, which will be
multiple times more robust than trying to tweak boot sequence once and
hoping that it doesn't deteriorate over time.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2013-06-24 20:00 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-13 13:02 [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tang Chen
2013-06-13 13:02 ` [Part1 PATCH v5 01/22] x86: Change get_ramdisk_{image|size}() to global Tang Chen
2013-06-14 21:30   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-13 13:02 ` [Part1 PATCH v5 02/22] x86, microcode: Use common get_ramdisk_{image|size}() Tang Chen
2013-06-14 21:31   ` [tip:x86/mm] x86, microcode: Use common get_ramdisk_{image|size}( ) tip-bot for Yinghai Lu
2013-06-13 13:02 ` [Part1 PATCH v5 03/22] x86, ACPI, mm: Kill max_low_pfn_mapped Tang Chen
2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-17 21:04   ` [Part1 PATCH v5 03/22] " Tejun Heo
2013-06-17 21:13     ` Yinghai Lu
2013-06-17 23:08       ` Tejun Heo
2013-06-13 13:02 ` [Part1 PATCH v5 04/22] x86, ACPI: Search buffer above 4GB in a second try for acpi initrd table override Tang Chen
2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-17 21:06   ` [Part1 PATCH v5 04/22] " Tejun Heo
2013-06-13 13:02 ` [Part1 PATCH v5 05/22] x86, ACPI: Increase acpi initrd override tables number limit Tang Chen
2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-13 13:02 ` [Part1 PATCH v5 06/22] x86, ACPI: Split acpi_initrd_override() into find/copy two steps Tang Chen
2013-06-14 21:31   ` [tip:x86/mm] x86, ACPI: Split acpi_initrd_override() into find/ copy " tip-bot for Yinghai Lu
2013-06-13 13:02 ` [Part1 PATCH v5 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array Tang Chen
2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-17 23:38   ` [Part1 PATCH v5 07/22] " Tejun Heo
2013-06-17 23:40     ` Yinghai Lu
2013-06-17 23:52   ` Tejun Heo
2013-06-13 13:02 ` [Part1 PATCH v5 08/22] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode Tang Chen
2013-06-14 21:31   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-18  0:07   ` [Part1 PATCH v5 08/22] " Tejun Heo
2013-06-13 13:02 ` [Part1 PATCH v5 09/22] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c Tang Chen
2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-18  0:33   ` [Part1 PATCH v5 09/22] " Tejun Heo
2013-06-13 13:02 ` [Part1 PATCH v5 10/22] x86, mm, numa: Move two functions calling on successful path later Tang Chen
2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-18  0:53   ` [Part1 PATCH v5 10/22] " Tejun Heo
2013-06-13 13:02 ` [Part1 PATCH v5 11/22] x86, mm, numa: Call numa_meminfo_cover_memory() checking early Tang Chen
2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-18  1:05   ` [Part1 PATCH v5 11/22] " Tejun Heo
2013-06-13 13:02 ` [Part1 PATCH v5 12/22] x86, mm, numa: Move node_map_pfn_alignment() to x86 Tang Chen
2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-18  1:08   ` [Part1 PATCH v5 12/22] " Tejun Heo
2013-06-13 13:03 ` [Part1 PATCH v5 13/22] x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment Tang Chen
2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-18  1:40   ` [Part1 PATCH v5 13/22] " Tejun Heo
2013-06-13 13:03 ` [Part1 PATCH v5 14/22] x86, mm, numa: Set memblock nid later Tang Chen
2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-18  1:45   ` [Part1 PATCH v5 14/22] " Tejun Heo
2013-06-13 13:03 ` [Part1 PATCH v5 15/22] x86, mm, numa: Move node_possible_map setting later Tang Chen
2013-06-14 21:32   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-13 13:03 ` [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down Tang Chen
2013-06-14 21:33   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-18  1:58   ` [Part1 PATCH v5 16/22] " Tejun Heo
2013-06-18  6:22     ` Yinghai Lu
2013-06-18  7:13       ` Yinghai Lu
2013-06-19 21:25       ` Yinghai Lu
2013-06-13 13:03 ` [Part1 PATCH v5 17/22] x86, ACPI, numa, ia64: split SLIT handling out Tang Chen
2013-06-14 21:33   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-13 13:03 ` [Part1 PATCH v5 18/22] x86, mm, numa: Add early_initmem_init() stub Tang Chen
2013-06-14 21:33   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-13 13:03 ` [Part1 PATCH v5 19/22] x86, mm: Parse numa info earlier Tang Chen
2013-06-14 21:33   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-13 13:03 ` [Part1 PATCH v5 20/22] x86, mm: Add comments for step_size shift Tang Chen
2013-06-14 21:33   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-13 13:03 ` [Part1 PATCH v5 21/22] x86, mm: Make init_mem_mapping be able to be called several times Tang Chen
2013-06-13 18:35   ` Konrad Rzeszutek Wilk
2013-06-13 22:47     ` Yinghai Lu
2013-06-14  5:08       ` Tang Chen
2013-06-14 21:33   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-13 13:03 ` [Part1 PATCH v5 22/22] x86, mm, numa: Put pagetable on local node ram for 64bit Tang Chen
2013-06-14 21:34   ` [tip:x86/mm] " tip-bot for Yinghai Lu
2013-06-18  2:03 ` [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier Tejun Heo
2013-06-18  5:47   ` Tang Chen
2013-06-18 17:21     ` Tejun Heo
2013-06-20  5:52       ` Tang Chen
2013-06-20  6:17         ` Tejun Heo
2013-06-21  9:19           ` Tang Chen
2013-06-21 18:25             ` Tejun Heo
2013-06-24  3:51               ` Tang Chen
2013-06-24  7:26                 ` Tang Chen
2013-06-24 19:59                   ` Tejun Heo
2013-06-18 17:10 ` Vasilis Liaskovitis
2013-06-18 20:19   ` Yinghai Lu
2013-06-19 10:05     ` Vasilis Liaskovitis
2013-06-20 18:42       ` Yinghai Lu
2013-06-24  9:40   ` Gu Zheng
2013-06-21  5:19 ` H. Peter Anvin
2013-06-21  6:06   ` Tang Chen
2013-06-21  6:10     ` H. Peter Anvin
2013-06-21  6:20       ` Tang Chen
2013-06-21  6:26         ` Tejun Heo
2013-06-21 20:18   ` Yinghai Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).