All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info
@ 2019-01-07  8:24 Pingfan Liu
  2019-01-07  8:24 ` [RFC PATCH 1/4] acpi: change the topo of acpi_table_upgrade() Pingfan Liu
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Pingfan Liu @ 2019-01-07  8:24 UTC (permalink / raw)
  To: x86, linux-acpi
  Cc: Pingfan Liu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Rafael J. Wysocki, Len Brown, linux-kernel

Background about the defect of the current bottom-up allocation style, take
the following scenario:
  |  unmovable node |     movable node                           |
     | kaslr-kernel |subtree of pgtable for phy<->virt |

Although kaslr-kernel can avoid to stain the movable node. But the
pgtable can still stain the movable node. That is a probability problem,
with low probability, but still exist. This patch tries to eliminate the
probability. With the previous patch, at the point of init_mem_mapping(),
memblock allocator can work with the knowledge of acpi memory hotmovable
info, and avoid to stain the movable node. As a result,
memory_map_bottom_up() is not needed any more.


Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: linux-kernel@vger.kernel.org

Pingfan Liu (4):
  acpi: change the topo of acpi_table_upgrade()
  x86/setup: parse acpi to get hotplug info before init_mem_mapping()
  x86/mm: set allowed range for memblock allocator
  x86/mm: remove bottom-up allocation style for x86_64

 arch/arm64/kernel/setup.c |   2 +-
 arch/x86/kernel/setup.c   |  17 ++++-
 arch/x86/mm/init.c        | 154 +++++++---------------------------------------
 arch/x86/mm/init_32.c     | 123 ++++++++++++++++++++++++++++++++++++
 arch/x86/mm/mm_internal.h |   7 +++
 drivers/acpi/tables.c     |   4 +-
 include/linux/acpi.h      |   5 +-
 7 files changed, 172 insertions(+), 140 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 1/4] acpi: change the topo of acpi_table_upgrade()
  2019-01-07  8:24 [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info Pingfan Liu
@ 2019-01-07  8:24 ` Pingfan Liu
  2019-01-07 10:55   ` Rafael J. Wysocki
  2019-01-07  8:24 ` [RFC PATCH 2/4] x86/setup: parse acpi to get hotplug info before init_mem_mapping() Pingfan Liu
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Pingfan Liu @ 2019-01-07  8:24 UTC (permalink / raw)
  To: x86, linux-acpi
  Cc: Pingfan Liu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Rafael J. Wysocki, Len Brown, linux-kernel

The current acpi_table_upgrade() relies on initrd_start, but this var is
only valid after relocate_initrd(). There is requirement to extract the
acpi info from initrd before memblock-allocator can work(see [2/4]), hence
acpi_table_upgrade() need to accept the input param directly.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 arch/arm64/kernel/setup.c | 2 +-
 arch/x86/kernel/setup.c   | 2 +-
 drivers/acpi/tables.c     | 4 +---
 include/linux/acpi.h      | 4 ++--
 4 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 4b0e123..48cb98c 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -315,7 +315,7 @@ void __init setup_arch(char **cmdline_p)
 	paging_init();
 	efi_apply_persistent_mem_reservations();
 
-	acpi_table_upgrade();
+	acpi_table_upgrade((void *)initrd_start, initrd_end - initrd_start);
 
 	/* Parse the ACPI tables for possible boot-time configuration */
 	acpi_boot_table_init();
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3d872a5..acbcd62 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1175,8 +1175,8 @@ void __init setup_arch(char **cmdline_p)
 
 	reserve_initrd();
 
-	acpi_table_upgrade();
 
+	acpi_table_upgrade((void *)initrd_start, initrd_end - initrd_start);
 	vsmp_init();
 
 	io_delay_init();
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 48eabb6..d29b05c 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -471,10 +471,8 @@ static DECLARE_BITMAP(acpi_initrd_installed, NR_ACPI_INITRD_TABLES);
 
 #define MAP_CHUNK_SIZE   (NR_FIX_BTMAPS << PAGE_SHIFT)
 
-void __init acpi_table_upgrade(void)
+void __init acpi_table_upgrade(void *data, size_t size)
 {
-	void *data = (void *)initrd_start;
-	size_t size = initrd_end - initrd_start;
 	int sig, no, table_nr = 0, total_offset = 0;
 	long offset = 0;
 	struct acpi_table_header *table;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 87715f2..44dcbba 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -1272,9 +1272,9 @@ acpi_graph_get_remote_endpoint(const struct fwnode_handle *fwnode,
 #endif
 
 #ifdef CONFIG_ACPI_TABLE_UPGRADE
-void acpi_table_upgrade(void);
+void acpi_table_upgrade(void *data, size_t size);
 #else
-static inline void acpi_table_upgrade(void) { }
+static inline void acpi_table_upgrade(void *data, size_t size) { }
 #endif
 
 #if defined(CONFIG_ACPI) && defined(CONFIG_ACPI_WATCHDOG)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 2/4] x86/setup: parse acpi to get hotplug info before init_mem_mapping()
  2019-01-07  8:24 [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info Pingfan Liu
  2019-01-07  8:24 ` [RFC PATCH 1/4] acpi: change the topo of acpi_table_upgrade() Pingfan Liu
@ 2019-01-07  8:24 ` Pingfan Liu
  2019-01-07 12:52   ` Pingfan Liu
  2019-01-07 17:11   ` Dave Hansen
  2019-01-07  8:24 ` [RFC PATCH 3/4] x86/mm: set allowed range for memblock allocator Pingfan Liu
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 19+ messages in thread
From: Pingfan Liu @ 2019-01-07  8:24 UTC (permalink / raw)
  To: x86, linux-acpi
  Cc: Pingfan Liu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Rafael J. Wysocki, Len Brown, linux-kernel

At present, memblock bottom-up allocation can help us against stamping over
movable node in very high probability. But if the hotplug info has already
been parsed, the memblock allocator can step around the movable node by
itself. This patch pushes the parsing step forward, just ahead of where,
the memblock allocator can work. Later in this series, the bottom-up
allocation style can be removed on x86_64.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/setup.c | 15 +++++++++++++++
 include/linux/acpi.h    |  1 +
 2 files changed, 16 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index acbcd62..df4132c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -805,6 +805,20 @@ dump_kernel_offset(struct notifier_block *self, unsigned long v, void *p)
 	return 0;
 }
 
+/* only need the effect of acpi_numa_memory_affinity_init()
+ * ->memblock_mark_hotplug()
+ */
+static int early_detect_acpi_memhotplug(void)
+{
+#ifdef CONFIG_ACPI_NUMA
+	acpi_table_upgrade(__va(get_ramdisk_image()), get_ramdisk_size());
+	acpi_table_init();
+	acpi_numa_init();
+	acpi_tb_terminate();
+#endif
+	return 0;
+}
+
 /*
  * Determine if we were loaded by an EFI loader.  If so, then we have also been
  * passed the efi memmap, systab, etc., so we should use these data structures
@@ -1131,6 +1145,7 @@ void __init setup_arch(char **cmdline_p)
 	trim_platform_memory_ranges();
 	trim_low_memory_range();
 
+	early_detect_acpi_memhotplug();
 	init_mem_mapping();
 
 	idt_setup_early_pf();
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 44dcbba..1b69044 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -235,6 +235,7 @@ int acpi_mps_check (void);
 int acpi_numa_init (void);
 
 int acpi_table_init (void);
+void acpi_tb_terminate(void);
 int acpi_table_parse(char *id, acpi_tbl_table_handler handler);
 int __init acpi_table_parse_entries(char *id, unsigned long table_size,
 			      int entry_id,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 3/4] x86/mm: set allowed range for memblock allocator
  2019-01-07  8:24 [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info Pingfan Liu
  2019-01-07  8:24 ` [RFC PATCH 1/4] acpi: change the topo of acpi_table_upgrade() Pingfan Liu
  2019-01-07  8:24 ` [RFC PATCH 2/4] x86/setup: parse acpi to get hotplug info before init_mem_mapping() Pingfan Liu
@ 2019-01-07  8:24 ` Pingfan Liu
  2019-01-07  8:24 ` [RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64 Pingfan Liu
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Pingfan Liu @ 2019-01-07  8:24 UTC (permalink / raw)
  To: x86, linux-acpi
  Cc: Pingfan Liu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Rafael J. Wysocki, Len Brown, linux-kernel

Due to the incoming divergence of x86_32 and x86_64, there is requirement
to set the allowed allocating range at the early boot stage.
This patch also includes minor change to remove redundat cond check, refer
to memblock_find_in_range_node(), memblock_find_in_range() has already
protect itself from the case: start > end.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/mm/init.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index f905a23..84baa66 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -76,6 +76,14 @@ static unsigned long min_pfn_mapped;
 
 static bool __initdata can_use_brk_pgt = true;
 
+static unsigned long min_pfn_allowed;
+static unsigned long max_pfn_allowed;
+void set_alloc_range(unsigned long low, unsigned long high)
+{
+	min_pfn_allowed = low;
+	max_pfn_allowed = high;
+}
+
 /*
  * Pages returned are already directly mapped.
  *
@@ -100,12 +108,10 @@ __ref void *alloc_low_pages(unsigned int num)
 	if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
 		unsigned long ret = 0;
 
-		if (min_pfn_mapped < max_pfn_mapped) {
-			ret = memblock_find_in_range(
-					min_pfn_mapped << PAGE_SHIFT,
-					max_pfn_mapped << PAGE_SHIFT,
-					PAGE_SIZE * num , PAGE_SIZE);
-		}
+		ret = memblock_find_in_range(
+			min_pfn_allowed << PAGE_SHIFT,
+			max_pfn_allowed << PAGE_SHIFT,
+			PAGE_SIZE * num, PAGE_SIZE);
 		if (ret)
 			memblock_reserve(ret, PAGE_SIZE * num);
 		else if (can_use_brk_pgt)
@@ -588,14 +594,17 @@ static void __init memory_map_top_down(unsigned long map_start,
 			start = map_start;
 		mapped_ram_size += init_range_memory_mapping(start,
 							last_start);
+		set_alloc_range(min_pfn_mapped, max_pfn_mapped);
 		last_start = start;
 		min_pfn_mapped = last_start >> PAGE_SHIFT;
 		if (mapped_ram_size >= step_size)
 			step_size = get_new_step_size(step_size);
 	}
 
-	if (real_end < map_end)
+	if (real_end < map_end) {
 		init_range_memory_mapping(real_end, map_end);
+		set_alloc_range(min_pfn_mapped, max_pfn_mapped);
+	}
 }
 
 /**
@@ -636,6 +645,7 @@ static void __init memory_map_bottom_up(unsigned long map_start,
 		}
 
 		mapped_ram_size += init_range_memory_mapping(start, next);
+		set_alloc_range(min_pfn_mapped, max_pfn_mapped);
 		start = next;
 
 		if (mapped_ram_size >= step_size)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64
  2019-01-07  8:24 [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info Pingfan Liu
                   ` (2 preceding siblings ...)
  2019-01-07  8:24 ` [RFC PATCH 3/4] x86/mm: set allowed range for memblock allocator Pingfan Liu
@ 2019-01-07  8:24 ` Pingfan Liu
  2019-01-07 17:42   ` Dave Hansen
  2019-01-07 17:03 ` [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info Dave Hansen
  2019-01-08 10:05   ` Chao Fan
  5 siblings, 1 reply; 19+ messages in thread
From: Pingfan Liu @ 2019-01-07  8:24 UTC (permalink / raw)
  To: x86, linux-acpi
  Cc: Pingfan Liu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Rafael J. Wysocki, Len Brown, linux-kernel

There are two acheivements by this patch.
-1st. keep the subtree of pgtable away from movable node.
Background about the defect of the current bottom-up allocation style, take
the following scenario:
  |  unmovable node |     movable node                           |
     | kaslr-kernel |subtree of pgtable for phy<->virt |

Although kaslr-kernel can avoid to stain the movable node. [1] But the
pgtable can still stain the movable node. That is a probability problem,
with low probability, but still exist. This patch tries to eliminate the
probability. With the previous patch, at the point of init_mem_mapping(),
memblock allocator can work with the knowledge of acpi memory hotmovable
info, and avoid to stain the movable node. As a result,
memory_map_bottom_up() is not needed any more.

-2nd. simplify the logic of memory_map_top_down()
Thanks to the help of early_make_pgtable(), x86_64 can directly set up the
subtree of pgtable at any place, hence the careful iteration in
memory_map_top_down() can be discard.

[1]: https://lore.kernel.org/patchwork/patch/1029376/
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: linux-kernel@vger.kernel.org

---
 arch/x86/mm/init.c        | 140 +++-------------------------------------------
 arch/x86/mm/init_32.c     | 123 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/mm_internal.h |   7 +++
 3 files changed, 139 insertions(+), 131 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 84baa66..4e0286b 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -72,8 +72,6 @@ static unsigned long __initdata pgt_buf_start;
 static unsigned long __initdata pgt_buf_end;
 static unsigned long __initdata pgt_buf_top;
 
-static unsigned long min_pfn_mapped;
-
 static bool __initdata can_use_brk_pgt = true;
 
 static unsigned long min_pfn_allowed;
@@ -504,7 +502,7 @@ unsigned long __ref init_memory_mapping(unsigned long start,
  * That range would have hole in the middle or ends, and only ram parts
  * will be mapped in init_range_memory_mapping().
  */
-static unsigned long __init init_range_memory_mapping(
+unsigned long __init init_range_memory_mapping(
 					   unsigned long r_start,
 					   unsigned long r_end)
 {
@@ -532,127 +530,6 @@ static unsigned long __init init_range_memory_mapping(
 	return mapped_ram_size;
 }
 
-static unsigned long __init get_new_step_size(unsigned long step_size)
-{
-	/*
-	 * Initial mapped size is PMD_SIZE (2M).
-	 * We can not set step_size to be PUD_SIZE (1G) yet.
-	 * In worse case, when we cross the 1G boundary, and
-	 * PG_LEVEL_2M is not set, we will need 1+1+512 pages (2M + 8k)
-	 * to map 1G range with PTE. Hence we use one less than the
-	 * difference of page table level shifts.
-	 *
-	 * Don't need to worry about overflow in the top-down case, on 32bit,
-	 * when step_size is 0, round_down() returns 0 for start, and that
-	 * turns it into 0x100000000ULL.
-	 * In the bottom-up case, round_up(x, 0) returns 0 though too, which
-	 * needs to be taken into consideration by the code below.
-	 */
-	return step_size << (PMD_SHIFT - PAGE_SHIFT - 1);
-}
-
-/**
- * memory_map_top_down - Map [map_start, map_end) top down
- * @map_start: start address of the target memory range
- * @map_end: end address of the target memory range
- *
- * This function will setup direct mapping for memory range
- * [map_start, map_end) in top-down. That said, the page tables
- * will be allocated at the end of the memory, and we map the
- * memory in top-down.
- */
-static void __init memory_map_top_down(unsigned long map_start,
-				       unsigned long map_end)
-{
-	unsigned long real_end, start, last_start;
-	unsigned long step_size;
-	unsigned long addr;
-	unsigned long mapped_ram_size = 0;
-
-	/* xen has big range in reserved near end of ram, skip it at first.*/
-	addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE);
-	real_end = addr + PMD_SIZE;
-
-	/* step_size need to be small so pgt_buf from BRK could cover it */
-	step_size = PMD_SIZE;
-	max_pfn_mapped = 0; /* will get exact value next */
-	min_pfn_mapped = real_end >> PAGE_SHIFT;
-	last_start = start = real_end;
-
-	/*
-	 * We start from the top (end of memory) and go to the bottom.
-	 * The memblock_find_in_range() gets us a block of RAM from the
-	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
-	 * for page table.
-	 */
-	while (last_start > map_start) {
-		if (last_start > step_size) {
-			start = round_down(last_start - 1, step_size);
-			if (start < map_start)
-				start = map_start;
-		} else
-			start = map_start;
-		mapped_ram_size += init_range_memory_mapping(start,
-							last_start);
-		set_alloc_range(min_pfn_mapped, max_pfn_mapped);
-		last_start = start;
-		min_pfn_mapped = last_start >> PAGE_SHIFT;
-		if (mapped_ram_size >= step_size)
-			step_size = get_new_step_size(step_size);
-	}
-
-	if (real_end < map_end) {
-		init_range_memory_mapping(real_end, map_end);
-		set_alloc_range(min_pfn_mapped, max_pfn_mapped);
-	}
-}
-
-/**
- * memory_map_bottom_up - Map [map_start, map_end) bottom up
- * @map_start: start address of the target memory range
- * @map_end: end address of the target memory range
- *
- * This function will setup direct mapping for memory range
- * [map_start, map_end) in bottom-up. Since we have limited the
- * bottom-up allocation above the kernel, the page tables will
- * be allocated just above the kernel and we map the memory
- * in [map_start, map_end) in bottom-up.
- */
-static void __init memory_map_bottom_up(unsigned long map_start,
-					unsigned long map_end)
-{
-	unsigned long next, start;
-	unsigned long mapped_ram_size = 0;
-	/* step_size need to be small so pgt_buf from BRK could cover it */
-	unsigned long step_size = PMD_SIZE;
-
-	start = map_start;
-	min_pfn_mapped = start >> PAGE_SHIFT;
-
-	/*
-	 * We start from the bottom (@map_start) and go to the top (@map_end).
-	 * The memblock_find_in_range() gets us a block of RAM from the
-	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
-	 * for page table.
-	 */
-	while (start < map_end) {
-		if (step_size && map_end - start > step_size) {
-			next = round_up(start + 1, step_size);
-			if (next > map_end)
-				next = map_end;
-		} else {
-			next = map_end;
-		}
-
-		mapped_ram_size += init_range_memory_mapping(start, next);
-		set_alloc_range(min_pfn_mapped, max_pfn_mapped);
-		start = next;
-
-		if (mapped_ram_size >= step_size)
-			step_size = get_new_step_size(step_size);
-	}
-}
-
 void __init init_mem_mapping(void)
 {
 	unsigned long end;
@@ -663,6 +540,7 @@ void __init init_mem_mapping(void)
 
 #ifdef CONFIG_X86_64
 	end = max_pfn << PAGE_SHIFT;
+	set_alloc_range(0x100000, end);
 #else
 	end = max_low_pfn << PAGE_SHIFT;
 #endif
@@ -673,6 +551,13 @@ void __init init_mem_mapping(void)
 	/* Init the trampoline, possibly with KASLR memory offset */
 	init_trampoline();
 
+#ifdef CONFIG_X86_64
+	init_range_memory_mapping(ISA_END_ADDRESS, end);
+	if (max_pfn > max_low_pfn) {
+		/* can we preseve max_low_pfn ?*/
+		max_low_pfn = max_pfn;
+	}
+#else
 	/*
 	 * If the allocation is in bottom-up direction, we setup direct mapping
 	 * in bottom-up, otherwise we setup direct mapping in top-down.
@@ -692,13 +577,6 @@ void __init init_mem_mapping(void)
 	} else {
 		memory_map_top_down(ISA_END_ADDRESS, end);
 	}
-
-#ifdef CONFIG_X86_64
-	if (max_pfn > max_low_pfn) {
-		/* can we preseve max_low_pfn ?*/
-		max_low_pfn = max_pfn;
-	}
-#else
 	early_ioremap_page_table_range_init();
 #endif
 
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 85c94f9..ecf7243 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -58,6 +58,8 @@ unsigned long highstart_pfn, highend_pfn;
 
 bool __read_mostly __vmalloc_start_set = false;
 
+static unsigned long min_pfn_mapped;
+
 /*
  * Creates a middle page table and puts a pointer to it in the
  * given global directory entry. This only returns the gd entry
@@ -516,6 +518,127 @@ void __init native_pagetable_init(void)
 	paging_init();
 }
 
+static unsigned long __init get_new_step_size(unsigned long step_size)
+{
+	/*
+	 * Initial mapped size is PMD_SIZE (2M).
+	 * We can not set step_size to be PUD_SIZE (1G) yet.
+	 * In worse case, when we cross the 1G boundary, and
+	 * PG_LEVEL_2M is not set, we will need 1+1+512 pages (2M + 8k)
+	 * to map 1G range with PTE. Hence we use one less than the
+	 * difference of page table level shifts.
+	 *
+	 * Don't need to worry about overflow in the top-down case, on 32bit,
+	 * when step_size is 0, round_down() returns 0 for start, and that
+	 * turns it into 0x100000000ULL.
+	 * In the bottom-up case, round_up(x, 0) returns 0 though too, which
+	 * needs to be taken into consideration by the code below.
+	 */
+	return step_size << (PMD_SHIFT - PAGE_SHIFT - 1);
+}
+
+/**
+ * memory_map_top_down - Map [map_start, map_end) top down
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in top-down. That said, the page tables
+ * will be allocated at the end of the memory, and we map the
+ * memory in top-down.
+ */
+void __init memory_map_top_down(unsigned long map_start,
+				       unsigned long map_end)
+{
+	unsigned long real_end, start, last_start;
+	unsigned long step_size;
+	unsigned long addr;
+	unsigned long mapped_ram_size = 0;
+
+	/* xen has big range in reserved near end of ram, skip it at first.*/
+	addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE);
+	real_end = addr + PMD_SIZE;
+
+	/* step_size need to be small so pgt_buf from BRK could cover it */
+	step_size = PMD_SIZE;
+	max_pfn_mapped = 0; /* will get exact value next */
+	min_pfn_mapped = real_end >> PAGE_SHIFT;
+	last_start = start = real_end;
+
+	/*
+	 * We start from the top (end of memory) and go to the bottom.
+	 * The memblock_find_in_range() gets us a block of RAM from the
+	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
+	 * for page table.
+	 */
+	while (last_start > map_start) {
+		if (last_start > step_size) {
+			start = round_down(last_start - 1, step_size);
+			if (start < map_start)
+				start = map_start;
+		} else
+			start = map_start;
+		mapped_ram_size += init_range_memory_mapping(start,
+							last_start);
+		set_alloc_range(min_pfn_mapped, max_pfn_mapped);
+		last_start = start;
+		min_pfn_mapped = last_start >> PAGE_SHIFT;
+		if (mapped_ram_size >= step_size)
+			step_size = get_new_step_size(step_size);
+	}
+
+	if (real_end < map_end) {
+		init_range_memory_mapping(real_end, map_end);
+		set_alloc_range(min_pfn_mapped, max_pfn_mapped);
+	}
+}
+
+/**
+ * memory_map_bottom_up - Map [map_start, map_end) bottom up
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in bottom-up. Since we have limited the
+ * bottom-up allocation above the kernel, the page tables will
+ * be allocated just above the kernel and we map the memory
+ * in [map_start, map_end) in bottom-up.
+ */
+void __init memory_map_bottom_up(unsigned long map_start,
+					unsigned long map_end)
+{
+	unsigned long next, start;
+	unsigned long mapped_ram_size = 0;
+	/* step_size need to be small so pgt_buf from BRK could cover it */
+	unsigned long step_size = PMD_SIZE;
+
+	start = map_start;
+	min_pfn_mapped = start >> PAGE_SHIFT;
+
+	/*
+	 * We start from the bottom (@map_start) and go to the top (@map_end).
+	 * The memblock_find_in_range() gets us a block of RAM from the
+	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
+	 * for page table.
+	 */
+	while (start < map_end) {
+		if (step_size && map_end - start > step_size) {
+			next = round_up(start + 1, step_size);
+			if (next > map_end)
+				next = map_end;
+		} else {
+			next = map_end;
+		}
+
+		mapped_ram_size += init_range_memory_mapping(start, next);
+		set_alloc_range(min_pfn_mapped, max_pfn_mapped);
+		start = next;
+
+		if (mapped_ram_size >= step_size)
+			step_size = get_new_step_size(step_size);
+	}
+}
+
 /*
  * Build a proper pagetable for the kernel mappings.  Up until this
  * point, we've been running on some set of pagetables constructed by
diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h
index 319bde3..28006de 100644
--- a/arch/x86/mm/mm_internal.h
+++ b/arch/x86/mm/mm_internal.h
@@ -8,6 +8,13 @@ static inline void *alloc_low_page(void)
 	return alloc_low_pages(1);
 }
 
+unsigned long __init init_range_memory_mapping(unsigned long r_start,
+	unsigned long r_end);
+void set_alloc_range(unsigned long low, unsigned long high);
+void __init memory_map_top_down(unsigned long map_start,
+				       unsigned long map_end);
+void __init memory_map_bottom_up(unsigned long map_start,
+					unsigned long map_end);
 void early_ioremap_page_table_range_init(void);
 
 unsigned long kernel_physical_mapping_init(unsigned long start,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/4] acpi: change the topo of acpi_table_upgrade()
  2019-01-07  8:24 ` [RFC PATCH 1/4] acpi: change the topo of acpi_table_upgrade() Pingfan Liu
@ 2019-01-07 10:55   ` Rafael J. Wysocki
  0 siblings, 0 replies; 19+ messages in thread
From: Rafael J. Wysocki @ 2019-01-07 10:55 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: the arch/x86 maintainers, ACPI Devel Maling List,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Rafael J. Wysocki,
	Len Brown, Linux Kernel Mailing List

On Mon, Jan 7, 2019 at 9:25 AM Pingfan Liu <kernelfans@gmail.com> wrote:
>
> The current acpi_table_upgrade() relies on initrd_start, but this var is
> only valid after relocate_initrd(). There is requirement to extract the
> acpi info from initrd before memblock-allocator can work(see [2/4]), hence
> acpi_table_upgrade() need to accept the input param directly.
>
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Len Brown <lenb@kernel.org>
> Cc: linux-kernel@vger.kernel.org

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  arch/arm64/kernel/setup.c | 2 +-
>  arch/x86/kernel/setup.c   | 2 +-
>  drivers/acpi/tables.c     | 4 +---
>  include/linux/acpi.h      | 4 ++--
>  4 files changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 4b0e123..48cb98c 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -315,7 +315,7 @@ void __init setup_arch(char **cmdline_p)
>         paging_init();
>         efi_apply_persistent_mem_reservations();
>
> -       acpi_table_upgrade();
> +       acpi_table_upgrade((void *)initrd_start, initrd_end - initrd_start);
>
>         /* Parse the ACPI tables for possible boot-time configuration */
>         acpi_boot_table_init();
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 3d872a5..acbcd62 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1175,8 +1175,8 @@ void __init setup_arch(char **cmdline_p)
>
>         reserve_initrd();
>
> -       acpi_table_upgrade();
>
> +       acpi_table_upgrade((void *)initrd_start, initrd_end - initrd_start);
>         vsmp_init();
>
>         io_delay_init();
> diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
> index 48eabb6..d29b05c 100644
> --- a/drivers/acpi/tables.c
> +++ b/drivers/acpi/tables.c
> @@ -471,10 +471,8 @@ static DECLARE_BITMAP(acpi_initrd_installed, NR_ACPI_INITRD_TABLES);
>
>  #define MAP_CHUNK_SIZE   (NR_FIX_BTMAPS << PAGE_SHIFT)
>
> -void __init acpi_table_upgrade(void)
> +void __init acpi_table_upgrade(void *data, size_t size)
>  {
> -       void *data = (void *)initrd_start;
> -       size_t size = initrd_end - initrd_start;
>         int sig, no, table_nr = 0, total_offset = 0;
>         long offset = 0;
>         struct acpi_table_header *table;
> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> index 87715f2..44dcbba 100644
> --- a/include/linux/acpi.h
> +++ b/include/linux/acpi.h
> @@ -1272,9 +1272,9 @@ acpi_graph_get_remote_endpoint(const struct fwnode_handle *fwnode,
>  #endif
>
>  #ifdef CONFIG_ACPI_TABLE_UPGRADE
> -void acpi_table_upgrade(void);
> +void acpi_table_upgrade(void *data, size_t size);
>  #else
> -static inline void acpi_table_upgrade(void) { }
> +static inline void acpi_table_upgrade(void *data, size_t size) { }
>  #endif
>
>  #if defined(CONFIG_ACPI) && defined(CONFIG_ACPI_WATCHDOG)
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 2/4] x86/setup: parse acpi to get hotplug info before init_mem_mapping()
  2019-01-07  8:24 ` [RFC PATCH 2/4] x86/setup: parse acpi to get hotplug info before init_mem_mapping() Pingfan Liu
@ 2019-01-07 12:52   ` Pingfan Liu
  2019-01-07 17:11   ` Dave Hansen
  1 sibling, 0 replies; 19+ messages in thread
From: Pingfan Liu @ 2019-01-07 12:52 UTC (permalink / raw)
  To: x86, linux-acpi
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Rafael J. Wysocki,
	Len Brown, linux-kernel

On Mon, Jan 7, 2019 at 4:25 PM Pingfan Liu <kernelfans@gmail.com> wrote:
>
> At present, memblock bottom-up allocation can help us against stamping over
> movable node in very high probability. But if the hotplug info has already
> been parsed, the memblock allocator can step around the movable node by
> itself. This patch pushes the parsing step forward, just ahead of where,
> the memblock allocator can work. Later in this series, the bottom-up
> allocation style can be removed on x86_64.
>
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Len Brown <lenb@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> ---
>  arch/x86/kernel/setup.c | 15 +++++++++++++++
>  include/linux/acpi.h    |  1 +
>  2 files changed, 16 insertions(+)
>
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index acbcd62..df4132c 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -805,6 +805,20 @@ dump_kernel_offset(struct notifier_block *self, unsigned long v, void *p)
>         return 0;
>  }
>
> +/* only need the effect of acpi_numa_memory_affinity_init()
> + * ->memblock_mark_hotplug()
> + */
> +static int early_detect_acpi_memhotplug(void)
> +{
> +#ifdef CONFIG_ACPI_NUMA
> +       acpi_table_upgrade(__va(get_ramdisk_image()), get_ramdisk_size());
> +       acpi_table_init();
> +       acpi_numa_init();

As this is RFC version, I do not suppress this extra printk info yet.
Should do it next version.
> +       acpi_tb_terminate();
> +#endif
> +       return 0;
> +}
> +
>  /*
>   * Determine if we were loaded by an EFI loader.  If so, then we have also been
>   * passed the efi memmap, systab, etc., so we should use these data structures
> @@ -1131,6 +1145,7 @@ void __init setup_arch(char **cmdline_p)
>         trim_platform_memory_ranges();
>         trim_low_memory_range();
>
> +       early_detect_acpi_memhotplug();
>         init_mem_mapping();
>
>         idt_setup_early_pf();
> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> index 44dcbba..1b69044 100644
> --- a/include/linux/acpi.h
> +++ b/include/linux/acpi.h
> @@ -235,6 +235,7 @@ int acpi_mps_check (void);
>  int acpi_numa_init (void);
>
>  int acpi_table_init (void);
> +void acpi_tb_terminate(void);
>  int acpi_table_parse(char *id, acpi_tbl_table_handler handler);
>  int __init acpi_table_parse_entries(char *id, unsigned long table_size,
>                               int entry_id,
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info
  2019-01-07  8:24 [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info Pingfan Liu
                   ` (3 preceding siblings ...)
  2019-01-07  8:24 ` [RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64 Pingfan Liu
@ 2019-01-07 17:03 ` Dave Hansen
  2019-01-08  5:49   ` Pingfan Liu
  2019-01-08 10:05   ` Chao Fan
  5 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2019-01-07 17:03 UTC (permalink / raw)
  To: Pingfan Liu, x86, linux-acpi
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Rafael J. Wysocki,
	Len Brown, linux-kernel

On 1/7/19 12:24 AM, Pingfan Liu wrote:
> Background about the defect of the current bottom-up allocation style, take
> the following scenario:
>   |  unmovable node |     movable node                           |
>      | kaslr-kernel |subtree of pgtable for phy<->virt |
> 
> Although kaslr-kernel can avoid to stain the movable node. But the
> pgtable can still stain the movable node. That is a probability problem,
> with low probability, but still exist. This patch tries to eliminate the
> probability. With the previous patch, at the point of init_mem_mapping(),
> memblock allocator can work with the knowledge of acpi memory hotmovable
> info, and avoid to stain the movable node. As a result,
> memory_map_bottom_up() is not needed any more.

I'm really missing the basic problem statement.  What's the problem this
is fixing?  What is the end-user-visible impact of this problem?

To make memory hot-remove work, we want as much memory as possible to he
hot-removable, which is basically what movable nodes are used for.  But,
it sounds like, maybe, that KASLR can place the kernel image inside the
movable node.  This is somehow related to the bottom-up allocation style
currently in use.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 2/4] x86/setup: parse acpi to get hotplug info before init_mem_mapping()
  2019-01-07  8:24 ` [RFC PATCH 2/4] x86/setup: parse acpi to get hotplug info before init_mem_mapping() Pingfan Liu
  2019-01-07 12:52   ` Pingfan Liu
@ 2019-01-07 17:11   ` Dave Hansen
  2019-01-08  6:30     ` Pingfan Liu
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2019-01-07 17:11 UTC (permalink / raw)
  To: Pingfan Liu, x86, linux-acpi
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Rafael J. Wysocki,
	Len Brown, linux-kernel


On 1/7/19 12:24 AM, Pingfan Liu wrote:
> At present, memblock bottom-up allocation can help us against stamping over
> movable node in very high probability.

Is this what you are fixing?  Making a "high probability", a certainty?
 Is this the problem?

> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index acbcd62..df4132c 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -805,6 +805,20 @@ dump_kernel_offset(struct notifier_block *self, unsigned long v, void *p)
>  	return 0;
>  }
>  
> +/* only need the effect of acpi_numa_memory_affinity_init()
> + * ->memblock_mark_hotplug()
> + */

CodingStyle, please.

> +static int early_detect_acpi_memhotplug(void)
> +{
> +#ifdef CONFIG_ACPI_NUMA
> +	acpi_table_upgrade(__va(get_ramdisk_image()), get_ramdisk_size());

This adds a new, early, call to acpi_table_upgrade(), and presumably all
the following functions.  However, it does not remove any of the later
calls.  How do they interact with each other now that they are
presumably called twice?

> +	acpi_table_init();
> +	acpi_numa_init();
> +	acpi_tb_terminate();
> +#endif
> +	return 0;
> +}

Why does this return an 'int' that is unconsumed by its lone caller?

There seems to be a lack of comments on this newly-added code.

>  /*
>   * Determine if we were loaded by an EFI loader.  If so, then we have also been
>   * passed the efi memmap, systab, etc., so we should use these data structures
> @@ -1131,6 +1145,7 @@ void __init setup_arch(char **cmdline_p)
>  	trim_platform_memory_ranges();
>  	trim_low_memory_range();
>  
> +	early_detect_acpi_memhotplug();

Comments, please.  Why is this call here, specifically?  What is it doing?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64
  2019-01-07  8:24 ` [RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64 Pingfan Liu
@ 2019-01-07 17:42   ` Dave Hansen
  2019-01-08  6:13     ` Pingfan Liu
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2019-01-07 17:42 UTC (permalink / raw)
  To: Pingfan Liu, x86, linux-acpi
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Rafael J. Wysocki,
	Len Brown, linux-kernel

On 1/7/19 12:24 AM, Pingfan Liu wrote:
> There are two acheivements by this patch.
> -1st. keep the subtree of pgtable away from movable node.
> Background about the defect of the current bottom-up allocation style, take
> the following scenario:
>   |  unmovable node |     movable node                           |
>      | kaslr-kernel |subtree of pgtable for phy<->virt |



> Although kaslr-kernel can avoid to stain the movable node. [1] But the
> pgtable can still stain the movable node. That is a probability problem,
> with low probability, but still exist. This patch tries to eliminate the
> probability. With the previous patch, at the point of init_mem_mapping(),
> memblock allocator can work with the knowledge of acpi memory hotmovable
> info, and avoid to stain the movable node. As a result,
> memory_map_bottom_up() is not needed any more.
> 
> -2nd. simplify the logic of memory_map_top_down()
> Thanks to the help of early_make_pgtable(), x86_64 can directly set up the
> subtree of pgtable at any place, hence the careful iteration in
> memory_map_top_down() can be discard.

>  void __init init_mem_mapping(void)
>  {
>  	unsigned long end;
> @@ -663,6 +540,7 @@ void __init init_mem_mapping(void)
>  
>  #ifdef CONFIG_X86_64
>  	end = max_pfn << PAGE_SHIFT;
> +	set_alloc_range(0x100000, end);
>  #else

Why is this 0x100000 open-coded?  Why is this needed *now*?


>  	/*
>  	 * If the allocation is in bottom-up direction, we setup direct mapping
>  	 * in bottom-up, otherwise we setup direct mapping in top-down.
> @@ -692,13 +577,6 @@ void __init init_mem_mapping(void)
>  	} else {
>  		memory_map_top_down(ISA_END_ADDRESS, end);
>  	}
> -
> -#ifdef CONFIG_X86_64
> -	if (max_pfn > max_low_pfn) {
> -		/* can we preseve max_low_pfn ?*/
> -		max_low_pfn = max_pfn;
> -	}
> -#else
>  	early_ioremap_page_table_range_init();
>  #endif
>  
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 85c94f9..ecf7243 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -58,6 +58,8 @@ unsigned long highstart_pfn, highend_pfn;
>  
>  bool __read_mostly __vmalloc_start_set = false;
>  
> +static unsigned long min_pfn_mapped;
> +
>  /*
>   * Creates a middle page table and puts a pointer to it in the
>   * given global directory entry. This only returns the gd entry
> @@ -516,6 +518,127 @@ void __init native_pagetable_init(void)
>  	paging_init();
>  }
>  
> +static unsigned long __init get_new_step_size(unsigned long step_size)
> +{
> +	/*
> +	 * Initial mapped size is PMD_SIZE (2M).
> +	 * We can not set step_size to be PUD_SIZE (1G) yet.
> +	 * In worse case, when we cross the 1G boundary, and
> +	 * PG_LEVEL_2M is not set, we will need 1+1+512 pages (2M + 8k)
> +	 * to map 1G range with PTE. Hence we use one less than the
> +	 * difference of page table level shifts.
> +	 *
> +	 * Don't need to worry about overflow in the top-down case, on 32bit,
> +	 * when step_size is 0, round_down() returns 0 for start, and that
> +	 * turns it into 0x100000000ULL.
> +	 * In the bottom-up case, round_up(x, 0) returns 0 though too, which
> +	 * needs to be taken into consideration by the code below.
> +	 */
> +	return step_size << (PMD_SHIFT - PAGE_SHIFT - 1);
> +}
> +
> +/**
> + * memory_map_top_down - Map [map_start, map_end) top down
> + * @map_start: start address of the target memory range
> + * @map_end: end address of the target memory range
> + *
> + * This function will setup direct mapping for memory range
> + * [map_start, map_end) in top-down. That said, the page tables
> + * will be allocated at the end of the memory, and we map the
> + * memory in top-down.
> + */
> +void __init memory_map_top_down(unsigned long map_start,
> +				       unsigned long map_end)
> +{
> +	unsigned long real_end, start, last_start;
> +	unsigned long step_size;
> +	unsigned long addr;
> +	unsigned long mapped_ram_size = 0;
> +
> +	/* xen has big range in reserved near end of ram, skip it at first.*/
> +	addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE);
> +	real_end = addr + PMD_SIZE;
> +
> +	/* step_size need to be small so pgt_buf from BRK could cover it */
> +	step_size = PMD_SIZE;
> +	max_pfn_mapped = 0; /* will get exact value next */
> +	min_pfn_mapped = real_end >> PAGE_SHIFT;
> +	last_start = start = real_end;
> +
> +	/*
> +	 * We start from the top (end of memory) and go to the bottom.
> +	 * The memblock_find_in_range() gets us a block of RAM from the
> +	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
> +	 * for page table.
> +	 */
> +	while (last_start > map_start) {
> +		if (last_start > step_size) {
> +			start = round_down(last_start - 1, step_size);
> +			if (start < map_start)
> +				start = map_start;
> +		} else
> +			start = map_start;
> +		mapped_ram_size += init_range_memory_mapping(start,
> +							last_start);
> +		set_alloc_range(min_pfn_mapped, max_pfn_mapped);
> +		last_start = start;
> +		min_pfn_mapped = last_start >> PAGE_SHIFT;
> +		if (mapped_ram_size >= step_size)
> +			step_size = get_new_step_size(step_size);
> +	}
> +
> +	if (real_end < map_end) {
> +		init_range_memory_mapping(real_end, map_end);
> +		set_alloc_range(min_pfn_mapped, max_pfn_mapped);
> +	}
> +}
> +
> +/**
> + * memory_map_bottom_up - Map [map_start, map_end) bottom up
> + * @map_start: start address of the target memory range
> + * @map_end: end address of the target memory range
> + *
> + * This function will setup direct mapping for memory range
> + * [map_start, map_end) in bottom-up. Since we have limited the
> + * bottom-up allocation above the kernel, the page tables will
> + * be allocated just above the kernel and we map the memory
> + * in [map_start, map_end) in bottom-up.
> + */
> +void __init memory_map_bottom_up(unsigned long map_start,
> +					unsigned long map_end)
> +{
> +	unsigned long next, start;
> +	unsigned long mapped_ram_size = 0;
> +	/* step_size need to be small so pgt_buf from BRK could cover it */
> +	unsigned long step_size = PMD_SIZE;
> +
> +	start = map_start;
> +	min_pfn_mapped = start >> PAGE_SHIFT;
> +
> +	/*
> +	 * We start from the bottom (@map_start) and go to the top (@map_end).
> +	 * The memblock_find_in_range() gets us a block of RAM from the
> +	 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
> +	 * for page table.
> +	 */
> +	while (start < map_end) {
> +		if (step_size && map_end - start > step_size) {
> +			next = round_up(start + 1, step_size);
> +			if (next > map_end)
> +				next = map_end;
> +		} else {
> +			next = map_end;
> +		}
> +
> +		mapped_ram_size += init_range_memory_mapping(start, next);
> +		set_alloc_range(min_pfn_mapped, max_pfn_mapped);
> +		start = next;
> +
> +		if (mapped_ram_size >= step_size)
> +			step_size = get_new_step_size(step_size);
> +	}
> +}

One more suggestion:  Can you *move* the code in a separate patch?
Un-use it in this patch, but wait for one more patch to actually move it.

>  /*
>   * Build a proper pagetable for the kernel mappings.  Up until this
>   * point, we've been running on some set of pagetables constructed by
> diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h
> index 319bde3..28006de 100644
> --- a/arch/x86/mm/mm_internal.h
> +++ b/arch/x86/mm/mm_internal.h
> @@ -8,6 +8,13 @@ static inline void *alloc_low_page(void)
>  	return alloc_low_pages(1);
>  }
>  
> +unsigned long __init init_range_memory_mapping(unsigned long r_start,
> +	unsigned long r_end);
> +void set_alloc_range(unsigned long low, unsigned long high);
> +void __init memory_map_top_down(unsigned long map_start,
> +				       unsigned long map_end);
> +void __init memory_map_bottom_up(unsigned long map_start,
> +					unsigned long map_end);

Is there a reason we can't just move all these calls into init_32.c?

Seems like we probably just want one, new function, like:

	init_mem_mapping_x86_32(end);

And then we just export *that* instead of exporting all of these helpers
that only get used on x86_32.  It also makes init_mem_mapping() more
readable since the #ifdef's are shorter.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info
  2019-01-07 17:03 ` [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info Dave Hansen
@ 2019-01-08  5:49   ` Pingfan Liu
  0 siblings, 0 replies; 19+ messages in thread
From: Pingfan Liu @ 2019-01-08  5:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-acpi, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Rafael J. Wysocki, Len Brown, linux-kernel

On Tue, Jan 8, 2019 at 1:04 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/7/19 12:24 AM, Pingfan Liu wrote:
> > Background about the defect of the current bottom-up allocation style, take
> > the following scenario:
> >   |  unmovable node |     movable node                           |
> >      | kaslr-kernel |subtree of pgtable for phy<->virt |
> >
> > Although kaslr-kernel can avoid to stain the movable node. But the
> > pgtable can still stain the movable node. That is a probability problem,
> > with low probability, but still exist. This patch tries to eliminate the
> > probability. With the previous patch, at the point of init_mem_mapping(),
> > memblock allocator can work with the knowledge of acpi memory hotmovable
> > info, and avoid to stain the movable node. As a result,
> > memory_map_bottom_up() is not needed any more.
>
> I'm really missing the basic problem statement.  What's the problem this
> is fixing?  What is the end-user-visible impact of this problem?
>
Sorry for the misaligned figure. It should be
   |  kaslr-kernel    |subtree of pgtable for phy<->virt    |
                              |--- boundary between unmovable node and
movable node
Where kaslr kernel can be guaranteed to sit inside unmovable node
after patch: https://lore.kernel.org/patchwork/patch/1029376/. But if
kaslr kernel is located near the end of the movable node, then
bottom-up allocator may create pagetable which crosses the  boundary
between unmovable node and movable node.  It is a probability issue,
the factors include -1. how big the gap between kernel end and
unmovable node's end.  -2. how many memory does the system own.
Alternative way to fix this issue is by increasing the gap by
boot/compressed/kaslr*. But taking the scenario of PB level memory,
the pagetable will take server MB even if using 1GB page, so it is
hard to decide how much should the gap increase.
In a word, this series fix the probability with certainty, by
allocating pagetable on unmovable node, instead of following kernel
end.

> To make memory hot-remove work, we want as much memory as possible to he
> hot-removable, which is basically what movable nodes are used for.  But,
> it sounds like, maybe, that KASLR can place the kernel image inside the
> movable node.  This is somehow related to the bottom-up allocation style
> currently in use.

Yes, currently kaslr kernel can stain the movable node, but it will
not do this soon after the patch:
https://lore.kernel.org/patchwork/patch/1029376/

Thanks,
Pingfan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64
  2019-01-07 17:42   ` Dave Hansen
@ 2019-01-08  6:13     ` Pingfan Liu
  2019-01-08  6:37       ` Juergen Gross
  2019-01-08 17:32       ` Dave Hansen
  0 siblings, 2 replies; 19+ messages in thread
From: Pingfan Liu @ 2019-01-08  6:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-acpi, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Rafael J. Wysocki, Len Brown, linux-kernel

On Tue, Jan 8, 2019 at 1:42 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/7/19 12:24 AM, Pingfan Liu wrote:
> > There are two acheivements by this patch.
> > -1st. keep the subtree of pgtable away from movable node.
> > Background about the defect of the current bottom-up allocation style, take
> > the following scenario:
> >   |  unmovable node |     movable node                           |
> >      | kaslr-kernel |subtree of pgtable for phy<->virt |
>
>
>
> > Although kaslr-kernel can avoid to stain the movable node. [1] But the
> > pgtable can still stain the movable node. That is a probability problem,
> > with low probability, but still exist. This patch tries to eliminate the
> > probability. With the previous patch, at the point of init_mem_mapping(),
> > memblock allocator can work with the knowledge of acpi memory hotmovable
> > info, and avoid to stain the movable node. As a result,
> > memory_map_bottom_up() is not needed any more.
> >
> > -2nd. simplify the logic of memory_map_top_down()
> > Thanks to the help of early_make_pgtable(), x86_64 can directly set up the
> > subtree of pgtable at any place, hence the careful iteration in
> > memory_map_top_down() can be discard.
>
> >  void __init init_mem_mapping(void)
> >  {
> >       unsigned long end;
> > @@ -663,6 +540,7 @@ void __init init_mem_mapping(void)
> >
> >  #ifdef CONFIG_X86_64
> >       end = max_pfn << PAGE_SHIFT;
> > +     set_alloc_range(0x100000, end);
> >  #else
>
> Why is this 0x100000 open-coded?  Why is this needed *now*?
>

Memory under 1MB should be used by BIOS. For x86_64, after
e820__memblock_setup(), the memblock allocator has already been ready
to work. But there are two factors to in order to
set_alloc_range(0x100000, end). The major one is to be compatible with
x86_32, please refer to alloc_low_pages->memblock_find_in_range() uses
[min_pfn_mapped, max_pfn_mapped] to limit the range, which is ready to
be allocated from. The minor one is to prevent unexpected allocation
from memblock allocator through allow_low_pages() at very early stage.
>
> >       /*
> >        * If the allocation is in bottom-up direction, we setup direct mapping
> >        * in bottom-up, otherwise we setup direct mapping in top-down.
> > @@ -692,13 +577,6 @@ void __init init_mem_mapping(void)
> >       } else {
> >               memory_map_top_down(ISA_END_ADDRESS, end);
> >       }
> > -
> > -#ifdef CONFIG_X86_64
> > -     if (max_pfn > max_low_pfn) {
> > -             /* can we preseve max_low_pfn ?*/
> > -             max_low_pfn = max_pfn;
> > -     }
> > -#else
> >       early_ioremap_page_table_range_init();
> >  #endif
> >
> > diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> > index 85c94f9..ecf7243 100644
> > --- a/arch/x86/mm/init_32.c
> > +++ b/arch/x86/mm/init_32.c
> > @@ -58,6 +58,8 @@ unsigned long highstart_pfn, highend_pfn;
> >
> >  bool __read_mostly __vmalloc_start_set = false;
> >
> > +static unsigned long min_pfn_mapped;
> > +
> >  /*
> >   * Creates a middle page table and puts a pointer to it in the
> >   * given global directory entry. This only returns the gd entry
> > @@ -516,6 +518,127 @@ void __init native_pagetable_init(void)
> >       paging_init();
> >  }
> >
> > +static unsigned long __init get_new_step_size(unsigned long step_size)
> > +{
> > +     /*
> > +      * Initial mapped size is PMD_SIZE (2M).
> > +      * We can not set step_size to be PUD_SIZE (1G) yet.
> > +      * In worse case, when we cross the 1G boundary, and
> > +      * PG_LEVEL_2M is not set, we will need 1+1+512 pages (2M + 8k)
> > +      * to map 1G range with PTE. Hence we use one less than the
> > +      * difference of page table level shifts.
> > +      *
> > +      * Don't need to worry about overflow in the top-down case, on 32bit,
> > +      * when step_size is 0, round_down() returns 0 for start, and that
> > +      * turns it into 0x100000000ULL.
> > +      * In the bottom-up case, round_up(x, 0) returns 0 though too, which
> > +      * needs to be taken into consideration by the code below.
> > +      */
> > +     return step_size << (PMD_SHIFT - PAGE_SHIFT - 1);
> > +}
> > +
> > +/**
> > + * memory_map_top_down - Map [map_start, map_end) top down
> > + * @map_start: start address of the target memory range
> > + * @map_end: end address of the target memory range
> > + *
> > + * This function will setup direct mapping for memory range
> > + * [map_start, map_end) in top-down. That said, the page tables
> > + * will be allocated at the end of the memory, and we map the
> > + * memory in top-down.
> > + */
> > +void __init memory_map_top_down(unsigned long map_start,
> > +                                    unsigned long map_end)
> > +{
> > +     unsigned long real_end, start, last_start;
> > +     unsigned long step_size;
> > +     unsigned long addr;
> > +     unsigned long mapped_ram_size = 0;
> > +
> > +     /* xen has big range in reserved near end of ram, skip it at first.*/
> > +     addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE);
> > +     real_end = addr + PMD_SIZE;
> > +
> > +     /* step_size need to be small so pgt_buf from BRK could cover it */
> > +     step_size = PMD_SIZE;
> > +     max_pfn_mapped = 0; /* will get exact value next */
> > +     min_pfn_mapped = real_end >> PAGE_SHIFT;
> > +     last_start = start = real_end;
> > +
> > +     /*
> > +      * We start from the top (end of memory) and go to the bottom.
> > +      * The memblock_find_in_range() gets us a block of RAM from the
> > +      * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
> > +      * for page table.
> > +      */
> > +     while (last_start > map_start) {
> > +             if (last_start > step_size) {
> > +                     start = round_down(last_start - 1, step_size);
> > +                     if (start < map_start)
> > +                             start = map_start;
> > +             } else
> > +                     start = map_start;
> > +             mapped_ram_size += init_range_memory_mapping(start,
> > +                                                     last_start);
> > +             set_alloc_range(min_pfn_mapped, max_pfn_mapped);
> > +             last_start = start;
> > +             min_pfn_mapped = last_start >> PAGE_SHIFT;
> > +             if (mapped_ram_size >= step_size)
> > +                     step_size = get_new_step_size(step_size);
> > +     }
> > +
> > +     if (real_end < map_end) {
> > +             init_range_memory_mapping(real_end, map_end);
> > +             set_alloc_range(min_pfn_mapped, max_pfn_mapped);
> > +     }
> > +}
> > +
> > +/**
> > + * memory_map_bottom_up - Map [map_start, map_end) bottom up
> > + * @map_start: start address of the target memory range
> > + * @map_end: end address of the target memory range
> > + *
> > + * This function will setup direct mapping for memory range
> > + * [map_start, map_end) in bottom-up. Since we have limited the
> > + * bottom-up allocation above the kernel, the page tables will
> > + * be allocated just above the kernel and we map the memory
> > + * in [map_start, map_end) in bottom-up.
> > + */
> > +void __init memory_map_bottom_up(unsigned long map_start,
> > +                                     unsigned long map_end)
> > +{
> > +     unsigned long next, start;
> > +     unsigned long mapped_ram_size = 0;
> > +     /* step_size need to be small so pgt_buf from BRK could cover it */
> > +     unsigned long step_size = PMD_SIZE;
> > +
> > +     start = map_start;
> > +     min_pfn_mapped = start >> PAGE_SHIFT;
> > +
> > +     /*
> > +      * We start from the bottom (@map_start) and go to the top (@map_end).
> > +      * The memblock_find_in_range() gets us a block of RAM from the
> > +      * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
> > +      * for page table.
> > +      */
> > +     while (start < map_end) {
> > +             if (step_size && map_end - start > step_size) {
> > +                     next = round_up(start + 1, step_size);
> > +                     if (next > map_end)
> > +                             next = map_end;
> > +             } else {
> > +                     next = map_end;
> > +             }
> > +
> > +             mapped_ram_size += init_range_memory_mapping(start, next);
> > +             set_alloc_range(min_pfn_mapped, max_pfn_mapped);
> > +             start = next;
> > +
> > +             if (mapped_ram_size >= step_size)
> > +                     step_size = get_new_step_size(step_size);
> > +     }
> > +}
>
> One more suggestion:  Can you *move* the code in a separate patch?
> Un-use it in this patch, but wait for one more patch to actually move it.
>

Good suggestion. It will make it easier to review. I will do it in next version
> >  /*
> >   * Build a proper pagetable for the kernel mappings.  Up until this
> >   * point, we've been running on some set of pagetables constructed by
> > diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h
> > index 319bde3..28006de 100644
> > --- a/arch/x86/mm/mm_internal.h
> > +++ b/arch/x86/mm/mm_internal.h
> > @@ -8,6 +8,13 @@ static inline void *alloc_low_page(void)
> >       return alloc_low_pages(1);
> >  }
> >
> > +unsigned long __init init_range_memory_mapping(unsigned long r_start,
> > +     unsigned long r_end);
> > +void set_alloc_range(unsigned long low, unsigned long high);
> > +void __init memory_map_top_down(unsigned long map_start,
> > +                                    unsigned long map_end);
> > +void __init memory_map_bottom_up(unsigned long map_start,
> > +                                     unsigned long map_end);
>
> Is there a reason we can't just move all these calls into init_32.c?
>
> Seems like we probably just want one, new function, like:
>
>         init_mem_mapping_x86_32(end);
>
> And then we just export *that* instead of exporting all of these helpers
> that only get used on x86_32.  It also makes init_mem_mapping() more
> readable since the #ifdef's are shorter.

Yes, I will do like this.

Thanks for your kindly review.

Regards,
Pingfan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 2/4] x86/setup: parse acpi to get hotplug info before init_mem_mapping()
  2019-01-07 17:11   ` Dave Hansen
@ 2019-01-08  6:30     ` Pingfan Liu
  0 siblings, 0 replies; 19+ messages in thread
From: Pingfan Liu @ 2019-01-08  6:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-acpi, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Rafael J. Wysocki, Len Brown, linux-kernel

On Tue, Jan 8, 2019 at 1:11 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
>
> On 1/7/19 12:24 AM, Pingfan Liu wrote:
> > At present, memblock bottom-up allocation can help us against stamping over
> > movable node in very high probability.
>
> Is this what you are fixing?  Making a "high probability", a certainty?
>  Is this the problem?
>

Yes, as my reply on another mail in detail.
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index acbcd62..df4132c 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -805,6 +805,20 @@ dump_kernel_offset(struct notifier_block *self, unsigned long v, void *p)
> >       return 0;
> >  }
> >
> > +/* only need the effect of acpi_numa_memory_affinity_init()
> > + * ->memblock_mark_hotplug()
> > + */
>
> CodingStyle, please.
>

Will fix.
> > +static int early_detect_acpi_memhotplug(void)
> > +{
> > +#ifdef CONFIG_ACPI_NUMA
> > +     acpi_table_upgrade(__va(get_ramdisk_image()), get_ramdisk_size());
>
> This adds a new, early, call to acpi_table_upgrade(), and presumably all
> the following functions.  However, it does not remove any of the later
> calls.  How do they interact with each other now that they are
> presumably called twice?
>

ACPI is a big subsystem, I have a hurry through these functions. This
group seems not to allocate extra memory, and using static data. So if
called twice, just overwriting the effect of previous one. The only
issue is printk some info twice. I will pay more time on this for the
next version.
> > +     acpi_table_init();
> > +     acpi_numa_init();
> > +     acpi_tb_terminate();
> > +#endif
> > +     return 0;
> > +}
>
> Why does this return an 'int' that is unconsumed by its lone caller?
>

No special purpose about the return. Just a habit.
> There seems to be a lack of comments on this newly-added code.
>
> >  /*
> >   * Determine if we were loaded by an EFI loader.  If so, then we have also been
> >   * passed the efi memmap, systab, etc., so we should use these data structures
> > @@ -1131,6 +1145,7 @@ void __init setup_arch(char **cmdline_p)
> >       trim_platform_memory_ranges();
> >       trim_low_memory_range();
> >
> > +     early_detect_acpi_memhotplug();
>
> Comments, please.  Why is this call here, specifically?  What is it doing?
>
It parses the acpi srat to extract memory hotmovable info, and feed
those info to memory allocator. The exactly effect is:
acpi_numa_memory_affinity_init() ->memblock_mark_hotplug(). So later
when memblock allocator allocates range, in __next_mem_range(), there
is cond check to skip movable node:  if (movable_node_is_enabled() &&
memblock_is_hotpluggable(m)) continue;

Thanks,
Pingfan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64
  2019-01-08  6:13     ` Pingfan Liu
@ 2019-01-08  6:37       ` Juergen Gross
  2019-01-08 17:32       ` Dave Hansen
  1 sibling, 0 replies; 19+ messages in thread
From: Juergen Gross @ 2019-01-08  6:37 UTC (permalink / raw)
  To: Pingfan Liu, Dave Hansen
  Cc: x86, linux-acpi, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Rafael J. Wysocki, Len Brown, linux-kernel

On 08/01/2019 07:13, Pingfan Liu wrote:
> On Tue, Jan 8, 2019 at 1:42 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 1/7/19 12:24 AM, Pingfan Liu wrote:
>>> There are two acheivements by this patch.
>>> -1st. keep the subtree of pgtable away from movable node.
>>> Background about the defect of the current bottom-up allocation style, take
>>> the following scenario:
>>>   |  unmovable node |     movable node                           |
>>>      | kaslr-kernel |subtree of pgtable for phy<->virt |
>>
>>
>>
>>> Although kaslr-kernel can avoid to stain the movable node. [1] But the
>>> pgtable can still stain the movable node. That is a probability problem,
>>> with low probability, but still exist. This patch tries to eliminate the
>>> probability. With the previous patch, at the point of init_mem_mapping(),
>>> memblock allocator can work with the knowledge of acpi memory hotmovable
>>> info, and avoid to stain the movable node. As a result,
>>> memory_map_bottom_up() is not needed any more.
>>>
>>> -2nd. simplify the logic of memory_map_top_down()
>>> Thanks to the help of early_make_pgtable(), x86_64 can directly set up the
>>> subtree of pgtable at any place, hence the careful iteration in
>>> memory_map_top_down() can be discard.
>>
>>>  void __init init_mem_mapping(void)
>>>  {
>>>       unsigned long end;
>>> @@ -663,6 +540,7 @@ void __init init_mem_mapping(void)
>>>
>>>  #ifdef CONFIG_X86_64
>>>       end = max_pfn << PAGE_SHIFT;
>>> +     set_alloc_range(0x100000, end);
>>>  #else
>>
>> Why is this 0x100000 open-coded?  Why is this needed *now*?
>>
> 
> Memory under 1MB should be used by BIOS. For x86_64, after

Xen PV- and PVH-guests don't have that BIOS restriction.


Juergen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info
  2019-01-07  8:24 [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info Pingfan Liu
@ 2019-01-08 10:05   ` Chao Fan
  2019-01-07  8:24 ` [RFC PATCH 2/4] x86/setup: parse acpi to get hotplug info before init_mem_mapping() Pingfan Liu
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Chao Fan @ 2019-01-08 10:05 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: x86, linux-acpi, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Rafael J. Wysocki, Len Brown, linux-kernel

On Mon, Jan 07, 2019 at 04:24:41PM +0800, Pingfan Liu wrote:
>Background about the defect of the current bottom-up allocation style, take
>the following scenario:
>  |  unmovable node |     movable node                           |
>     | kaslr-kernel |subtree of pgtable for phy<->virt |
>
>Although kaslr-kernel can avoid to stain the movable node. But the
>pgtable can still stain the movable node. That is a probability problem,
>with low probability, but still exist. This patch tries to eliminate the
>probability. With the previous patch, at the point of init_mem_mapping(),
>memblock allocator can work with the knowledge of acpi memory hotmovable
>info, and avoid to stain the movable node. As a result,
>memory_map_bottom_up() is not needed any more.
>

Hi Pingfan,

Tang Chen ever tried to do this before adding 'movable_node':
commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
Author: Tang Chen <tangchen@cn.fujitsu.com>
Date:   Fri Feb 22 16:33:44 2013 -0800

    acpi, memory-hotplug: parse SRAT before memblock is ready

Then, Lu Yinghai tried to do the similar job, you can see:
https://lwn.net/Articles/554854/
for more information. Hope that can help you.

Thanks,
Chao Fan

>
>Cc: Thomas Gleixner <tglx@linutronix.de>
>Cc: Ingo Molnar <mingo@redhat.com>
>Cc: Borislav Petkov <bp@alien8.de>
>Cc: "H. Peter Anvin" <hpa@zytor.com>
>Cc: Dave Hansen <dave.hansen@linux.intel.com>
>Cc: Andy Lutomirski <luto@kernel.org>
>Cc: Peter Zijlstra <peterz@infradead.org>
>Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
>Cc: Len Brown <lenb@kernel.org>
>Cc: linux-kernel@vger.kernel.org
>
>Pingfan Liu (4):
>  acpi: change the topo of acpi_table_upgrade()
>  x86/setup: parse acpi to get hotplug info before init_mem_mapping()
>  x86/mm: set allowed range for memblock allocator
>  x86/mm: remove bottom-up allocation style for x86_64
>
> arch/arm64/kernel/setup.c |   2 +-
> arch/x86/kernel/setup.c   |  17 ++++-
> arch/x86/mm/init.c        | 154 +++++++---------------------------------------
> arch/x86/mm/init_32.c     | 123 ++++++++++++++++++++++++++++++++++++
> arch/x86/mm/mm_internal.h |   7 +++
> drivers/acpi/tables.c     |   4 +-
> include/linux/acpi.h      |   5 +-
> 7 files changed, 172 insertions(+), 140 deletions(-)
>
>-- 
>2.7.4
>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info
@ 2019-01-08 10:05   ` Chao Fan
  0 siblings, 0 replies; 19+ messages in thread
From: Chao Fan @ 2019-01-08 10:05 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: x86, linux-acpi, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Rafael J. Wysocki, Len Brown, linux-kernel

On Mon, Jan 07, 2019 at 04:24:41PM +0800, Pingfan Liu wrote:
>Background about the defect of the current bottom-up allocation style, take
>the following scenario:
>  |  unmovable node |     movable node                           |
>     | kaslr-kernel |subtree of pgtable for phy<->virt |
>
>Although kaslr-kernel can avoid to stain the movable node. But the
>pgtable can still stain the movable node. That is a probability problem,
>with low probability, but still exist. This patch tries to eliminate the
>probability. With the previous patch, at the point of init_mem_mapping(),
>memblock allocator can work with the knowledge of acpi memory hotmovable
>info, and avoid to stain the movable node. As a result,
>memory_map_bottom_up() is not needed any more.
>

Hi Pingfan,

Tang Chen ever tried to do this before adding 'movable_node':
commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
Author: Tang Chen <tangchen@cn.fujitsu.com>
Date:   Fri Feb 22 16:33:44 2013 -0800

    acpi, memory-hotplug: parse SRAT before memblock is ready

Then, Lu Yinghai tried to do the similar job, you can see:
https://lwn.net/Articles/554854/
for more information. Hope that can help you.

Thanks,
Chao Fan

>
>Cc: Thomas Gleixner <tglx@linutronix.de>
>Cc: Ingo Molnar <mingo@redhat.com>
>Cc: Borislav Petkov <bp@alien8.de>
>Cc: "H. Peter Anvin" <hpa@zytor.com>
>Cc: Dave Hansen <dave.hansen@linux.intel.com>
>Cc: Andy Lutomirski <luto@kernel.org>
>Cc: Peter Zijlstra <peterz@infradead.org>
>Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
>Cc: Len Brown <lenb@kernel.org>
>Cc: linux-kernel@vger.kernel.org
>
>Pingfan Liu (4):
>  acpi: change the topo of acpi_table_upgrade()
>  x86/setup: parse acpi to get hotplug info before init_mem_mapping()
>  x86/mm: set allowed range for memblock allocator
>  x86/mm: remove bottom-up allocation style for x86_64
>
> arch/arm64/kernel/setup.c |   2 +-
> arch/x86/kernel/setup.c   |  17 ++++-
> arch/x86/mm/init.c        | 154 +++++++---------------------------------------
> arch/x86/mm/init_32.c     | 123 ++++++++++++++++++++++++++++++++++++
> arch/x86/mm/mm_internal.h |   7 +++
> drivers/acpi/tables.c     |   4 +-
> include/linux/acpi.h      |   5 +-
> 7 files changed, 172 insertions(+), 140 deletions(-)
>
>-- 
>2.7.4
>
>
>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info
  2019-01-08 10:05   ` Chao Fan
  (?)
@ 2019-01-08 13:27   ` Pingfan Liu
  -1 siblings, 0 replies; 19+ messages in thread
From: Pingfan Liu @ 2019-01-08 13:27 UTC (permalink / raw)
  To: Chao Fan
  Cc: x86, linux-acpi, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Rafael J. Wysocki, Len Brown, linux-kernel, Tejun Heo, yinghai

On Tue, Jan 8, 2019 at 6:06 PM Chao Fan <fanc.fnst@cn.fujitsu.com> wrote:
>
> On Mon, Jan 07, 2019 at 04:24:41PM +0800, Pingfan Liu wrote:
> >Background about the defect of the current bottom-up allocation style, take
> >the following scenario:
> >  |  unmovable node |     movable node                           |
> >     | kaslr-kernel |subtree of pgtable for phy<->virt |
> >
> >Although kaslr-kernel can avoid to stain the movable node. But the
> >pgtable can still stain the movable node. That is a probability problem,
> >with low probability, but still exist. This patch tries to eliminate the
> >probability. With the previous patch, at the point of init_mem_mapping(),
> >memblock allocator can work with the knowledge of acpi memory hotmovable
> >info, and avoid to stain the movable node. As a result,
> >memory_map_bottom_up() is not needed any more.
> >
>
> Hi Pingfan,
>
> Tang Chen ever tried to do this before adding 'movable_node':
> commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
> Author: Tang Chen <tangchen@cn.fujitsu.com>
> Date:   Fri Feb 22 16:33:44 2013 -0800
>
>     acpi, memory-hotplug: parse SRAT before memblock is ready
>
> Then, Lu Yinghai tried to do the similar job, you can see:
> https://lwn.net/Articles/554854/
> for more information. Hope that can help you.
>
Thanks, It is a long thread, as my understanding, Tejun concerned
about the early parsing of ACPI consumes memory from memblock
allocator. If it is, then this should not happen in my series.
Cc Tejun and Yinghai.

Regards,
Pingfan
> Thanks,
> Chao Fan
>
> >
> >Cc: Thomas Gleixner <tglx@linutronix.de>
> >Cc: Ingo Molnar <mingo@redhat.com>
> >Cc: Borislav Petkov <bp@alien8.de>
> >Cc: "H. Peter Anvin" <hpa@zytor.com>
> >Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >Cc: Andy Lutomirski <luto@kernel.org>
> >Cc: Peter Zijlstra <peterz@infradead.org>
> >Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> >Cc: Len Brown <lenb@kernel.org>
> >Cc: linux-kernel@vger.kernel.org
> >
> >Pingfan Liu (4):
> >  acpi: change the topo of acpi_table_upgrade()
> >  x86/setup: parse acpi to get hotplug info before init_mem_mapping()
> >  x86/mm: set allowed range for memblock allocator
> >  x86/mm: remove bottom-up allocation style for x86_64
> >
> > arch/arm64/kernel/setup.c |   2 +-
> > arch/x86/kernel/setup.c   |  17 ++++-
> > arch/x86/mm/init.c        | 154 +++++++---------------------------------------
> > arch/x86/mm/init_32.c     | 123 ++++++++++++++++++++++++++++++++++++
> > arch/x86/mm/mm_internal.h |   7 +++
> > drivers/acpi/tables.c     |   4 +-
> > include/linux/acpi.h      |   5 +-
> > 7 files changed, 172 insertions(+), 140 deletions(-)
> >
> >--
> >2.7.4
> >
> >
> >
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64
  2019-01-08  6:13     ` Pingfan Liu
  2019-01-08  6:37       ` Juergen Gross
@ 2019-01-08 17:32       ` Dave Hansen
  2019-01-09  2:44         ` Pingfan Liu
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2019-01-08 17:32 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: x86, linux-acpi, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Rafael J. Wysocki, Len Brown, linux-kernel

On 1/7/19 10:13 PM, Pingfan Liu wrote:
> On Tue, Jan 8, 2019 at 1:42 AM Dave Hansen <dave.hansen@intel.com> wrote:
>> Why is this 0x100000 open-coded?  Why is this needed *now*?
>>
> 
> Memory under 1MB should be used by BIOS. For x86_64, after
> e820__memblock_setup(), the memblock allocator has already been ready
> to work. But there are two factors to in order to
> set_alloc_range(0x100000, end). The major one is to be compatible with
> x86_32, please refer to alloc_low_pages->memblock_find_in_range() uses
> [min_pfn_mapped, max_pfn_mapped] to limit the range, which is ready to
> be allocated from. The minor one is to prevent unexpected allocation
> from memblock allocator through allow_low_pages() at very early stage.

Wow, that's a ton of critical information which was neither commented
upon or referenced in the changelog.  Can you fix this up in the next
version, please?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64
  2019-01-08 17:32       ` Dave Hansen
@ 2019-01-09  2:44         ` Pingfan Liu
  0 siblings, 0 replies; 19+ messages in thread
From: Pingfan Liu @ 2019-01-09  2:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-acpi, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Rafael J. Wysocki, Len Brown, linux-kernel

On Wed, Jan 9, 2019 at 1:33 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/7/19 10:13 PM, Pingfan Liu wrote:
> > On Tue, Jan 8, 2019 at 1:42 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >> Why is this 0x100000 open-coded?  Why is this needed *now*?
> >>
> >
> > Memory under 1MB should be used by BIOS. For x86_64, after
> > e820__memblock_setup(), the memblock allocator has already been ready
> > to work. But there are two factors to in order to
> > set_alloc_range(0x100000, end). The major one is to be compatible with
> > x86_32, please refer to alloc_low_pages->memblock_find_in_range() uses
> > [min_pfn_mapped, max_pfn_mapped] to limit the range, which is ready to
> > be allocated from. The minor one is to prevent unexpected allocation
> > from memblock allocator through allow_low_pages() at very early stage.
>
> Wow, that's a ton of critical information which was neither commented
> upon or referenced in the changelog.  Can you fix this up in the next
> version, please?

Sure.

Thanks,
Pingfan

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-01-09  2:44 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-07  8:24 [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info Pingfan Liu
2019-01-07  8:24 ` [RFC PATCH 1/4] acpi: change the topo of acpi_table_upgrade() Pingfan Liu
2019-01-07 10:55   ` Rafael J. Wysocki
2019-01-07  8:24 ` [RFC PATCH 2/4] x86/setup: parse acpi to get hotplug info before init_mem_mapping() Pingfan Liu
2019-01-07 12:52   ` Pingfan Liu
2019-01-07 17:11   ` Dave Hansen
2019-01-08  6:30     ` Pingfan Liu
2019-01-07  8:24 ` [RFC PATCH 3/4] x86/mm: set allowed range for memblock allocator Pingfan Liu
2019-01-07  8:24 ` [RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64 Pingfan Liu
2019-01-07 17:42   ` Dave Hansen
2019-01-08  6:13     ` Pingfan Liu
2019-01-08  6:37       ` Juergen Gross
2019-01-08 17:32       ` Dave Hansen
2019-01-09  2:44         ` Pingfan Liu
2019-01-07 17:03 ` [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info Dave Hansen
2019-01-08  5:49   ` Pingfan Liu
2019-01-08 10:05 ` Chao Fan
2019-01-08 10:05   ` Chao Fan
2019-01-08 13:27   ` Pingfan Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.