linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv3 0/2] mm/memblock: reuse memblock bottom-up allocation style
@ 2018-12-28  3:00 Pingfan Liu
  2018-12-28  3:00 ` [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr Pingfan Liu
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Pingfan Liu @ 2018-12-28  3:00 UTC (permalink / raw)
  To: linux-acpi, linux-mm, kexec
  Cc: Pingfan Liu, Tang Chen, Rafael J. Wysocki, Len Brown,
	Andrew Morton, Mike Rapoport, Michal Hocko, Jonathan Corbet,
	Yaowei Bai, Pavel Tatashin, Nicholas Piggin, Naoya Horiguchi,
	Daniel Vacek, Mathieu Malaterre, Stefan Agner, Dave Young,
	Baoquan He, yinghai, vgoyal, linux-kernel

The bottom-up allocation style is introduced to cope with movable_node,
where the limit inferior of allocation starts from kernel's end, due to
lack of knowledge of memory hotplug info at this early time.
Beside this original aim, 'kexec -c' prefers to reuse this style to alloc mem
at lower address, since if the reserved region is beyond 4G, then it requires
extra mem (default is 16M) for swiotlb. But at this time hotplug info has been
got, the limit inferior can be extend to 0, which is done by this series

Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Daniel Vacek <neelx@redhat.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: yinghai@kernel.org,
Cc: vgoyal@redhat.com
Cc: linux-kernel@vger.kernel.org

Pingfan Liu (2):
  mm/memblock: extend the limit inferior of bottom-up after parsing
    hotplug attr
  x86/kdump: bugfix, make the behavior of crashkernel=X consistent with
    kaslr

 arch/x86/kernel/setup.c  |  9 +++++---
 drivers/acpi/numa.c      |  4 ++++
 include/linux/memblock.h |  1 +
 mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
 4 files changed, 46 insertions(+), 26 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr
  2018-12-28  3:00 [PATCHv3 0/2] mm/memblock: reuse memblock bottom-up allocation style Pingfan Liu
@ 2018-12-28  3:00 ` Pingfan Liu
  2018-12-31  8:40   ` Mike Rapoport
  2018-12-28  3:00 ` [PATCHv3 2/2] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr Pingfan Liu
  2018-12-28  3:39 ` [PATCHv3 0/2] mm/memblock: reuse memblock bottom-up allocation style Baoquan He
  2 siblings, 1 reply; 20+ messages in thread
From: Pingfan Liu @ 2018-12-28  3:00 UTC (permalink / raw)
  To: linux-acpi, linux-mm, kexec
  Cc: Pingfan Liu, Tang Chen, Rafael J. Wysocki, Len Brown,
	Andrew Morton, Mike Rapoport, Michal Hocko, Jonathan Corbet,
	Yaowei Bai, Pavel Tatashin, Nicholas Piggin, Naoya Horiguchi,
	Daniel Vacek, Mathieu Malaterre, Stefan Agner, Dave Young,
	Baoquan He, yinghai, vgoyal, linux-kernel

The bottom-up allocation style is introduced to cope with movable_node,
where the limit inferior of allocation starts from kernel's end, due to
lack of knowledge of memory hotplug info at this early time. But if later,
hotplug info has been got, the limit inferior can be extend to 0.
'kexec -c' prefers to reuse this style to alloc mem at lower address,
since if the reserved region is beyond 4G, then it requires extra mem
(default is 16M) for swiotlb.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Daniel Vacek <neelx@redhat.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: yinghai@kernel.org,
Cc: vgoyal@redhat.com
Cc: linux-kernel@vger.kernel.org
---
 drivers/acpi/numa.c      |  4 ++++
 include/linux/memblock.h |  1 +
 mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
 3 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 2746994..3eea4e4 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
 
 		cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
 					    acpi_parse_memory_affinity, 0);
+
+#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
+		mark_mem_hotplug_parsed();
+#endif
 	}
 
 	/* SLIT: System Locality Information Table */
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index aee299a..d89ed9e 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 bool memblock_overlaps_region(struct memblock_type *type,
 			      phys_addr_t base, phys_addr_t size);
+void mark_mem_hotplug_parsed(void);
 int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
 int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
 int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index 81ae63c..a3f5e46 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
 	return 0;
 }
 
+static bool mem_hotmovable_parsed __initdata_memblock;
+void __init_memblock mark_mem_hotplug_parsed(void)
+{
+	mem_hotmovable_parsed = true;
+}
+
 /**
  * memblock_find_in_range_node - find free area in given range and node
  * @size: size of free area to find
@@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
 					phys_addr_t end, int nid,
 					enum memblock_flags flags)
 {
-	phys_addr_t kernel_end, ret;
+	phys_addr_t kernel_end, ret = 0;
 
 	/* pump up @end */
 	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
@@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
 	end = max(start, end);
 	kernel_end = __pa_symbol(_end);
 
-	/*
-	 * try bottom-up allocation only when bottom-up mode
-	 * is set and @end is above the kernel image.
-	 */
-	if (memblock_bottom_up() && end > kernel_end) {
-		phys_addr_t bottom_up_start;
+	if (memblock_bottom_up()) {
+		phys_addr_t bottom_up_start = start;
 
-		/* make sure we will allocate above the kernel */
-		bottom_up_start = max(start, kernel_end);
-
-		/* ok, try bottom-up allocation first */
-		ret = __memblock_find_range_bottom_up(bottom_up_start, end,
-						      size, align, nid, flags);
-		if (ret)
+		if (mem_hotmovable_parsed) {
+			ret = __memblock_find_range_bottom_up(
+				bottom_up_start, end, size, align, nid,
+				flags);
 			return ret;
 
 		/*
-		 * we always limit bottom-up allocation above the kernel,
-		 * but top-down allocation doesn't have the limit, so
-		 * retrying top-down allocation may succeed when bottom-up
-		 * allocation failed.
-		 *
-		 * bottom-up allocation is expected to be fail very rarely,
-		 * so we use WARN_ONCE() here to see the stack trace if
-		 * fail happens.
+		 * if mem hotplug info is not parsed yet, try bottom-up
+		 * allocation with @end above the kernel image.
 		 */
-		WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
+		} else if (!mem_hotmovable_parsed && end > kernel_end) {
+			/* make sure we will allocate above the kernel */
+			bottom_up_start = max(start, kernel_end);
+			ret = __memblock_find_range_bottom_up(
+				bottom_up_start, end, size, align, nid,
+				flags);
+			if (ret)
+				return ret;
+			/*
+			 * we always limit bottom-up allocation above the
+			 * kernel, but top-down allocation doesn't have
+			 * the limit, so retrying top-down allocation may
+			 * succeed when bottom-up allocation failed.
+			 *
+			 * bottom-up allocation is expected to be fail
+			 * very rarely, so we use WARN_ONCE() here to see
+			 * the stack trace if fail happens.
+			 */
+			WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
 			  "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
+		}
 	}
 
 	return __memblock_find_range_top_down(start, end, size, align, nid,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCHv3 2/2] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr
  2018-12-28  3:00 [PATCHv3 0/2] mm/memblock: reuse memblock bottom-up allocation style Pingfan Liu
  2018-12-28  3:00 ` [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr Pingfan Liu
@ 2018-12-28  3:00 ` Pingfan Liu
  2018-12-31  8:46   ` Mike Rapoport
  2018-12-28  3:39 ` [PATCHv3 0/2] mm/memblock: reuse memblock bottom-up allocation style Baoquan He
  2 siblings, 1 reply; 20+ messages in thread
From: Pingfan Liu @ 2018-12-28  3:00 UTC (permalink / raw)
  To: linux-acpi, linux-mm, kexec
  Cc: Pingfan Liu, Tang Chen, Rafael J. Wysocki, Len Brown,
	Andrew Morton, Mike Rapoport, Michal Hocko, Jonathan Corbet,
	Yaowei Bai, Pavel Tatashin, Nicholas Piggin, Naoya Horiguchi,
	Daniel Vacek, Mathieu Malaterre, Stefan Agner, Dave Young,
	Baoquan He, yinghai, vgoyal, linux-kernel

Customer reported a bug on a high end server with many pcie devices, where
kernel bootup with crashkernel=384M, and kaslr is enabled. Even
though we still see much memory under 896 MB, the finding still failed
intermittently. Because currently we can only find region under 896 MB,
if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
randomly, and crashkernel reservation need be aligned to 128 MB, that's
why failure is found. It raises confusion to the end user that sometimes
crashkernel=X works while sometimes fails.
If want to make it succeed, customer can change kernel option to
"crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
limited space to behave even though its grammer looks more generic.
And we can't answer questions raised from customer that confidently:
1) why it doesn't succeed to reserve 896 MB;
2) what's wrong with memory region under 4G;
3) why I have to add ',high', I only require 384 MB, not 3840 MB.

This patch simplifies the method suggested in the mail [1]. It just goes
bottom-up to find a candidate region for crashkernel. The bottom-up may be
better compatible with the old reservation style, i.e. still want to get
memory region from 896 MB firstly, then [896 MB, 4G], finally above 4G.

There is one trivial thing about the compatibility with old kexec-tools:
if the reserved region is above 896M, then old tool will fail to load
bzImage. But without this patch, the old tool also fail since there is no
memory below 896M can be reserved for crashkernel.

[1]: http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Daniel Vacek <neelx@redhat.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: yinghai@kernel.org,
Cc: vgoyal@redhat.com
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/setup.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d494b9b..165f9c3 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -541,15 +541,18 @@ static void __init reserve_crashkernel(void)
 
 	/* 0 means: find the address automatically */
 	if (crash_base <= 0) {
+		bool bottom_up = memblock_bottom_up();
+
+		memblock_set_bottom_up(true);
 		/*
 		 * Set CRASH_ADDR_LOW_MAX upper bound for crash memory,
 		 * as old kexec-tools loads bzImage below that, unless
 		 * "crashkernel=size[KMG],high" is specified.
 		 */
 		crash_base = memblock_find_in_range(CRASH_ALIGN,
-						    high ? CRASH_ADDR_HIGH_MAX
-							 : CRASH_ADDR_LOW_MAX,
-						    crash_size, CRASH_ALIGN);
+			(max_pfn * PAGE_SIZE), crash_size, CRASH_ALIGN);
+		memblock_set_bottom_up(bottom_up);
+
 		if (!crash_base) {
 			pr_info("crashkernel reservation failed - No suitable area found.\n");
 			return;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 0/2] mm/memblock: reuse memblock bottom-up allocation style
  2018-12-28  3:00 [PATCHv3 0/2] mm/memblock: reuse memblock bottom-up allocation style Pingfan Liu
  2018-12-28  3:00 ` [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr Pingfan Liu
  2018-12-28  3:00 ` [PATCHv3 2/2] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr Pingfan Liu
@ 2018-12-28  3:39 ` Baoquan He
  2 siblings, 0 replies; 20+ messages in thread
From: Baoquan He @ 2018-12-28  3:39 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: linux-acpi, linux-mm, kexec, Tang Chen, Rafael J. Wysocki,
	Len Brown, Andrew Morton, Mike Rapoport, Michal Hocko,
	Jonathan Corbet, Yaowei Bai, Pavel Tatashin, Nicholas Piggin,
	Naoya Horiguchi, Daniel Vacek, Mathieu Malaterre, Stefan Agner,
	Dave Young, yinghai, vgoyal, linux-kernel

On 12/28/18 at 11:00am, Pingfan Liu wrote:
> The bottom-up allocation style is introduced to cope with movable_node,
> where the limit inferior of allocation starts from kernel's end, due to
> lack of knowledge of memory hotplug info at this early time.
> Beside this original aim, 'kexec -c' prefers to reuse this style to alloc mem

Wondering what is 'kexec -c'.

> at lower address, since if the reserved region is beyond 4G, then it requires
> extra mem (default is 16M) for swiotlb. But at this time hotplug info has been

The default is 256M, not sure if we are talking about the same thing.

low_size = max(swiotlb_size_or_default() + (8UL << 20), 256UL << 20);

> got, the limit inferior can be extend to 0, which is done by this series
> 
> Cc: Tang Chen <tangchen@cn.fujitsu.com>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Daniel Vacek <neelx@redhat.com>
> Cc: Mathieu Malaterre <malat@debian.org>
> Cc: Stefan Agner <stefan@agner.ch>
> Cc: Dave Young <dyoung@redhat.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: yinghai@kernel.org,
> Cc: vgoyal@redhat.com
> Cc: linux-kernel@vger.kernel.org
> 
> Pingfan Liu (2):
>   mm/memblock: extend the limit inferior of bottom-up after parsing
>     hotplug attr
>   x86/kdump: bugfix, make the behavior of crashkernel=X consistent with
>     kaslr
> 
>  arch/x86/kernel/setup.c  |  9 +++++---
>  drivers/acpi/numa.c      |  4 ++++
>  include/linux/memblock.h |  1 +
>  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
>  4 files changed, 46 insertions(+), 26 deletions(-)
> 
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr
  2018-12-28  3:00 ` [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr Pingfan Liu
@ 2018-12-31  8:40   ` Mike Rapoport
  2019-01-02  6:47     ` Pingfan Liu
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Rapoport @ 2018-12-31  8:40 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: linux-acpi, linux-mm, kexec, Tang Chen, Rafael J. Wysocki,
	Len Brown, Andrew Morton, Mike Rapoport, Michal Hocko,
	Jonathan Corbet, Yaowei Bai, Pavel Tatashin, Nicholas Piggin,
	Naoya Horiguchi, Daniel Vacek, Mathieu Malaterre, Stefan Agner,
	Dave Young, Baoquan He, yinghai, vgoyal, linux-kernel

On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> The bottom-up allocation style is introduced to cope with movable_node,
> where the limit inferior of allocation starts from kernel's end, due to
> lack of knowledge of memory hotplug info at this early time. But if later,
> hotplug info has been got, the limit inferior can be extend to 0.
> 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> since if the reserved region is beyond 4G, then it requires extra mem
> (default is 16M) for swiotlb.

I fail to understand why the availability of memory hotplug information
would allow to extend the lower limit of bottom-up memblock allocations
below the kernel. The memory in the physical range [0, kernel_start) can be
allocated as soon as the kernel memory is reserved.

The extents of the memory node hosting the kernel image can be used to
limit memblok allocations from that particular node, even in top-down mode.
 
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> Cc: Tang Chen <tangchen@cn.fujitsu.com>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Daniel Vacek <neelx@redhat.com>
> Cc: Mathieu Malaterre <malat@debian.org>
> Cc: Stefan Agner <stefan@agner.ch>
> Cc: Dave Young <dyoung@redhat.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: yinghai@kernel.org,
> Cc: vgoyal@redhat.com
> Cc: linux-kernel@vger.kernel.org
> ---
>  drivers/acpi/numa.c      |  4 ++++
>  include/linux/memblock.h |  1 +
>  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
>  3 files changed, 40 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> index 2746994..3eea4e4 100644
> --- a/drivers/acpi/numa.c
> +++ b/drivers/acpi/numa.c
> @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> 
>  		cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
>  					    acpi_parse_memory_affinity, 0);
> +
> +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> +		mark_mem_hotplug_parsed();
> +#endif
>  	}
> 
>  	/* SLIT: System Locality Information Table */
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index aee299a..d89ed9e 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
>  void memblock_trim_memory(phys_addr_t align);
>  bool memblock_overlaps_region(struct memblock_type *type,
>  			      phys_addr_t base, phys_addr_t size);
> +void mark_mem_hotplug_parsed(void);
>  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
>  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
>  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 81ae63c..a3f5e46 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
>  	return 0;
>  }
> 
> +static bool mem_hotmovable_parsed __initdata_memblock;
> +void __init_memblock mark_mem_hotplug_parsed(void)
> +{
> +	mem_hotmovable_parsed = true;
> +}
> +
>  /**
>   * memblock_find_in_range_node - find free area in given range and node
>   * @size: size of free area to find
> @@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
>  					phys_addr_t end, int nid,
>  					enum memblock_flags flags)
>  {
> -	phys_addr_t kernel_end, ret;
> +	phys_addr_t kernel_end, ret = 0;
> 
>  	/* pump up @end */
>  	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> @@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
>  	end = max(start, end);
>  	kernel_end = __pa_symbol(_end);
> 
> -	/*
> -	 * try bottom-up allocation only when bottom-up mode
> -	 * is set and @end is above the kernel image.
> -	 */
> -	if (memblock_bottom_up() && end > kernel_end) {
> -		phys_addr_t bottom_up_start;
> +	if (memblock_bottom_up()) {
> +		phys_addr_t bottom_up_start = start;
> 
> -		/* make sure we will allocate above the kernel */
> -		bottom_up_start = max(start, kernel_end);
> -
> -		/* ok, try bottom-up allocation first */
> -		ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> -						      size, align, nid, flags);
> -		if (ret)
> +		if (mem_hotmovable_parsed) {
> +			ret = __memblock_find_range_bottom_up(
> +				bottom_up_start, end, size, align, nid,
> +				flags);
>  			return ret;
> 
>  		/*
> -		 * we always limit bottom-up allocation above the kernel,
> -		 * but top-down allocation doesn't have the limit, so
> -		 * retrying top-down allocation may succeed when bottom-up
> -		 * allocation failed.
> -		 *
> -		 * bottom-up allocation is expected to be fail very rarely,
> -		 * so we use WARN_ONCE() here to see the stack trace if
> -		 * fail happens.
> +		 * if mem hotplug info is not parsed yet, try bottom-up
> +		 * allocation with @end above the kernel image.
>  		 */
> -		WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> +		} else if (!mem_hotmovable_parsed && end > kernel_end) {
> +			/* make sure we will allocate above the kernel */
> +			bottom_up_start = max(start, kernel_end);
> +			ret = __memblock_find_range_bottom_up(
> +				bottom_up_start, end, size, align, nid,
> +				flags);
> +			if (ret)
> +				return ret;
> +			/*
> +			 * we always limit bottom-up allocation above the
> +			 * kernel, but top-down allocation doesn't have
> +			 * the limit, so retrying top-down allocation may
> +			 * succeed when bottom-up allocation failed.
> +			 *
> +			 * bottom-up allocation is expected to be fail
> +			 * very rarely, so we use WARN_ONCE() here to see
> +			 * the stack trace if fail happens.
> +			 */
> +			WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
>  			  "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
> +		}
>  	}
> 
>  	return __memblock_find_range_top_down(start, end, size, align, nid,
> -- 
> 2.7.4
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 2/2] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr
  2018-12-28  3:00 ` [PATCHv3 2/2] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr Pingfan Liu
@ 2018-12-31  8:46   ` Mike Rapoport
  2019-01-02  6:47     ` Pingfan Liu
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Rapoport @ 2018-12-31  8:46 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: linux-acpi, linux-mm, kexec, Tang Chen, Rafael J. Wysocki,
	Len Brown, Andrew Morton, Mike Rapoport, Michal Hocko,
	Jonathan Corbet, Yaowei Bai, Pavel Tatashin, Nicholas Piggin,
	Naoya Horiguchi, Daniel Vacek, Mathieu Malaterre, Stefan Agner,
	Dave Young, Baoquan He, yinghai, vgoyal, linux-kernel

On Fri, Dec 28, 2018 at 11:00:02AM +0800, Pingfan Liu wrote:
> Customer reported a bug on a high end server with many pcie devices, where
> kernel bootup with crashkernel=384M, and kaslr is enabled. Even
> though we still see much memory under 896 MB, the finding still failed
> intermittently. Because currently we can only find region under 896 MB,
> if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
> randomly, and crashkernel reservation need be aligned to 128 MB, that's
> why failure is found. It raises confusion to the end user that sometimes
> crashkernel=X works while sometimes fails.
> If want to make it succeed, customer can change kernel option to
> "crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
> limited space to behave even though its grammer looks more generic.
> And we can't answer questions raised from customer that confidently:
> 1) why it doesn't succeed to reserve 896 MB;
> 2) what's wrong with memory region under 4G;
> 3) why I have to add ',high', I only require 384 MB, not 3840 MB.
> 
> This patch simplifies the method suggested in the mail [1]. It just goes
> bottom-up to find a candidate region for crashkernel. The bottom-up may be
> better compatible with the old reservation style, i.e. still want to get
> memory region from 896 MB firstly, then [896 MB, 4G], finally above 4G.
> 
> There is one trivial thing about the compatibility with old kexec-tools:
> if the reserved region is above 896M, then old tool will fail to load
> bzImage. But without this patch, the old tool also fail since there is no
> memory below 896M can be reserved for crashkernel.
> 
> [1]: http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> Cc: Tang Chen <tangchen@cn.fujitsu.com>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Daniel Vacek <neelx@redhat.com>
> Cc: Mathieu Malaterre <malat@debian.org>
> Cc: Stefan Agner <stefan@agner.ch>
> Cc: Dave Young <dyoung@redhat.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: yinghai@kernel.org,
> Cc: vgoyal@redhat.com
> Cc: linux-kernel@vger.kernel.org
> ---
>  arch/x86/kernel/setup.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index d494b9b..165f9c3 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -541,15 +541,18 @@ static void __init reserve_crashkernel(void)
> 
>  	/* 0 means: find the address automatically */
>  	if (crash_base <= 0) {
> +		bool bottom_up = memblock_bottom_up();
> +
> +		memblock_set_bottom_up(true);
>
>  		/*
>  		 * Set CRASH_ADDR_LOW_MAX upper bound for crash memory,
>  		 * as old kexec-tools loads bzImage below that, unless
>  		 * "crashkernel=size[KMG],high" is specified.
>  		 */
>  		crash_base = memblock_find_in_range(CRASH_ALIGN,
> -						    high ? CRASH_ADDR_HIGH_MAX
> -							 : CRASH_ADDR_LOW_MAX,
> -						    crash_size, CRASH_ALIGN);
> +			(max_pfn * PAGE_SIZE), crash_size, CRASH_ALIGN);
> +		memblock_set_bottom_up(bottom_up);

Using bottom-up does not guarantee that the allocation won't fall into a
removable memory, it only makes it highly probable.

I think that the 'max_pfn * PAGE_SIZE' limit should be replaced with the
end of the non-removable memory node.

> +
>  		if (!crash_base) {
>  			pr_info("crashkernel reservation failed - No suitable area found.\n");
>  			return;
> -- 
> 2.7.4
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr
  2018-12-31  8:40   ` Mike Rapoport
@ 2019-01-02  6:47     ` Pingfan Liu
  2019-01-02  9:27       ` Mike Rapoport
  0 siblings, 1 reply; 20+ messages in thread
From: Pingfan Liu @ 2019-01-02  6:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-acpi, linux-mm, kexec, Tang Chen, Rafael J. Wysocki,
	Len Brown, Andrew Morton, Mike Rapoport, Michal Hocko,
	Jonathan Corbet, Yaowei Bai, Pavel Tatashin, Nicholas Piggin,
	Naoya Horiguchi, Daniel Vacek, Mathieu Malaterre, Stefan Agner,
	Dave Young, Baoquan He, yinghai, vgoyal, linux-kernel

On Mon, Dec 31, 2018 at 4:40 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> > The bottom-up allocation style is introduced to cope with movable_node,
> > where the limit inferior of allocation starts from kernel's end, due to
> > lack of knowledge of memory hotplug info at this early time. But if later,
> > hotplug info has been got, the limit inferior can be extend to 0.
> > 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> > since if the reserved region is beyond 4G, then it requires extra mem
> > (default is 16M) for swiotlb.
>
> I fail to understand why the availability of memory hotplug information
> would allow to extend the lower limit of bottom-up memblock allocations
> below the kernel. The memory in the physical range [0, kernel_start) can be
> allocated as soon as the kernel memory is reserved.
>
Yes, the  [0, kernel_start) can be allocated at this time by some func
e.g. memblock_reserve(). But there is trick. For the func like
memblock_find_in_range(), this is hotplug attr checking ,,it will
check the hotmovable attr in __next_mem_range()
{
if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
continue
}.  So the movable memory can be safely skipped.

Thanks for your kindly review.

Regards,
Pingfan

> The extents of the memory node hosting the kernel image can be used to
> limit memblok allocations from that particular node, even in top-down mode.
>
> > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > Cc: Len Brown <lenb@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Jonathan Corbet <corbet@lwn.net>
> > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> > Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> > Cc: Nicholas Piggin <npiggin@gmail.com>
> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: Daniel Vacek <neelx@redhat.com>
> > Cc: Mathieu Malaterre <malat@debian.org>
> > Cc: Stefan Agner <stefan@agner.ch>
> > Cc: Dave Young <dyoung@redhat.com>
> > Cc: Baoquan He <bhe@redhat.com>
> > Cc: yinghai@kernel.org,
> > Cc: vgoyal@redhat.com
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  drivers/acpi/numa.c      |  4 ++++
> >  include/linux/memblock.h |  1 +
> >  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
> >  3 files changed, 40 insertions(+), 23 deletions(-)
> >
> > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> > index 2746994..3eea4e4 100644
> > --- a/drivers/acpi/numa.c
> > +++ b/drivers/acpi/numa.c
> > @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> >
> >               cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> >                                           acpi_parse_memory_affinity, 0);
> > +
> > +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > +             mark_mem_hotplug_parsed();
> > +#endif
> >       }
> >
> >       /* SLIT: System Locality Information Table */
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index aee299a..d89ed9e 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
> >  void memblock_trim_memory(phys_addr_t align);
> >  bool memblock_overlaps_region(struct memblock_type *type,
> >                             phys_addr_t base, phys_addr_t size);
> > +void mark_mem_hotplug_parsed(void);
> >  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> >  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> >  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 81ae63c..a3f5e46 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
> >       return 0;
> >  }
> >
> > +static bool mem_hotmovable_parsed __initdata_memblock;
> > +void __init_memblock mark_mem_hotplug_parsed(void)
> > +{
> > +     mem_hotmovable_parsed = true;
> > +}
> > +
> >  /**
> >   * memblock_find_in_range_node - find free area in given range and node
> >   * @size: size of free area to find
> > @@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> >                                       phys_addr_t end, int nid,
> >                                       enum memblock_flags flags)
> >  {
> > -     phys_addr_t kernel_end, ret;
> > +     phys_addr_t kernel_end, ret = 0;
> >
> >       /* pump up @end */
> >       if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> > @@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> >       end = max(start, end);
> >       kernel_end = __pa_symbol(_end);
> >
> > -     /*
> > -      * try bottom-up allocation only when bottom-up mode
> > -      * is set and @end is above the kernel image.
> > -      */
> > -     if (memblock_bottom_up() && end > kernel_end) {
> > -             phys_addr_t bottom_up_start;
> > +     if (memblock_bottom_up()) {
> > +             phys_addr_t bottom_up_start = start;
> >
> > -             /* make sure we will allocate above the kernel */
> > -             bottom_up_start = max(start, kernel_end);
> > -
> > -             /* ok, try bottom-up allocation first */
> > -             ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > -                                                   size, align, nid, flags);
> > -             if (ret)
> > +             if (mem_hotmovable_parsed) {
> > +                     ret = __memblock_find_range_bottom_up(
> > +                             bottom_up_start, end, size, align, nid,
> > +                             flags);
> >                       return ret;
> >
> >               /*
> > -              * we always limit bottom-up allocation above the kernel,
> > -              * but top-down allocation doesn't have the limit, so
> > -              * retrying top-down allocation may succeed when bottom-up
> > -              * allocation failed.
> > -              *
> > -              * bottom-up allocation is expected to be fail very rarely,
> > -              * so we use WARN_ONCE() here to see the stack trace if
> > -              * fail happens.
> > +              * if mem hotplug info is not parsed yet, try bottom-up
> > +              * allocation with @end above the kernel image.
> >                */
> > -             WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > +             } else if (!mem_hotmovable_parsed && end > kernel_end) {
> > +                     /* make sure we will allocate above the kernel */
> > +                     bottom_up_start = max(start, kernel_end);
> > +                     ret = __memblock_find_range_bottom_up(
> > +                             bottom_up_start, end, size, align, nid,
> > +                             flags);
> > +                     if (ret)
> > +                             return ret;
> > +                     /*
> > +                      * we always limit bottom-up allocation above the
> > +                      * kernel, but top-down allocation doesn't have
> > +                      * the limit, so retrying top-down allocation may
> > +                      * succeed when bottom-up allocation failed.
> > +                      *
> > +                      * bottom-up allocation is expected to be fail
> > +                      * very rarely, so we use WARN_ONCE() here to see
> > +                      * the stack trace if fail happens.
> > +                      */
> > +                     WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> >                         "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
> > +             }
> >       }
> >
> >       return __memblock_find_range_top_down(start, end, size, align, nid,
> > --
> > 2.7.4
> >
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 2/2] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr
  2018-12-31  8:46   ` Mike Rapoport
@ 2019-01-02  6:47     ` Pingfan Liu
  2019-01-02  9:28       ` Mike Rapoport
  0 siblings, 1 reply; 20+ messages in thread
From: Pingfan Liu @ 2019-01-02  6:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-acpi, linux-mm, kexec, Tang Chen, Rafael J. Wysocki,
	Len Brown, Andrew Morton, Mike Rapoport, Michal Hocko,
	Jonathan Corbet, Yaowei Bai, Pavel Tatashin, Nicholas Piggin,
	Naoya Horiguchi, Daniel Vacek, Mathieu Malaterre, Stefan Agner,
	Dave Young, Baoquan He, yinghai, vgoyal, linux-kernel

On Mon, Dec 31, 2018 at 4:46 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Fri, Dec 28, 2018 at 11:00:02AM +0800, Pingfan Liu wrote:
> > Customer reported a bug on a high end server with many pcie devices, where
> > kernel bootup with crashkernel=384M, and kaslr is enabled. Even
> > though we still see much memory under 896 MB, the finding still failed
> > intermittently. Because currently we can only find region under 896 MB,
> > if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
> > randomly, and crashkernel reservation need be aligned to 128 MB, that's
> > why failure is found. It raises confusion to the end user that sometimes
> > crashkernel=X works while sometimes fails.
> > If want to make it succeed, customer can change kernel option to
> > "crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
> > limited space to behave even though its grammer looks more generic.
> > And we can't answer questions raised from customer that confidently:
> > 1) why it doesn't succeed to reserve 896 MB;
> > 2) what's wrong with memory region under 4G;
> > 3) why I have to add ',high', I only require 384 MB, not 3840 MB.
> >
> > This patch simplifies the method suggested in the mail [1]. It just goes
> > bottom-up to find a candidate region for crashkernel. The bottom-up may be
> > better compatible with the old reservation style, i.e. still want to get
> > memory region from 896 MB firstly, then [896 MB, 4G], finally above 4G.
> >
> > There is one trivial thing about the compatibility with old kexec-tools:
> > if the reserved region is above 896M, then old tool will fail to load
> > bzImage. But without this patch, the old tool also fail since there is no
> > memory below 896M can be reserved for crashkernel.
> >
> > [1]: http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > Cc: Len Brown <lenb@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Jonathan Corbet <corbet@lwn.net>
> > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> > Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> > Cc: Nicholas Piggin <npiggin@gmail.com>
> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: Daniel Vacek <neelx@redhat.com>
> > Cc: Mathieu Malaterre <malat@debian.org>
> > Cc: Stefan Agner <stefan@agner.ch>
> > Cc: Dave Young <dyoung@redhat.com>
> > Cc: Baoquan He <bhe@redhat.com>
> > Cc: yinghai@kernel.org,
> > Cc: vgoyal@redhat.com
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  arch/x86/kernel/setup.c | 9 ++++++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index d494b9b..165f9c3 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -541,15 +541,18 @@ static void __init reserve_crashkernel(void)
> >
> >       /* 0 means: find the address automatically */
> >       if (crash_base <= 0) {
> > +             bool bottom_up = memblock_bottom_up();
> > +
> > +             memblock_set_bottom_up(true);
> >
> >               /*
> >                * Set CRASH_ADDR_LOW_MAX upper bound for crash memory,
> >                * as old kexec-tools loads bzImage below that, unless
> >                * "crashkernel=size[KMG],high" is specified.
> >                */
> >               crash_base = memblock_find_in_range(CRASH_ALIGN,
> > -                                                 high ? CRASH_ADDR_HIGH_MAX
> > -                                                      : CRASH_ADDR_LOW_MAX,
> > -                                                 crash_size, CRASH_ALIGN);
> > +                     (max_pfn * PAGE_SIZE), crash_size, CRASH_ALIGN);
> > +             memblock_set_bottom_up(bottom_up);
>
> Using bottom-up does not guarantee that the allocation won't fall into a
> removable memory, it only makes it highly probable.
>
> I think that the 'max_pfn * PAGE_SIZE' limit should be replaced with the
> end of the non-removable memory node.
>
Since passing MEMBLOCK_NONE, memblock_find_in_range() ->...->
__next_mem_range(), there is a logic to guarantee hotmovable memory
will not be stamped over.
if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
continue;

Thanks,
Pingfan

> > +
> >               if (!crash_base) {
> >                       pr_info("crashkernel reservation failed - No suitable area found.\n");
> >                       return;
> > --
> > 2.7.4
> >
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr
  2019-01-02  6:47     ` Pingfan Liu
@ 2019-01-02  9:27       ` Mike Rapoport
  2019-01-02 10:18         ` Baoquan He
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Rapoport @ 2019-01-02  9:27 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: linux-acpi, linux-mm, kexec, Tang Chen, Rafael J. Wysocki,
	Len Brown, Andrew Morton, Mike Rapoport, Michal Hocko,
	Jonathan Corbet, Yaowei Bai, Pavel Tatashin, Nicholas Piggin,
	Naoya Horiguchi, Daniel Vacek, Mathieu Malaterre, Stefan Agner,
	Dave Young, Baoquan He, yinghai, vgoyal, linux-kernel

On Wed, Jan 02, 2019 at 02:47:34PM +0800, Pingfan Liu wrote:
> On Mon, Dec 31, 2018 at 4:40 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >
> > On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> > > The bottom-up allocation style is introduced to cope with movable_node,
> > > where the limit inferior of allocation starts from kernel's end, due to
> > > lack of knowledge of memory hotplug info at this early time. But if later,
> > > hotplug info has been got, the limit inferior can be extend to 0.
> > > 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> > > since if the reserved region is beyond 4G, then it requires extra mem
> > > (default is 16M) for swiotlb.
> >
> > I fail to understand why the availability of memory hotplug information
> > would allow to extend the lower limit of bottom-up memblock allocations
> > below the kernel. The memory in the physical range [0, kernel_start) can be
> > allocated as soon as the kernel memory is reserved.
> >
> Yes, the  [0, kernel_start) can be allocated at this time by some func
> e.g. memblock_reserve(). But there is trick. For the func like
> memblock_find_in_range(), this is hotplug attr checking ,,it will
> check the hotmovable attr in __next_mem_range()
> {
> if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
> continue
> }.  So the movable memory can be safely skipped.

I still don't see the connection between allocating memory below
kernel_start and the hotplug info.

The check for 'end > kernel_end' in

 	if (memblock_bottom_up() && end > kernel_end)

does not protect against allocation in a hotplugable area.
If memblock_find_in_range() is called before hotplug info is parsed it can
return a range in a hotplugable area.

The point I'd like to clarify is why allocating memory in the range [0,
kernel_start) cannot be done before hotplug info is available and why it is
safe to allocate that memory afterwards?

> Thanks for your kindly review.
> 
> Regards,
> Pingfan
> 
> > The extents of the memory node hosting the kernel image can be used to
> > limit memblok allocations from that particular node, even in top-down mode.
> >
> > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > Cc: Len Brown <lenb@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> > > Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> > > Cc: Nicholas Piggin <npiggin@gmail.com>
> > > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > Cc: Daniel Vacek <neelx@redhat.com>
> > > Cc: Mathieu Malaterre <malat@debian.org>
> > > Cc: Stefan Agner <stefan@agner.ch>
> > > Cc: Dave Young <dyoung@redhat.com>
> > > Cc: Baoquan He <bhe@redhat.com>
> > > Cc: yinghai@kernel.org,
> > > Cc: vgoyal@redhat.com
> > > Cc: linux-kernel@vger.kernel.org
> > > ---
> > >  drivers/acpi/numa.c      |  4 ++++
> > >  include/linux/memblock.h |  1 +
> > >  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
> > >  3 files changed, 40 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> > > index 2746994..3eea4e4 100644
> > > --- a/drivers/acpi/numa.c
> > > +++ b/drivers/acpi/numa.c
> > > @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> > >
> > >               cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> > >                                           acpi_parse_memory_affinity, 0);
> > > +
> > > +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > > +             mark_mem_hotplug_parsed();
> > > +#endif
> > >       }
> > >
> > >       /* SLIT: System Locality Information Table */
> > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > > index aee299a..d89ed9e 100644
> > > --- a/include/linux/memblock.h
> > > +++ b/include/linux/memblock.h
> > > @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
> > >  void memblock_trim_memory(phys_addr_t align);
> > >  bool memblock_overlaps_region(struct memblock_type *type,
> > >                             phys_addr_t base, phys_addr_t size);
> > > +void mark_mem_hotplug_parsed(void);
> > >  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> > >  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> > >  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > index 81ae63c..a3f5e46 100644
> > > --- a/mm/memblock.c
> > > +++ b/mm/memblock.c
> > > @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
> > >       return 0;
> > >  }
> > >
> > > +static bool mem_hotmovable_parsed __initdata_memblock;
> > > +void __init_memblock mark_mem_hotplug_parsed(void)
> > > +{
> > > +     mem_hotmovable_parsed = true;
> > > +}
> > > +
> > >  /**
> > >   * memblock_find_in_range_node - find free area in given range and node
> > >   * @size: size of free area to find
> > > @@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > >                                       phys_addr_t end, int nid,
> > >                                       enum memblock_flags flags)
> > >  {
> > > -     phys_addr_t kernel_end, ret;
> > > +     phys_addr_t kernel_end, ret = 0;
> > >
> > >       /* pump up @end */
> > >       if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> > > @@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > >       end = max(start, end);
> > >       kernel_end = __pa_symbol(_end);
> > >
> > > -     /*
> > > -      * try bottom-up allocation only when bottom-up mode
> > > -      * is set and @end is above the kernel image.
> > > -      */
> > > -     if (memblock_bottom_up() && end > kernel_end) {
> > > -             phys_addr_t bottom_up_start;
> > > +     if (memblock_bottom_up()) {
> > > +             phys_addr_t bottom_up_start = start;
> > >
> > > -             /* make sure we will allocate above the kernel */
> > > -             bottom_up_start = max(start, kernel_end);
> > > -
> > > -             /* ok, try bottom-up allocation first */
> > > -             ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > -                                                   size, align, nid, flags);
> > > -             if (ret)
> > > +             if (mem_hotmovable_parsed) {
> > > +                     ret = __memblock_find_range_bottom_up(
> > > +                             bottom_up_start, end, size, align, nid,
> > > +                             flags);
> > >                       return ret;
> > >
> > >               /*
> > > -              * we always limit bottom-up allocation above the kernel,
> > > -              * but top-down allocation doesn't have the limit, so
> > > -              * retrying top-down allocation may succeed when bottom-up
> > > -              * allocation failed.
> > > -              *
> > > -              * bottom-up allocation is expected to be fail very rarely,
> > > -              * so we use WARN_ONCE() here to see the stack trace if
> > > -              * fail happens.
> > > +              * if mem hotplug info is not parsed yet, try bottom-up
> > > +              * allocation with @end above the kernel image.
> > >                */
> > > -             WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > +             } else if (!mem_hotmovable_parsed && end > kernel_end) {
> > > +                     /* make sure we will allocate above the kernel */
> > > +                     bottom_up_start = max(start, kernel_end);
> > > +                     ret = __memblock_find_range_bottom_up(
> > > +                             bottom_up_start, end, size, align, nid,
> > > +                             flags);
> > > +                     if (ret)
> > > +                             return ret;
> > > +                     /*
> > > +                      * we always limit bottom-up allocation above the
> > > +                      * kernel, but top-down allocation doesn't have
> > > +                      * the limit, so retrying top-down allocation may
> > > +                      * succeed when bottom-up allocation failed.
> > > +                      *
> > > +                      * bottom-up allocation is expected to be fail
> > > +                      * very rarely, so we use WARN_ONCE() here to see
> > > +                      * the stack trace if fail happens.
> > > +                      */
> > > +                     WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > >                         "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
> > > +             }
> > >       }
> > >
> > >       return __memblock_find_range_top_down(start, end, size, align, nid,
> > > --
> > > 2.7.4
> > >
> >
> > --
> > Sincerely yours,
> > Mike.
> >
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 2/2] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr
  2019-01-02  6:47     ` Pingfan Liu
@ 2019-01-02  9:28       ` Mike Rapoport
  0 siblings, 0 replies; 20+ messages in thread
From: Mike Rapoport @ 2019-01-02  9:28 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: linux-acpi, linux-mm, kexec, Tang Chen, Rafael J. Wysocki,
	Len Brown, Andrew Morton, Mike Rapoport, Michal Hocko,
	Jonathan Corbet, Yaowei Bai, Pavel Tatashin, Nicholas Piggin,
	Naoya Horiguchi, Daniel Vacek, Mathieu Malaterre, Stefan Agner,
	Dave Young, Baoquan He, yinghai, vgoyal, linux-kernel

On Wed, Jan 02, 2019 at 02:47:54PM +0800, Pingfan Liu wrote:
> On Mon, Dec 31, 2018 at 4:46 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >
> > On Fri, Dec 28, 2018 at 11:00:02AM +0800, Pingfan Liu wrote:
> > > Customer reported a bug on a high end server with many pcie devices, where
> > > kernel bootup with crashkernel=384M, and kaslr is enabled. Even
> > > though we still see much memory under 896 MB, the finding still failed
> > > intermittently. Because currently we can only find region under 896 MB,
> > > if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
> > > randomly, and crashkernel reservation need be aligned to 128 MB, that's
> > > why failure is found. It raises confusion to the end user that sometimes
> > > crashkernel=X works while sometimes fails.
> > > If want to make it succeed, customer can change kernel option to
> > > "crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
> > > limited space to behave even though its grammer looks more generic.
> > > And we can't answer questions raised from customer that confidently:
> > > 1) why it doesn't succeed to reserve 896 MB;
> > > 2) what's wrong with memory region under 4G;
> > > 3) why I have to add ',high', I only require 384 MB, not 3840 MB.
> > >
> > > This patch simplifies the method suggested in the mail [1]. It just goes
> > > bottom-up to find a candidate region for crashkernel. The bottom-up may be
> > > better compatible with the old reservation style, i.e. still want to get
> > > memory region from 896 MB firstly, then [896 MB, 4G], finally above 4G.
> > >
> > > There is one trivial thing about the compatibility with old kexec-tools:
> > > if the reserved region is above 896M, then old tool will fail to load
> > > bzImage. But without this patch, the old tool also fail since there is no
> > > memory below 896M can be reserved for crashkernel.
> > >
> > > [1]: http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > Cc: Len Brown <lenb@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> > > Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> > > Cc: Nicholas Piggin <npiggin@gmail.com>
> > > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > Cc: Daniel Vacek <neelx@redhat.com>
> > > Cc: Mathieu Malaterre <malat@debian.org>
> > > Cc: Stefan Agner <stefan@agner.ch>
> > > Cc: Dave Young <dyoung@redhat.com>
> > > Cc: Baoquan He <bhe@redhat.com>
> > > Cc: yinghai@kernel.org,
> > > Cc: vgoyal@redhat.com
> > > Cc: linux-kernel@vger.kernel.org
> > > ---
> > >  arch/x86/kernel/setup.c | 9 ++++++---
> > >  1 file changed, 6 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > > index d494b9b..165f9c3 100644
> > > --- a/arch/x86/kernel/setup.c
> > > +++ b/arch/x86/kernel/setup.c
> > > @@ -541,15 +541,18 @@ static void __init reserve_crashkernel(void)
> > >
> > >       /* 0 means: find the address automatically */
> > >       if (crash_base <= 0) {
> > > +             bool bottom_up = memblock_bottom_up();
> > > +
> > > +             memblock_set_bottom_up(true);
> > >
> > >               /*
> > >                * Set CRASH_ADDR_LOW_MAX upper bound for crash memory,
> > >                * as old kexec-tools loads bzImage below that, unless
> > >                * "crashkernel=size[KMG],high" is specified.
> > >                */
> > >               crash_base = memblock_find_in_range(CRASH_ALIGN,
> > > -                                                 high ? CRASH_ADDR_HIGH_MAX
> > > -                                                      : CRASH_ADDR_LOW_MAX,
> > > -                                                 crash_size, CRASH_ALIGN);
> > > +                     (max_pfn * PAGE_SIZE), crash_size, CRASH_ALIGN);
> > > +             memblock_set_bottom_up(bottom_up);
> >
> > Using bottom-up does not guarantee that the allocation won't fall into a
> > removable memory, it only makes it highly probable.
> >
> > I think that the 'max_pfn * PAGE_SIZE' limit should be replaced with the
> > end of the non-removable memory node.
> >
> Since passing MEMBLOCK_NONE, memblock_find_in_range() ->...->
> __next_mem_range(), there is a logic to guarantee hotmovable memory
> will not be stamped over.
> if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
> continue;

Thanks for the clarification, I've missed that.
 
> Thanks,
> Pingfan
> 
> > > +
> > >               if (!crash_base) {
> > >                       pr_info("crashkernel reservation failed - No suitable area found.\n");
> > >                       return;
> > > --
> > > 2.7.4
> > >
> >
> > --
> > Sincerely yours,
> > Mike.
> >
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr
  2019-01-02  9:27       ` Mike Rapoport
@ 2019-01-02 10:18         ` Baoquan He
  2019-01-02 17:05           ` Mike Rapoport
  2019-01-04  5:59           ` Pingfan Liu
  0 siblings, 2 replies; 20+ messages in thread
From: Baoquan He @ 2019-01-02 10:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pingfan Liu, linux-acpi, linux-mm, kexec, Tang Chen,
	Rafael J. Wysocki, Len Brown, Andrew Morton, Mike Rapoport,
	Michal Hocko, Jonathan Corbet, Yaowei Bai, Pavel Tatashin,
	Nicholas Piggin, Naoya Horiguchi, Daniel Vacek,
	Mathieu Malaterre, Stefan Agner, Dave Young, yinghai, vgoyal,
	linux-kernel

On 01/02/19 at 11:27am, Mike Rapoport wrote:
> On Wed, Jan 02, 2019 at 02:47:34PM +0800, Pingfan Liu wrote:
> > On Mon, Dec 31, 2018 at 4:40 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > >
> > > On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> > > > The bottom-up allocation style is introduced to cope with movable_node,
> > > > where the limit inferior of allocation starts from kernel's end, due to
> > > > lack of knowledge of memory hotplug info at this early time. But if later,
> > > > hotplug info has been got, the limit inferior can be extend to 0.
> > > > 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> > > > since if the reserved region is beyond 4G, then it requires extra mem
> > > > (default is 16M) for swiotlb.
> > >
> > > I fail to understand why the availability of memory hotplug information
> > > would allow to extend the lower limit of bottom-up memblock allocations
> > > below the kernel. The memory in the physical range [0, kernel_start) can be
> > > allocated as soon as the kernel memory is reserved.
> > >
> > Yes, the  [0, kernel_start) can be allocated at this time by some func
> > e.g. memblock_reserve(). But there is trick. For the func like
> > memblock_find_in_range(), this is hotplug attr checking ,,it will
> > check the hotmovable attr in __next_mem_range()
> > {
> > if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
> > continue
> > }.  So the movable memory can be safely skipped.
> 
> I still don't see the connection between allocating memory below
> kernel_start and the hotplug info.
> 
> The check for 'end > kernel_end' in
> 
>  	if (memblock_bottom_up() && end > kernel_end)
> 
> does not protect against allocation in a hotplugable area.
> If memblock_find_in_range() is called before hotplug info is parsed it can
> return a range in a hotplugable area.
> 
> The point I'd like to clarify is why allocating memory in the range [0,
> kernel_start) cannot be done before hotplug info is available and why it is
> safe to allocate that memory afterwards?

Well, I think that's because we have KASLR. Before KASLR was introdueced,
kernel is put at a low and fixed physical address. Allocating memblock
bottom-up after kernel can make sure those kernel data is in the same node
as kernel text itself before SRAT parsed. While [0, kernel_start) is a
very small range, e.g on x86, it was 16 MB, which is very possibly used
up.

But now, with KASLR enabled by default, this bottom-up after kernel text
allocation has potential issue. E.g we have node0 (including normal zone),
node1(including movable zone), if KASLR put kernel at top of node0, the
next memblock allocation before SRAT parsed will stamp into movable zone
of node1, hotplug doesn't work well any more consequently. I had
considered this issue previously, but haven't thought of a way to fix
it.

While it's not related to this patch. About this patchset, I didn't
check it carefully in v2 post, and acked it. In fact the current way is
not good, Pingfan should call __memblock_find_range_bottom_up() directly
for crashkernel reserving. Reasons are:
1)SRAT parsing is done, system restore to take top-down way to do
memblock allocat.
2)we do need to find range bottom-up if user specify crashkernel=xxM
(without a explicit base address).

Thanks
Baoquan

> 
> > Thanks for your kindly review.
> > 
> > Regards,
> > Pingfan
> > 
> > > The extents of the memory node hosting the kernel image can be used to
> > > limit memblok allocations from that particular node, even in top-down mode.
> > >
> > > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > > > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > Cc: Len Brown <lenb@kernel.org>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> > > > Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> > > > Cc: Nicholas Piggin <npiggin@gmail.com>
> > > > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > > Cc: Daniel Vacek <neelx@redhat.com>
> > > > Cc: Mathieu Malaterre <malat@debian.org>
> > > > Cc: Stefan Agner <stefan@agner.ch>
> > > > Cc: Dave Young <dyoung@redhat.com>
> > > > Cc: Baoquan He <bhe@redhat.com>
> > > > Cc: yinghai@kernel.org,
> > > > Cc: vgoyal@redhat.com
> > > > Cc: linux-kernel@vger.kernel.org
> > > > ---
> > > >  drivers/acpi/numa.c      |  4 ++++
> > > >  include/linux/memblock.h |  1 +
> > > >  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
> > > >  3 files changed, 40 insertions(+), 23 deletions(-)
> > > >
> > > > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> > > > index 2746994..3eea4e4 100644
> > > > --- a/drivers/acpi/numa.c
> > > > +++ b/drivers/acpi/numa.c
> > > > @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> > > >
> > > >               cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> > > >                                           acpi_parse_memory_affinity, 0);
> > > > +
> > > > +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > > > +             mark_mem_hotplug_parsed();
> > > > +#endif
> > > >       }
> > > >
> > > >       /* SLIT: System Locality Information Table */
> > > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > > > index aee299a..d89ed9e 100644
> > > > --- a/include/linux/memblock.h
> > > > +++ b/include/linux/memblock.h
> > > > @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
> > > >  void memblock_trim_memory(phys_addr_t align);
> > > >  bool memblock_overlaps_region(struct memblock_type *type,
> > > >                             phys_addr_t base, phys_addr_t size);
> > > > +void mark_mem_hotplug_parsed(void);
> > > >  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> > > >  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> > > >  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > index 81ae63c..a3f5e46 100644
> > > > --- a/mm/memblock.c
> > > > +++ b/mm/memblock.c
> > > > @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
> > > >       return 0;
> > > >  }
> > > >
> > > > +static bool mem_hotmovable_parsed __initdata_memblock;
> > > > +void __init_memblock mark_mem_hotplug_parsed(void)
> > > > +{
> > > > +     mem_hotmovable_parsed = true;
> > > > +}
> > > > +
> > > >  /**
> > > >   * memblock_find_in_range_node - find free area in given range and node
> > > >   * @size: size of free area to find
> > > > @@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > >                                       phys_addr_t end, int nid,
> > > >                                       enum memblock_flags flags)
> > > >  {
> > > > -     phys_addr_t kernel_end, ret;
> > > > +     phys_addr_t kernel_end, ret = 0;
> > > >
> > > >       /* pump up @end */
> > > >       if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> > > > @@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > >       end = max(start, end);
> > > >       kernel_end = __pa_symbol(_end);
> > > >
> > > > -     /*
> > > > -      * try bottom-up allocation only when bottom-up mode
> > > > -      * is set and @end is above the kernel image.
> > > > -      */
> > > > -     if (memblock_bottom_up() && end > kernel_end) {
> > > > -             phys_addr_t bottom_up_start;
> > > > +     if (memblock_bottom_up()) {
> > > > +             phys_addr_t bottom_up_start = start;
> > > >
> > > > -             /* make sure we will allocate above the kernel */
> > > > -             bottom_up_start = max(start, kernel_end);
> > > > -
> > > > -             /* ok, try bottom-up allocation first */
> > > > -             ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > > -                                                   size, align, nid, flags);
> > > > -             if (ret)
> > > > +             if (mem_hotmovable_parsed) {
> > > > +                     ret = __memblock_find_range_bottom_up(
> > > > +                             bottom_up_start, end, size, align, nid,
> > > > +                             flags);
> > > >                       return ret;
> > > >
> > > >               /*
> > > > -              * we always limit bottom-up allocation above the kernel,
> > > > -              * but top-down allocation doesn't have the limit, so
> > > > -              * retrying top-down allocation may succeed when bottom-up
> > > > -              * allocation failed.
> > > > -              *
> > > > -              * bottom-up allocation is expected to be fail very rarely,
> > > > -              * so we use WARN_ONCE() here to see the stack trace if
> > > > -              * fail happens.
> > > > +              * if mem hotplug info is not parsed yet, try bottom-up
> > > > +              * allocation with @end above the kernel image.
> > > >                */
> > > > -             WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > > +             } else if (!mem_hotmovable_parsed && end > kernel_end) {
> > > > +                     /* make sure we will allocate above the kernel */
> > > > +                     bottom_up_start = max(start, kernel_end);
> > > > +                     ret = __memblock_find_range_bottom_up(
> > > > +                             bottom_up_start, end, size, align, nid,
> > > > +                             flags);
> > > > +                     if (ret)
> > > > +                             return ret;
> > > > +                     /*
> > > > +                      * we always limit bottom-up allocation above the
> > > > +                      * kernel, but top-down allocation doesn't have
> > > > +                      * the limit, so retrying top-down allocation may
> > > > +                      * succeed when bottom-up allocation failed.
> > > > +                      *
> > > > +                      * bottom-up allocation is expected to be fail
> > > > +                      * very rarely, so we use WARN_ONCE() here to see
> > > > +                      * the stack trace if fail happens.
> > > > +                      */
> > > > +                     WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > >                         "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
> > > > +             }
> > > >       }
> > > >
> > > >       return __memblock_find_range_top_down(start, end, size, align, nid,
> > > > --
> > > > 2.7.4
> > > >
> > >
> > > --
> > > Sincerely yours,
> > > Mike.
> > >
> > 
> 
> -- 
> Sincerely yours,
> Mike.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr
  2019-01-02 10:18         ` Baoquan He
@ 2019-01-02 17:05           ` Mike Rapoport
  2019-01-03 18:47             ` Tejun Heo
  2019-01-04  5:59           ` Pingfan Liu
  1 sibling, 1 reply; 20+ messages in thread
From: Mike Rapoport @ 2019-01-02 17:05 UTC (permalink / raw)
  To: Baoquan He
  Cc: Pingfan Liu, Tejun Heo, linux-acpi, linux-mm, kexec, Tang Chen,
	Rafael J. Wysocki, Len Brown, Andrew Morton, Mike Rapoport,
	Michal Hocko, Jonathan Corbet, Yaowei Bai, Pavel Tatashin,
	Nicholas Piggin, Naoya Horiguchi, Daniel Vacek,
	Mathieu Malaterre, Stefan Agner, Dave Young, yinghai, vgoyal,
	linux-kernel

(added Tejun)

On Wed, Jan 02, 2019 at 06:18:04PM +0800, Baoquan He wrote:
> On 01/02/19 at 11:27am, Mike Rapoport wrote:
> > On Wed, Jan 02, 2019 at 02:47:34PM +0800, Pingfan Liu wrote:
> > > On Mon, Dec 31, 2018 at 4:40 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > > >
> > > > On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> > > > > The bottom-up allocation style is introduced to cope with movable_node,
> > > > > where the limit inferior of allocation starts from kernel's end, due to
> > > > > lack of knowledge of memory hotplug info at this early time. But if later,
> > > > > hotplug info has been got, the limit inferior can be extend to 0.
> > > > > 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> > > > > since if the reserved region is beyond 4G, then it requires extra mem
> > > > > (default is 16M) for swiotlb.
> > > >
> > > > I fail to understand why the availability of memory hotplug information
> > > > would allow to extend the lower limit of bottom-up memblock allocations
> > > > below the kernel. The memory in the physical range [0, kernel_start) can be
> > > > allocated as soon as the kernel memory is reserved.
> > > >
> > > Yes, the  [0, kernel_start) can be allocated at this time by some func
> > > e.g. memblock_reserve(). But there is trick. For the func like
> > > memblock_find_in_range(), this is hotplug attr checking ,,it will
> > > check the hotmovable attr in __next_mem_range()
> > > {
> > > if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
> > > continue
> > > }.  So the movable memory can be safely skipped.
> > 
> > I still don't see the connection between allocating memory below
> > kernel_start and the hotplug info.
> > 
> > The check for 'end > kernel_end' in
> > 
> >  	if (memblock_bottom_up() && end > kernel_end)
> > 
> > does not protect against allocation in a hotplugable area.
> > If memblock_find_in_range() is called before hotplug info is parsed it can
> > return a range in a hotplugable area.
> > 
> > The point I'd like to clarify is why allocating memory in the range [0,
> > kernel_start) cannot be done before hotplug info is available and why it is
> > safe to allocate that memory afterwards?
> 
> Well, I think that's because we have KASLR. Before KASLR was introdueced,
> kernel is put at a low and fixed physical address. Allocating memblock
> bottom-up after kernel can make sure those kernel data is in the same node
> as kernel text itself before SRAT parsed. While [0, kernel_start) is a
> very small range, e.g on x86, it was 16 MB, which is very possibly used
> up.
> 
> But now, with KASLR enabled by default, this bottom-up after kernel text
> allocation has potential issue. E.g we have node0 (including normal zone),
> node1(including movable zone), if KASLR put kernel at top of node0, the
> next memblock allocation before SRAT parsed will stamp into movable zone
> of node1, hotplug doesn't work well any more consequently. I had
> considered this issue previously, but haven't thought of a way to fix
> it.
 
I agree that currently the bottom-up allocation after the kernel text has
issues with KASLR. But this issues are not necessarily related to the
memory hotplug. Even with a single memory node, a bottom-up allocation will
fail if KASLR would put the kernel near the end of node0.

What I am trying to understand is whether there is a fundamental reason to
prevent allocations from [0, kernel_start)?

Maybe Tejun can recall why he suggested to start bottom-up allocations from
kernel_end.

> While it's not related to this patch. About this patchset, I didn't
> check it carefully in v2 post, and acked it. In fact the current way is
> not good, Pingfan should call __memblock_find_range_bottom_up() directly
> for crashkernel reserving. Reasons are:
> 1)SRAT parsing is done, system restore to take top-down way to do
> memblock allocat.
> 2)we do need to find range bottom-up if user specify crashkernel=xxM
> (without a explicit base address).
> 
> Thanks
> Baoquan
> 
> > 
> > > Thanks for your kindly review.
> > > 
> > > Regards,
> > > Pingfan
> > > 
> > > > The extents of the memory node hosting the kernel image can be used to
> > > > limit memblok allocations from that particular node, even in top-down mode.
> > > >
> > > > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > > > > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > > Cc: Len Brown <lenb@kernel.org>
> > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > > > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> > > > > Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> > > > > Cc: Nicholas Piggin <npiggin@gmail.com>
> > > > > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > > > Cc: Daniel Vacek <neelx@redhat.com>
> > > > > Cc: Mathieu Malaterre <malat@debian.org>
> > > > > Cc: Stefan Agner <stefan@agner.ch>
> > > > > Cc: Dave Young <dyoung@redhat.com>
> > > > > Cc: Baoquan He <bhe@redhat.com>
> > > > > Cc: yinghai@kernel.org,
> > > > > Cc: vgoyal@redhat.com
> > > > > Cc: linux-kernel@vger.kernel.org
> > > > > ---
> > > > >  drivers/acpi/numa.c      |  4 ++++
> > > > >  include/linux/memblock.h |  1 +
> > > > >  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
> > > > >  3 files changed, 40 insertions(+), 23 deletions(-)
> > > > >
> > > > > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> > > > > index 2746994..3eea4e4 100644
> > > > > --- a/drivers/acpi/numa.c
> > > > > +++ b/drivers/acpi/numa.c
> > > > > @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> > > > >
> > > > >               cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> > > > >                                           acpi_parse_memory_affinity, 0);
> > > > > +
> > > > > +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > > > > +             mark_mem_hotplug_parsed();
> > > > > +#endif
> > > > >       }
> > > > >
> > > > >       /* SLIT: System Locality Information Table */
> > > > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > > > > index aee299a..d89ed9e 100644
> > > > > --- a/include/linux/memblock.h
> > > > > +++ b/include/linux/memblock.h
> > > > > @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
> > > > >  void memblock_trim_memory(phys_addr_t align);
> > > > >  bool memblock_overlaps_region(struct memblock_type *type,
> > > > >                             phys_addr_t base, phys_addr_t size);
> > > > > +void mark_mem_hotplug_parsed(void);
> > > > >  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> > > > >  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> > > > >  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> > > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > > index 81ae63c..a3f5e46 100644
> > > > > --- a/mm/memblock.c
> > > > > +++ b/mm/memblock.c
> > > > > @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
> > > > >       return 0;
> > > > >  }
> > > > >
> > > > > +static bool mem_hotmovable_parsed __initdata_memblock;
> > > > > +void __init_memblock mark_mem_hotplug_parsed(void)
> > > > > +{
> > > > > +     mem_hotmovable_parsed = true;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * memblock_find_in_range_node - find free area in given range and node
> > > > >   * @size: size of free area to find
> > > > > @@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > > >                                       phys_addr_t end, int nid,
> > > > >                                       enum memblock_flags flags)
> > > > >  {
> > > > > -     phys_addr_t kernel_end, ret;
> > > > > +     phys_addr_t kernel_end, ret = 0;
> > > > >
> > > > >       /* pump up @end */
> > > > >       if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> > > > > @@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > > >       end = max(start, end);
> > > > >       kernel_end = __pa_symbol(_end);
> > > > >
> > > > > -     /*
> > > > > -      * try bottom-up allocation only when bottom-up mode
> > > > > -      * is set and @end is above the kernel image.
> > > > > -      */
> > > > > -     if (memblock_bottom_up() && end > kernel_end) {
> > > > > -             phys_addr_t bottom_up_start;
> > > > > +     if (memblock_bottom_up()) {
> > > > > +             phys_addr_t bottom_up_start = start;
> > > > >
> > > > > -             /* make sure we will allocate above the kernel */
> > > > > -             bottom_up_start = max(start, kernel_end);
> > > > > -
> > > > > -             /* ok, try bottom-up allocation first */
> > > > > -             ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > > > -                                                   size, align, nid, flags);
> > > > > -             if (ret)
> > > > > +             if (mem_hotmovable_parsed) {
> > > > > +                     ret = __memblock_find_range_bottom_up(
> > > > > +                             bottom_up_start, end, size, align, nid,
> > > > > +                             flags);
> > > > >                       return ret;
> > > > >
> > > > >               /*
> > > > > -              * we always limit bottom-up allocation above the kernel,
> > > > > -              * but top-down allocation doesn't have the limit, so
> > > > > -              * retrying top-down allocation may succeed when bottom-up
> > > > > -              * allocation failed.
> > > > > -              *
> > > > > -              * bottom-up allocation is expected to be fail very rarely,
> > > > > -              * so we use WARN_ONCE() here to see the stack trace if
> > > > > -              * fail happens.
> > > > > +              * if mem hotplug info is not parsed yet, try bottom-up
> > > > > +              * allocation with @end above the kernel image.
> > > > >                */
> > > > > -             WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > > > +             } else if (!mem_hotmovable_parsed && end > kernel_end) {
> > > > > +                     /* make sure we will allocate above the kernel */
> > > > > +                     bottom_up_start = max(start, kernel_end);
> > > > > +                     ret = __memblock_find_range_bottom_up(
> > > > > +                             bottom_up_start, end, size, align, nid,
> > > > > +                             flags);
> > > > > +                     if (ret)
> > > > > +                             return ret;
> > > > > +                     /*
> > > > > +                      * we always limit bottom-up allocation above the
> > > > > +                      * kernel, but top-down allocation doesn't have
> > > > > +                      * the limit, so retrying top-down allocation may
> > > > > +                      * succeed when bottom-up allocation failed.
> > > > > +                      *
> > > > > +                      * bottom-up allocation is expected to be fail
> > > > > +                      * very rarely, so we use WARN_ONCE() here to see
> > > > > +                      * the stack trace if fail happens.
> > > > > +                      */
> > > > > +                     WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > > >                         "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
> > > > > +             }
> > > > >       }
> > > > >
> > > > >       return __memblock_find_range_top_down(start, end, size, align, nid,
> > > > > --
> > > > > 2.7.4
> > > > >
> > > >
> > > > --
> > > > Sincerely yours,
> > > > Mike.
> > > >
> > > 
> > 
> > -- 
> > Sincerely yours,
> > Mike.
> > 
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr
  2019-01-02 17:05           ` Mike Rapoport
@ 2019-01-03 18:47             ` Tejun Heo
  2019-01-04 15:09               ` Mike Rapoport
  2019-01-07  8:37               ` Pingfan Liu
  0 siblings, 2 replies; 20+ messages in thread
From: Tejun Heo @ 2019-01-03 18:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Baoquan He, Pingfan Liu, linux-acpi, linux-mm, kexec, Tang Chen,
	Rafael J. Wysocki, Len Brown, Andrew Morton, Mike Rapoport,
	Michal Hocko, Jonathan Corbet, Yaowei Bai, Pavel Tatashin,
	Nicholas Piggin, Naoya Horiguchi, Daniel Vacek,
	Mathieu Malaterre, Stefan Agner, Dave Young, yinghai, vgoyal,
	linux-kernel

Hello,

On Wed, Jan 02, 2019 at 07:05:38PM +0200, Mike Rapoport wrote:
> I agree that currently the bottom-up allocation after the kernel text has
> issues with KASLR. But this issues are not necessarily related to the
> memory hotplug. Even with a single memory node, a bottom-up allocation will
> fail if KASLR would put the kernel near the end of node0.
> 
> What I am trying to understand is whether there is a fundamental reason to
> prevent allocations from [0, kernel_start)?
> 
> Maybe Tejun can recall why he suggested to start bottom-up allocations from
> kernel_end.

That's from 79442ed189ac ("mm/memblock.c: introduce bottom-up
allocation mode").  I wasn't involved in that patch, so no idea why
the restrictions were added, but FWIW it doesn't seem necessary to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr
  2019-01-02 10:18         ` Baoquan He
  2019-01-02 17:05           ` Mike Rapoport
@ 2019-01-04  5:59           ` Pingfan Liu
  2019-01-04 16:20             ` Mike Rapoport
  1 sibling, 1 reply; 20+ messages in thread
From: Pingfan Liu @ 2019-01-04  5:59 UTC (permalink / raw)
  To: Baoquan He
  Cc: Mike Rapoport, linux-acpi, linux-mm, kexec, Rafael J. Wysocki,
	Len Brown, Andrew Morton, Mike Rapoport, Michal Hocko,
	Jonathan Corbet, Yaowei Bai, Nicholas Piggin, Naoya Horiguchi,
	Daniel Vacek, Mathieu Malaterre, Stefan Agner, Dave Young,
	yinghai, vgoyal, linux-kernel

On Wed, Jan 2, 2019 at 6:18 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 01/02/19 at 11:27am, Mike Rapoport wrote:
> > On Wed, Jan 02, 2019 at 02:47:34PM +0800, Pingfan Liu wrote:
> > > On Mon, Dec 31, 2018 at 4:40 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > > >
> > > > On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> > > > > The bottom-up allocation style is introduced to cope with movable_node,
> > > > > where the limit inferior of allocation starts from kernel's end, due to
> > > > > lack of knowledge of memory hotplug info at this early time. But if later,
> > > > > hotplug info has been got, the limit inferior can be extend to 0.
> > > > > 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> > > > > since if the reserved region is beyond 4G, then it requires extra mem
> > > > > (default is 16M) for swiotlb.
> > > >
> > > > I fail to understand why the availability of memory hotplug information
> > > > would allow to extend the lower limit of bottom-up memblock allocations
> > > > below the kernel. The memory in the physical range [0, kernel_start) can be
> > > > allocated as soon as the kernel memory is reserved.
> > > >
> > > Yes, the  [0, kernel_start) can be allocated at this time by some func
> > > e.g. memblock_reserve(). But there is trick. For the func like
> > > memblock_find_in_range(), this is hotplug attr checking ,,it will
> > > check the hotmovable attr in __next_mem_range()
> > > {
> > > if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
> > > continue
> > > }.  So the movable memory can be safely skipped.
> >
> > I still don't see the connection between allocating memory below
> > kernel_start and the hotplug info.
> >
> > The check for 'end > kernel_end' in
> >
> >       if (memblock_bottom_up() && end > kernel_end)
> >
> > does not protect against allocation in a hotplugable area.
> > If memblock_find_in_range() is called before hotplug info is parsed it can
> > return a range in a hotplugable area.
> >
> > The point I'd like to clarify is why allocating memory in the range [0,
> > kernel_start) cannot be done before hotplug info is available and why it is
> > safe to allocate that memory afterwards?
>
> Well, I think that's because we have KASLR. Before KASLR was introdueced,
> kernel is put at a low and fixed physical address. Allocating memblock
> bottom-up after kernel can make sure those kernel data is in the same node
> as kernel text itself before SRAT parsed. While [0, kernel_start) is a
> very small range, e.g on x86, it was 16 MB, which is very possibly used
> up.
>
> But now, with KASLR enabled by default, this bottom-up after kernel text
> allocation has potential issue. E.g we have node0 (including normal zone),
> node1(including movable zone), if KASLR put kernel at top of node0, the
> next memblock allocation before SRAT parsed will stamp into movable zone
> of node1, hotplug doesn't work well any more consequently. I had
> considered this issue previously, but haven't thought of a way to fix
> it.
>
> While it's not related to this patch. About this patchset, I didn't
> check it carefully in v2 post, and acked it. In fact the current way is
> not good, Pingfan should call __memblock_find_range_bottom_up() directly
> for crashkernel reserving. Reasons are:

Good suggestion, thanks. I will send out V4.

Regards,
Pingfan
> 1)SRAT parsing is done, system restore to take top-down way to do
> memblock allocat.
> 2)we do need to find range bottom-up if user specify crashkernel=xxM
> (without a explicit base address).
>
> Thanks
> Baoquan
>
> >
> > > Thanks for your kindly review.
> > >
> > > Regards,
> > > Pingfan
> > >
> > > > The extents of the memory node hosting the kernel image can be used to
> > > > limit memblok allocations from that particular node, even in top-down mode.
> > > >
> > > > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > > > > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > > Cc: Len Brown <lenb@kernel.org>
> > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > > > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> > > > > Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> > > > > Cc: Nicholas Piggin <npiggin@gmail.com>
> > > > > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > > > Cc: Daniel Vacek <neelx@redhat.com>
> > > > > Cc: Mathieu Malaterre <malat@debian.org>
> > > > > Cc: Stefan Agner <stefan@agner.ch>
> > > > > Cc: Dave Young <dyoung@redhat.com>
> > > > > Cc: Baoquan He <bhe@redhat.com>
> > > > > Cc: yinghai@kernel.org,
> > > > > Cc: vgoyal@redhat.com
> > > > > Cc: linux-kernel@vger.kernel.org
> > > > > ---
> > > > >  drivers/acpi/numa.c      |  4 ++++
> > > > >  include/linux/memblock.h |  1 +
> > > > >  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
> > > > >  3 files changed, 40 insertions(+), 23 deletions(-)
> > > > >
> > > > > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> > > > > index 2746994..3eea4e4 100644
> > > > > --- a/drivers/acpi/numa.c
> > > > > +++ b/drivers/acpi/numa.c
> > > > > @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> > > > >
> > > > >               cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> > > > >                                           acpi_parse_memory_affinity, 0);
> > > > > +
> > > > > +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > > > > +             mark_mem_hotplug_parsed();
> > > > > +#endif
> > > > >       }
> > > > >
> > > > >       /* SLIT: System Locality Information Table */
> > > > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > > > > index aee299a..d89ed9e 100644
> > > > > --- a/include/linux/memblock.h
> > > > > +++ b/include/linux/memblock.h
> > > > > @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
> > > > >  void memblock_trim_memory(phys_addr_t align);
> > > > >  bool memblock_overlaps_region(struct memblock_type *type,
> > > > >                             phys_addr_t base, phys_addr_t size);
> > > > > +void mark_mem_hotplug_parsed(void);
> > > > >  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> > > > >  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> > > > >  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> > > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > > index 81ae63c..a3f5e46 100644
> > > > > --- a/mm/memblock.c
> > > > > +++ b/mm/memblock.c
> > > > > @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
> > > > >       return 0;
> > > > >  }
> > > > >
> > > > > +static bool mem_hotmovable_parsed __initdata_memblock;
> > > > > +void __init_memblock mark_mem_hotplug_parsed(void)
> > > > > +{
> > > > > +     mem_hotmovable_parsed = true;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * memblock_find_in_range_node - find free area in given range and node
> > > > >   * @size: size of free area to find
> > > > > @@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > > >                                       phys_addr_t end, int nid,
> > > > >                                       enum memblock_flags flags)
> > > > >  {
> > > > > -     phys_addr_t kernel_end, ret;
> > > > > +     phys_addr_t kernel_end, ret = 0;
> > > > >
> > > > >       /* pump up @end */
> > > > >       if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> > > > > @@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > > >       end = max(start, end);
> > > > >       kernel_end = __pa_symbol(_end);
> > > > >
> > > > > -     /*
> > > > > -      * try bottom-up allocation only when bottom-up mode
> > > > > -      * is set and @end is above the kernel image.
> > > > > -      */
> > > > > -     if (memblock_bottom_up() && end > kernel_end) {
> > > > > -             phys_addr_t bottom_up_start;
> > > > > +     if (memblock_bottom_up()) {
> > > > > +             phys_addr_t bottom_up_start = start;
> > > > >
> > > > > -             /* make sure we will allocate above the kernel */
> > > > > -             bottom_up_start = max(start, kernel_end);
> > > > > -
> > > > > -             /* ok, try bottom-up allocation first */
> > > > > -             ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > > > -                                                   size, align, nid, flags);
> > > > > -             if (ret)
> > > > > +             if (mem_hotmovable_parsed) {
> > > > > +                     ret = __memblock_find_range_bottom_up(
> > > > > +                             bottom_up_start, end, size, align, nid,
> > > > > +                             flags);
> > > > >                       return ret;
> > > > >
> > > > >               /*
> > > > > -              * we always limit bottom-up allocation above the kernel,
> > > > > -              * but top-down allocation doesn't have the limit, so
> > > > > -              * retrying top-down allocation may succeed when bottom-up
> > > > > -              * allocation failed.
> > > > > -              *
> > > > > -              * bottom-up allocation is expected to be fail very rarely,
> > > > > -              * so we use WARN_ONCE() here to see the stack trace if
> > > > > -              * fail happens.
> > > > > +              * if mem hotplug info is not parsed yet, try bottom-up
> > > > > +              * allocation with @end above the kernel image.
> > > > >                */
> > > > > -             WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > > > +             } else if (!mem_hotmovable_parsed && end > kernel_end) {
> > > > > +                     /* make sure we will allocate above the kernel */
> > > > > +                     bottom_up_start = max(start, kernel_end);
> > > > > +                     ret = __memblock_find_range_bottom_up(
> > > > > +                             bottom_up_start, end, size, align, nid,
> > > > > +                             flags);
> > > > > +                     if (ret)
> > > > > +                             return ret;
> > > > > +                     /*
> > > > > +                      * we always limit bottom-up allocation above the
> > > > > +                      * kernel, but top-down allocation doesn't have
> > > > > +                      * the limit, so retrying top-down allocation may
> > > > > +                      * succeed when bottom-up allocation failed.
> > > > > +                      *
> > > > > +                      * bottom-up allocation is expected to be fail
> > > > > +                      * very rarely, so we use WARN_ONCE() here to see
> > > > > +                      * the stack trace if fail happens.
> > > > > +                      */
> > > > > +                     WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > > >                         "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
> > > > > +             }
> > > > >       }
> > > > >
> > > > >       return __memblock_find_range_top_down(start, end, size, align, nid,
> > > > > --
> > > > > 2.7.4
> > > > >
> > > >
> > > > --
> > > > Sincerely yours,
> > > > Mike.
> > > >
> > >
> >
> > --
> > Sincerely yours,
> > Mike.
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr
  2019-01-03 18:47             ` Tejun Heo
@ 2019-01-04 15:09               ` Mike Rapoport
  2019-01-05  3:44                 ` Baoquan He
  2019-01-07  8:37               ` Pingfan Liu
  1 sibling, 1 reply; 20+ messages in thread
From: Mike Rapoport @ 2019-01-04 15:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Baoquan He, Pingfan Liu, linux-acpi, linux-mm, kexec, Tang Chen,
	Rafael J. Wysocki, Len Brown, Andrew Morton, Mike Rapoport,
	Michal Hocko, Jonathan Corbet, Yaowei Bai, Pavel Tatashin,
	Nicholas Piggin, Naoya Horiguchi, Daniel Vacek,
	Mathieu Malaterre, Stefan Agner, Dave Young, yinghai, vgoyal,
	linux-kernel

On Thu, Jan 03, 2019 at 10:47:06AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jan 02, 2019 at 07:05:38PM +0200, Mike Rapoport wrote:
> > I agree that currently the bottom-up allocation after the kernel text has
> > issues with KASLR. But this issues are not necessarily related to the
> > memory hotplug. Even with a single memory node, a bottom-up allocation will
> > fail if KASLR would put the kernel near the end of node0.
> > 
> > What I am trying to understand is whether there is a fundamental reason to
> > prevent allocations from [0, kernel_start)?
> > 
> > Maybe Tejun can recall why he suggested to start bottom-up allocations from
> > kernel_end.
> 
> That's from 79442ed189ac ("mm/memblock.c: introduce bottom-up
> allocation mode").  I wasn't involved in that patch, so no idea why
> the restrictions were added, but FWIW it doesn't seem necessary to me.

I should have added the reference [1] at the first place :)
Thanks!

[1] https://lore.kernel.org/lkml/20130904192215.GG26609@mtj.dyndns.org/
 
> Thanks.
> 
> -- 
> tejun
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr
  2019-01-04  5:59           ` Pingfan Liu
@ 2019-01-04 16:20             ` Mike Rapoport
  0 siblings, 0 replies; 20+ messages in thread
From: Mike Rapoport @ 2019-01-04 16:20 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: Baoquan He, linux-acpi, linux-mm, kexec, Rafael J. Wysocki,
	Len Brown, Andrew Morton, Mike Rapoport, Michal Hocko,
	Jonathan Corbet, Yaowei Bai, Nicholas Piggin, Naoya Horiguchi,
	Daniel Vacek, Mathieu Malaterre, Stefan Agner, Dave Young,
	yinghai, vgoyal, linux-kernel

On Fri, Jan 04, 2019 at 01:59:46PM +0800, Pingfan Liu wrote:
> On Wed, Jan 2, 2019 at 6:18 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 01/02/19 at 11:27am, Mike Rapoport wrote:
> > > On Wed, Jan 02, 2019 at 02:47:34PM +0800, Pingfan Liu wrote:
> > > > On Mon, Dec 31, 2018 at 4:40 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > > > >
> > > > > On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> > > > > > The bottom-up allocation style is introduced to cope with movable_node,
> > > > > > where the limit inferior of allocation starts from kernel's end, due to
> > > > > > lack of knowledge of memory hotplug info at this early time. But if later,
> > > > > > hotplug info has been got, the limit inferior can be extend to 0.
> > > > > > 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> > > > > > since if the reserved region is beyond 4G, then it requires extra mem
> > > > > > (default is 16M) for swiotlb.
> > > > >
> > > > > I fail to understand why the availability of memory hotplug information
> > > > > would allow to extend the lower limit of bottom-up memblock allocations
> > > > > below the kernel. The memory in the physical range [0, kernel_start) can be
> > > > > allocated as soon as the kernel memory is reserved.
> > > > >
> > > > Yes, the  [0, kernel_start) can be allocated at this time by some func
> > > > e.g. memblock_reserve(). But there is trick. For the func like
> > > > memblock_find_in_range(), this is hotplug attr checking ,,it will
> > > > check the hotmovable attr in __next_mem_range()
> > > > {
> > > > if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
> > > > continue
> > > > }.  So the movable memory can be safely skipped.
> > >
> > > I still don't see the connection between allocating memory below
> > > kernel_start and the hotplug info.
> > >
> > > The check for 'end > kernel_end' in
> > >
> > >       if (memblock_bottom_up() && end > kernel_end)
> > >
> > > does not protect against allocation in a hotplugable area.
> > > If memblock_find_in_range() is called before hotplug info is parsed it can
> > > return a range in a hotplugable area.
> > >
> > > The point I'd like to clarify is why allocating memory in the range [0,
> > > kernel_start) cannot be done before hotplug info is available and why it is
> > > safe to allocate that memory afterwards?
> >
> > Well, I think that's because we have KASLR. Before KASLR was introdueced,
> > kernel is put at a low and fixed physical address. Allocating memblock
> > bottom-up after kernel can make sure those kernel data is in the same node
> > as kernel text itself before SRAT parsed. While [0, kernel_start) is a
> > very small range, e.g on x86, it was 16 MB, which is very possibly used
> > up.
> >
> > But now, with KASLR enabled by default, this bottom-up after kernel text
> > allocation has potential issue. E.g we have node0 (including normal zone),
> > node1(including movable zone), if KASLR put kernel at top of node0, the
> > next memblock allocation before SRAT parsed will stamp into movable zone
> > of node1, hotplug doesn't work well any more consequently. I had
> > considered this issue previously, but haven't thought of a way to fix
> > it.
> >
> > While it's not related to this patch. About this patchset, I didn't
> > check it carefully in v2 post, and acked it. In fact the current way is
> > not good, Pingfan should call __memblock_find_range_bottom_up() directly
> > for crashkernel reserving. Reasons are:
> 
> Good suggestion, thanks. I will send out V4.

I think we can simply remove the restriction of allocating above the kernel
in the memblock_find_in_range_node().
 
> Regards,
> Pingfan
> > 1)SRAT parsing is done, system restore to take top-down way to do
> > memblock allocat.
> > 2)we do need to find range bottom-up if user specify crashkernel=xxM
> > (without a explicit base address).
> >
> > Thanks
> > Baoquan
> >
> > >
> > > > Thanks for your kindly review.
> > > >
> > > > Regards,
> > > > Pingfan
> > > >
> > > > > The extents of the memory node hosting the kernel image can be used to
> > > > > limit memblok allocations from that particular node, even in top-down mode.
> > > > >
> > > > > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > > > > > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > > > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > > > Cc: Len Brown <lenb@kernel.org>
> > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > > > > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> > > > > > Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> > > > > > Cc: Nicholas Piggin <npiggin@gmail.com>
> > > > > > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > > > > Cc: Daniel Vacek <neelx@redhat.com>
> > > > > > Cc: Mathieu Malaterre <malat@debian.org>
> > > > > > Cc: Stefan Agner <stefan@agner.ch>
> > > > > > Cc: Dave Young <dyoung@redhat.com>
> > > > > > Cc: Baoquan He <bhe@redhat.com>
> > > > > > Cc: yinghai@kernel.org,
> > > > > > Cc: vgoyal@redhat.com
> > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > ---
> > > > > >  drivers/acpi/numa.c      |  4 ++++
> > > > > >  include/linux/memblock.h |  1 +
> > > > > >  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
> > > > > >  3 files changed, 40 insertions(+), 23 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> > > > > > index 2746994..3eea4e4 100644
> > > > > > --- a/drivers/acpi/numa.c
> > > > > > +++ b/drivers/acpi/numa.c
> > > > > > @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> > > > > >
> > > > > >               cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> > > > > >                                           acpi_parse_memory_affinity, 0);
> > > > > > +
> > > > > > +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > > > > > +             mark_mem_hotplug_parsed();
> > > > > > +#endif
> > > > > >       }
> > > > > >
> > > > > >       /* SLIT: System Locality Information Table */
> > > > > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > > > > > index aee299a..d89ed9e 100644
> > > > > > --- a/include/linux/memblock.h
> > > > > > +++ b/include/linux/memblock.h
> > > > > > @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
> > > > > >  void memblock_trim_memory(phys_addr_t align);
> > > > > >  bool memblock_overlaps_region(struct memblock_type *type,
> > > > > >                             phys_addr_t base, phys_addr_t size);
> > > > > > +void mark_mem_hotplug_parsed(void);
> > > > > >  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> > > > > >  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> > > > > >  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> > > > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > > > index 81ae63c..a3f5e46 100644
> > > > > > --- a/mm/memblock.c
> > > > > > +++ b/mm/memblock.c
> > > > > > @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
> > > > > >       return 0;
> > > > > >  }
> > > > > >
> > > > > > +static bool mem_hotmovable_parsed __initdata_memblock;
> > > > > > +void __init_memblock mark_mem_hotplug_parsed(void)
> > > > > > +{
> > > > > > +     mem_hotmovable_parsed = true;
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > >   * memblock_find_in_range_node - find free area in given range and node
> > > > > >   * @size: size of free area to find
> > > > > > @@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > > > >                                       phys_addr_t end, int nid,
> > > > > >                                       enum memblock_flags flags)
> > > > > >  {
> > > > > > -     phys_addr_t kernel_end, ret;
> > > > > > +     phys_addr_t kernel_end, ret = 0;
> > > > > >
> > > > > >       /* pump up @end */
> > > > > >       if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> > > > > > @@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > > > >       end = max(start, end);
> > > > > >       kernel_end = __pa_symbol(_end);
> > > > > >
> > > > > > -     /*
> > > > > > -      * try bottom-up allocation only when bottom-up mode
> > > > > > -      * is set and @end is above the kernel image.
> > > > > > -      */
> > > > > > -     if (memblock_bottom_up() && end > kernel_end) {
> > > > > > -             phys_addr_t bottom_up_start;
> > > > > > +     if (memblock_bottom_up()) {
> > > > > > +             phys_addr_t bottom_up_start = start;
> > > > > >
> > > > > > -             /* make sure we will allocate above the kernel */
> > > > > > -             bottom_up_start = max(start, kernel_end);
> > > > > > -
> > > > > > -             /* ok, try bottom-up allocation first */
> > > > > > -             ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > > > > -                                                   size, align, nid, flags);
> > > > > > -             if (ret)
> > > > > > +             if (mem_hotmovable_parsed) {
> > > > > > +                     ret = __memblock_find_range_bottom_up(
> > > > > > +                             bottom_up_start, end, size, align, nid,
> > > > > > +                             flags);
> > > > > >                       return ret;
> > > > > >
> > > > > >               /*
> > > > > > -              * we always limit bottom-up allocation above the kernel,
> > > > > > -              * but top-down allocation doesn't have the limit, so
> > > > > > -              * retrying top-down allocation may succeed when bottom-up
> > > > > > -              * allocation failed.
> > > > > > -              *
> > > > > > -              * bottom-up allocation is expected to be fail very rarely,
> > > > > > -              * so we use WARN_ONCE() here to see the stack trace if
> > > > > > -              * fail happens.
> > > > > > +              * if mem hotplug info is not parsed yet, try bottom-up
> > > > > > +              * allocation with @end above the kernel image.
> > > > > >                */
> > > > > > -             WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > > > > +             } else if (!mem_hotmovable_parsed && end > kernel_end) {
> > > > > > +                     /* make sure we will allocate above the kernel */
> > > > > > +                     bottom_up_start = max(start, kernel_end);
> > > > > > +                     ret = __memblock_find_range_bottom_up(
> > > > > > +                             bottom_up_start, end, size, align, nid,
> > > > > > +                             flags);
> > > > > > +                     if (ret)
> > > > > > +                             return ret;
> > > > > > +                     /*
> > > > > > +                      * we always limit bottom-up allocation above the
> > > > > > +                      * kernel, but top-down allocation doesn't have
> > > > > > +                      * the limit, so retrying top-down allocation may
> > > > > > +                      * succeed when bottom-up allocation failed.
> > > > > > +                      *
> > > > > > +                      * bottom-up allocation is expected to be fail
> > > > > > +                      * very rarely, so we use WARN_ONCE() here to see
> > > > > > +                      * the stack trace if fail happens.
> > > > > > +                      */
> > > > > > +                     WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > > > >                         "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
> > > > > > +             }
> > > > > >       }
> > > > > >
> > > > > >       return __memblock_find_range_top_down(start, end, size, align, nid,
> > > > > > --
> > > > > > 2.7.4
> > > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours,
> > > > > Mike.
> > > > >
> > > >
> > >
> > > --
> > > Sincerely yours,
> > > Mike.
> > >
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr
  2019-01-04 15:09               ` Mike Rapoport
@ 2019-01-05  3:44                 ` Baoquan He
  2019-01-06  6:27                   ` Mike Rapoport
  0 siblings, 1 reply; 20+ messages in thread
From: Baoquan He @ 2019-01-05  3:44 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Tejun Heo, Pingfan Liu, linux-acpi, linux-mm, kexec, Tang Chen,
	Rafael J. Wysocki, Len Brown, Andrew Morton, Mike Rapoport,
	Michal Hocko, Jonathan Corbet, Yaowei Bai, Pavel Tatashin,
	Nicholas Piggin, Naoya Horiguchi, Daniel Vacek,
	Mathieu Malaterre, Stefan Agner, Dave Young, yinghai, vgoyal,
	linux-kernel

On 01/04/19 at 05:09pm, Mike Rapoport wrote:
> On Thu, Jan 03, 2019 at 10:47:06AM -0800, Tejun Heo wrote:
> > Hello,
> > 
> > On Wed, Jan 02, 2019 at 07:05:38PM +0200, Mike Rapoport wrote:
> > > I agree that currently the bottom-up allocation after the kernel text has
> > > issues with KASLR. But this issues are not necessarily related to the
> > > memory hotplug. Even with a single memory node, a bottom-up allocation will
> > > fail if KASLR would put the kernel near the end of node0.
> > > 
> > > What I am trying to understand is whether there is a fundamental reason to
> > > prevent allocations from [0, kernel_start)?
> > > 
> > > Maybe Tejun can recall why he suggested to start bottom-up allocations from
> > > kernel_end.
> > 
> > That's from 79442ed189ac ("mm/memblock.c: introduce bottom-up
> > allocation mode").  I wasn't involved in that patch, so no idea why
> > the restrictions were added, but FWIW it doesn't seem necessary to me.
> 
> I should have added the reference [1] at the first place :)
> Thanks!
> 
> [1] https://lore.kernel.org/lkml/20130904192215.GG26609@mtj.dyndns.org/

With my understanding, we may not be able to discard the bottom-up
method for the current kernel. It's related to hotplug feature when
'movable_node' kernel parameter is specified. With 'movable_node',
system relies on reading hotplug information from firmware, on x86 it's
acpi SRAT table. In the current system, we allocate memblock region
top-down by default. However, before that hotplug information retrieving,
there are several places of memblock allocating, top-down memblock
allocation must break hotplug feature since it will allocate kernel data
in movable zone which is usually at the end node on bare metal system.

This bottom-up way is taken on many ARCHes, it works well on system if
KASLR is not enabled. Below is the searching result in the current linux
kernel, we can see that all ARCHes have this mechanism, except of
arm/arm64. But now only arm64/mips/x86 have KASLR.

W/o KASLR, allocating memblock region above kernle end when hotplug info
is not parsed, looks very reasonable. Since kernel is usually put at
lower address, e.g on x86, it's 16M. My thought is that we need do
memblock allocation around kernel before hotplug info parsed. That is
for system w/o KASLR, we will keep the current bottom-up way; for system
with KASLR, we should allocate memblock region top-down just below
kernel start.

This issue must break hotplug, just because currently bare metal system
need add 'nokaslr' to disable KASLR since another bug fix is under
discussion as below, so this issue is covered up.

 [PATCH v14 0/5] x86/boot/KASLR: Parse ACPI table and limit KASLR to choosing immovable memory
lkml.kernel.org/r/20181214093013.13370-1-fanc.fnst@cn.fujitsu.com

[~ ]$ git grep memblock_set_bottom_up
arch/alpha/kernel/setup.c:      memblock_set_bottom_up(true);
arch/m68k/mm/motorola.c:        memblock_set_bottom_up(true);
arch/mips/kernel/setup.c:       memblock_set_bottom_up(true);
arch/mips/kernel/traps.c:       memblock_set_bottom_up(false);
arch/nds32/kernel/setup.c:      memblock_set_bottom_up(true);
arch/powerpc/kernel/paca.c:             memblock_set_bottom_up(true);
arch/powerpc/kernel/paca.c:             memblock_set_bottom_up(false);
arch/s390/kernel/setup.c:       memblock_set_bottom_up(true);
arch/s390/kernel/setup.c:       memblock_set_bottom_up(false);
arch/sparc/mm/init_32.c:        memblock_set_bottom_up(true);
arch/x86/kernel/setup.c:                memblock_set_bottom_up(true);
arch/x86/mm/numa.c:     memblock_set_bottom_up(false);
include/linux/memblock.h:static inline void __init memblock_set_bottom_up(bool enable)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr
  2019-01-05  3:44                 ` Baoquan He
@ 2019-01-06  6:27                   ` Mike Rapoport
  2019-01-08  8:50                     ` Baoquan He
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Rapoport @ 2019-01-06  6:27 UTC (permalink / raw)
  To: Baoquan He
  Cc: Tejun Heo, Pingfan Liu, linux-acpi, linux-mm, kexec, Tang Chen,
	Rafael J. Wysocki, Len Brown, Andrew Morton, Mike Rapoport,
	Michal Hocko, Jonathan Corbet, Yaowei Bai, Pavel Tatashin,
	Nicholas Piggin, Naoya Horiguchi, Daniel Vacek,
	Mathieu Malaterre, Stefan Agner, Dave Young, yinghai, vgoyal,
	linux-kernel

On Sat, Jan 05, 2019 at 11:44:50AM +0800, Baoquan He wrote:
> On 01/04/19 at 05:09pm, Mike Rapoport wrote:
> > On Thu, Jan 03, 2019 at 10:47:06AM -0800, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Wed, Jan 02, 2019 at 07:05:38PM +0200, Mike Rapoport wrote:
> > > > I agree that currently the bottom-up allocation after the kernel text has
> > > > issues with KASLR. But this issues are not necessarily related to the
> > > > memory hotplug. Even with a single memory node, a bottom-up allocation will
> > > > fail if KASLR would put the kernel near the end of node0.
> > > > 
> > > > What I am trying to understand is whether there is a fundamental reason to
> > > > prevent allocations from [0, kernel_start)?
> > > > 
> > > > Maybe Tejun can recall why he suggested to start bottom-up allocations from
> > > > kernel_end.
> > > 
> > > That's from 79442ed189ac ("mm/memblock.c: introduce bottom-up
> > > allocation mode").  I wasn't involved in that patch, so no idea why
> > > the restrictions were added, but FWIW it doesn't seem necessary to me.
> > 
> > I should have added the reference [1] at the first place :)
> > Thanks!
> > 
> > [1] https://lore.kernel.org/lkml/20130904192215.GG26609@mtj.dyndns.org/
> 
> With my understanding, we may not be able to discard the bottom-up
> method for the current kernel. It's related to hotplug feature when
> 'movable_node' kernel parameter is specified. With 'movable_node',
> system relies on reading hotplug information from firmware, on x86 it's
> acpi SRAT table. In the current system, we allocate memblock region
> top-down by default. However, before that hotplug information retrieving,
> there are several places of memblock allocating, top-down memblock
> allocation must break hotplug feature since it will allocate kernel data
> in movable zone which is usually at the end node on bare metal system.

I do not suggest to discard the bottom-up method, I merely suggest to allow
it to use [0, kernel_start).
 
> This bottom-up way is taken on many ARCHes, it works well on system if
> KASLR is not enabled. Below is the searching result in the current linux
> kernel, we can see that all ARCHes have this mechanism, except of
> arm/arm64. But now only arm64/mips/x86 have KASLR.
> 
> W/o KASLR, allocating memblock region above kernle end when hotplug info
> is not parsed, looks very reasonable. Since kernel is usually put at
> lower address, e.g on x86, it's 16M. My thought is that we need do
> memblock allocation around kernel before hotplug info parsed. That is
> for system w/o KASLR, we will keep the current bottom-up way; for system
> with KASLR, we should allocate memblock region top-down just below
> kernel start.

I completely agree. I was thinking about making
memblock_find_in_range_node() to do something like

if (memblock_bottom_up()) {
	bottom_up_start = max(start, kernel_end);

	ret = __memblock_find_range_bottom_up(bottom_up_start, end,
					      size, align, nid, flags);
	if (ret)
		return ret;

	bottom_up_start = max(start, 0);
	end = kernel_start;

	ret = __memblock_find_range_top_down(bottom_up_start, end,
					     size, align, nid, flags);
	if (ret)
		return ret;
}

 
> This issue must break hotplug, just because currently bare metal system
> need add 'nokaslr' to disable KASLR since another bug fix is under
> discussion as below, so this issue is covered up.
> 
>  [PATCH v14 0/5] x86/boot/KASLR: Parse ACPI table and limit KASLR to choosing immovable memory
> lkml.kernel.org/r/20181214093013.13370-1-fanc.fnst@cn.fujitsu.com
> 
> [~ ]$ git grep memblock_set_bottom_up
> arch/alpha/kernel/setup.c:      memblock_set_bottom_up(true);
> arch/m68k/mm/motorola.c:        memblock_set_bottom_up(true);
> arch/mips/kernel/setup.c:       memblock_set_bottom_up(true);
> arch/mips/kernel/traps.c:       memblock_set_bottom_up(false);
> arch/nds32/kernel/setup.c:      memblock_set_bottom_up(true);
> arch/powerpc/kernel/paca.c:             memblock_set_bottom_up(true);
> arch/powerpc/kernel/paca.c:             memblock_set_bottom_up(false);
> arch/s390/kernel/setup.c:       memblock_set_bottom_up(true);
> arch/s390/kernel/setup.c:       memblock_set_bottom_up(false);
> arch/sparc/mm/init_32.c:        memblock_set_bottom_up(true);
> arch/x86/kernel/setup.c:                memblock_set_bottom_up(true);
> arch/x86/mm/numa.c:     memblock_set_bottom_up(false);
> include/linux/memblock.h:static inline void __init memblock_set_bottom_up(bool enable)
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr
  2019-01-03 18:47             ` Tejun Heo
  2019-01-04 15:09               ` Mike Rapoport
@ 2019-01-07  8:37               ` Pingfan Liu
  1 sibling, 0 replies; 20+ messages in thread
From: Pingfan Liu @ 2019-01-07  8:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Rapoport, Baoquan He, linux-acpi, linux-mm, kexec,
	Rafael J. Wysocki, Len Brown, Andrew Morton, Mike Rapoport,
	Michal Hocko, Jonathan Corbet, Yaowei Bai, Nicholas Piggin,
	Naoya Horiguchi, Daniel Vacek, Mathieu Malaterre, Stefan Agner,
	Dave Young, yinghai, vgoyal, linux-kernel

I send out a series [RFC PATCH 0/4] x86_64/mm: remove bottom-up
allocation style by pushing forward the parsing of mem hotplug info (
https://lore.kernel.org/lkml/1546849485-27933-1-git-send-email-kernelfans@gmail.com/T/#t).
Please give comment if you are interested.

Thanks,
Pingfan

On Fri, Jan 4, 2019 at 2:47 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Jan 02, 2019 at 07:05:38PM +0200, Mike Rapoport wrote:
> > I agree that currently the bottom-up allocation after the kernel text has
> > issues with KASLR. But this issues are not necessarily related to the
> > memory hotplug. Even with a single memory node, a bottom-up allocation will
> > fail if KASLR would put the kernel near the end of node0.
> >
> > What I am trying to understand is whether there is a fundamental reason to
> > prevent allocations from [0, kernel_start)?
> >
> > Maybe Tejun can recall why he suggested to start bottom-up allocations from
> > kernel_end.
>
> That's from 79442ed189ac ("mm/memblock.c: introduce bottom-up
> allocation mode").  I wasn't involved in that patch, so no idea why
> the restrictions were added, but FWIW it doesn't seem necessary to me.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr
  2019-01-06  6:27                   ` Mike Rapoport
@ 2019-01-08  8:50                     ` Baoquan He
  0 siblings, 0 replies; 20+ messages in thread
From: Baoquan He @ 2019-01-08  8:50 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Tejun Heo, Pingfan Liu, linux-acpi, linux-mm, kexec, Tang Chen,
	Rafael J. Wysocki, Len Brown, Andrew Morton, Mike Rapoport,
	Michal Hocko, Jonathan Corbet, Yaowei Bai, Pavel Tatashin,
	Nicholas Piggin, Naoya Horiguchi, Daniel Vacek,
	Mathieu Malaterre, Stefan Agner, Dave Young, yinghai, vgoyal,
	linux-kernel

On 01/06/19 at 08:27am, Mike Rapoport wrote:
> I do not suggest to discard the bottom-up method, I merely suggest to allow
> it to use [0, kernel_start).

Sorry for late reply.

I misunderstood it, sorry.

> > This bottom-up way is taken on many ARCHes, it works well on system if
> > KASLR is not enabled. Below is the searching result in the current linux
> > kernel, we can see that all ARCHes have this mechanism, except of
> > arm/arm64. But now only arm64/mips/x86 have KASLR.
> > 
> > W/o KASLR, allocating memblock region above kernle end when hotplug info
> > is not parsed, looks very reasonable. Since kernel is usually put at
> > lower address, e.g on x86, it's 16M. My thought is that we need do
> > memblock allocation around kernel before hotplug info parsed. That is
> > for system w/o KASLR, we will keep the current bottom-up way; for system
> > with KASLR, we should allocate memblock region top-down just below
> > kernel start.
> 
> I completely agree. I was thinking about making
> memblock_find_in_range_node() to do something like
> 
> if (memblock_bottom_up()) {
> 	bottom_up_start = max(start, kernel_end);

In this way, if start < kernel_end, it will still succeed to find a
region in bottom-up way after kernel end.

I am still reading code. Just noticed Pingfan sent a RFC patchset to put
SRAT parsing earlier, not sure if he has tested it in numa system with
acpi. I doubt that really works.

Thanks
Baoquan

> 	ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> 					      size, align, nid, flags);



> 	if (ret)
> 		return ret;
> 
> 	bottom_up_start = max(start, 0);
> 	end = kernel_start;
> 
> 	ret = __memblock_find_range_top_down(bottom_up_start, end,
> 					     size, align, nid, flags);
> 	if (ret)
> 		return ret;
> }

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-01-08  8:50 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-28  3:00 [PATCHv3 0/2] mm/memblock: reuse memblock bottom-up allocation style Pingfan Liu
2018-12-28  3:00 ` [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr Pingfan Liu
2018-12-31  8:40   ` Mike Rapoport
2019-01-02  6:47     ` Pingfan Liu
2019-01-02  9:27       ` Mike Rapoport
2019-01-02 10:18         ` Baoquan He
2019-01-02 17:05           ` Mike Rapoport
2019-01-03 18:47             ` Tejun Heo
2019-01-04 15:09               ` Mike Rapoport
2019-01-05  3:44                 ` Baoquan He
2019-01-06  6:27                   ` Mike Rapoport
2019-01-08  8:50                     ` Baoquan He
2019-01-07  8:37               ` Pingfan Liu
2019-01-04  5:59           ` Pingfan Liu
2019-01-04 16:20             ` Mike Rapoport
2018-12-28  3:00 ` [PATCHv3 2/2] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr Pingfan Liu
2018-12-31  8:46   ` Mike Rapoport
2019-01-02  6:47     ` Pingfan Liu
2019-01-02  9:28       ` Mike Rapoport
2018-12-28  3:39 ` [PATCHv3 0/2] mm/memblock: reuse memblock bottom-up allocation style Baoquan He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).