kexec.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] arm64: kdump: Function supplement and performance optimization
@ 2022-06-13  8:09 Zhen Lei
  2022-06-13  8:09 ` [PATCH 1/5] arm64: kdump: Provide default size when crashkernel=Y,low is not specified Zhen Lei
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Zhen Lei @ 2022-06-13  8:09 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Baoquan He, Vivek Goyal, kexec,
	linux-kernel, Catalin Marinas, Will Deacon, linux-arm-kernel,
	Jonathan Corbet, linux-doc
  Cc: Zhen Lei, Randy Dunlap, Feng Zhou, Kefeng Wang, Chen Zhou,
	John Donnelly, Dave Kleikamp

After the basic functions of "support reserving crashkernel above 4G on arm64
kdump"(see https://lkml.org/lkml/2022/5/6/428) are implemented, we still have
three features to be improved.
1. When crashkernel=X,high is specified but crashkernel=Y,low is not specified,
   the default crash low memory size is provided.
2. For crashkernel=X without '@offset', if the low memory fails to be allocated,
   fall back to reserve region from high memory(above DMA zones).
3. If crashkernel=X,high is used, page mapping is performed only for the crash
   high memory, and block mapping is still used for other linear address spaces.
   Compared to the previous version:
   (1) For crashkernel=X[@offset], the memory above 4G is not changed to block
       mapping, leave it to the next time.
   (2) The implementation method is modified. Now the implementation is simpler
       and clearer.

Zhen Lei (5):
  arm64: kdump: Provide default size when crashkernel=Y,low is not
    specified
  arm64: kdump: Support crashkernel=X fall back to reserve region above
    DMA zones
  arm64: kdump: Remove some redundant checks in map_mem()
  arm64: kdump: Decide when to reserve crash memory in
    reserve_crashkernel()
  arm64: kdump: Don't defer the reservation of crash high memory

 .../admin-guide/kernel-parameters.txt         |  10 +-
 arch/arm64/mm/init.c                          | 109 ++++++++++++++++--
 arch/arm64/mm/mmu.c                           |  25 ++--
 3 files changed, 112 insertions(+), 32 deletions(-)

-- 
2.25.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/5] arm64: kdump: Provide default size when crashkernel=Y,low is not specified
  2022-06-13  8:09 [PATCH 0/5] arm64: kdump: Function supplement and performance optimization Zhen Lei
@ 2022-06-13  8:09 ` Zhen Lei
  2022-06-17  2:40   ` Baoquan He
  2022-06-17  8:26   ` Baoquan He
  2022-06-13  8:09 ` [PATCH 2/5] arm64: kdump: Support crashkernel=X fall back to reserve region above DMA zones Zhen Lei
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 26+ messages in thread
From: Zhen Lei @ 2022-06-13  8:09 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Baoquan He, Vivek Goyal, kexec,
	linux-kernel, Catalin Marinas, Will Deacon, linux-arm-kernel,
	Jonathan Corbet, linux-doc
  Cc: Zhen Lei, Randy Dunlap, Feng Zhou, Kefeng Wang, Chen Zhou,
	John Donnelly, Dave Kleikamp

To be consistent with the implementation of x86 and improve cross-platform
user experience. Try to allocate at least 256 MiB low memory automatically
when crashkernel=Y,low is not specified.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  8 +-------
 arch/arm64/mm/init.c                            | 12 +++++++++++-
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 8090130b544b070..61b179232b68001 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -843,7 +843,7 @@
 			available.
 			It will be ignored if crashkernel=X is specified.
 	crashkernel=size[KMG],low
-			[KNL, X86-64] range under 4G. When crashkernel=X,high
+			[KNL, X86-64, ARM64] range under 4G. When crashkernel=X,high
 			is passed, kernel could allocate physical memory region
 			above 4G, that cause second kernel crash on system
 			that require some amount of low memory, e.g. swiotlb
@@ -857,12 +857,6 @@
 			It will be ignored when crashkernel=X,high is not used
 			or memory reserved is below 4G.
 
-			[KNL, ARM64] range in low memory.
-			This one lets the user specify a low range in the
-			DMA zone for the crash dump kernel.
-			It will be ignored when crashkernel=X,high is not used
-			or memory reserved is located in the DMA zones.
-
 	cryptomgr.notests
 			[KNL] Disable crypto self-tests
 
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 339ee84e5a61a0b..5390f361208ccf7 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -96,6 +96,14 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit = PHYS_MASK + 1;
 #define CRASH_ADDR_LOW_MAX		arm64_dma_phys_limit
 #define CRASH_ADDR_HIGH_MAX		(PHYS_MASK + 1)
 
+/*
+ * This is an empirical value in x86_64 and taken here directly. Please
+ * refer to the code comment in reserve_crashkernel_low() of x86_64 for more
+ * details.
+ */
+#define DEFAULT_CRASH_KERNEL_LOW_SIZE	\
+	max(swiotlb_size_or_default() + (8UL << 20), 256UL << 20)
+
 static int __init reserve_crashkernel_low(unsigned long long low_size)
 {
 	unsigned long long low_base;
@@ -147,7 +155,9 @@ static void __init reserve_crashkernel(void)
 		 * is not allowed.
 		 */
 		ret = parse_crashkernel_low(cmdline, 0, &crash_low_size, &crash_base);
-		if (ret && (ret != -ENOENT))
+		if (ret == -ENOENT)
+			crash_low_size = DEFAULT_CRASH_KERNEL_LOW_SIZE;
+		else if (ret)
 			return;
 
 		crash_max = CRASH_ADDR_HIGH_MAX;
-- 
2.25.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/5] arm64: kdump: Support crashkernel=X fall back to reserve region above DMA zones
  2022-06-13  8:09 [PATCH 0/5] arm64: kdump: Function supplement and performance optimization Zhen Lei
  2022-06-13  8:09 ` [PATCH 1/5] arm64: kdump: Provide default size when crashkernel=Y,low is not specified Zhen Lei
@ 2022-06-13  8:09 ` Zhen Lei
  2022-06-17  4:16   ` Baoquan He
  2022-06-13  8:09 ` [PATCH 3/5] arm64: kdump: Remove some redundant checks in map_mem() Zhen Lei
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Zhen Lei @ 2022-06-13  8:09 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Baoquan He, Vivek Goyal, kexec,
	linux-kernel, Catalin Marinas, Will Deacon, linux-arm-kernel,
	Jonathan Corbet, linux-doc
  Cc: Zhen Lei, Randy Dunlap, Feng Zhou, Kefeng Wang, Chen Zhou,
	John Donnelly, Dave Kleikamp

For crashkernel=X without '@offset', select a region within DMA zones
first, and fall back to reserve region above DMA zones. This allows
users to use the same configuration on multiple platforms.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  2 +-
 arch/arm64/mm/init.c                            | 16 +++++++++++++++-
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 61b179232b68001..fdac18beba5624e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -823,7 +823,7 @@
 			memory region [offset, offset + size] for that kernel
 			image. If '@offset' is omitted, then a suitable offset
 			is selected automatically.
-			[KNL, X86-64] Select a region under 4G first, and
+			[KNL, X86-64, ARM64] Select a region under 4G first, and
 			fall back to reserve region above 4G when '@offset'
 			hasn't been specified.
 			See Documentation/admin-guide/kdump/kdump.rst for further details.
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 5390f361208ccf7..8539598f9e58b4d 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -138,6 +138,7 @@ static void __init reserve_crashkernel(void)
 	unsigned long long crash_max = CRASH_ADDR_LOW_MAX;
 	char *cmdline = boot_command_line;
 	int ret;
+	bool fixed_base;
 
 	if (!IS_ENABLED(CONFIG_KEXEC_CORE))
 		return;
@@ -166,15 +167,28 @@ static void __init reserve_crashkernel(void)
 		return;
 	}
 
+	fixed_base = !!crash_base;
 	crash_size = PAGE_ALIGN(crash_size);
 
 	/* User specifies base address explicitly. */
-	if (crash_base)
+	if (fixed_base)
 		crash_max = crash_base + crash_size;
 
+retry:
 	crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
 					       crash_base, crash_max);
 	if (!crash_base) {
+		/*
+		 * Attempt to fully allocate low memory failed, fall back
+		 * to high memory, the minimum required low memory will be
+		 * reserved later.
+		 */
+		if (!fixed_base && (crash_max == CRASH_ADDR_LOW_MAX)) {
+			crash_max = CRASH_ADDR_HIGH_MAX;
+			crash_low_size = DEFAULT_CRASH_KERNEL_LOW_SIZE;
+			goto retry;
+		}
+
 		pr_warn("cannot allocate crashkernel (size:0x%llx)\n",
 			crash_size);
 		return;
-- 
2.25.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/5] arm64: kdump: Remove some redundant checks in map_mem()
  2022-06-13  8:09 [PATCH 0/5] arm64: kdump: Function supplement and performance optimization Zhen Lei
  2022-06-13  8:09 ` [PATCH 1/5] arm64: kdump: Provide default size when crashkernel=Y,low is not specified Zhen Lei
  2022-06-13  8:09 ` [PATCH 2/5] arm64: kdump: Support crashkernel=X fall back to reserve region above DMA zones Zhen Lei
@ 2022-06-13  8:09 ` Zhen Lei
  2022-06-20  7:42   ` Baoquan He
  2022-06-13  8:09 ` [PATCH 4/5] arm64: kdump: Decide when to reserve crash memory in reserve_crashkernel() Zhen Lei
  2022-06-13  8:09 ` [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory Zhen Lei
  4 siblings, 1 reply; 26+ messages in thread
From: Zhen Lei @ 2022-06-13  8:09 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Baoquan He, Vivek Goyal, kexec,
	linux-kernel, Catalin Marinas, Will Deacon, linux-arm-kernel,
	Jonathan Corbet, linux-doc
  Cc: Zhen Lei, Randy Dunlap, Feng Zhou, Kefeng Wang, Chen Zhou,
	John Donnelly, Dave Kleikamp

arm64_memblock_init()
	if (!IS_ENABLED(CONFIG_ZONE_DMA/DMA32))
		reserve_crashkernel()
			//initialize crashk_res when
			//"crashkernel=" is correctly specified
paging_init()
	map_mem()

As shown in the above pseudo code, the crashk_res.end can only be
initialized to non-zero when both "!IS_ENABLED(CONFIG_ZONE_DMA/DMA32)"
and crash_mem_map are true. So some checks in map_mem() can be adjusted
or optimized.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 arch/arm64/mm/mmu.c | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 626ec32873c6c36..6028a5757e4eae2 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -529,12 +529,12 @@ static void __init map_mem(pgd_t *pgdp)
 
 #ifdef CONFIG_KEXEC_CORE
 	if (crash_mem_map) {
-		if (IS_ENABLED(CONFIG_ZONE_DMA) ||
-		    IS_ENABLED(CONFIG_ZONE_DMA32))
-			flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
-		else if (crashk_res.end)
+		if (crashk_res.end)
 			memblock_mark_nomap(crashk_res.start,
 			    resource_size(&crashk_res));
+		else if (IS_ENABLED(CONFIG_ZONE_DMA) ||
+			 IS_ENABLED(CONFIG_ZONE_DMA32))
+			flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 	}
 #endif
 
@@ -571,16 +571,13 @@ static void __init map_mem(pgd_t *pgdp)
 	 * through /sys/kernel/kexec_crash_size interface.
 	 */
 #ifdef CONFIG_KEXEC_CORE
-	if (crash_mem_map &&
-	    !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) {
-		if (crashk_res.end) {
-			__map_memblock(pgdp, crashk_res.start,
-				       crashk_res.end + 1,
-				       PAGE_KERNEL,
-				       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
-			memblock_clear_nomap(crashk_res.start,
-					     resource_size(&crashk_res));
-		}
+	if (crashk_res.end) {
+		__map_memblock(pgdp, crashk_res.start,
+			       crashk_res.end + 1,
+			       PAGE_KERNEL,
+			       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
+		memblock_clear_nomap(crashk_res.start,
+				     resource_size(&crashk_res));
 	}
 #endif
 }
-- 
2.25.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/5] arm64: kdump: Decide when to reserve crash memory in reserve_crashkernel()
  2022-06-13  8:09 [PATCH 0/5] arm64: kdump: Function supplement and performance optimization Zhen Lei
                   ` (2 preceding siblings ...)
  2022-06-13  8:09 ` [PATCH 3/5] arm64: kdump: Remove some redundant checks in map_mem() Zhen Lei
@ 2022-06-13  8:09 ` Zhen Lei
  2022-06-13  8:09 ` [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory Zhen Lei
  4 siblings, 0 replies; 26+ messages in thread
From: Zhen Lei @ 2022-06-13  8:09 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Baoquan He, Vivek Goyal, kexec,
	linux-kernel, Catalin Marinas, Will Deacon, linux-arm-kernel,
	Jonathan Corbet, linux-doc
  Cc: Zhen Lei, Randy Dunlap, Feng Zhou, Kefeng Wang, Chen Zhou,
	John Donnelly, Dave Kleikamp

After the kexec completes data loading, the crash memory must be set to be
inaccessible, to prevent the current kernel from damaging the data of the
crash kernel. But for some platforms, the DMA zones is not known until the
dtb or acpi table is parsed, but by then the linear mapping has been
created, all are forced to be page-level mapping.

To optimize the system performance (reduce the TLB miss rate) when
crashkernel=X,high is used. The reservation of crash memory is divided
into two phases: reserve crash high memory before paging_init() is called
and crash low memory after it. We only perform page mapping for the crash
high memory.

commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for
platforms with no DMA memory zones") has caused reserve_crashkernel() to
be called in two places: before or after paging_init(), which is
controlled by whether CONFIG_ZONE_DMA/DMA32 is enabled. Just move the
control into reserve_crashkernel(), prepare for the optimizations
mentioned above.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 arch/arm64/mm/init.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 8539598f9e58b4d..fb24efbc46f5ef4 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -90,6 +90,9 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit;
 phys_addr_t __ro_after_init arm64_dma_phys_limit = PHYS_MASK + 1;
 #endif
 
+#define DMA_PHYS_LIMIT_UNKNOWN		0
+#define DMA_PHYS_LIMIT_KNOWN		1
+
 /* Current arm64 boot protocol requires 2MB alignment */
 #define CRASH_ALIGN			SZ_2M
 
@@ -131,18 +134,23 @@ static int __init reserve_crashkernel_low(unsigned long long low_size)
  * line parameter. The memory reserved is used by dump capture kernel when
  * primary kernel is crashing.
  */
-static void __init reserve_crashkernel(void)
+static void __init reserve_crashkernel(int dma_state)
 {
 	unsigned long long crash_base, crash_size;
 	unsigned long long crash_low_size = 0;
 	unsigned long long crash_max = CRASH_ADDR_LOW_MAX;
 	char *cmdline = boot_command_line;
+	int dma_enabled = IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32);
 	int ret;
 	bool fixed_base;
 
 	if (!IS_ENABLED(CONFIG_KEXEC_CORE))
 		return;
 
+	if ((!dma_enabled && (dma_state != DMA_PHYS_LIMIT_UNKNOWN)) ||
+	     (dma_enabled && (dma_state != DMA_PHYS_LIMIT_KNOWN)))
+		return;
+
 	/* crashkernel=X[@offset] */
 	ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
 				&crash_size, &crash_base);
@@ -413,8 +421,7 @@ void __init arm64_memblock_init(void)
 
 	early_init_fdt_scan_reserved_mem();
 
-	if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32))
-		reserve_crashkernel();
+	reserve_crashkernel(DMA_PHYS_LIMIT_UNKNOWN);
 
 	high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
 }
@@ -462,8 +469,7 @@ void __init bootmem_init(void)
 	 * request_standard_resources() depends on crashkernel's memory being
 	 * reserved, so do it here.
 	 */
-	if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32))
-		reserve_crashkernel();
+	reserve_crashkernel(DMA_PHYS_LIMIT_KNOWN);
 
 	memblock_dump_all();
 }
-- 
2.25.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-13  8:09 [PATCH 0/5] arm64: kdump: Function supplement and performance optimization Zhen Lei
                   ` (3 preceding siblings ...)
  2022-06-13  8:09 ` [PATCH 4/5] arm64: kdump: Decide when to reserve crash memory in reserve_crashkernel() Zhen Lei
@ 2022-06-13  8:09 ` Zhen Lei
  2022-06-21  5:33   ` Baoquan He
  4 siblings, 1 reply; 26+ messages in thread
From: Zhen Lei @ 2022-06-13  8:09 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Baoquan He, Vivek Goyal, kexec,
	linux-kernel, Catalin Marinas, Will Deacon, linux-arm-kernel,
	Jonathan Corbet, linux-doc
  Cc: Zhen Lei, Randy Dunlap, Feng Zhou, Kefeng Wang, Chen Zhou,
	John Donnelly, Dave Kleikamp

If the crashkernel has both high memory above DMA zones and low memory
in DMA zones, kexec always loads the content such as Image and dtb to the
high memory instead of the low memory. This means that only high memory
requires write protection based on page-level mapping. The allocation of
high memory does not depend on the DMA boundary. So we can reserve the
high memory first even if the crashkernel reservation is deferred.

This means that the block mapping can still be performed on other kernel
linear address spaces, the TLB miss rate can be reduced and the system
performance will be improved.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 arch/arm64/mm/init.c | 71 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 65 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index fb24efbc46f5ef4..ae0bae2cafe6ab0 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -141,15 +141,44 @@ static void __init reserve_crashkernel(int dma_state)
 	unsigned long long crash_max = CRASH_ADDR_LOW_MAX;
 	char *cmdline = boot_command_line;
 	int dma_enabled = IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32);
-	int ret;
+	int ret, skip_res = 0, skip_low_res = 0;
 	bool fixed_base;
 
 	if (!IS_ENABLED(CONFIG_KEXEC_CORE))
 		return;
 
-	if ((!dma_enabled && (dma_state != DMA_PHYS_LIMIT_UNKNOWN)) ||
-	     (dma_enabled && (dma_state != DMA_PHYS_LIMIT_KNOWN)))
-		return;
+	/*
+	 * In the following table:
+	 * X,high  means crashkernel=X,high
+	 * unknown means dma_state = DMA_PHYS_LIMIT_UNKNOWN
+	 * known   means dma_state = DMA_PHYS_LIMIT_KNOWN
+	 *
+	 * The first two columns indicate the status, and the last two
+	 * columns indicate the phase in which crash high or low memory
+	 * needs to be reserved.
+	 *  ---------------------------------------------------
+	 * | DMA enabled | X,high used |  unknown  |   known   |
+	 *  ---------------------------------------------------
+	 * |      N            N       |    low    |    NOP    |
+	 * |      Y            N       |    NOP    |    low    |
+	 * |      N            Y       |  high/low |    NOP    |
+	 * |      Y            Y       |    high   |    low    |
+	 *  ---------------------------------------------------
+	 *
+	 * But in this function, the crash high memory allocation of
+	 * crashkernel=Y,high and the crash low memory allocation of
+	 * crashkernel=X[@offset] for crashk_res are mixed at one place.
+	 * So the table above need to be adjusted as below:
+	 *  ---------------------------------------------------
+	 * | DMA enabled | X,high used |  unknown  |   known   |
+	 *  ---------------------------------------------------
+	 * |      N            N       |    res    |    NOP    |
+	 * |      Y            N       |    NOP    |    res    |
+	 * |      N            Y       |res/low_res|    NOP    |
+	 * |      Y            Y       |    res    |  low_res  |
+	 *  ---------------------------------------------------
+	 *
+	 */
 
 	/* crashkernel=X[@offset] */
 	ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
@@ -169,10 +198,33 @@ static void __init reserve_crashkernel(int dma_state)
 		else if (ret)
 			return;
 
+		/* See the third row of the second table above, NOP */
+		if (!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN))
+			return;
+
+		/* See the fourth row of the second table above */
+		if (dma_enabled) {
+			if (dma_state == DMA_PHYS_LIMIT_UNKNOWN)
+				skip_low_res = 1;
+			else
+				skip_res = 1;
+		}
+
 		crash_max = CRASH_ADDR_HIGH_MAX;
 	} else if (ret || !crash_size) {
 		/* The specified value is invalid */
 		return;
+	} else {
+		/* See the 1-2 rows of the second table above, NOP */
+		if ((!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) ||
+		     (dma_enabled && (dma_state == DMA_PHYS_LIMIT_UNKNOWN)))
+			return;
+	}
+
+	if (skip_res) {
+		crash_base = crashk_res.start;
+		crash_size = crashk_res.end - crashk_res.start + 1;
+		goto check_low;
 	}
 
 	fixed_base = !!crash_base;
@@ -202,9 +254,18 @@ static void __init reserve_crashkernel(int dma_state)
 		return;
 	}
 
+	crashk_res.start = crash_base;
+	crashk_res.end = crash_base + crash_size - 1;
+
+check_low:
+	if (skip_low_res)
+		return;
+
 	if ((crash_base >= CRASH_ADDR_LOW_MAX) &&
 	     crash_low_size && reserve_crashkernel_low(crash_low_size)) {
 		memblock_phys_free(crash_base, crash_size);
+		crashk_res.start = 0;
+		crashk_res.end = 0;
 		return;
 	}
 
@@ -219,8 +280,6 @@ static void __init reserve_crashkernel(int dma_state)
 	if (crashk_low_res.end)
 		kmemleak_ignore_phys(crashk_low_res.start);
 
-	crashk_res.start = crash_base;
-	crashk_res.end = crash_base + crash_size - 1;
 	insert_resource(&iomem_resource, &crashk_res);
 }
 
-- 
2.25.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] arm64: kdump: Provide default size when crashkernel=Y,low is not specified
  2022-06-13  8:09 ` [PATCH 1/5] arm64: kdump: Provide default size when crashkernel=Y,low is not specified Zhen Lei
@ 2022-06-17  2:40   ` Baoquan He
  2022-06-17  7:39     ` Leizhen (ThunderTown)
  2022-06-17  8:26   ` Baoquan He
  1 sibling, 1 reply; 26+ messages in thread
From: Baoquan He @ 2022-06-17  2:40 UTC (permalink / raw)
  To: Zhen Lei
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Catalin Marinas, Will Deacon, linux-arm-kernel, Jonathan Corbet,
	linux-doc, Randy Dunlap, Feng Zhou, Kefeng Wang, Chen Zhou,
	John Donnelly, Dave Kleikamp

On 06/13/22 at 04:09pm, Zhen Lei wrote:
> To be consistent with the implementation of x86 and improve cross-platform
> user experience. Try to allocate at least 256 MiB low memory automatically
> when crashkernel=Y,low is not specified.

This should correspond to the case that crashkernel=,high is explicitly
specified, while crashkenrel=,low is omitted. It could be better to
mention these.

Otherwise, this looks good to me.

> 
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  8 +-------
>  arch/arm64/mm/init.c                            | 12 +++++++++++-
>  2 files changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 8090130b544b070..61b179232b68001 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -843,7 +843,7 @@
>  			available.
>  			It will be ignored if crashkernel=X is specified.
>  	crashkernel=size[KMG],low
> -			[KNL, X86-64] range under 4G. When crashkernel=X,high
> +			[KNL, X86-64, ARM64] range under 4G. When crashkernel=X,high
                        ~~~~ exceeds 80 characters, it should be OK.

>  			is passed, kernel could allocate physical memory region
>  			above 4G, that cause second kernel crash on system
>  			that require some amount of low memory, e.g. swiotlb
> @@ -857,12 +857,6 @@
>  			It will be ignored when crashkernel=X,high is not used
>  			or memory reserved is below 4G.
>  
> -			[KNL, ARM64] range in low memory.
> -			This one lets the user specify a low range in the
> -			DMA zone for the crash dump kernel.
> -			It will be ignored when crashkernel=X,high is not used
> -			or memory reserved is located in the DMA zones.
> -
>  	cryptomgr.notests
>  			[KNL] Disable crypto self-tests
>  
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 339ee84e5a61a0b..5390f361208ccf7 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -96,6 +96,14 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit = PHYS_MASK + 1;
>  #define CRASH_ADDR_LOW_MAX		arm64_dma_phys_limit
>  #define CRASH_ADDR_HIGH_MAX		(PHYS_MASK + 1)
>  
> +/*
> + * This is an empirical value in x86_64 and taken here directly. Please
> + * refer to the code comment in reserve_crashkernel_low() of x86_64 for more
> + * details.
> + */
> +#define DEFAULT_CRASH_KERNEL_LOW_SIZE	\
> +	max(swiotlb_size_or_default() + (8UL << 20), 256UL << 20)
> +
>  static int __init reserve_crashkernel_low(unsigned long long low_size)
>  {
>  	unsigned long long low_base;
> @@ -147,7 +155,9 @@ static void __init reserve_crashkernel(void)
>  		 * is not allowed.
>  		 */
>  		ret = parse_crashkernel_low(cmdline, 0, &crash_low_size, &crash_base);
> -		if (ret && (ret != -ENOENT))
> +		if (ret == -ENOENT)
> +			crash_low_size = DEFAULT_CRASH_KERNEL_LOW_SIZE;
> +		else if (ret)
>  			return;
>  
>  		crash_max = CRASH_ADDR_HIGH_MAX;
> -- 
> 2.25.1
> 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] arm64: kdump: Support crashkernel=X fall back to reserve region above DMA zones
  2022-06-13  8:09 ` [PATCH 2/5] arm64: kdump: Support crashkernel=X fall back to reserve region above DMA zones Zhen Lei
@ 2022-06-17  4:16   ` Baoquan He
  0 siblings, 0 replies; 26+ messages in thread
From: Baoquan He @ 2022-06-17  4:16 UTC (permalink / raw)
  To: Zhen Lei
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Catalin Marinas, Will Deacon, linux-arm-kernel, Jonathan Corbet,
	linux-doc, Randy Dunlap, Feng Zhou, Kefeng Wang, Chen Zhou,
	John Donnelly, Dave Kleikamp

On 06/13/22 at 04:09pm, Zhen Lei wrote:
> For crashkernel=X without '@offset', select a region within DMA zones
> first, and fall back to reserve region above DMA zones. This allows
> users to use the same configuration on multiple platforms.

LGTM,

Acked-by: Baoquan He <bhe@redhat.com>

> 
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  2 +-
>  arch/arm64/mm/init.c                            | 16 +++++++++++++++-
>  2 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 61b179232b68001..fdac18beba5624e 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -823,7 +823,7 @@
>  			memory region [offset, offset + size] for that kernel
>  			image. If '@offset' is omitted, then a suitable offset
>  			is selected automatically.
> -			[KNL, X86-64] Select a region under 4G first, and
> +			[KNL, X86-64, ARM64] Select a region under 4G first, and
>  			fall back to reserve region above 4G when '@offset'
>  			hasn't been specified.
>  			See Documentation/admin-guide/kdump/kdump.rst for further details.
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 5390f361208ccf7..8539598f9e58b4d 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -138,6 +138,7 @@ static void __init reserve_crashkernel(void)
>  	unsigned long long crash_max = CRASH_ADDR_LOW_MAX;
>  	char *cmdline = boot_command_line;
>  	int ret;
> +	bool fixed_base;
>  
>  	if (!IS_ENABLED(CONFIG_KEXEC_CORE))
>  		return;
> @@ -166,15 +167,28 @@ static void __init reserve_crashkernel(void)
>  		return;
>  	}
>  
> +	fixed_base = !!crash_base;
>  	crash_size = PAGE_ALIGN(crash_size);
>  
>  	/* User specifies base address explicitly. */
> -	if (crash_base)
> +	if (fixed_base)
>  		crash_max = crash_base + crash_size;
>  
> +retry:
>  	crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
>  					       crash_base, crash_max);
>  	if (!crash_base) {
> +		/*
> +		 * Attempt to fully allocate low memory failed, fall back
> +		 * to high memory, the minimum required low memory will be
> +		 * reserved later.
> +		 */
> +		if (!fixed_base && (crash_max == CRASH_ADDR_LOW_MAX)) {
> +			crash_max = CRASH_ADDR_HIGH_MAX;
> +			crash_low_size = DEFAULT_CRASH_KERNEL_LOW_SIZE;
> +			goto retry;
> +		}
> +
>  		pr_warn("cannot allocate crashkernel (size:0x%llx)\n",
>  			crash_size);
>  		return;
> -- 
> 2.25.1
> 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] arm64: kdump: Provide default size when crashkernel=Y,low is not specified
  2022-06-17  2:40   ` Baoquan He
@ 2022-06-17  7:39     ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 26+ messages in thread
From: Leizhen (ThunderTown) @ 2022-06-17  7:39 UTC (permalink / raw)
  To: Baoquan He
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Catalin Marinas, Will Deacon, linux-arm-kernel, Jonathan Corbet,
	linux-doc, Randy Dunlap, Feng Zhou, Kefeng Wang, Chen Zhou,
	John Donnelly, Dave Kleikamp



On 2022/6/17 10:40, Baoquan He wrote:
> On 06/13/22 at 04:09pm, Zhen Lei wrote:
>> To be consistent with the implementation of x86 and improve cross-platform
>> user experience. Try to allocate at least 256 MiB low memory automatically
>> when crashkernel=Y,low is not specified.
> 
> This should correspond to the case that crashkernel=,high is explicitly
> specified, while crashkenrel=,low is omitted. It could be better to
> mention these.

Okay, I'll update the description in the next version.

> 
> Otherwise, this looks good to me.
> 
>>
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> ---
>>  Documentation/admin-guide/kernel-parameters.txt |  8 +-------
>>  arch/arm64/mm/init.c                            | 12 +++++++++++-
>>  2 files changed, 12 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index 8090130b544b070..61b179232b68001 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -843,7 +843,7 @@
>>  			available.
>>  			It will be ignored if crashkernel=X is specified.
>>  	crashkernel=size[KMG],low
>> -			[KNL, X86-64] range under 4G. When crashkernel=X,high
>> +			[KNL, X86-64, ARM64] range under 4G. When crashkernel=X,high
>                         ~~~~ exceeds 80 characters, it should be OK.
> 
>>  			is passed, kernel could allocate physical memory region
>>  			above 4G, that cause second kernel crash on system
>>  			that require some amount of low memory, e.g. swiotlb
>> @@ -857,12 +857,6 @@
>>  			It will be ignored when crashkernel=X,high is not used
>>  			or memory reserved is below 4G.
>>  
>> -			[KNL, ARM64] range in low memory.
>> -			This one lets the user specify a low range in the
>> -			DMA zone for the crash dump kernel.
>> -			It will be ignored when crashkernel=X,high is not used
>> -			or memory reserved is located in the DMA zones.
>> -
>>  	cryptomgr.notests
>>  			[KNL] Disable crypto self-tests
>>  
>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>> index 339ee84e5a61a0b..5390f361208ccf7 100644
>> --- a/arch/arm64/mm/init.c
>> +++ b/arch/arm64/mm/init.c
>> @@ -96,6 +96,14 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit = PHYS_MASK + 1;
>>  #define CRASH_ADDR_LOW_MAX		arm64_dma_phys_limit
>>  #define CRASH_ADDR_HIGH_MAX		(PHYS_MASK + 1)
>>  
>> +/*
>> + * This is an empirical value in x86_64 and taken here directly. Please
>> + * refer to the code comment in reserve_crashkernel_low() of x86_64 for more
>> + * details.
>> + */
>> +#define DEFAULT_CRASH_KERNEL_LOW_SIZE	\
>> +	max(swiotlb_size_or_default() + (8UL << 20), 256UL << 20)
>> +
>>  static int __init reserve_crashkernel_low(unsigned long long low_size)
>>  {
>>  	unsigned long long low_base;
>> @@ -147,7 +155,9 @@ static void __init reserve_crashkernel(void)
>>  		 * is not allowed.
>>  		 */
>>  		ret = parse_crashkernel_low(cmdline, 0, &crash_low_size, &crash_base);
>> -		if (ret && (ret != -ENOENT))
>> +		if (ret == -ENOENT)
>> +			crash_low_size = DEFAULT_CRASH_KERNEL_LOW_SIZE;
>> +		else if (ret)
>>  			return;
>>  
>>  		crash_max = CRASH_ADDR_HIGH_MAX;
>> -- 
>> 2.25.1
>>
> 
> .
> 

-- 
Regards,
  Zhen Lei

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] arm64: kdump: Provide default size when crashkernel=Y,low is not specified
  2022-06-13  8:09 ` [PATCH 1/5] arm64: kdump: Provide default size when crashkernel=Y,low is not specified Zhen Lei
  2022-06-17  2:40   ` Baoquan He
@ 2022-06-17  8:26   ` Baoquan He
  1 sibling, 0 replies; 26+ messages in thread
From: Baoquan He @ 2022-06-17  8:26 UTC (permalink / raw)
  To: Zhen Lei, msalter, ctatman
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Catalin Marinas, Will Deacon, linux-arm-kernel, Jonathan Corbet,
	linux-doc, Randy Dunlap, Feng Zhou, Kefeng Wang, Chen Zhou,
	John Donnelly, Dave Kleikamp

On 06/13/22 at 04:09pm, Zhen Lei wrote:
> To be consistent with the implementation of x86 and improve cross-platform
> user experience. Try to allocate at least 256 MiB low memory automatically
> when crashkernel=Y,low is not specified.
> 
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  8 +-------
>  arch/arm64/mm/init.c                            | 12 +++++++++++-
>  2 files changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 8090130b544b070..61b179232b68001 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -843,7 +843,7 @@
>  			available.
>  			It will be ignored if crashkernel=X is specified.
>  	crashkernel=size[KMG],low
> -			[KNL, X86-64] range under 4G. When crashkernel=X,high
> +			[KNL, X86-64, ARM64] range under 4G. When crashkernel=X,high
>  			is passed, kernel could allocate physical memory region
>  			above 4G, that cause second kernel crash on system
>  			that require some amount of low memory, e.g. swiotlb
> @@ -857,12 +857,6 @@
>  			It will be ignored when crashkernel=X,high is not used
>  			or memory reserved is below 4G.
>  
> -			[KNL, ARM64] range in low memory.
> -			This one lets the user specify a low range in the
> -			DMA zone for the crash dump kernel.
> -			It will be ignored when crashkernel=X,high is not used
> -			or memory reserved is located in the DMA zones.
> -
>  	cryptomgr.notests
>  			[KNL] Disable crypto self-tests
>  
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 339ee84e5a61a0b..5390f361208ccf7 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -96,6 +96,14 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit = PHYS_MASK + 1;
>  #define CRASH_ADDR_LOW_MAX		arm64_dma_phys_limit
>  #define CRASH_ADDR_HIGH_MAX		(PHYS_MASK + 1)
>  
> +/*
> + * This is an empirical value in x86_64 and taken here directly. Please
> + * refer to the code comment in reserve_crashkernel_low() of x86_64 for more
> + * details.
> + */
> +#define DEFAULT_CRASH_KERNEL_LOW_SIZE	\
> +	max(swiotlb_size_or_default() + (8UL << 20), 256UL << 20)

About this default low value, 256M, I am not sure if it can be lowered
down. We have Ampere Mt-Jade systems in Redhat and their biggest
contiguous memory is less than 256M under low 4G when the firmware 32bit
option is disabled. Obviously this will fail the default crashkernel,low
value.

I am not sure how common the 32bit option is disabled. If it's an
important feature and widely set, it need be taken into consideration
when deciding this default crashkernel,low value. Otherwise, the
crashkernel=xM won't work in Ampere Mt-Jade system with 32bit option
disabled, and people need specify crashkernel=xM,high,
crashkernel=yM,low explicitly. The omission of crashkernel,low is also
not allowed in the case.

Hi, Mark and Christopher,

Add you in CC. If you happen to know contact person from Ampere, please
also feel free to add them in this thread.

Thanks
Baoquan

>  static int __init reserve_crashkernel_low(unsigned long long low_size)
>  {
>  	unsigned long long low_base;
> @@ -147,7 +155,9 @@ static void __init reserve_crashkernel(void)
>  		 * is not allowed.
>  		 */
>  		ret = parse_crashkernel_low(cmdline, 0, &crash_low_size, &crash_base);
> -		if (ret && (ret != -ENOENT))
> +		if (ret == -ENOENT)
> +			crash_low_size = DEFAULT_CRASH_KERNEL_LOW_SIZE;
> +		else if (ret)
>  			return;
>  
>  		crash_max = CRASH_ADDR_HIGH_MAX;
> -- 
> 2.25.1
> 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] arm64: kdump: Remove some redundant checks in map_mem()
  2022-06-13  8:09 ` [PATCH 3/5] arm64: kdump: Remove some redundant checks in map_mem() Zhen Lei
@ 2022-06-20  7:42   ` Baoquan He
  0 siblings, 0 replies; 26+ messages in thread
From: Baoquan He @ 2022-06-20  7:42 UTC (permalink / raw)
  To: Zhen Lei
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Catalin Marinas, Will Deacon, linux-arm-kernel, Jonathan Corbet,
	linux-doc, Randy Dunlap, Feng Zhou, Kefeng Wang, Chen Zhou,
	John Donnelly, Dave Kleikamp

On 06/13/22 at 04:09pm, Zhen Lei wrote:
> arm64_memblock_init()
> 	if (!IS_ENABLED(CONFIG_ZONE_DMA/DMA32))
> 		reserve_crashkernel()
> 			//initialize crashk_res when
> 			//"crashkernel=" is correctly specified
> paging_init()
> 	map_mem()
> 
> As shown in the above pseudo code, the crashk_res.end can only be
> initialized to non-zero when both "!IS_ENABLED(CONFIG_ZONE_DMA/DMA32)"
> and crash_mem_map are true. So some checks in map_mem() can be adjusted
> or optimized.

LGTM,

Acked-by: Baoquan He <bhe@redhat.com>

> 
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  arch/arm64/mm/mmu.c | 25 +++++++++++--------------
>  1 file changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 626ec32873c6c36..6028a5757e4eae2 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -529,12 +529,12 @@ static void __init map_mem(pgd_t *pgdp)
>  
>  #ifdef CONFIG_KEXEC_CORE
>  	if (crash_mem_map) {
> -		if (IS_ENABLED(CONFIG_ZONE_DMA) ||
> -		    IS_ENABLED(CONFIG_ZONE_DMA32))
> -			flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> -		else if (crashk_res.end)
> +		if (crashk_res.end)
>  			memblock_mark_nomap(crashk_res.start,
>  			    resource_size(&crashk_res));
> +		else if (IS_ENABLED(CONFIG_ZONE_DMA) ||
> +			 IS_ENABLED(CONFIG_ZONE_DMA32))
> +			flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>  	}
>  #endif
>  
> @@ -571,16 +571,13 @@ static void __init map_mem(pgd_t *pgdp)
>  	 * through /sys/kernel/kexec_crash_size interface.
>  	 */
>  #ifdef CONFIG_KEXEC_CORE
> -	if (crash_mem_map &&
> -	    !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) {
> -		if (crashk_res.end) {
> -			__map_memblock(pgdp, crashk_res.start,
> -				       crashk_res.end + 1,
> -				       PAGE_KERNEL,
> -				       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> -			memblock_clear_nomap(crashk_res.start,
> -					     resource_size(&crashk_res));
> -		}
> +	if (crashk_res.end) {
> +		__map_memblock(pgdp, crashk_res.start,
> +			       crashk_res.end + 1,
> +			       PAGE_KERNEL,
> +			       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> +		memblock_clear_nomap(crashk_res.start,
> +				     resource_size(&crashk_res));
>  	}
>  #endif
>  }
> -- 
> 2.25.1
> 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-13  8:09 ` [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory Zhen Lei
@ 2022-06-21  5:33   ` Baoquan He
  2022-06-21  6:24     ` Kefeng Wang
  2022-06-21  7:56     ` Leizhen (ThunderTown)
  0 siblings, 2 replies; 26+ messages in thread
From: Baoquan He @ 2022-06-21  5:33 UTC (permalink / raw)
  To: Zhen Lei, Catalin Marinas, Ard Biesheuvel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Will Deacon, linux-arm-kernel, Jonathan Corbet, linux-doc,
	Randy Dunlap, Feng Zhou, Kefeng Wang, Chen Zhou, John Donnelly,
	Dave Kleikamp

Hi,

On 06/13/22 at 04:09pm, Zhen Lei wrote:
> If the crashkernel has both high memory above DMA zones and low memory
> in DMA zones, kexec always loads the content such as Image and dtb to the
> high memory instead of the low memory. This means that only high memory
> requires write protection based on page-level mapping. The allocation of
> high memory does not depend on the DMA boundary. So we can reserve the
> high memory first even if the crashkernel reservation is deferred.
> 
> This means that the block mapping can still be performed on other kernel
> linear address spaces, the TLB miss rate can be reduced and the system
> performance will be improved.

Ugh, this looks a little ugly, honestly.

If that's for sure arm64 can't split large page mapping of linear
region, this patch is one way to optimize linear mapping. Given kdump
setting is necessary on arm64 server, the booting speed is truly
impacted heavily.

However, I would suggest letting it as is with below reasons:

1) The code will complicate the crashkernel reservatoin code which
is already difficult to understand. 
2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32
  disabled, the other is crashkernel=,high is specified. While both
  two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32
  enabled, and most of systems have crashkernel=xM which is enough.
  Having them optimized won't bring benefit to most of systems.
3) Besides, the crashkernel=,high can be handled earlier because 
  arm64 alwasys have memblock.bottom_up == false currently, thus we
  don't need worry arbout the lower limit of crashkernel,high
  reservation for now. If memblock.bottom_up is set true in the future,
  this patch doesn't work any more.


...
        crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
                                               crash_base, crash_max);

So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as
is caused by crashkernel reserving, since no regression is brought.
And meantime, turning to check if there's any way to make the contiguous
linear mapping and later splitting work. The patch 4, 5 in this patchset
doesn't make much sense to me, frankly speaking.

Thanks
Baoquan

> 
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  arch/arm64/mm/init.c | 71 ++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 65 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index fb24efbc46f5ef4..ae0bae2cafe6ab0 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -141,15 +141,44 @@ static void __init reserve_crashkernel(int dma_state)
>  	unsigned long long crash_max = CRASH_ADDR_LOW_MAX;
>  	char *cmdline = boot_command_line;
>  	int dma_enabled = IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32);
> -	int ret;
> +	int ret, skip_res = 0, skip_low_res = 0;
>  	bool fixed_base;
>  
>  	if (!IS_ENABLED(CONFIG_KEXEC_CORE))
>  		return;
>  
> -	if ((!dma_enabled && (dma_state != DMA_PHYS_LIMIT_UNKNOWN)) ||
> -	     (dma_enabled && (dma_state != DMA_PHYS_LIMIT_KNOWN)))
> -		return;
> +	/*
> +	 * In the following table:
> +	 * X,high  means crashkernel=X,high
> +	 * unknown means dma_state = DMA_PHYS_LIMIT_UNKNOWN
> +	 * known   means dma_state = DMA_PHYS_LIMIT_KNOWN
> +	 *
> +	 * The first two columns indicate the status, and the last two
> +	 * columns indicate the phase in which crash high or low memory
> +	 * needs to be reserved.
> +	 *  ---------------------------------------------------
> +	 * | DMA enabled | X,high used |  unknown  |   known   |
> +	 *  ---------------------------------------------------
> +	 * |      N            N       |    low    |    NOP    |
> +	 * |      Y            N       |    NOP    |    low    |
> +	 * |      N            Y       |  high/low |    NOP    |
> +	 * |      Y            Y       |    high   |    low    |
> +	 *  ---------------------------------------------------
> +	 *
> +	 * But in this function, the crash high memory allocation of
> +	 * crashkernel=Y,high and the crash low memory allocation of
> +	 * crashkernel=X[@offset] for crashk_res are mixed at one place.
> +	 * So the table above need to be adjusted as below:
> +	 *  ---------------------------------------------------
> +	 * | DMA enabled | X,high used |  unknown  |   known   |
> +	 *  ---------------------------------------------------
> +	 * |      N            N       |    res    |    NOP    |
> +	 * |      Y            N       |    NOP    |    res    |
> +	 * |      N            Y       |res/low_res|    NOP    |
> +	 * |      Y            Y       |    res    |  low_res  |
> +	 *  ---------------------------------------------------
> +	 *
> +	 */
>  
>  	/* crashkernel=X[@offset] */
>  	ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
> @@ -169,10 +198,33 @@ static void __init reserve_crashkernel(int dma_state)
>  		else if (ret)
>  			return;
>  
> +		/* See the third row of the second table above, NOP */
> +		if (!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN))
> +			return;
> +
> +		/* See the fourth row of the second table above */
> +		if (dma_enabled) {
> +			if (dma_state == DMA_PHYS_LIMIT_UNKNOWN)
> +				skip_low_res = 1;
> +			else
> +				skip_res = 1;
> +		}
> +
>  		crash_max = CRASH_ADDR_HIGH_MAX;
>  	} else if (ret || !crash_size) {
>  		/* The specified value is invalid */
>  		return;
> +	} else {
> +		/* See the 1-2 rows of the second table above, NOP */
> +		if ((!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) ||
> +		     (dma_enabled && (dma_state == DMA_PHYS_LIMIT_UNKNOWN)))
> +			return;
> +	}
> +
> +	if (skip_res) {
> +		crash_base = crashk_res.start;
> +		crash_size = crashk_res.end - crashk_res.start + 1;
> +		goto check_low;
>  	}
>  
>  	fixed_base = !!crash_base;
> @@ -202,9 +254,18 @@ static void __init reserve_crashkernel(int dma_state)
>  		return;
>  	}
>  
> +	crashk_res.start = crash_base;
> +	crashk_res.end = crash_base + crash_size - 1;
> +
> +check_low:
> +	if (skip_low_res)
> +		return;
> +
>  	if ((crash_base >= CRASH_ADDR_LOW_MAX) &&
>  	     crash_low_size && reserve_crashkernel_low(crash_low_size)) {
>  		memblock_phys_free(crash_base, crash_size);
> +		crashk_res.start = 0;
> +		crashk_res.end = 0;
>  		return;
>  	}
>  
> @@ -219,8 +280,6 @@ static void __init reserve_crashkernel(int dma_state)
>  	if (crashk_low_res.end)
>  		kmemleak_ignore_phys(crashk_low_res.start);
>  
> -	crashk_res.start = crash_base;
> -	crashk_res.end = crash_base + crash_size - 1;
>  	insert_resource(&iomem_resource, &crashk_res);
>  }
>  
> -- 
> 2.25.1
> 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-21  5:33   ` Baoquan He
@ 2022-06-21  6:24     ` Kefeng Wang
  2022-06-21  9:27       ` Baoquan He
  2022-06-21 18:04       ` Catalin Marinas
  2022-06-21  7:56     ` Leizhen (ThunderTown)
  1 sibling, 2 replies; 26+ messages in thread
From: Kefeng Wang @ 2022-06-21  6:24 UTC (permalink / raw)
  To: Baoquan He, Zhen Lei, Catalin Marinas, Ard Biesheuvel, Mark Rutland
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Will Deacon, linux-arm-kernel, Jonathan Corbet, linux-doc,
	Randy Dunlap, Feng Zhou, Chen Zhou, John Donnelly, Dave Kleikamp,
	liushixin


On 2022/6/21 13:33, Baoquan He wrote:
> Hi,
>
> On 06/13/22 at 04:09pm, Zhen Lei wrote:
>> If the crashkernel has both high memory above DMA zones and low memory
>> in DMA zones, kexec always loads the content such as Image and dtb to the
>> high memory instead of the low memory. This means that only high memory
>> requires write protection based on page-level mapping. The allocation of
>> high memory does not depend on the DMA boundary. So we can reserve the
>> high memory first even if the crashkernel reservation is deferred.
>>
>> This means that the block mapping can still be performed on other kernel
>> linear address spaces, the TLB miss rate can be reduced and the system
>> performance will be improved.
> Ugh, this looks a little ugly, honestly.
>
> If that's for sure arm64 can't split large page mapping of linear
> region, this patch is one way to optimize linear mapping. Given kdump
> setting is necessary on arm64 server, the booting speed is truly
> impacted heavily.

Is there some conclusion or discussion that arm64 can't split large page 
mapping?

Could the crashkernel reservation (and Kfence pool) be splited dynamically?

I found Mark replay "arm64: remove page granularity limitation from 
KFENCE"[1],

   "We also avoid live changes from block<->table mappings, since the
   archtitecture gives us very weak guarantees there and generally requires
   a Break-Before-Make sequence (though IIRC this was tightened up
   somewhat, so maybe going one way is supposed to work). Unless it's
   really necessary, I'd rather not split these block mappings while
   they're live."

Hi Mark and Catalin,  could you give some comment,  many thanks.

[1] 
https://lore.kernel.org/lkml/20210920101938.GA13863@C02TD0UTHF1T.local/T/#m1a7f974593f5545cbcfc0d21560df4e7926b1381


>
> However, I would suggest letting it as is with below reasons:
>
> 1) The code will complicate the crashkernel reservatoin code which
> is already difficult to understand.
> 2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32
>    disabled, the other is crashkernel=,high is specified. While both
>    two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32
>    enabled, and most of systems have crashkernel=xM which is enough.
>    Having them optimized won't bring benefit to most of systems.
> 3) Besides, the crashkernel=,high can be handled earlier because
>    arm64 alwasys have memblock.bottom_up == false currently, thus we
>    don't need worry arbout the lower limit of crashkernel,high
>    reservation for now. If memblock.bottom_up is set true in the future,
>    this patch doesn't work any more.
>
>
> ...
>          crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
>                                                 crash_base, crash_max);
>
> So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as
> is caused by crashkernel reserving, since no regression is brought.
> And meantime, turning to check if there's any way to make the contiguous
> linear mapping and later splitting work. The patch 4, 5 in this patchset
> doesn't make much sense to me, frankly speaking.
>
> Thanks
> Baoquan

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-21  5:33   ` Baoquan He
  2022-06-21  6:24     ` Kefeng Wang
@ 2022-06-21  7:56     ` Leizhen (ThunderTown)
  2022-06-21  9:35       ` Baoquan He
  1 sibling, 1 reply; 26+ messages in thread
From: Leizhen (ThunderTown) @ 2022-06-21  7:56 UTC (permalink / raw)
  To: Baoquan He, Catalin Marinas, Ard Biesheuvel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Will Deacon, linux-arm-kernel, Jonathan Corbet, linux-doc,
	Randy Dunlap, Feng Zhou, Kefeng Wang, Chen Zhou, John Donnelly,
	Dave Kleikamp



On 2022/6/21 13:33, Baoquan He wrote:
> Hi,
> 
> On 06/13/22 at 04:09pm, Zhen Lei wrote:
>> If the crashkernel has both high memory above DMA zones and low memory
>> in DMA zones, kexec always loads the content such as Image and dtb to the
>> high memory instead of the low memory. This means that only high memory
>> requires write protection based on page-level mapping. The allocation of
>> high memory does not depend on the DMA boundary. So we can reserve the
>> high memory first even if the crashkernel reservation is deferred.
>>
>> This means that the block mapping can still be performed on other kernel
>> linear address spaces, the TLB miss rate can be reduced and the system
>> performance will be improved.
> 
> Ugh, this looks a little ugly, honestly.
> 
> If that's for sure arm64 can't split large page mapping of linear
> region, this patch is one way to optimize linear mapping. Given kdump
> setting is necessary on arm64 server, the booting speed is truly
> impacted heavily.

There is also a performance impact when running.

> 
> However, I would suggest letting it as is with below reasons:
> 
> 1) The code will complicate the crashkernel reservatoin code which
> is already difficult to understand. 

Yeah, I feel it, too.

> 2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32
>   disabled, the other is crashkernel=,high is specified. While both
>   two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32
>   enabled, and most of systems have crashkernel=xM which is enough.
>   Having them optimized won't bring benefit to most of systems.

The case of CONFIG_ZONE_DMA|DMA32 disabled have been resolved by
commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for platforms with no DMA memory zones").
Currently the performance problem to be optimized is that DMA is enabled.


> 3) Besides, the crashkernel=,high can be handled earlier because 
>   arm64 alwasys have memblock.bottom_up == false currently, thus we
>   don't need worry arbout the lower limit of crashkernel,high
>   reservation for now. If memblock.bottom_up is set true in the future,
>   this patch doesn't work any more.
> 
> 
> ...
>         crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
>                                                crash_base, crash_max);
> 
> So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as
> is caused by crashkernel reserving, since no regression is brought.
> And meantime, turning to check if there's any way to make the contiguous
> linear mapping and later splitting work. The patch 4, 5 in this patchset
> doesn't make much sense to me, frankly speaking.

OK. As discussed earlier, I can rethink if there is a better way to patch 4-5,
and this time focus on patch 1-2. In this way, all the functions are complete,
and only optimization is left.

> 
> Thanks
> Baoquan
> 
>>
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> ---
>>  arch/arm64/mm/init.c | 71 ++++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 65 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>> index fb24efbc46f5ef4..ae0bae2cafe6ab0 100644
>> --- a/arch/arm64/mm/init.c
>> +++ b/arch/arm64/mm/init.c
>> @@ -141,15 +141,44 @@ static void __init reserve_crashkernel(int dma_state)
>>  	unsigned long long crash_max = CRASH_ADDR_LOW_MAX;
>>  	char *cmdline = boot_command_line;
>>  	int dma_enabled = IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32);
>> -	int ret;
>> +	int ret, skip_res = 0, skip_low_res = 0;
>>  	bool fixed_base;
>>  
>>  	if (!IS_ENABLED(CONFIG_KEXEC_CORE))
>>  		return;
>>  
>> -	if ((!dma_enabled && (dma_state != DMA_PHYS_LIMIT_UNKNOWN)) ||
>> -	     (dma_enabled && (dma_state != DMA_PHYS_LIMIT_KNOWN)))
>> -		return;
>> +	/*
>> +	 * In the following table:
>> +	 * X,high  means crashkernel=X,high
>> +	 * unknown means dma_state = DMA_PHYS_LIMIT_UNKNOWN
>> +	 * known   means dma_state = DMA_PHYS_LIMIT_KNOWN
>> +	 *
>> +	 * The first two columns indicate the status, and the last two
>> +	 * columns indicate the phase in which crash high or low memory
>> +	 * needs to be reserved.
>> +	 *  ---------------------------------------------------
>> +	 * | DMA enabled | X,high used |  unknown  |   known   |
>> +	 *  ---------------------------------------------------
>> +	 * |      N            N       |    low    |    NOP    |
>> +	 * |      Y            N       |    NOP    |    low    |
>> +	 * |      N            Y       |  high/low |    NOP    |
>> +	 * |      Y            Y       |    high   |    low    |
>> +	 *  ---------------------------------------------------
>> +	 *
>> +	 * But in this function, the crash high memory allocation of
>> +	 * crashkernel=Y,high and the crash low memory allocation of
>> +	 * crashkernel=X[@offset] for crashk_res are mixed at one place.
>> +	 * So the table above need to be adjusted as below:
>> +	 *  ---------------------------------------------------
>> +	 * | DMA enabled | X,high used |  unknown  |   known   |
>> +	 *  ---------------------------------------------------
>> +	 * |      N            N       |    res    |    NOP    |
>> +	 * |      Y            N       |    NOP    |    res    |
>> +	 * |      N            Y       |res/low_res|    NOP    |
>> +	 * |      Y            Y       |    res    |  low_res  |
>> +	 *  ---------------------------------------------------
>> +	 *
>> +	 */
>>  
>>  	/* crashkernel=X[@offset] */
>>  	ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
>> @@ -169,10 +198,33 @@ static void __init reserve_crashkernel(int dma_state)
>>  		else if (ret)
>>  			return;
>>  
>> +		/* See the third row of the second table above, NOP */
>> +		if (!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN))
>> +			return;
>> +
>> +		/* See the fourth row of the second table above */
>> +		if (dma_enabled) {
>> +			if (dma_state == DMA_PHYS_LIMIT_UNKNOWN)
>> +				skip_low_res = 1;
>> +			else
>> +				skip_res = 1;
>> +		}
>> +
>>  		crash_max = CRASH_ADDR_HIGH_MAX;
>>  	} else if (ret || !crash_size) {
>>  		/* The specified value is invalid */
>>  		return;
>> +	} else {
>> +		/* See the 1-2 rows of the second table above, NOP */
>> +		if ((!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) ||
>> +		     (dma_enabled && (dma_state == DMA_PHYS_LIMIT_UNKNOWN)))
>> +			return;
>> +	}
>> +
>> +	if (skip_res) {
>> +		crash_base = crashk_res.start;
>> +		crash_size = crashk_res.end - crashk_res.start + 1;
>> +		goto check_low;
>>  	}
>>  
>>  	fixed_base = !!crash_base;
>> @@ -202,9 +254,18 @@ static void __init reserve_crashkernel(int dma_state)
>>  		return;
>>  	}
>>  
>> +	crashk_res.start = crash_base;
>> +	crashk_res.end = crash_base + crash_size - 1;
>> +
>> +check_low:
>> +	if (skip_low_res)
>> +		return;
>> +
>>  	if ((crash_base >= CRASH_ADDR_LOW_MAX) &&
>>  	     crash_low_size && reserve_crashkernel_low(crash_low_size)) {
>>  		memblock_phys_free(crash_base, crash_size);
>> +		crashk_res.start = 0;
>> +		crashk_res.end = 0;
>>  		return;
>>  	}
>>  
>> @@ -219,8 +280,6 @@ static void __init reserve_crashkernel(int dma_state)
>>  	if (crashk_low_res.end)
>>  		kmemleak_ignore_phys(crashk_low_res.start);
>>  
>> -	crashk_res.start = crash_base;
>> -	crashk_res.end = crash_base + crash_size - 1;
>>  	insert_resource(&iomem_resource, &crashk_res);
>>  }
>>  
>> -- 
>> 2.25.1
>>
> 
> .
> 

-- 
Regards,
  Zhen Lei

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-21  6:24     ` Kefeng Wang
@ 2022-06-21  9:27       ` Baoquan He
  2022-06-21 18:04       ` Catalin Marinas
  1 sibling, 0 replies; 26+ messages in thread
From: Baoquan He @ 2022-06-21  9:27 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: Zhen Lei, Catalin Marinas, Ard Biesheuvel, Mark Rutland,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Will Deacon, linux-arm-kernel, Jonathan Corbet, linux-doc,
	Randy Dunlap, Feng Zhou, Chen Zhou, John Donnelly, Dave Kleikamp,
	liushixin

On 06/21/22 at 02:24pm, Kefeng Wang wrote:
> 
> On 2022/6/21 13:33, Baoquan He wrote:
> > Hi,
> > 
> > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> > > If the crashkernel has both high memory above DMA zones and low memory
> > > in DMA zones, kexec always loads the content such as Image and dtb to the
> > > high memory instead of the low memory. This means that only high memory
> > > requires write protection based on page-level mapping. The allocation of
> > > high memory does not depend on the DMA boundary. So we can reserve the
> > > high memory first even if the crashkernel reservation is deferred.
> > > 
> > > This means that the block mapping can still be performed on other kernel
> > > linear address spaces, the TLB miss rate can be reduced and the system
> > > performance will be improved.
> > Ugh, this looks a little ugly, honestly.
> > 
> > If that's for sure arm64 can't split large page mapping of linear
> > region, this patch is one way to optimize linear mapping. Given kdump
> > setting is necessary on arm64 server, the booting speed is truly
> > impacted heavily.
> 
> Is there some conclusion or discussion that arm64 can't split large page
> mapping?

Yes, please see below commit log. 
commit d27cfa1fc823 ("arm64: mm: set the contiguous bit for kernel mappings where appropriate")

> 
> Could the crashkernel reservation (and Kfence pool) be splited dynamically?

For crashkernel region, we have arch_kexec_protect_crashkres() to secure
the region, and crash_shrink_memory() could be called to shrink it.
While crahshkernel region could be crossig part of a block mapping or section
mapping and the mapping need be splitted, that will cause TLB conflicts.

> 
> I found Mark replay "arm64: remove page granularity limitation from
> KFENCE"[1],
> 
>   "We also avoid live changes from block<->table mappings, since the
>   archtitecture gives us very weak guarantees there and generally requires
>   a Break-Before-Make sequence (though IIRC this was tightened up
>   somewhat, so maybe going one way is supposed to work). Unless it's
>   really necessary, I'd rather not split these block mappings while
>   they're live."
> 
> Hi Mark and Catalin,  could you give some comment,  many thanks.
> 
> [1] https://lore.kernel.org/lkml/20210920101938.GA13863@C02TD0UTHF1T.local/T/#m1a7f974593f5545cbcfc0d21560df4e7926b1381
> 
> 
> > 
> > However, I would suggest letting it as is with below reasons:
> > 
> > 1) The code will complicate the crashkernel reservatoin code which
> > is already difficult to understand.
> > 2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32
> >    disabled, the other is crashkernel=,high is specified. While both
> >    two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32
> >    enabled, and most of systems have crashkernel=xM which is enough.
> >    Having them optimized won't bring benefit to most of systems.
> > 3) Besides, the crashkernel=,high can be handled earlier because
> >    arm64 alwasys have memblock.bottom_up == false currently, thus we
> >    don't need worry arbout the lower limit of crashkernel,high
> >    reservation for now. If memblock.bottom_up is set true in the future,
> >    this patch doesn't work any more.
> > 
> > 
> > ...
> >          crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
> >                                                 crash_base, crash_max);
> > 
> > So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as
> > is caused by crashkernel reserving, since no regression is brought.
> > And meantime, turning to check if there's any way to make the contiguous
> > linear mapping and later splitting work. The patch 4, 5 in this patchset
> > doesn't make much sense to me, frankly speaking.
> > 
> > Thanks
> > Baoquan
> 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-21  7:56     ` Leizhen (ThunderTown)
@ 2022-06-21  9:35       ` Baoquan He
  0 siblings, 0 replies; 26+ messages in thread
From: Baoquan He @ 2022-06-21  9:35 UTC (permalink / raw)
  To: Leizhen (ThunderTown)
  Cc: Catalin Marinas, Ard Biesheuvel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Eric Biederman,
	Rob Herring, Frank Rowand, devicetree, Dave Young, Vivek Goyal,
	kexec, linux-kernel, Will Deacon, linux-arm-kernel,
	Jonathan Corbet, linux-doc, Randy Dunlap, Feng Zhou, Kefeng Wang,
	Chen Zhou, John Donnelly, Dave Kleikamp

On 06/21/22 at 03:56pm, Leizhen (ThunderTown) wrote:
> 
> 
> On 2022/6/21 13:33, Baoquan He wrote:
> > Hi,
> > 
> > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> >> If the crashkernel has both high memory above DMA zones and low memory
> >> in DMA zones, kexec always loads the content such as Image and dtb to the
> >> high memory instead of the low memory. This means that only high memory
> >> requires write protection based on page-level mapping. The allocation of
> >> high memory does not depend on the DMA boundary. So we can reserve the
> >> high memory first even if the crashkernel reservation is deferred.
> >>
> >> This means that the block mapping can still be performed on other kernel
> >> linear address spaces, the TLB miss rate can be reduced and the system
> >> performance will be improved.
> > 
> > Ugh, this looks a little ugly, honestly.
> > 
> > If that's for sure arm64 can't split large page mapping of linear
> > region, this patch is one way to optimize linear mapping. Given kdump
> > setting is necessary on arm64 server, the booting speed is truly
> > impacted heavily.
> 
> There is also a performance impact when running.

Yes, indeed, the TLB flush will happen more often.

> 
> > 
> > However, I would suggest letting it as is with below reasons:
> > 
> > 1) The code will complicate the crashkernel reservatoin code which
> > is already difficult to understand. 
> 
> Yeah, I feel it, too.
> 
> > 2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32
> >   disabled, the other is crashkernel=,high is specified. While both
> >   two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32
> >   enabled, and most of systems have crashkernel=xM which is enough.
> >   Having them optimized won't bring benefit to most of systems.
> 
> The case of CONFIG_ZONE_DMA|DMA32 disabled have been resolved by
> commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for platforms with no DMA memory zones").
> Currently the performance problem to be optimized is that DMA is enabled.

Yes, the disabled CONFIG_ZONE_DMA|DMA32 case has avoided the problem since
its boundary is decided already at that time. Crashkenrel=,high can slso
avoid this benefitting from the top done memblock allocating. However,
the crashkerne=xM which now gets the fallback support is the main syntax
we will use, that still has the problem.

> 
> 
> > 3) Besides, the crashkernel=,high can be handled earlier because 
> >   arm64 alwasys have memblock.bottom_up == false currently, thus we
> >   don't need worry arbout the lower limit of crashkernel,high
> >   reservation for now. If memblock.bottom_up is set true in the future,
> >   this patch doesn't work any more.
> > 
> > 
> > ...
> >         crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
> >                                                crash_base, crash_max);
> > 
> > So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as
> > is caused by crashkernel reserving, since no regression is brought.
> > And meantime, turning to check if there's any way to make the contiguous
> > linear mapping and later splitting work. The patch 4, 5 in this patchset
> > doesn't make much sense to me, frankly speaking.
> 
> OK. As discussed earlier, I can rethink if there is a better way to patch 4-5,
> and this time focus on patch 1-2. In this way, all the functions are complete,
> and only optimization is left.

Sounds nice, thx.

> > 
> >>
> >> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> >> ---
> >>  arch/arm64/mm/init.c | 71 ++++++++++++++++++++++++++++++++++++++++----
> >>  1 file changed, 65 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> >> index fb24efbc46f5ef4..ae0bae2cafe6ab0 100644
> >> --- a/arch/arm64/mm/init.c
> >> +++ b/arch/arm64/mm/init.c
> >> @@ -141,15 +141,44 @@ static void __init reserve_crashkernel(int dma_state)
> >>  	unsigned long long crash_max = CRASH_ADDR_LOW_MAX;
> >>  	char *cmdline = boot_command_line;
> >>  	int dma_enabled = IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32);
> >> -	int ret;
> >> +	int ret, skip_res = 0, skip_low_res = 0;
> >>  	bool fixed_base;
> >>  
> >>  	if (!IS_ENABLED(CONFIG_KEXEC_CORE))
> >>  		return;
> >>  
> >> -	if ((!dma_enabled && (dma_state != DMA_PHYS_LIMIT_UNKNOWN)) ||
> >> -	     (dma_enabled && (dma_state != DMA_PHYS_LIMIT_KNOWN)))
> >> -		return;
> >> +	/*
> >> +	 * In the following table:
> >> +	 * X,high  means crashkernel=X,high
> >> +	 * unknown means dma_state = DMA_PHYS_LIMIT_UNKNOWN
> >> +	 * known   means dma_state = DMA_PHYS_LIMIT_KNOWN
> >> +	 *
> >> +	 * The first two columns indicate the status, and the last two
> >> +	 * columns indicate the phase in which crash high or low memory
> >> +	 * needs to be reserved.
> >> +	 *  ---------------------------------------------------
> >> +	 * | DMA enabled | X,high used |  unknown  |   known   |
> >> +	 *  ---------------------------------------------------
> >> +	 * |      N            N       |    low    |    NOP    |
> >> +	 * |      Y            N       |    NOP    |    low    |
> >> +	 * |      N            Y       |  high/low |    NOP    |
> >> +	 * |      Y            Y       |    high   |    low    |
> >> +	 *  ---------------------------------------------------
> >> +	 *
> >> +	 * But in this function, the crash high memory allocation of
> >> +	 * crashkernel=Y,high and the crash low memory allocation of
> >> +	 * crashkernel=X[@offset] for crashk_res are mixed at one place.
> >> +	 * So the table above need to be adjusted as below:
> >> +	 *  ---------------------------------------------------
> >> +	 * | DMA enabled | X,high used |  unknown  |   known   |
> >> +	 *  ---------------------------------------------------
> >> +	 * |      N            N       |    res    |    NOP    |
> >> +	 * |      Y            N       |    NOP    |    res    |
> >> +	 * |      N            Y       |res/low_res|    NOP    |
> >> +	 * |      Y            Y       |    res    |  low_res  |
> >> +	 *  ---------------------------------------------------
> >> +	 *
> >> +	 */
> >>  
> >>  	/* crashkernel=X[@offset] */
> >>  	ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
> >> @@ -169,10 +198,33 @@ static void __init reserve_crashkernel(int dma_state)
> >>  		else if (ret)
> >>  			return;
> >>  
> >> +		/* See the third row of the second table above, NOP */
> >> +		if (!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN))
> >> +			return;
> >> +
> >> +		/* See the fourth row of the second table above */
> >> +		if (dma_enabled) {
> >> +			if (dma_state == DMA_PHYS_LIMIT_UNKNOWN)
> >> +				skip_low_res = 1;
> >> +			else
> >> +				skip_res = 1;
> >> +		}
> >> +
> >>  		crash_max = CRASH_ADDR_HIGH_MAX;
> >>  	} else if (ret || !crash_size) {
> >>  		/* The specified value is invalid */
> >>  		return;
> >> +	} else {
> >> +		/* See the 1-2 rows of the second table above, NOP */
> >> +		if ((!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) ||
> >> +		     (dma_enabled && (dma_state == DMA_PHYS_LIMIT_UNKNOWN)))
> >> +			return;
> >> +	}
> >> +
> >> +	if (skip_res) {
> >> +		crash_base = crashk_res.start;
> >> +		crash_size = crashk_res.end - crashk_res.start + 1;
> >> +		goto check_low;
> >>  	}
> >>  
> >>  	fixed_base = !!crash_base;
> >> @@ -202,9 +254,18 @@ static void __init reserve_crashkernel(int dma_state)
> >>  		return;
> >>  	}
> >>  
> >> +	crashk_res.start = crash_base;
> >> +	crashk_res.end = crash_base + crash_size - 1;
> >> +
> >> +check_low:
> >> +	if (skip_low_res)
> >> +		return;
> >> +
> >>  	if ((crash_base >= CRASH_ADDR_LOW_MAX) &&
> >>  	     crash_low_size && reserve_crashkernel_low(crash_low_size)) {
> >>  		memblock_phys_free(crash_base, crash_size);
> >> +		crashk_res.start = 0;
> >> +		crashk_res.end = 0;
> >>  		return;
> >>  	}
> >>  
> >> @@ -219,8 +280,6 @@ static void __init reserve_crashkernel(int dma_state)
> >>  	if (crashk_low_res.end)
> >>  		kmemleak_ignore_phys(crashk_low_res.start);
> >>  
> >> -	crashk_res.start = crash_base;
> >> -	crashk_res.end = crash_base + crash_size - 1;
> >>  	insert_resource(&iomem_resource, &crashk_res);
> >>  }
> >>  
> >> -- 
> >> 2.25.1
> >>
> > 
> > .
> > 
> 
> -- 
> Regards,
>   Zhen Lei
> 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-21  6:24     ` Kefeng Wang
  2022-06-21  9:27       ` Baoquan He
@ 2022-06-21 18:04       ` Catalin Marinas
  2022-06-22  8:35         ` Baoquan He
  2022-06-22 12:03         ` Kefeng Wang
  1 sibling, 2 replies; 26+ messages in thread
From: Catalin Marinas @ 2022-06-21 18:04 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: Baoquan He, Zhen Lei, Ard Biesheuvel, Mark Rutland,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Will Deacon, linux-arm-kernel, Jonathan Corbet, linux-doc,
	Randy Dunlap, Feng Zhou, Chen Zhou, John Donnelly, Dave Kleikamp,
	liushixin

On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
> On 2022/6/21 13:33, Baoquan He wrote:
> > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> > > If the crashkernel has both high memory above DMA zones and low memory
> > > in DMA zones, kexec always loads the content such as Image and dtb to the
> > > high memory instead of the low memory. This means that only high memory
> > > requires write protection based on page-level mapping. The allocation of
> > > high memory does not depend on the DMA boundary. So we can reserve the
> > > high memory first even if the crashkernel reservation is deferred.
> > > 
> > > This means that the block mapping can still be performed on other kernel
> > > linear address spaces, the TLB miss rate can be reduced and the system
> > > performance will be improved.
> > 
> > Ugh, this looks a little ugly, honestly.
> > 
> > If that's for sure arm64 can't split large page mapping of linear
> > region, this patch is one way to optimize linear mapping. Given kdump
> > setting is necessary on arm64 server, the booting speed is truly
> > impacted heavily.
> 
> Is there some conclusion or discussion that arm64 can't split large page
> mapping?
> 
> Could the crashkernel reservation (and Kfence pool) be splited dynamically?
> 
> I found Mark replay "arm64: remove page granularity limitation from
> KFENCE"[1],
> 
>   "We also avoid live changes from block<->table mappings, since the
>   archtitecture gives us very weak guarantees there and generally requires
>   a Break-Before-Make sequence (though IIRC this was tightened up
>   somewhat, so maybe going one way is supposed to work). Unless it's
>   really necessary, I'd rather not split these block mappings while
>   they're live."

The problem with splitting is that you can end up with two entries in
the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
abort (but can be worse like loss of coherency).

Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
all, the software would have to unmap the range, TLBI, remap. With
FEAT_BBM (level 2), we can do this without tearing the mapping down but
we still need to handle the potential TLB conflict abort. The handler
only needs a TLBI but if it touches the memory range being changed it
risks faulting again. With vmap stacks and the kernel image mapped in
the vmalloc space, we have a small window where this could be handled
but we probably can't go into the C part of the exception handling
(tracing etc. may access a kmalloc'ed object for example).

Another option is to do a stop_machine() (if multi-processor at that
point), disable the MMUs, modify the page tables, re-enable the MMU but
it's also complicated.

-- 
Catalin

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-21 18:04       ` Catalin Marinas
@ 2022-06-22  8:35         ` Baoquan He
  2022-06-23 14:07           ` Catalin Marinas
  2022-06-22 12:03         ` Kefeng Wang
  1 sibling, 1 reply; 26+ messages in thread
From: Baoquan He @ 2022-06-22  8:35 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Kefeng Wang, Zhen Lei, Ard Biesheuvel, Mark Rutland,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Will Deacon, linux-arm-kernel, Jonathan Corbet, linux-doc,
	Randy Dunlap, Feng Zhou, Chen Zhou, John Donnelly, Dave Kleikamp,
	liushixin

Hi Catalin,

On 06/21/22 at 07:04pm, Catalin Marinas wrote:
> On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
> > On 2022/6/21 13:33, Baoquan He wrote:
> > > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> > > > If the crashkernel has both high memory above DMA zones and low memory
> > > > in DMA zones, kexec always loads the content such as Image and dtb to the
> > > > high memory instead of the low memory. This means that only high memory
> > > > requires write protection based on page-level mapping. The allocation of
> > > > high memory does not depend on the DMA boundary. So we can reserve the
> > > > high memory first even if the crashkernel reservation is deferred.
> > > > 
> > > > This means that the block mapping can still be performed on other kernel
> > > > linear address spaces, the TLB miss rate can be reduced and the system
> > > > performance will be improved.
> > > 
> > > Ugh, this looks a little ugly, honestly.
> > > 
> > > If that's for sure arm64 can't split large page mapping of linear
> > > region, this patch is one way to optimize linear mapping. Given kdump
> > > setting is necessary on arm64 server, the booting speed is truly
> > > impacted heavily.
> > 
> > Is there some conclusion or discussion that arm64 can't split large page
> > mapping?
> > 
> > Could the crashkernel reservation (and Kfence pool) be splited dynamically?
> > 
> > I found Mark replay "arm64: remove page granularity limitation from
> > KFENCE"[1],
> > 
> >   "We also avoid live changes from block<->table mappings, since the
> >   archtitecture gives us very weak guarantees there and generally requires
> >   a Break-Before-Make sequence (though IIRC this was tightened up
> >   somewhat, so maybe going one way is supposed to work). Unless it's
> >   really necessary, I'd rather not split these block mappings while
> >   they're live."
> 
> The problem with splitting is that you can end up with two entries in
> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> abort (but can be worse like loss of coherency).

Thanks for this explanation. Is this a drawback of arm64 design? X86
code do the same thing w/o issue, is there way to overcome this on
arm64 from hardware or software side?

I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
have different bootup time. And the more often TLB miss and flush will
cause performance cost. It is really a pity if we have very powerful
arm64 cpu and system capacity, but bottlenecked by this drawback.

> 
> Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
> all, the software would have to unmap the range, TLBI, remap. With
> FEAT_BBM (level 2), we can do this without tearing the mapping down but
> we still need to handle the potential TLB conflict abort. The handler
> only needs a TLBI but if it touches the memory range being changed it
> risks faulting again. With vmap stacks and the kernel image mapped in
> the vmalloc space, we have a small window where this could be handled
> but we probably can't go into the C part of the exception handling
> (tracing etc. may access a kmalloc'ed object for example).
> 
> Another option is to do a stop_machine() (if multi-processor at that
> point), disable the MMUs, modify the page tables, re-enable the MMU but
> it's also complicated.
> 
> -- 
> Catalin
> 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-21 18:04       ` Catalin Marinas
  2022-06-22  8:35         ` Baoquan He
@ 2022-06-22 12:03         ` Kefeng Wang
  2022-06-23 10:27           ` Catalin Marinas
  1 sibling, 1 reply; 26+ messages in thread
From: Kefeng Wang @ 2022-06-22 12:03 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Baoquan He, Zhen Lei, Ard Biesheuvel, Mark Rutland,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Will Deacon, linux-arm-kernel, Jonathan Corbet, linux-doc,
	Randy Dunlap, Feng Zhou, Chen Zhou, John Donnelly, Dave Kleikamp,
	liushixin


On 2022/6/22 2:04, Catalin Marinas wrote:
> On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
>> On 2022/6/21 13:33, Baoquan He wrote:
>>> On 06/13/22 at 04:09pm, Zhen Lei wrote:
>>>> If the crashkernel has both high memory above DMA zones and low memory
>>>> in DMA zones, kexec always loads the content such as Image and dtb to the
>>>> high memory instead of the low memory. This means that only high memory
>>>> requires write protection based on page-level mapping. The allocation of
>>>> high memory does not depend on the DMA boundary. So we can reserve the
>>>> high memory first even if the crashkernel reservation is deferred.
>>>>
>>>> This means that the block mapping can still be performed on other kernel
>>>> linear address spaces, the TLB miss rate can be reduced and the system
>>>> performance will be improved.
>>> Ugh, this looks a little ugly, honestly.
>>>
>>> If that's for sure arm64 can't split large page mapping of linear
>>> region, this patch is one way to optimize linear mapping. Given kdump
>>> setting is necessary on arm64 server, the booting speed is truly
>>> impacted heavily.
>> Is there some conclusion or discussion that arm64 can't split large page
>> mapping?
>>
>> Could the crashkernel reservation (and Kfence pool) be splited dynamically?
>>
>> I found Mark replay "arm64: remove page granularity limitation from
>> KFENCE"[1],
>>
>>    "We also avoid live changes from block<->table mappings, since the
>>    archtitecture gives us very weak guarantees there and generally requires
>>    a Break-Before-Make sequence (though IIRC this was tightened up
>>    somewhat, so maybe going one way is supposed to work). Unless it's
>>    really necessary, I'd rather not split these block mappings while
>>    they're live."
> The problem with splitting is that you can end up with two entries in
> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> abort (but can be worse like loss of coherency).
Thanks for your explanation,
> Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
> all, the software would have to unmap the range, TLBI, remap. With
> FEAT_BBM (level 2), we can do this without tearing the mapping down but
> we still need to handle the potential TLB conflict abort. The handler
> only needs a TLBI but if it touches the memory range being changed it
> risks faulting again. With vmap stacks and the kernel image mapped in
> the vmalloc space, we have a small window where this could be handled
> but we probably can't go into the C part of the exception handling
> (tracing etc. may access a kmalloc'ed object for example).

So if without FEAT_BBM,we can only guarantee BBM sequence via

"unmap the range, TLBI, remap" or the following option, and with

FEAT_BBM (level 2), we could have easy way to avoid TLB conflict for

some vmalloc space, but still hard to deal with other scence?


>
> Another option is to do a stop_machine() (if multi-processor at that
> point), disable the MMUs, modify the page tables, re-enable the MMU but
> it's also complicated.
>

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-22 12:03         ` Kefeng Wang
@ 2022-06-23 10:27           ` Catalin Marinas
  2022-06-23 14:23             ` Kefeng Wang
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2022-06-23 10:27 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: Baoquan He, Zhen Lei, Ard Biesheuvel, Mark Rutland,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Will Deacon, linux-arm-kernel, Jonathan Corbet, linux-doc,
	Randy Dunlap, Feng Zhou, Chen Zhou, John Donnelly, Dave Kleikamp,
	liushixin

On Wed, Jun 22, 2022 at 08:03:21PM +0800, Kefeng Wang wrote:
> On 2022/6/22 2:04, Catalin Marinas wrote:
> > On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
> > > On 2022/6/21 13:33, Baoquan He wrote:
> > > > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> > > > > If the crashkernel has both high memory above DMA zones and low memory
> > > > > in DMA zones, kexec always loads the content such as Image and dtb to the
> > > > > high memory instead of the low memory. This means that only high memory
> > > > > requires write protection based on page-level mapping. The allocation of
> > > > > high memory does not depend on the DMA boundary. So we can reserve the
> > > > > high memory first even if the crashkernel reservation is deferred.
> > > > > 
> > > > > This means that the block mapping can still be performed on other kernel
> > > > > linear address spaces, the TLB miss rate can be reduced and the system
> > > > > performance will be improved.
> > > > Ugh, this looks a little ugly, honestly.
> > > > 
> > > > If that's for sure arm64 can't split large page mapping of linear
> > > > region, this patch is one way to optimize linear mapping. Given kdump
> > > > setting is necessary on arm64 server, the booting speed is truly
> > > > impacted heavily.
> > > Is there some conclusion or discussion that arm64 can't split large page
> > > mapping?
> > > 
> > > Could the crashkernel reservation (and Kfence pool) be splited dynamically?
> > > 
> > > I found Mark replay "arm64: remove page granularity limitation from
> > > KFENCE"[1],
> > > 
> > >    "We also avoid live changes from block<->table mappings, since the
> > >    archtitecture gives us very weak guarantees there and generally requires
> > >    a Break-Before-Make sequence (though IIRC this was tightened up
> > >    somewhat, so maybe going one way is supposed to work). Unless it's
> > >    really necessary, I'd rather not split these block mappings while
> > >    they're live."
> > The problem with splitting is that you can end up with two entries in
> > the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> > for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> > abort (but can be worse like loss of coherency).
> Thanks for your explanation,
> > Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
> > all, the software would have to unmap the range, TLBI, remap. With
> > FEAT_BBM (level 2), we can do this without tearing the mapping down but
> > we still need to handle the potential TLB conflict abort. The handler
> > only needs a TLBI but if it touches the memory range being changed it
> > risks faulting again. With vmap stacks and the kernel image mapped in
> > the vmalloc space, we have a small window where this could be handled
> > but we probably can't go into the C part of the exception handling
> > (tracing etc. may access a kmalloc'ed object for example).
> 
> So if without FEAT_BBM,we can only guarantee BBM sequence via
> "unmap the range, TLBI, remap" or the following option,

Yes, that's the break-before-make sequence.

> and with FEAT_BBM (level 2), we could have easy way to avoid TLB
> conflict for some vmalloc space, but still hard to deal with other
> scence?

It's not too hard in theory. Basically there's a small risk of getting a
TLB conflict abort for the mappings you change without a BBM sequence (I
think it's nearly non-existed when going from large block to smaller
pages, though the architecture states that it's still possible). Since
we only want to do this for the linear map and the kernel and stack are
in the vmalloc space, we can handle such trap as an safety measure (it
just needs a TLBI). It may help to tweak a model to force it to generate
such conflict aborts, otherwise we'd not be able to test the code.

It's possible that such trap is raised at EL2 if a guest caused the
conflict abort (the architecture left this as IMP DEF). The hypervisors
may need to be taught to do a TLBI VMALLS12E1 instead of killing the
guest. I haven't checked what KVM does.

-- 
Catalin

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-22  8:35         ` Baoquan He
@ 2022-06-23 14:07           ` Catalin Marinas
  2022-06-27  2:52             ` Baoquan He
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2022-06-23 14:07 UTC (permalink / raw)
  To: Baoquan He
  Cc: Kefeng Wang, Zhen Lei, Ard Biesheuvel, Mark Rutland,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Will Deacon, linux-arm-kernel, Jonathan Corbet, linux-doc,
	Randy Dunlap, Feng Zhou, Chen Zhou, John Donnelly, Dave Kleikamp,
	liushixin

On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote:
> On 06/21/22 at 07:04pm, Catalin Marinas wrote:
> > The problem with splitting is that you can end up with two entries in
> > the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> > for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> > abort (but can be worse like loss of coherency).
> 
> Thanks for this explanation. Is this a drawback of arm64 design? X86
> code do the same thing w/o issue, is there way to overcome this on
> arm64 from hardware or software side?

It is a drawback of the arm64 implementations. Having multiple TLB
entries for the same VA would need additional logic in hardware to
detect, so the microarchitects have pushed back. In ARMv8.4, some
balanced was reached with FEAT_BBM so that the only visible side-effect
is a potential TLB conflict abort that could be resolved by software.

> I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
> have different bootup time. And the more often TLB miss and flush will
> cause performance cost. It is really a pity if we have very powerful
> arm64 cpu and system capacity, but bottlenecked by this drawback.

Is it only the boot time affected or the runtime performance as well?

-- 
Catalin

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-23 10:27           ` Catalin Marinas
@ 2022-06-23 14:23             ` Kefeng Wang
  0 siblings, 0 replies; 26+ messages in thread
From: Kefeng Wang @ 2022-06-23 14:23 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Baoquan He, Zhen Lei, Ard Biesheuvel, Mark Rutland,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Will Deacon, linux-arm-kernel, Jonathan Corbet, linux-doc,
	Randy Dunlap, Feng Zhou, Chen Zhou, John Donnelly, Dave Kleikamp,
	liushixin


On 2022/6/23 18:27, Catalin Marinas wrote:
> On Wed, Jun 22, 2022 at 08:03:21PM +0800, Kefeng Wang wrote:
>> On 2022/6/22 2:04, Catalin Marinas wrote:
>>> On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
>>>> On 2022/6/21 13:33, Baoquan He wrote:
>>>>> On 06/13/22 at 04:09pm, Zhen Lei wrote:
>>>>>> If the crashkernel has both high memory above DMA zones and low memory
>>>>>> in DMA zones, kexec always loads the content such as Image and dtb to the
>>>>>> high memory instead of the low memory. This means that only high memory
>>>>>> requires write protection based on page-level mapping. The allocation of
>>>>>> high memory does not depend on the DMA boundary. So we can reserve the
>>>>>> high memory first even if the crashkernel reservation is deferred.
>>>>>>
>>>>>> This means that the block mapping can still be performed on other kernel
>>>>>> linear address spaces, the TLB miss rate can be reduced and the system
>>>>>> performance will be improved.
>>>>> Ugh, this looks a little ugly, honestly.
>>>>>
>>>>> If that's for sure arm64 can't split large page mapping of linear
>>>>> region, this patch is one way to optimize linear mapping. Given kdump
>>>>> setting is necessary on arm64 server, the booting speed is truly
>>>>> impacted heavily.
>>>> Is there some conclusion or discussion that arm64 can't split large page
>>>> mapping?
>>>>
>>>> Could the crashkernel reservation (and Kfence pool) be splited dynamically?
>>>>
>>>> I found Mark replay "arm64: remove page granularity limitation from
>>>> KFENCE"[1],
>>>>
>>>>     "We also avoid live changes from block<->table mappings, since the
>>>>     archtitecture gives us very weak guarantees there and generally requires
>>>>     a Break-Before-Make sequence (though IIRC this was tightened up
>>>>     somewhat, so maybe going one way is supposed to work). Unless it's
>>>>     really necessary, I'd rather not split these block mappings while
>>>>     they're live."
>>> The problem with splitting is that you can end up with two entries in
>>> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
>>> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
>>> abort (but can be worse like loss of coherency).
>> Thanks for your explanation,
>>> Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
>>> all, the software would have to unmap the range, TLBI, remap. With
>>> FEAT_BBM (level 2), we can do this without tearing the mapping down but
>>> we still need to handle the potential TLB conflict abort. The handler
>>> only needs a TLBI but if it touches the memory range being changed it
>>> risks faulting again. With vmap stacks and the kernel image mapped in
>>> the vmalloc space, we have a small window where this could be handled
>>> but we probably can't go into the C part of the exception handling
>>> (tracing etc. may access a kmalloc'ed object for example).
>> So if without FEAT_BBM,we can only guarantee BBM sequence via
>> "unmap the range, TLBI, remap" or the following option,
> Yes, that's the break-before-make sequence.
>
>> and with FEAT_BBM (level 2), we could have easy way to avoid TLB
>> conflict for some vmalloc space, but still hard to deal with other
>> scence?
> It's not too hard in theory. Basically there's a small risk of getting a
> TLB conflict abort for the mappings you change without a BBM sequence (I
> think it's nearly non-existed when going from large block to smaller
> pages, though the architecture states that it's still possible). Since
> we only want to do this for the linear map and the kernel and stack are
> in the vmalloc space, we can handle such trap as an safety measure (it
> just needs a TLBI). It may help to tweak a model to force it to generate
> such conflict aborts, otherwise we'd not be able to test the code.
>
> It's possible that such trap is raised at EL2 if a guest caused the
> conflict abort (the architecture left this as IMP DEF). The hypervisors
> may need to be taught to do a TLBI VMALLS12E1 instead of killing the
> guest. I haven't checked what KVM does.
Got it,many thanks.

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-23 14:07           ` Catalin Marinas
@ 2022-06-27  2:52             ` Baoquan He
  2022-06-27  9:17               ` Leizhen (ThunderTown)
  0 siblings, 1 reply; 26+ messages in thread
From: Baoquan He @ 2022-06-27  2:52 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Kefeng Wang, Zhen Lei, Ard Biesheuvel, Mark Rutland,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Will Deacon, linux-arm-kernel, Jonathan Corbet, linux-doc,
	Randy Dunlap, Feng Zhou, Chen Zhou, John Donnelly, Dave Kleikamp,
	liushixin

On 06/23/22 at 03:07pm, Catalin Marinas wrote:
> On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote:
> > On 06/21/22 at 07:04pm, Catalin Marinas wrote:
> > > The problem with splitting is that you can end up with two entries in
> > > the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> > > for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> > > abort (but can be worse like loss of coherency).
> > 
> > Thanks for this explanation. Is this a drawback of arm64 design? X86
> > code do the same thing w/o issue, is there way to overcome this on
> > arm64 from hardware or software side?
> 
> It is a drawback of the arm64 implementations. Having multiple TLB
> entries for the same VA would need additional logic in hardware to
> detect, so the microarchitects have pushed back. In ARMv8.4, some
> balanced was reached with FEAT_BBM so that the only visible side-effect
> is a potential TLB conflict abort that could be resolved by software.

I see, thx.

> 
> > I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
> > have different bootup time. And the more often TLB miss and flush will
> > cause performance cost. It is really a pity if we have very powerful
> > arm64 cpu and system capacity, but bottlenecked by this drawback.
> 
> Is it only the boot time affected or the runtime performance as well?

Sorry for late reply. What I observerd is the boot time serious latecy
with huge memory. Since the timestamp is not available at that time,
we can't tell the number. I didn't notice the runtime performance.


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-27  2:52             ` Baoquan He
@ 2022-06-27  9:17               ` Leizhen (ThunderTown)
  2022-06-27 10:17                 ` Baoquan He
  0 siblings, 1 reply; 26+ messages in thread
From: Leizhen (ThunderTown) @ 2022-06-27  9:17 UTC (permalink / raw)
  To: Baoquan He, Catalin Marinas
  Cc: Kefeng Wang, Ard Biesheuvel, Mark Rutland, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin,
	Eric Biederman, Rob Herring, Frank Rowand, devicetree,
	Dave Young, Vivek Goyal, kexec, linux-kernel, Will Deacon,
	linux-arm-kernel, Jonathan Corbet, linux-doc, Randy Dunlap,
	Feng Zhou, Chen Zhou, John Donnelly, Dave Kleikamp, liushixin



On 2022/6/27 10:52, Baoquan He wrote:
> On 06/23/22 at 03:07pm, Catalin Marinas wrote:
>> On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote:
>>> On 06/21/22 at 07:04pm, Catalin Marinas wrote:
>>>> The problem with splitting is that you can end up with two entries in
>>>> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
>>>> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
>>>> abort (but can be worse like loss of coherency).
>>>
>>> Thanks for this explanation. Is this a drawback of arm64 design? X86
>>> code do the same thing w/o issue, is there way to overcome this on
>>> arm64 from hardware or software side?
>>
>> It is a drawback of the arm64 implementations. Having multiple TLB
>> entries for the same VA would need additional logic in hardware to
>> detect, so the microarchitects have pushed back. In ARMv8.4, some
>> balanced was reached with FEAT_BBM so that the only visible side-effect
>> is a potential TLB conflict abort that could be resolved by software.
> 
> I see, thx.
> 
>>
>>> I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
>>> have different bootup time. And the more often TLB miss and flush will
>>> cause performance cost. It is really a pity if we have very powerful
>>> arm64 cpu and system capacity, but bottlenecked by this drawback.
>>
>> Is it only the boot time affected or the runtime performance as well?
> 
> Sorry for late reply. What I observerd is the boot time serious latecy
> with huge memory. Since the timestamp is not available at that time,
> we can't tell the number. I didn't notice the runtime performance.

There's some data here, and I see you're not on the cc list.

https://lore.kernel.org/linux-mm/1656241815-28494-1-git-send-email-guanghuifeng@linux.alibaba.com/T/

> 
> .
> 

-- 
Regards,
  Zhen Lei

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-27  9:17               ` Leizhen (ThunderTown)
@ 2022-06-27 10:17                 ` Baoquan He
  2022-06-27 11:11                   ` Leizhen (ThunderTown)
  0 siblings, 1 reply; 26+ messages in thread
From: Baoquan He @ 2022-06-27 10:17 UTC (permalink / raw)
  To: Leizhen (ThunderTown)
  Cc: Catalin Marinas, Kefeng Wang, Ard Biesheuvel, Mark Rutland,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Will Deacon, linux-arm-kernel, Jonathan Corbet, linux-doc,
	Randy Dunlap, Feng Zhou, Chen Zhou, John Donnelly, Dave Kleikamp,
	liushixin

On 06/27/22 at 05:17pm, Leizhen (ThunderTown) wrote:
> 
> 
> On 2022/6/27 10:52, Baoquan He wrote:
> > On 06/23/22 at 03:07pm, Catalin Marinas wrote:
> >> On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote:
> >>> On 06/21/22 at 07:04pm, Catalin Marinas wrote:
> >>>> The problem with splitting is that you can end up with two entries in
> >>>> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> >>>> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> >>>> abort (but can be worse like loss of coherency).
> >>>
> >>> Thanks for this explanation. Is this a drawback of arm64 design? X86
> >>> code do the same thing w/o issue, is there way to overcome this on
> >>> arm64 from hardware or software side?
> >>
> >> It is a drawback of the arm64 implementations. Having multiple TLB
> >> entries for the same VA would need additional logic in hardware to
> >> detect, so the microarchitects have pushed back. In ARMv8.4, some
> >> balanced was reached with FEAT_BBM so that the only visible side-effect
> >> is a potential TLB conflict abort that could be resolved by software.
> > 
> > I see, thx.
> > 
> >>
> >>> I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
> >>> have different bootup time. And the more often TLB miss and flush will
> >>> cause performance cost. It is really a pity if we have very powerful
> >>> arm64 cpu and system capacity, but bottlenecked by this drawback.
> >>
> >> Is it only the boot time affected or the runtime performance as well?
> > 
> > Sorry for late reply. What I observerd is the boot time serious latecy
> > with huge memory. Since the timestamp is not available at that time,
> > we can't tell the number. I didn't notice the runtime performance.
> 
> There's some data here, and I see you're not on the cc list.
> 
> https://lore.kernel.org/linux-mm/1656241815-28494-1-git-send-email-guanghuifeng@linux.alibaba.com/T/

Thanks, Zhen Lei. I also saw the patch. That seems to be a good way,
since there's only one process running at that time. Not sure if there's
still risk of multiple TLB entries for the same VA existing.


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
  2022-06-27 10:17                 ` Baoquan He
@ 2022-06-27 11:11                   ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 26+ messages in thread
From: Leizhen (ThunderTown) @ 2022-06-27 11:11 UTC (permalink / raw)
  To: Baoquan He
  Cc: Catalin Marinas, Kefeng Wang, Ard Biesheuvel, Mark Rutland,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Eric Biederman, Rob Herring, Frank Rowand,
	devicetree, Dave Young, Vivek Goyal, kexec, linux-kernel,
	Will Deacon, linux-arm-kernel, Jonathan Corbet, linux-doc,
	Randy Dunlap, Feng Zhou, Chen Zhou, John Donnelly, Dave Kleikamp,
	liushixin



On 2022/6/27 18:17, Baoquan He wrote:
> On 06/27/22 at 05:17pm, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2022/6/27 10:52, Baoquan He wrote:
>>> On 06/23/22 at 03:07pm, Catalin Marinas wrote:
>>>> On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote:
>>>>> On 06/21/22 at 07:04pm, Catalin Marinas wrote:
>>>>>> The problem with splitting is that you can end up with two entries in
>>>>>> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
>>>>>> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
>>>>>> abort (but can be worse like loss of coherency).
>>>>>
>>>>> Thanks for this explanation. Is this a drawback of arm64 design? X86
>>>>> code do the same thing w/o issue, is there way to overcome this on
>>>>> arm64 from hardware or software side?
>>>>
>>>> It is a drawback of the arm64 implementations. Having multiple TLB
>>>> entries for the same VA would need additional logic in hardware to
>>>> detect, so the microarchitects have pushed back. In ARMv8.4, some
>>>> balanced was reached with FEAT_BBM so that the only visible side-effect
>>>> is a potential TLB conflict abort that could be resolved by software.
>>>
>>> I see, thx.
>>>
>>>>
>>>>> I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
>>>>> have different bootup time. And the more often TLB miss and flush will
>>>>> cause performance cost. It is really a pity if we have very powerful
>>>>> arm64 cpu and system capacity, but bottlenecked by this drawback.
>>>>
>>>> Is it only the boot time affected or the runtime performance as well?
>>>
>>> Sorry for late reply. What I observerd is the boot time serious latecy
>>> with huge memory. Since the timestamp is not available at that time,
>>> we can't tell the number. I didn't notice the runtime performance.
>>
>> There's some data here, and I see you're not on the cc list.
>>
>> https://lore.kernel.org/linux-mm/1656241815-28494-1-git-send-email-guanghuifeng@linux.alibaba.com/T/
> 
> Thanks, Zhen Lei. I also saw the patch. That seems to be a good way,

Yes.

> since there's only one process running at that time. Not sure if there's
> still risk of multiple TLB entries for the same VA existing.
> 
> .
> 

-- 
Regards,
  Zhen Lei

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2022-06-27 11:11 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-13  8:09 [PATCH 0/5] arm64: kdump: Function supplement and performance optimization Zhen Lei
2022-06-13  8:09 ` [PATCH 1/5] arm64: kdump: Provide default size when crashkernel=Y,low is not specified Zhen Lei
2022-06-17  2:40   ` Baoquan He
2022-06-17  7:39     ` Leizhen (ThunderTown)
2022-06-17  8:26   ` Baoquan He
2022-06-13  8:09 ` [PATCH 2/5] arm64: kdump: Support crashkernel=X fall back to reserve region above DMA zones Zhen Lei
2022-06-17  4:16   ` Baoquan He
2022-06-13  8:09 ` [PATCH 3/5] arm64: kdump: Remove some redundant checks in map_mem() Zhen Lei
2022-06-20  7:42   ` Baoquan He
2022-06-13  8:09 ` [PATCH 4/5] arm64: kdump: Decide when to reserve crash memory in reserve_crashkernel() Zhen Lei
2022-06-13  8:09 ` [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory Zhen Lei
2022-06-21  5:33   ` Baoquan He
2022-06-21  6:24     ` Kefeng Wang
2022-06-21  9:27       ` Baoquan He
2022-06-21 18:04       ` Catalin Marinas
2022-06-22  8:35         ` Baoquan He
2022-06-23 14:07           ` Catalin Marinas
2022-06-27  2:52             ` Baoquan He
2022-06-27  9:17               ` Leizhen (ThunderTown)
2022-06-27 10:17                 ` Baoquan He
2022-06-27 11:11                   ` Leizhen (ThunderTown)
2022-06-22 12:03         ` Kefeng Wang
2022-06-23 10:27           ` Catalin Marinas
2022-06-23 14:23             ` Kefeng Wang
2022-06-21  7:56     ` Leizhen (ThunderTown)
2022-06-21  9:35       ` Baoquan He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).