All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2018-08-22  3:07 ` Jia He
  0 siblings, 0 replies; 30+ messages in thread
From: Jia He @ 2018-08-22  3:07 UTC (permalink / raw)
  To: Russell King, Catalin Marinas, Will Deacon, Mark Rutland,
	Ard Biesheuvel, Andrew Morton, Michal Hocko
  Cc: Wei Yang, Kees Cook, Laura Abbott, Vladimir Murzin,
	Philip Derrin, AKASHI Takahiro, James Morse, Steve Capper,
	Gioh Kim, Vlastimil Babka, Mel Gorman, Johannes Weiner,
	Kemi Wang, Petr Tesarik, YASUAKI ISHIMATSU, Andrey Ryabinin,
	Nikolay Borisov, Daniel Jordan, Daniel Vacek, Eugeniu Rosca,
	linux-arm-kernel, linux-kernel, linux-mm, Jia He

Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
where possible") optimized the loop in memmap_init_zone(). But it causes
possible panic bug. So Daniel Vacek reverted it later.

But as suggested by Daniel Vacek, it is fine to using memblock to skip
gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.

More from what Daniel said:
"On arm and arm64, memblock is used by default. But generic version of
pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
not always return the next valid one but skips more resulting in some
valid frames to be skipped (as if they were invalid). And that's why
kernel was eventually crashing on some !arm machines."

About the performance consideration:
As said by James in b92df1de5,
"I have tested this patch on a virtual model of a Samurai CPU with a
sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.

Besides we can remain memblock_next_valid_pfn, there is still some room
for improvement. After this set, I can see the time overhead of memmap_init
is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
memory, pagesize 64k). I believe arm server will benefit more if memory is
larger than TBs

Patch 1 introduces new config to make codes more generic
Patch 2 remains the memblock_next_valid_pfn on arm and arm64,this patch is
        originated from b92df1de5
Patch 3 optimizes the memblock_next_valid_pfn()

Changelog:
V11:- drop patch#4-6, refine the codes
V10:- move codes to memblock.c, refine the performance consideration
V9: - rebase to mmotm master, refine the log description. No major changes
V8: - introduce new config and move generic code to early_pfn.h
    - optimize memblock_next_valid_pfn as suggested by Matthew Wilcox
V7: - fix i386 compilation error. refine the commit description
V6: - simplify the codes, move arm/arm64 common codes to one file.
    - refine patches as suggested by Danial Vacek and Ard Biesheuvel
V5: - further refining as suggested by Danial Vacek. Make codes
      arm/arm64 more arch specific
V4: - refine patches as suggested by Danial Vacek and Wei Yang
    - optimized on arm besides arm64
V3: - fix 2 issues reported by kbuild test robot
V2: - rebase to mmotm latest
    - remain memblock_next_valid_pfn on arm64
    - refine memblock_search_pfn_regions and pfn_valid_region

Jia He (3):
  arm: arm64: introduce CONFIG_HAVE_MEMBLOCK_PFN_VALID
  mm: page_alloc: remain memblock_next_valid_pfn() on arm/arm64
  mm: page_alloc: reduce unnecessary binary search in
    memblock_next_valid_pfn

 arch/arm/Kconfig       |  1 +
 arch/arm64/Kconfig     |  1 +
 include/linux/mmzone.h |  9 +++++++++
 mm/Kconfig             |  3 +++
 mm/memblock.c          | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c        |  5 ++++-
 6 files changed, 69 insertions(+), 1 deletion(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2018-08-22  3:07 ` Jia He
  0 siblings, 0 replies; 30+ messages in thread
From: Jia He @ 2018-08-22  3:07 UTC (permalink / raw)
  To: linux-arm-kernel

Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
where possible") optimized the loop in memmap_init_zone(). But it causes
possible panic bug. So Daniel Vacek reverted it later.

But as suggested by Daniel Vacek, it is fine to using memblock to skip
gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.

More from what Daniel said:
"On arm and arm64, memblock is used by default. But generic version of
pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
not always return the next valid one but skips more resulting in some
valid frames to be skipped (as if they were invalid). And that's why
kernel was eventually crashing on some !arm machines."

About the performance consideration:
As said by James in b92df1de5,
"I have tested this patch on a virtual model of a Samurai CPU with a
sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.

Besides we can remain memblock_next_valid_pfn, there is still some room
for improvement. After this set, I can see the time overhead of memmap_init
is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
memory, pagesize 64k). I believe arm server will benefit more if memory is
larger than TBs

Patch 1 introduces new config to make codes more generic
Patch 2 remains the memblock_next_valid_pfn on arm and arm64,this patch is
        originated from b92df1de5
Patch 3 optimizes the memblock_next_valid_pfn()

Changelog:
V11:- drop patch#4-6, refine the codes
V10:- move codes to memblock.c, refine the performance consideration
V9: - rebase to mmotm master, refine the log description. No major changes
V8: - introduce new config and move generic code to early_pfn.h
    - optimize memblock_next_valid_pfn as suggested by Matthew Wilcox
V7: - fix i386 compilation error. refine the commit description
V6: - simplify the codes, move arm/arm64 common codes to one file.
    - refine patches as suggested by Danial Vacek and Ard Biesheuvel
V5: - further refining as suggested by Danial Vacek. Make codes
      arm/arm64 more arch specific
V4: - refine patches as suggested by Danial Vacek and Wei Yang
    - optimized on arm besides arm64
V3: - fix 2 issues reported by kbuild test robot
V2: - rebase to mmotm latest
    - remain memblock_next_valid_pfn on arm64
    - refine memblock_search_pfn_regions and pfn_valid_region

Jia He (3):
  arm: arm64: introduce CONFIG_HAVE_MEMBLOCK_PFN_VALID
  mm: page_alloc: remain memblock_next_valid_pfn() on arm/arm64
  mm: page_alloc: reduce unnecessary binary search in
    memblock_next_valid_pfn

 arch/arm/Kconfig       |  1 +
 arch/arm64/Kconfig     |  1 +
 include/linux/mmzone.h |  9 +++++++++
 mm/Kconfig             |  3 +++
 mm/memblock.c          | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c        |  5 ++++-
 6 files changed, 69 insertions(+), 1 deletion(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v11 1/3] arm: arm64: introduce CONFIG_HAVE_MEMBLOCK_PFN_VALID
  2018-08-22  3:07 ` Jia He
@ 2018-08-22  3:07   ` Jia He
  -1 siblings, 0 replies; 30+ messages in thread
From: Jia He @ 2018-08-22  3:07 UTC (permalink / raw)
  To: Russell King, Catalin Marinas, Will Deacon, Mark Rutland,
	Ard Biesheuvel, Andrew Morton, Michal Hocko
  Cc: Wei Yang, Kees Cook, Laura Abbott, Vladimir Murzin,
	Philip Derrin, AKASHI Takahiro, James Morse, Steve Capper,
	Gioh Kim, Vlastimil Babka, Mel Gorman, Johannes Weiner,
	Kemi Wang, Petr Tesarik, YASUAKI ISHIMATSU, Andrey Ryabinin,
	Nikolay Borisov, Daniel Jordan, Daniel Vacek, Eugeniu Rosca,
	linux-arm-kernel, linux-kernel, linux-mm, Jia He

Make CONFIG_HAVE_MEMBLOCK_PFN_VALID a new config option so it can move
memblock_next_valid_pfn to generic code file. All the latter optimizations
are based on this config.

The memblock initialization time on arm/arm64 can benefit from this.

Signed-off-by: Jia He <jia.he@hxt-semitech.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/arm/Kconfig   | 1 +
 arch/arm64/Kconfig | 1 +
 mm/Kconfig         | 3 +++
 3 files changed, 5 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 843edfd..d3c7705 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1641,6 +1641,7 @@ config ARCH_SELECT_MEMORY_MODEL
 
 config HAVE_ARCH_PFN_VALID
 	def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
+	select HAVE_MEMBLOCK_PFN_VALID
 
 config HAVE_GENERIC_GUP
 	def_bool y
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 42c090c..d4119e6 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -777,6 +777,7 @@ config ARCH_SELECT_MEMORY_MODEL
 
 config HAVE_ARCH_PFN_VALID
 	def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
+	select HAVE_MEMBLOCK_PFN_VALID
 
 config HW_PERF_EVENTS
 	def_bool y
diff --git a/mm/Kconfig b/mm/Kconfig
index 94af022..28fcf54 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP
 config HAVE_MEMBLOCK_PHYS_MAP
 	bool
 
+config HAVE_MEMBLOCK_PFN_VALID
+	bool
+
 config HAVE_GENERIC_GUP
 	bool
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 1/3] arm: arm64: introduce CONFIG_HAVE_MEMBLOCK_PFN_VALID
@ 2018-08-22  3:07   ` Jia He
  0 siblings, 0 replies; 30+ messages in thread
From: Jia He @ 2018-08-22  3:07 UTC (permalink / raw)
  To: linux-arm-kernel

Make CONFIG_HAVE_MEMBLOCK_PFN_VALID a new config option so it can move
memblock_next_valid_pfn to generic code file. All the latter optimizations
are based on this config.

The memblock initialization time on arm/arm64 can benefit from this.

Signed-off-by: Jia He <jia.he@hxt-semitech.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/arm/Kconfig   | 1 +
 arch/arm64/Kconfig | 1 +
 mm/Kconfig         | 3 +++
 3 files changed, 5 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 843edfd..d3c7705 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1641,6 +1641,7 @@ config ARCH_SELECT_MEMORY_MODEL
 
 config HAVE_ARCH_PFN_VALID
 	def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
+	select HAVE_MEMBLOCK_PFN_VALID
 
 config HAVE_GENERIC_GUP
 	def_bool y
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 42c090c..d4119e6 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -777,6 +777,7 @@ config ARCH_SELECT_MEMORY_MODEL
 
 config HAVE_ARCH_PFN_VALID
 	def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
+	select HAVE_MEMBLOCK_PFN_VALID
 
 config HW_PERF_EVENTS
 	def_bool y
diff --git a/mm/Kconfig b/mm/Kconfig
index 94af022..28fcf54 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP
 config HAVE_MEMBLOCK_PHYS_MAP
 	bool
 
+config HAVE_MEMBLOCK_PFN_VALID
+	bool
+
 config HAVE_GENERIC_GUP
 	bool
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 2/3] mm: page_alloc: remain memblock_next_valid_pfn() on arm/arm64
  2018-08-22  3:07 ` Jia He
@ 2018-08-22  3:07   ` Jia He
  -1 siblings, 0 replies; 30+ messages in thread
From: Jia He @ 2018-08-22  3:07 UTC (permalink / raw)
  To: Russell King, Catalin Marinas, Will Deacon, Mark Rutland,
	Ard Biesheuvel, Andrew Morton, Michal Hocko
  Cc: Wei Yang, Kees Cook, Laura Abbott, Vladimir Murzin,
	Philip Derrin, AKASHI Takahiro, James Morse, Steve Capper,
	Gioh Kim, Vlastimil Babka, Mel Gorman, Johannes Weiner,
	Kemi Wang, Petr Tesarik, YASUAKI ISHIMATSU, Andrey Ryabinin,
	Nikolay Borisov, Daniel Jordan, Daniel Vacek, Eugeniu Rosca,
	linux-arm-kernel, linux-kernel, linux-mm, Jia He

Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
where possible") optimized the loop in memmap_init_zone(). But it causes
possible panic bug. So Daniel Vacek reverted it later.

But as suggested by Daniel Vacek, it is fine to using memblock to skip
gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
Daniel said:
"On arm and arm64, memblock is used by default. But generic version of
pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
not always return the next valid one but skips more resulting in some
valid frames to be skipped (as if they were invalid). And that's why
kernel was eventually crashing on some !arm machines."

About the performance consideration:
As said by James in b92df1de5,
"I have tested this patch on a virtual model of a Samurai CPU
with a sparse memory map.  The kernel boot time drops from 109 to
62 seconds."

Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.

Suggested-by: Daniel Vacek <neelx@redhat.com>
Signed-off-by: Jia He <jia.he@hxt-semitech.com>
---
 include/linux/mmzone.h |  9 +++++++++
 mm/memblock.c          | 30 ++++++++++++++++++++++++++++++
 mm/page_alloc.c        |  5 ++++-
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2..8e5e20b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1266,6 +1266,10 @@ static inline int pfn_present(unsigned long pfn)
 #endif
 
 #define early_pfn_valid(pfn)	pfn_valid(pfn)
+#ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
+extern unsigned long memblock_next_valid_pfn(unsigned long pfn);
+#define next_valid_pfn(pfn)	memblock_next_valid_pfn(pfn)
+#endif
 void sparse_init(void);
 #else
 #define sparse_init()	do {} while (0)
@@ -1287,6 +1291,11 @@ struct mminit_pfnnid_cache {
 #define early_pfn_valid(pfn)	(1)
 #endif
 
+/* fallback to default definitions*/
+#ifndef next_valid_pfn
+#define next_valid_pfn(pfn)	(pfn + 1)
+#endif
+
 void memory_present(int nid, unsigned long start, unsigned long end);
 
 /*
diff --git a/mm/memblock.c b/mm/memblock.c
index 3d03866..077ca62 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1140,6 +1140,36 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
 }
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
+#ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
+unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
+{
+	struct memblock_type *type = &memblock.memory;
+	unsigned int right = type->cnt;
+	unsigned int mid, left = 0;
+	phys_addr_t addr = PFN_PHYS(++pfn);
+
+	do {
+		mid = (right + left) / 2;
+
+		if (addr < type->regions[mid].base)
+			right = mid;
+		else if (addr >= (type->regions[mid].base +
+				  type->regions[mid].size))
+			left = mid + 1;
+		else {
+			/* addr is within the region, so pfn is valid */
+			return pfn;
+		}
+	} while (left < right);
+
+	if (right == type->cnt)
+		return -1UL;
+	else
+		return PHYS_PFN(type->regions[right].base);
+}
+EXPORT_SYMBOL(memblock_next_valid_pfn);
+#endif /*CONFIG_HAVE_MEMBLOCK_PFN_VALID*/
+
 static phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
 					phys_addr_t align, phys_addr_t start,
 					phys_addr_t end, int nid, ulong flags)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd3c7b9..607deff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5485,8 +5485,11 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		if (context != MEMMAP_EARLY)
 			goto not_early;
 
-		if (!early_pfn_valid(pfn))
+		if (!early_pfn_valid(pfn)) {
+			pfn = next_valid_pfn(pfn) - 1;
 			continue;
+		}
+
 		if (!early_pfn_in_nid(pfn, nid))
 			continue;
 		if (!update_defer_init(pgdat, pfn, end_pfn, &nr_initialised))
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 2/3] mm: page_alloc: remain memblock_next_valid_pfn() on arm/arm64
@ 2018-08-22  3:07   ` Jia He
  0 siblings, 0 replies; 30+ messages in thread
From: Jia He @ 2018-08-22  3:07 UTC (permalink / raw)
  To: linux-arm-kernel

Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
where possible") optimized the loop in memmap_init_zone(). But it causes
possible panic bug. So Daniel Vacek reverted it later.

But as suggested by Daniel Vacek, it is fine to using memblock to skip
gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
Daniel said:
"On arm and arm64, memblock is used by default. But generic version of
pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
not always return the next valid one but skips more resulting in some
valid frames to be skipped (as if they were invalid). And that's why
kernel was eventually crashing on some !arm machines."

About the performance consideration:
As said by James in b92df1de5,
"I have tested this patch on a virtual model of a Samurai CPU
with a sparse memory map.  The kernel boot time drops from 109 to
62 seconds."

Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.

Suggested-by: Daniel Vacek <neelx@redhat.com>
Signed-off-by: Jia He <jia.he@hxt-semitech.com>
---
 include/linux/mmzone.h |  9 +++++++++
 mm/memblock.c          | 30 ++++++++++++++++++++++++++++++
 mm/page_alloc.c        |  5 ++++-
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2..8e5e20b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1266,6 +1266,10 @@ static inline int pfn_present(unsigned long pfn)
 #endif
 
 #define early_pfn_valid(pfn)	pfn_valid(pfn)
+#ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
+extern unsigned long memblock_next_valid_pfn(unsigned long pfn);
+#define next_valid_pfn(pfn)	memblock_next_valid_pfn(pfn)
+#endif
 void sparse_init(void);
 #else
 #define sparse_init()	do {} while (0)
@@ -1287,6 +1291,11 @@ struct mminit_pfnnid_cache {
 #define early_pfn_valid(pfn)	(1)
 #endif
 
+/* fallback to default definitions*/
+#ifndef next_valid_pfn
+#define next_valid_pfn(pfn)	(pfn + 1)
+#endif
+
 void memory_present(int nid, unsigned long start, unsigned long end);
 
 /*
diff --git a/mm/memblock.c b/mm/memblock.c
index 3d03866..077ca62 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1140,6 +1140,36 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
 }
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
+#ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
+unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
+{
+	struct memblock_type *type = &memblock.memory;
+	unsigned int right = type->cnt;
+	unsigned int mid, left = 0;
+	phys_addr_t addr = PFN_PHYS(++pfn);
+
+	do {
+		mid = (right + left) / 2;
+
+		if (addr < type->regions[mid].base)
+			right = mid;
+		else if (addr >= (type->regions[mid].base +
+				  type->regions[mid].size))
+			left = mid + 1;
+		else {
+			/* addr is within the region, so pfn is valid */
+			return pfn;
+		}
+	} while (left < right);
+
+	if (right == type->cnt)
+		return -1UL;
+	else
+		return PHYS_PFN(type->regions[right].base);
+}
+EXPORT_SYMBOL(memblock_next_valid_pfn);
+#endif /*CONFIG_HAVE_MEMBLOCK_PFN_VALID*/
+
 static phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
 					phys_addr_t align, phys_addr_t start,
 					phys_addr_t end, int nid, ulong flags)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd3c7b9..607deff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5485,8 +5485,11 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		if (context != MEMMAP_EARLY)
 			goto not_early;
 
-		if (!early_pfn_valid(pfn))
+		if (!early_pfn_valid(pfn)) {
+			pfn = next_valid_pfn(pfn) - 1;
 			continue;
+		}
+
 		if (!early_pfn_in_nid(pfn, nid))
 			continue;
 		if (!update_defer_init(pgdat, pfn, end_pfn, &nr_initialised))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 3/3] mm: page_alloc: reduce unnecessary binary search in memblock_next_valid_pfn
  2018-08-22  3:07 ` Jia He
@ 2018-08-22  3:07   ` Jia He
  -1 siblings, 0 replies; 30+ messages in thread
From: Jia He @ 2018-08-22  3:07 UTC (permalink / raw)
  To: Russell King, Catalin Marinas, Will Deacon, Mark Rutland,
	Ard Biesheuvel, Andrew Morton, Michal Hocko
  Cc: Wei Yang, Kees Cook, Laura Abbott, Vladimir Murzin,
	Philip Derrin, AKASHI Takahiro, James Morse, Steve Capper,
	Gioh Kim, Vlastimil Babka, Mel Gorman, Johannes Weiner,
	Kemi Wang, Petr Tesarik, YASUAKI ISHIMATSU, Andrey Ryabinin,
	Nikolay Borisov, Daniel Jordan, Daniel Vacek, Eugeniu Rosca,
	linux-arm-kernel, linux-kernel, linux-mm, Jia He

Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
where possible") optimized the loop in memmap_init_zone(). But there is
still some room for improvement.

E.g. if pfn and pfn+1 are in the same memblock region, we can simply pfn++
instead of doing the binary search in memblock_next_valid_pfn.

Furthermore, if the pfn is in a gap of two memory region, skip to next
region directly if possible.

Attached the memblock region information in my server.
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000000200000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x00000017ffffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000200000-0x000000000021ffff]
[    0.000000]   node   0: [mem 0x0000000000820000-0x000000000307ffff]
[    0.000000]   node   0: [mem 0x0000000003080000-0x000000000308ffff]
[    0.000000]   node   0: [mem 0x0000000003090000-0x00000000031fffff]
[    0.000000]   node   0: [mem 0x0000000003200000-0x00000000033fffff]
[    0.000000]   node   0: [mem 0x0000000003410000-0x00000000034fffff]
[    0.000000]   node   0: [mem 0x0000000003500000-0x000000000351ffff]
[    0.000000]   node   0: [mem 0x0000000003520000-0x000000000353ffff]
[    0.000000]   node   0: [mem 0x0000000003540000-0x0000000003e3ffff]
[    0.000000]   node   0: [mem 0x0000000003e40000-0x0000000003e7ffff]
[    0.000000]   node   0: [mem 0x0000000003e80000-0x0000000003ecffff]
[    0.000000]   node   0: [mem 0x0000000003ed0000-0x0000000003ed5fff]
[    0.000000]   node   0: [mem 0x0000000003ed6000-0x0000000006eeafff]
[    0.000000]   node   0: [mem 0x0000000006eeb000-0x000000000710ffff]
[    0.000000]   node   0: [mem 0x0000000007110000-0x0000000007f0ffff]
[    0.000000]   node   0: [mem 0x0000000007f10000-0x0000000007faffff]
[    0.000000]   node   0: [mem 0x0000000007fb0000-0x000000000806ffff]
[    0.000000]   node   0: [mem 0x0000000008070000-0x00000000080affff]
[    0.000000]   node   0: [mem 0x00000000080b0000-0x000000000832ffff]
[    0.000000]   node   0: [mem 0x0000000008330000-0x000000000836ffff]
[    0.000000]   node   0: [mem 0x0000000008370000-0x000000000838ffff]
[    0.000000]   node   0: [mem 0x0000000008390000-0x00000000083a9fff]
[    0.000000]   node   0: [mem 0x00000000083aa000-0x00000000083bbfff]
[    0.000000]   node   0: [mem 0x00000000083bc000-0x00000000083fffff]
[    0.000000]   node   0: [mem 0x0000000008400000-0x000000000841ffff]
[    0.000000]   node   0: [mem 0x0000000008420000-0x000000000843ffff]
[    0.000000]   node   0: [mem 0x0000000008440000-0x000000000865ffff]
[    0.000000]   node   0: [mem 0x0000000008660000-0x000000000869ffff]
[    0.000000]   node   0: [mem 0x00000000086a0000-0x00000000086affff]
[    0.000000]   node   0: [mem 0x00000000086b0000-0x00000000086effff]
[    0.000000]   node   0: [mem 0x00000000086f0000-0x0000000008b6ffff]
[    0.000000]   node   0: [mem 0x0000000008b70000-0x0000000008bbffff]
[    0.000000]   node   0: [mem 0x0000000008bc0000-0x0000000008edffff]
[    0.000000]   node   0: [mem 0x0000000008ee0000-0x0000000008ee0fff]
[    0.000000]   node   0: [mem 0x0000000008ee1000-0x0000000008ee2fff]
[    0.000000]   node   0: [mem 0x0000000008ee3000-0x000000000decffff]
[    0.000000]   node   0: [mem 0x000000000ded0000-0x000000000defffff]
[    0.000000]   node   0: [mem 0x000000000df00000-0x000000000fffffff]
[    0.000000]   node   0: [mem 0x0000000010800000-0x0000000017feffff]
[    0.000000]   node   0: [mem 0x000000001c000000-0x000000001c00ffff]
[    0.000000]   node   0: [mem 0x000000001c010000-0x000000001c7fffff]
[    0.000000]   node   0: [mem 0x000000001c810000-0x000000007efbffff]
[    0.000000]   node   0: [mem 0x000000007efc0000-0x000000007efdffff]
[    0.000000]   node   0: [mem 0x000000007efe0000-0x000000007efeffff]
[    0.000000]   node   0: [mem 0x000000007eff0000-0x000000007effffff]
[    0.000000]   node   0: [mem 0x000000007f000000-0x00000017ffffffff]
[    0.000000] Initmem setup node 0 [mem
0x0000000000200000-0x00000017ffffffff]
[    0.000000] On node 0 totalpages: 25145296
[    0.000000]   DMA32 zone: 16376 pages used for memmap
[    0.000000]   DMA32 zone: 0 pages reserved
[    0.000000]   DMA32 zone: 1028048 pages, LIFO batch:31
[    0.000000]   Normal zone: 376832 pages used for memmap
[    0.000000]   Normal zone: 24117248 pages, LIFO batch:31

Signed-off-by: Jia He <jia.he@hxt-semitech.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
---
 mm/memblock.c | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 077ca62..46cb6be 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1144,28 +1144,49 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
 unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
 {
 	struct memblock_type *type = &memblock.memory;
+	struct memblock_region *regions = type->regions;
 	unsigned int right = type->cnt;
 	unsigned int mid, left = 0;
+	unsigned long start_pfn, end_pfn, next_start_pfn;
 	phys_addr_t addr = PFN_PHYS(++pfn);
+	static int early_region_idx __initdata_memblock = -1;
 
+	/* fast path, return pfn+1 if next pfn is in the same region */
+	if (early_region_idx != -1) {
+		start_pfn = PFN_DOWN(regions[early_region_idx].base);
+		end_pfn = PFN_DOWN(regions[early_region_idx].base +
+				regions[early_region_idx].size);
+
+		if (pfn >= start_pfn && pfn < end_pfn)
+			return pfn;
+
+		early_region_idx++;
+		next_start_pfn = PFN_DOWN(regions[early_region_idx].base);
+
+		if (pfn >= end_pfn && pfn <= next_start_pfn)
+			return next_start_pfn;
+	}
+
+	/* slow path, do the binary searching */
 	do {
 		mid = (right + left) / 2;
 
-		if (addr < type->regions[mid].base)
+		if (addr < regions[mid].base)
 			right = mid;
-		else if (addr >= (type->regions[mid].base +
-				  type->regions[mid].size))
+		else if (addr >= (regions[mid].base + regions[mid].size))
 			left = mid + 1;
 		else {
-			/* addr is within the region, so pfn is valid */
+			early_region_idx = mid;
 			return pfn;
 		}
 	} while (left < right);
 
 	if (right == type->cnt)
 		return -1UL;
-	else
-		return PHYS_PFN(type->regions[right].base);
+
+	early_region_idx = right;
+
+	return PHYS_PFN(regions[early_region_idx].base);
 }
 EXPORT_SYMBOL(memblock_next_valid_pfn);
 #endif /*CONFIG_HAVE_MEMBLOCK_PFN_VALID*/
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 3/3] mm: page_alloc: reduce unnecessary binary search in memblock_next_valid_pfn
@ 2018-08-22  3:07   ` Jia He
  0 siblings, 0 replies; 30+ messages in thread
From: Jia He @ 2018-08-22  3:07 UTC (permalink / raw)
  To: linux-arm-kernel

Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
where possible") optimized the loop in memmap_init_zone(). But there is
still some room for improvement.

E.g. if pfn and pfn+1 are in the same memblock region, we can simply pfn++
instead of doing the binary search in memblock_next_valid_pfn.

Furthermore, if the pfn is in a gap of two memory region, skip to next
region directly if possible.

Attached the memblock region information in my server.
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000000200000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x00000017ffffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000200000-0x000000000021ffff]
[    0.000000]   node   0: [mem 0x0000000000820000-0x000000000307ffff]
[    0.000000]   node   0: [mem 0x0000000003080000-0x000000000308ffff]
[    0.000000]   node   0: [mem 0x0000000003090000-0x00000000031fffff]
[    0.000000]   node   0: [mem 0x0000000003200000-0x00000000033fffff]
[    0.000000]   node   0: [mem 0x0000000003410000-0x00000000034fffff]
[    0.000000]   node   0: [mem 0x0000000003500000-0x000000000351ffff]
[    0.000000]   node   0: [mem 0x0000000003520000-0x000000000353ffff]
[    0.000000]   node   0: [mem 0x0000000003540000-0x0000000003e3ffff]
[    0.000000]   node   0: [mem 0x0000000003e40000-0x0000000003e7ffff]
[    0.000000]   node   0: [mem 0x0000000003e80000-0x0000000003ecffff]
[    0.000000]   node   0: [mem 0x0000000003ed0000-0x0000000003ed5fff]
[    0.000000]   node   0: [mem 0x0000000003ed6000-0x0000000006eeafff]
[    0.000000]   node   0: [mem 0x0000000006eeb000-0x000000000710ffff]
[    0.000000]   node   0: [mem 0x0000000007110000-0x0000000007f0ffff]
[    0.000000]   node   0: [mem 0x0000000007f10000-0x0000000007faffff]
[    0.000000]   node   0: [mem 0x0000000007fb0000-0x000000000806ffff]
[    0.000000]   node   0: [mem 0x0000000008070000-0x00000000080affff]
[    0.000000]   node   0: [mem 0x00000000080b0000-0x000000000832ffff]
[    0.000000]   node   0: [mem 0x0000000008330000-0x000000000836ffff]
[    0.000000]   node   0: [mem 0x0000000008370000-0x000000000838ffff]
[    0.000000]   node   0: [mem 0x0000000008390000-0x00000000083a9fff]
[    0.000000]   node   0: [mem 0x00000000083aa000-0x00000000083bbfff]
[    0.000000]   node   0: [mem 0x00000000083bc000-0x00000000083fffff]
[    0.000000]   node   0: [mem 0x0000000008400000-0x000000000841ffff]
[    0.000000]   node   0: [mem 0x0000000008420000-0x000000000843ffff]
[    0.000000]   node   0: [mem 0x0000000008440000-0x000000000865ffff]
[    0.000000]   node   0: [mem 0x0000000008660000-0x000000000869ffff]
[    0.000000]   node   0: [mem 0x00000000086a0000-0x00000000086affff]
[    0.000000]   node   0: [mem 0x00000000086b0000-0x00000000086effff]
[    0.000000]   node   0: [mem 0x00000000086f0000-0x0000000008b6ffff]
[    0.000000]   node   0: [mem 0x0000000008b70000-0x0000000008bbffff]
[    0.000000]   node   0: [mem 0x0000000008bc0000-0x0000000008edffff]
[    0.000000]   node   0: [mem 0x0000000008ee0000-0x0000000008ee0fff]
[    0.000000]   node   0: [mem 0x0000000008ee1000-0x0000000008ee2fff]
[    0.000000]   node   0: [mem 0x0000000008ee3000-0x000000000decffff]
[    0.000000]   node   0: [mem 0x000000000ded0000-0x000000000defffff]
[    0.000000]   node   0: [mem 0x000000000df00000-0x000000000fffffff]
[    0.000000]   node   0: [mem 0x0000000010800000-0x0000000017feffff]
[    0.000000]   node   0: [mem 0x000000001c000000-0x000000001c00ffff]
[    0.000000]   node   0: [mem 0x000000001c010000-0x000000001c7fffff]
[    0.000000]   node   0: [mem 0x000000001c810000-0x000000007efbffff]
[    0.000000]   node   0: [mem 0x000000007efc0000-0x000000007efdffff]
[    0.000000]   node   0: [mem 0x000000007efe0000-0x000000007efeffff]
[    0.000000]   node   0: [mem 0x000000007eff0000-0x000000007effffff]
[    0.000000]   node   0: [mem 0x000000007f000000-0x00000017ffffffff]
[    0.000000] Initmem setup node 0 [mem
0x0000000000200000-0x00000017ffffffff]
[    0.000000] On node 0 totalpages: 25145296
[    0.000000]   DMA32 zone: 16376 pages used for memmap
[    0.000000]   DMA32 zone: 0 pages reserved
[    0.000000]   DMA32 zone: 1028048 pages, LIFO batch:31
[    0.000000]   Normal zone: 376832 pages used for memmap
[    0.000000]   Normal zone: 24117248 pages, LIFO batch:31

Signed-off-by: Jia He <jia.he@hxt-semitech.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
---
 mm/memblock.c | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 077ca62..46cb6be 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1144,28 +1144,49 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
 unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
 {
 	struct memblock_type *type = &memblock.memory;
+	struct memblock_region *regions = type->regions;
 	unsigned int right = type->cnt;
 	unsigned int mid, left = 0;
+	unsigned long start_pfn, end_pfn, next_start_pfn;
 	phys_addr_t addr = PFN_PHYS(++pfn);
+	static int early_region_idx __initdata_memblock = -1;
 
+	/* fast path, return pfn+1 if next pfn is in the same region */
+	if (early_region_idx != -1) {
+		start_pfn = PFN_DOWN(regions[early_region_idx].base);
+		end_pfn = PFN_DOWN(regions[early_region_idx].base +
+				regions[early_region_idx].size);
+
+		if (pfn >= start_pfn && pfn < end_pfn)
+			return pfn;
+
+		early_region_idx++;
+		next_start_pfn = PFN_DOWN(regions[early_region_idx].base);
+
+		if (pfn >= end_pfn && pfn <= next_start_pfn)
+			return next_start_pfn;
+	}
+
+	/* slow path, do the binary searching */
 	do {
 		mid = (right + left) / 2;
 
-		if (addr < type->regions[mid].base)
+		if (addr < regions[mid].base)
 			right = mid;
-		else if (addr >= (type->regions[mid].base +
-				  type->regions[mid].size))
+		else if (addr >= (regions[mid].base + regions[mid].size))
 			left = mid + 1;
 		else {
-			/* addr is within the region, so pfn is valid */
+			early_region_idx = mid;
 			return pfn;
 		}
 	} while (left < right);
 
 	if (right == type->cnt)
 		return -1UL;
-	else
-		return PHYS_PFN(type->regions[right].base);
+
+	early_region_idx = right;
+
+	return PHYS_PFN(regions[early_region_idx].base);
 }
 EXPORT_SYMBOL(memblock_next_valid_pfn);
 #endif /*CONFIG_HAVE_MEMBLOCK_PFN_VALID*/
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
  2018-08-22  3:07 ` Jia He
@ 2018-09-05 21:57   ` Andrew Morton
  -1 siblings, 0 replies; 30+ messages in thread
From: Andrew Morton @ 2018-09-05 21:57 UTC (permalink / raw)
  To: Jia He
  Cc: Russell King, Catalin Marinas, Will Deacon, Mark Rutland,
	Ard Biesheuvel, Michal Hocko, Wei Yang, Kees Cook, Laura Abbott,
	Vladimir Murzin, Philip Derrin, AKASHI Takahiro, James Morse,
	Steve Capper, Gioh Kim, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Kemi Wang, Petr Tesarik, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Daniel Jordan, Daniel Vacek,
	Eugeniu Rosca, linux-arm-kernel, linux-kernel, linux-mm, Jia He

On Wed, 22 Aug 2018 11:07:14 +0800 Jia He <hejianet@gmail.com> wrote:

> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> where possible") optimized the loop in memmap_init_zone(). But it causes
> possible panic bug. So Daniel Vacek reverted it later.
> 
> But as suggested by Daniel Vacek, it is fine to using memblock to skip
> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> 
> More from what Daniel said:
> "On arm and arm64, memblock is used by default. But generic version of
> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> not always return the next valid one but skips more resulting in some
> valid frames to be skipped (as if they were invalid). And that's why
> kernel was eventually crashing on some !arm machines."
> 
> About the performance consideration:
> As said by James in b92df1de5,
> "I have tested this patch on a virtual model of a Samurai CPU with a
> sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> 
> Besides we can remain memblock_next_valid_pfn, there is still some room
> for improvement. After this set, I can see the time overhead of memmap_init
> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> memory, pagesize 64k). I believe arm server will benefit more if memory is
> larger than TBs

Thanks.  I switched to v11.  It would be nice to see some confirmation
from ARM people please?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2018-09-05 21:57   ` Andrew Morton
  0 siblings, 0 replies; 30+ messages in thread
From: Andrew Morton @ 2018-09-05 21:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 22 Aug 2018 11:07:14 +0800 Jia He <hejianet@gmail.com> wrote:

> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> where possible") optimized the loop in memmap_init_zone(). But it causes
> possible panic bug. So Daniel Vacek reverted it later.
> 
> But as suggested by Daniel Vacek, it is fine to using memblock to skip
> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> 
> More from what Daniel said:
> "On arm and arm64, memblock is used by default. But generic version of
> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> not always return the next valid one but skips more resulting in some
> valid frames to be skipped (as if they were invalid). And that's why
> kernel was eventually crashing on some !arm machines."
> 
> About the performance consideration:
> As said by James in b92df1de5,
> "I have tested this patch on a virtual model of a Samurai CPU with a
> sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> 
> Besides we can remain memblock_next_valid_pfn, there is still some room
> for improvement. After this set, I can see the time overhead of memmap_init
> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> memory, pagesize 64k). I believe arm server will benefit more if memory is
> larger than TBs

Thanks.  I switched to v11.  It would be nice to see some confirmation
from ARM people please?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
  2018-09-05 21:57   ` Andrew Morton
@ 2018-09-06 10:47     ` Will Deacon
  -1 siblings, 0 replies; 30+ messages in thread
From: Will Deacon @ 2018-09-06 10:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Russell King, Catalin Marinas, Mark Rutland,
	Ard Biesheuvel, Michal Hocko, Wei Yang, Kees Cook, Laura Abbott,
	Vladimir Murzin, Philip Derrin, AKASHI Takahiro, James Morse,
	Steve Capper, Gioh Kim, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Kemi Wang, Petr Tesarik, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Daniel Jordan, Daniel Vacek,
	Eugeniu Rosca, linux-arm-kernel, linux-kernel, linux-mm, Jia He

On Wed, Sep 05, 2018 at 02:57:55PM -0700, Andrew Morton wrote:
> On Wed, 22 Aug 2018 11:07:14 +0800 Jia He <hejianet@gmail.com> wrote:
> 
> > Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> > where possible") optimized the loop in memmap_init_zone(). But it causes
> > possible panic bug. So Daniel Vacek reverted it later.
> > 
> > But as suggested by Daniel Vacek, it is fine to using memblock to skip
> > gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> > 
> > More from what Daniel said:
> > "On arm and arm64, memblock is used by default. But generic version of
> > pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> > not always return the next valid one but skips more resulting in some
> > valid frames to be skipped (as if they were invalid). And that's why
> > kernel was eventually crashing on some !arm machines."
> > 
> > About the performance consideration:
> > As said by James in b92df1de5,
> > "I have tested this patch on a virtual model of a Samurai CPU with a
> > sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
> > Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> > 
> > Besides we can remain memblock_next_valid_pfn, there is still some room
> > for improvement. After this set, I can see the time overhead of memmap_init
> > is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> > memory, pagesize 64k). I believe arm server will benefit more if memory is
> > larger than TBs
> 
> Thanks.  I switched to v11.  It would be nice to see some confirmation
> from ARM people please?

I'll take a look...

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2018-09-06 10:47     ` Will Deacon
  0 siblings, 0 replies; 30+ messages in thread
From: Will Deacon @ 2018-09-06 10:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Sep 05, 2018 at 02:57:55PM -0700, Andrew Morton wrote:
> On Wed, 22 Aug 2018 11:07:14 +0800 Jia He <hejianet@gmail.com> wrote:
> 
> > Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> > where possible") optimized the loop in memmap_init_zone(). But it causes
> > possible panic bug. So Daniel Vacek reverted it later.
> > 
> > But as suggested by Daniel Vacek, it is fine to using memblock to skip
> > gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> > 
> > More from what Daniel said:
> > "On arm and arm64, memblock is used by default. But generic version of
> > pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> > not always return the next valid one but skips more resulting in some
> > valid frames to be skipped (as if they were invalid). And that's why
> > kernel was eventually crashing on some !arm machines."
> > 
> > About the performance consideration:
> > As said by James in b92df1de5,
> > "I have tested this patch on a virtual model of a Samurai CPU with a
> > sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
> > Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> > 
> > Besides we can remain memblock_next_valid_pfn, there is still some room
> > for improvement. After this set, I can see the time overhead of memmap_init
> > is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> > memory, pagesize 64k). I believe arm server will benefit more if memory is
> > larger than TBs
> 
> Thanks.  I switched to v11.  It would be nice to see some confirmation
> from ARM people please?

I'll take a look...

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
  2018-08-22  3:07 ` Jia He
@ 2018-09-06 11:24   ` Ard Biesheuvel
  -1 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-09-06 11:24 UTC (permalink / raw)
  To: Jia He
  Cc: Russell King, Catalin Marinas, Will Deacon, Mark Rutland,
	Andrew Morton, Michal Hocko, Wei Yang, Kees Cook, Laura Abbott,
	Vladimir Murzin, Philip Derrin, AKASHI Takahiro, James Morse,
	Steve Capper, Gioh Kim, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Kemi Wang, Petr Tesarik, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Daniel Jordan, Daniel Vacek,
	Eugeniu Rosca, linux-arm-kernel, Linux Kernel Mailing List,
	Linux-MM, Jia He

On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> where possible") optimized the loop in memmap_init_zone(). But it causes
> possible panic bug. So Daniel Vacek reverted it later.
>
> But as suggested by Daniel Vacek, it is fine to using memblock to skip
> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
>
> More from what Daniel said:
> "On arm and arm64, memblock is used by default. But generic version of
> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> not always return the next valid one but skips more resulting in some
> valid frames to be skipped (as if they were invalid). And that's why
> kernel was eventually crashing on some !arm machines."
>
> About the performance consideration:
> As said by James in b92df1de5,
> "I have tested this patch on a virtual model of a Samurai CPU with a
> sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
>
> Besides we can remain memblock_next_valid_pfn, there is still some room
> for improvement. After this set, I can see the time overhead of memmap_init
> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> memory, pagesize 64k). I believe arm server will benefit more if memory is
> larger than TBs
>

OK so we can summarize the benefits of this series as follows:
- boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
- boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
*milliseconds*

Google was not very helpful in figuring out what a Samurai CPU is and
why we should care about the boot time of Linux running on a virtual
model of it, and the 15 ms speedup is not that compelling either.

Apologies to Jia that it took 11 revisions to reach this conclusion,
but in /my/ opinion, tweaking the fragile memblock/pfn handling code
for this reason is totally unjustified, and we're better off
disregarding these patches.





> Patch 1 introduces new config to make codes more generic
> Patch 2 remains the memblock_next_valid_pfn on arm and arm64,this patch is
>         originated from b92df1de5
> Patch 3 optimizes the memblock_next_valid_pfn()
>
> Changelog:
> V11:- drop patch#4-6, refine the codes
> V10:- move codes to memblock.c, refine the performance consideration
> V9: - rebase to mmotm master, refine the log description. No major changes
> V8: - introduce new config and move generic code to early_pfn.h
>     - optimize memblock_next_valid_pfn as suggested by Matthew Wilcox
> V7: - fix i386 compilation error. refine the commit description
> V6: - simplify the codes, move arm/arm64 common codes to one file.
>     - refine patches as suggested by Danial Vacek and Ard Biesheuvel
> V5: - further refining as suggested by Danial Vacek. Make codes
>       arm/arm64 more arch specific
> V4: - refine patches as suggested by Danial Vacek and Wei Yang
>     - optimized on arm besides arm64
> V3: - fix 2 issues reported by kbuild test robot
> V2: - rebase to mmotm latest
>     - remain memblock_next_valid_pfn on arm64
>     - refine memblock_search_pfn_regions and pfn_valid_region
>
> Jia He (3):
>   arm: arm64: introduce CONFIG_HAVE_MEMBLOCK_PFN_VALID
>   mm: page_alloc: remain memblock_next_valid_pfn() on arm/arm64
>   mm: page_alloc: reduce unnecessary binary search in
>     memblock_next_valid_pfn
>
>  arch/arm/Kconfig       |  1 +
>  arch/arm64/Kconfig     |  1 +
>  include/linux/mmzone.h |  9 +++++++++
>  mm/Kconfig             |  3 +++
>  mm/memblock.c          | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c        |  5 ++++-
>  6 files changed, 69 insertions(+), 1 deletion(-)
>
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2018-09-06 11:24   ` Ard Biesheuvel
  0 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-09-06 11:24 UTC (permalink / raw)
  To: linux-arm-kernel

On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> where possible") optimized the loop in memmap_init_zone(). But it causes
> possible panic bug. So Daniel Vacek reverted it later.
>
> But as suggested by Daniel Vacek, it is fine to using memblock to skip
> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
>
> More from what Daniel said:
> "On arm and arm64, memblock is used by default. But generic version of
> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> not always return the next valid one but skips more resulting in some
> valid frames to be skipped (as if they were invalid). And that's why
> kernel was eventually crashing on some !arm machines."
>
> About the performance consideration:
> As said by James in b92df1de5,
> "I have tested this patch on a virtual model of a Samurai CPU with a
> sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
>
> Besides we can remain memblock_next_valid_pfn, there is still some room
> for improvement. After this set, I can see the time overhead of memmap_init
> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> memory, pagesize 64k). I believe arm server will benefit more if memory is
> larger than TBs
>

OK so we can summarize the benefits of this series as follows:
- boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
- boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
*milliseconds*

Google was not very helpful in figuring out what a Samurai CPU is and
why we should care about the boot time of Linux running on a virtual
model of it, and the 15 ms speedup is not that compelling either.

Apologies to Jia that it took 11 revisions to reach this conclusion,
but in /my/ opinion, tweaking the fragile memblock/pfn handling code
for this reason is totally unjustified, and we're better off
disregarding these patches.





> Patch 1 introduces new config to make codes more generic
> Patch 2 remains the memblock_next_valid_pfn on arm and arm64,this patch is
>         originated from b92df1de5
> Patch 3 optimizes the memblock_next_valid_pfn()
>
> Changelog:
> V11:- drop patch#4-6, refine the codes
> V10:- move codes to memblock.c, refine the performance consideration
> V9: - rebase to mmotm master, refine the log description. No major changes
> V8: - introduce new config and move generic code to early_pfn.h
>     - optimize memblock_next_valid_pfn as suggested by Matthew Wilcox
> V7: - fix i386 compilation error. refine the commit description
> V6: - simplify the codes, move arm/arm64 common codes to one file.
>     - refine patches as suggested by Danial Vacek and Ard Biesheuvel
> V5: - further refining as suggested by Danial Vacek. Make codes
>       arm/arm64 more arch specific
> V4: - refine patches as suggested by Danial Vacek and Wei Yang
>     - optimized on arm besides arm64
> V3: - fix 2 issues reported by kbuild test robot
> V2: - rebase to mmotm latest
>     - remain memblock_next_valid_pfn on arm64
>     - refine memblock_search_pfn_regions and pfn_valid_region
>
> Jia He (3):
>   arm: arm64: introduce CONFIG_HAVE_MEMBLOCK_PFN_VALID
>   mm: page_alloc: remain memblock_next_valid_pfn() on arm/arm64
>   mm: page_alloc: reduce unnecessary binary search in
>     memblock_next_valid_pfn
>
>  arch/arm/Kconfig       |  1 +
>  arch/arm64/Kconfig     |  1 +
>  include/linux/mmzone.h |  9 +++++++++
>  mm/Kconfig             |  3 +++
>  mm/memblock.c          | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c        |  5 ++++-
>  6 files changed, 69 insertions(+), 1 deletion(-)
>
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
  2018-09-06 11:24   ` Ard Biesheuvel
@ 2018-09-07 14:44     ` Will Deacon
  -1 siblings, 0 replies; 30+ messages in thread
From: Will Deacon @ 2018-09-07 14:44 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Jia He, Russell King, Catalin Marinas, Mark Rutland,
	Andrew Morton, Michal Hocko, Wei Yang, Kees Cook, Laura Abbott,
	Vladimir Murzin, Philip Derrin, AKASHI Takahiro, James Morse,
	Steve Capper, Gioh Kim, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, Kemi Wang, Petr Tesarik, YASUAKI ISHIMATSU,
	Andrey Ryabinin, Nikolay Borisov, Daniel Jordan, Daniel Vacek,
	Eugeniu Rosca, linux-arm-kernel, Linux Kernel Mailing List,
	Linux-MM, Jia He

On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
> > Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> > where possible") optimized the loop in memmap_init_zone(). But it causes
> > possible panic bug. So Daniel Vacek reverted it later.
> >
> > But as suggested by Daniel Vacek, it is fine to using memblock to skip
> > gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> >
> > More from what Daniel said:
> > "On arm and arm64, memblock is used by default. But generic version of
> > pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> > not always return the next valid one but skips more resulting in some
> > valid frames to be skipped (as if they were invalid). And that's why
> > kernel was eventually crashing on some !arm machines."
> >
> > About the performance consideration:
> > As said by James in b92df1de5,
> > "I have tested this patch on a virtual model of a Samurai CPU with a
> > sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
> > Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> >
> > Besides we can remain memblock_next_valid_pfn, there is still some room
> > for improvement. After this set, I can see the time overhead of memmap_init
> > is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> > memory, pagesize 64k). I believe arm server will benefit more if memory is
> > larger than TBs
> >
> 
> OK so we can summarize the benefits of this series as follows:
> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
> *milliseconds*
> 
> Google was not very helpful in figuring out what a Samurai CPU is and
> why we should care about the boot time of Linux running on a virtual
> model of it, and the 15 ms speedup is not that compelling either.
> 
> Apologies to Jia that it took 11 revisions to reach this conclusion,
> but in /my/ opinion, tweaking the fragile memblock/pfn handling code
> for this reason is totally unjustified, and we're better off
> disregarding these patches.

Oh, we're talking about a *simulator* for the significant boot time
improvement here? I didn't realise that, so I agree that the premise of
this patch set looks pretty questionable given how much "fun" we've had
with the memmap on arm and arm64.

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2018-09-07 14:44     ` Will Deacon
  0 siblings, 0 replies; 30+ messages in thread
From: Will Deacon @ 2018-09-07 14:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
> > Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> > where possible") optimized the loop in memmap_init_zone(). But it causes
> > possible panic bug. So Daniel Vacek reverted it later.
> >
> > But as suggested by Daniel Vacek, it is fine to using memblock to skip
> > gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> >
> > More from what Daniel said:
> > "On arm and arm64, memblock is used by default. But generic version of
> > pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> > not always return the next valid one but skips more resulting in some
> > valid frames to be skipped (as if they were invalid). And that's why
> > kernel was eventually crashing on some !arm machines."
> >
> > About the performance consideration:
> > As said by James in b92df1de5,
> > "I have tested this patch on a virtual model of a Samurai CPU with a
> > sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
> > Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> >
> > Besides we can remain memblock_next_valid_pfn, there is still some room
> > for improvement. After this set, I can see the time overhead of memmap_init
> > is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> > memory, pagesize 64k). I believe arm server will benefit more if memory is
> > larger than TBs
> >
> 
> OK so we can summarize the benefits of this series as follows:
> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
> *milliseconds*
> 
> Google was not very helpful in figuring out what a Samurai CPU is and
> why we should care about the boot time of Linux running on a virtual
> model of it, and the 15 ms speedup is not that compelling either.
> 
> Apologies to Jia that it took 11 revisions to reach this conclusion,
> but in /my/ opinion, tweaking the fragile memblock/pfn handling code
> for this reason is totally unjustified, and we're better off
> disregarding these patches.

Oh, we're talking about a *simulator* for the significant boot time
improvement here? I didn't realise that, so I agree that the premise of
this patch set looks pretty questionable given how much "fun" we've had
with the memmap on arm and arm64.

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
  2018-09-07 14:44     ` Will Deacon
  (?)
@ 2018-09-14 18:50       ` Eugeniu Rosca
  -1 siblings, 0 replies; 30+ messages in thread
From: Eugeniu Rosca @ 2018-09-14 18:50 UTC (permalink / raw)
  To: Jia He, Will Deacon, Ard Biesheuvel
  Cc: Russell King, Catalin Marinas, Mark Rutland, Andrew Morton,
	Michal Hocko, Wei Yang, Kees Cook, Laura Abbott, Vladimir Murzin,
	Philip Derrin, AKASHI Takahiro, James Morse, Steve Capper,
	Gioh Kim, Vlastimil Babka, Mel Gorman, Johannes Weiner,
	Kemi Wang, Petr Tesarik, YASUAKI ISHIMATSU, Andrey Ryabinin,
	Nikolay Borisov, Daniel Jordan, Daniel Vacek, linux-arm-kernel,
	Linux Kernel Mailing List, Linux-MM, Jia He, George G. Davis,
	Vladimir Zapolskiy, Andy Lowe, linux-renesas-soc, Eugeniu Rosca,
	Eugeniu Rosca

+ Renesas people

Hello Will, hello Ard, 

On Fri, Sep 07, 2018 at 03:44:47PM +0100, Will Deacon wrote:
> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
> > OK so we can summarize the benefits of this series as follows:
> > - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
> > - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
> > *milliseconds*
> > 
> > Google was not very helpful in figuring out what a Samurai CPU is and
> > why we should care about the boot time of Linux running on a virtual
> > model of it, and the 15 ms speedup is not that compelling either.
> > 
> > Apologies to Jia that it took 11 revisions to reach this conclusion,
> > but in /my/ opinion, tweaking the fragile memblock/pfn handling code
> > for this reason is totally unjustified, and we're better off
> > disregarding these patches.
> 
> Oh, we're talking about a *simulator* for the significant boot time
> improvement here? I didn't realise that, so I agree that the premise of
> this patch set looks pretty questionable given how much "fun" we've had
> with the memmap on arm and arm64.
> 
> Will

Similar to https://lkml.org/lkml/2018/1/24/420, my measurements show that
the boot time of R-Car H3-ES2.0 Salvator-X (having 4GiB RAM) is decreased
by ~135-140ms with this patch-set applied on top of v4.19-rc3.

I agree that in the Desktop realm you would barely perceive the 140ms
difference, but saving 140ms on the automotive SoC (designed for products
which must comply with 2s-to-rear-view-camera NHTSA US regulations) *is*
significant.

FWIW, cppcheck and `checkpatch --strict` report style issues for
patches #2 and #3. I hope these can be fixed and the review process
can go on? From functional standpoint, I did some dynamic testing on
H3-Salvator-X with UBSAN/KASAN=y and didn't observe any regressions, so:

Tested-by: Eugeniu Rosca <erosca@de.adit-jv.com>

Best regards,
Eugeniu.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2018-09-14 18:50       ` Eugeniu Rosca
  0 siblings, 0 replies; 30+ messages in thread
From: Eugeniu Rosca @ 2018-09-14 18:50 UTC (permalink / raw)
  To: Jia He, Will Deacon, Ard Biesheuvel
  Cc: Russell King, Catalin Marinas, Mark Rutland, Andrew Morton,
	Michal Hocko, Wei Yang, Kees Cook, Laura Abbott, Vladimir Murzin,
	Philip Derrin, AKASHI Takahiro, James Morse, Steve Capper,
	Gioh Kim, Vlastimil Babka, Mel Gorman, Johannes Weiner,
	Kemi Wang, Petr Tesarik, YASUAKI ISHIMATSU, Andrey Ryabinin,
	Nikolay Borisov, Daniel Jordan, Daniel Vacek, linux-arm-kernel,
	Linux Kernel Mailing List, Linux-MM, Jia He, George G. Davis,
	Vladimir Zapolskiy, Andy Lowe, linux-renesas-soc, Eugeniu Rosca,
	Eugeniu Rosca

+ Renesas people

Hello Will, hello Ard, 

On Fri, Sep 07, 2018 at 03:44:47PM +0100, Will Deacon wrote:
> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
> > OK so we can summarize the benefits of this series as follows:
> > - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
> > - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
> > *milliseconds*
> > 
> > Google was not very helpful in figuring out what a Samurai CPU is and
> > why we should care about the boot time of Linux running on a virtual
> > model of it, and the 15 ms speedup is not that compelling either.
> > 
> > Apologies to Jia that it took 11 revisions to reach this conclusion,
> > but in /my/ opinion, tweaking the fragile memblock/pfn handling code
> > for this reason is totally unjustified, and we're better off
> > disregarding these patches.
> 
> Oh, we're talking about a *simulator* for the significant boot time
> improvement here? I didn't realise that, so I agree that the premise of
> this patch set looks pretty questionable given how much "fun" we've had
> with the memmap on arm and arm64.
> 
> Will

Similar to https://lkml.org/lkml/2018/1/24/420, my measurements show that
the boot time of R-Car H3-ES2.0 Salvator-X (having 4GiB RAM) is decreased
by ~135-140ms with this patch-set applied on top of v4.19-rc3.

I agree that in the Desktop realm you would barely perceive the 140ms
difference, but saving 140ms on the automotive SoC (designed for products
which must comply with 2s-to-rear-view-camera NHTSA US regulations) *is*
significant.

FWIW, cppcheck and `checkpatch --strict` report style issues for
patches #2 and #3. I hope these can be fixed and the review process
can go on? From functional standpoint, I did some dynamic testing on
H3-Salvator-X with UBSAN/KASAN=y and didn't observe any regressions, so:

Tested-by: Eugeniu Rosca <erosca@de.adit-jv.com>

Best regards,
Eugeniu.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2018-09-14 18:50       ` Eugeniu Rosca
  0 siblings, 0 replies; 30+ messages in thread
From: Eugeniu Rosca @ 2018-09-14 18:50 UTC (permalink / raw)
  To: linux-arm-kernel

+ Renesas people

Hello Will, hello Ard, 

On Fri, Sep 07, 2018 at 03:44:47PM +0100, Will Deacon wrote:
> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
> > OK so we can summarize the benefits of this series as follows:
> > - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
> > - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
> > *milliseconds*
> > 
> > Google was not very helpful in figuring out what a Samurai CPU is and
> > why we should care about the boot time of Linux running on a virtual
> > model of it, and the 15 ms speedup is not that compelling either.
> > 
> > Apologies to Jia that it took 11 revisions to reach this conclusion,
> > but in /my/ opinion, tweaking the fragile memblock/pfn handling code
> > for this reason is totally unjustified, and we're better off
> > disregarding these patches.
> 
> Oh, we're talking about a *simulator* for the significant boot time
> improvement here? I didn't realise that, so I agree that the premise of
> this patch set looks pretty questionable given how much "fun" we've had
> with the memmap on arm and arm64.
> 
> Will

Similar to https://lkml.org/lkml/2018/1/24/420, my measurements show that
the boot time of R-Car H3-ES2.0 Salvator-X (having 4GiB RAM) is decreased
by ~135-140ms with this patch-set applied on top of v4.19-rc3.

I agree that in the Desktop realm you would barely perceive the 140ms
difference, but saving 140ms on the automotive SoC (designed for products
which must comply with 2s-to-rear-view-camera NHTSA US regulations) *is*
significant.

FWIW, cppcheck and `checkpatch --strict` report style issues for
patches #2 and #3. I hope these can be fixed and the review process
can go on? From functional standpoint, I did some dynamic testing on
H3-Salvator-X with UBSAN/KASAN=y and didn't observe any regressions, so:

Tested-by: Eugeniu Rosca <erosca@de.adit-jv.com>

Best regards,
Eugeniu.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
  2018-09-07 14:44     ` Will Deacon
@ 2019-06-08  4:22       ` Hanjun Guo
  -1 siblings, 0 replies; 30+ messages in thread
From: Hanjun Guo @ 2019-06-08  4:22 UTC (permalink / raw)
  To: Will Deacon, Ard Biesheuvel
  Cc: Mark Rutland, Michal Hocko, Catalin Marinas, Wei Yang, Linux-MM,
	Jia He, Eugeniu Rosca, Petr Tesarik, Nikolay Borisov,
	Russell King, Daniel Jordan, AKASHI Takahiro, Gioh Kim,
	Andrey Ryabinin, Laura Abbott, Daniel Vacek, Mel Gorman,
	Vladimir Murzin, Kees Cook, Philip Derrin, YASUAKI ISHIMATSU,
	Jia He, Kemi Wang, Vlastimil Babka, linux-arm-kernel,
	Steve Capper, Linux Kernel Mailing List, James Morse,
	Johannes Weiner, Andrew Morton

Hi Ard, Will,

This week we were trying to debug an issue of time consuming in mem_init(),
and leading to this similar solution form Jia He, so I would like to bring this
thread back, please see my detail test result below.

On 2018/9/7 22:44, Will Deacon wrote:
> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
>> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
>>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
>>> where possible") optimized the loop in memmap_init_zone(). But it causes
>>> possible panic bug. So Daniel Vacek reverted it later.
>>>
>>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
>>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
>>>
>>> More from what Daniel said:
>>> "On arm and arm64, memblock is used by default. But generic version of
>>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
>>> not always return the next valid one but skips more resulting in some
>>> valid frames to be skipped (as if they were invalid). And that's why
>>> kernel was eventually crashing on some !arm machines."
>>>
>>> About the performance consideration:
>>> As said by James in b92df1de5,
>>> "I have tested this patch on a virtual model of a Samurai CPU with a
>>> sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
>>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
>>>
>>> Besides we can remain memblock_next_valid_pfn, there is still some room
>>> for improvement. After this set, I can see the time overhead of memmap_init
>>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
>>> memory, pagesize 64k). I believe arm server will benefit more if memory is
>>> larger than TBs
>>>
>>
>> OK so we can summarize the benefits of this series as follows:
>> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
>> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
>> *milliseconds*
>>
>> Google was not very helpful in figuring out what a Samurai CPU is and
>> why we should care about the boot time of Linux running on a virtual
>> model of it, and the 15 ms speedup is not that compelling either.

Testing this patch set on top of Kunpeng 920 based ARM64 server, with
384G memory in total, we got the time consuming below

             without this patch set      with this patch set
mem_init()        13310ms                      1415ms

So we got about 8x speedup on this machine, which is very impressive.

The time consuming is related the memory DIMM size and where to locate those
memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
We also tested 1T memory with 64G size for each memory DIMM on another ARM64
machine, the time consuming reduced from 20s to 2s (I think it's related to
firmware implementations).

>>
>> Apologies to Jia that it took 11 revisions to reach this conclusion,
>> but in /my/ opinion, tweaking the fragile memblock/pfn handling code
>> for this reason is totally unjustified, and we're better off
>> disregarding these patches.

Indeed this patch set has a bug, For exampe, if we have 3 regions which
is [a, b] [c, d] [e, f] if address of pfn is bigger than the end address of
last region, we will increase early_region_idx to count of region, which is
out of bound of the regions. Fixed by patch below,

 mm/memblock.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 8279295..8283bf0 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1252,13 +1252,17 @@ unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
 		if (pfn >= start_pfn && pfn < end_pfn)
 			return pfn;

-		early_region_idx++;
+		/* try slow path */
+		if (++early_region_idx == type->cnt)
+			goto slow_path;
+
 		next_start_pfn = PFN_DOWN(regions[early_region_idx].base);

 		if (pfn >= end_pfn && pfn <= next_start_pfn)
 			return next_start_pfn;
 	}

+slow_path:
 	/* slow path, do the binary searching */
 	do {
 		mid = (right + left) / 2;

As the really impressive speedup on our ARM64 server system, could you reconsider
this patch set for merge? if you want more data I'm willing to clarify and give
more test.

Thanks
Hanjun


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2019-06-08  4:22       ` Hanjun Guo
  0 siblings, 0 replies; 30+ messages in thread
From: Hanjun Guo @ 2019-06-08  4:22 UTC (permalink / raw)
  To: Will Deacon, Ard Biesheuvel
  Cc: Mark Rutland, Michal Hocko, Catalin Marinas, Kemi Wang, Wei Yang,
	Linux-MM, Eugeniu Rosca, Petr Tesarik, Nikolay Borisov,
	Russell King, Daniel Jordan, AKASHI Takahiro, Mel Gorman,
	Andrey Ryabinin, Laura Abbott, Daniel Vacek, Vladimir Murzin,
	Kees Cook, Vlastimil Babka, Johannes Weiner, YASUAKI ISHIMATSU,
	Jia He, Jia He, Gioh Kim, linux-arm-kernel, Steve Capper,
	Linux Kernel Mailing List, James Morse, Philip Derrin,
	Andrew Morton

Hi Ard, Will,

This week we were trying to debug an issue of time consuming in mem_init(),
and leading to this similar solution form Jia He, so I would like to bring this
thread back, please see my detail test result below.

On 2018/9/7 22:44, Will Deacon wrote:
> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
>> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
>>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
>>> where possible") optimized the loop in memmap_init_zone(). But it causes
>>> possible panic bug. So Daniel Vacek reverted it later.
>>>
>>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
>>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
>>>
>>> More from what Daniel said:
>>> "On arm and arm64, memblock is used by default. But generic version of
>>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
>>> not always return the next valid one but skips more resulting in some
>>> valid frames to be skipped (as if they were invalid). And that's why
>>> kernel was eventually crashing on some !arm machines."
>>>
>>> About the performance consideration:
>>> As said by James in b92df1de5,
>>> "I have tested this patch on a virtual model of a Samurai CPU with a
>>> sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
>>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
>>>
>>> Besides we can remain memblock_next_valid_pfn, there is still some room
>>> for improvement. After this set, I can see the time overhead of memmap_init
>>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
>>> memory, pagesize 64k). I believe arm server will benefit more if memory is
>>> larger than TBs
>>>
>>
>> OK so we can summarize the benefits of this series as follows:
>> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
>> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
>> *milliseconds*
>>
>> Google was not very helpful in figuring out what a Samurai CPU is and
>> why we should care about the boot time of Linux running on a virtual
>> model of it, and the 15 ms speedup is not that compelling either.

Testing this patch set on top of Kunpeng 920 based ARM64 server, with
384G memory in total, we got the time consuming below

             without this patch set      with this patch set
mem_init()        13310ms                      1415ms

So we got about 8x speedup on this machine, which is very impressive.

The time consuming is related the memory DIMM size and where to locate those
memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
We also tested 1T memory with 64G size for each memory DIMM on another ARM64
machine, the time consuming reduced from 20s to 2s (I think it's related to
firmware implementations).

>>
>> Apologies to Jia that it took 11 revisions to reach this conclusion,
>> but in /my/ opinion, tweaking the fragile memblock/pfn handling code
>> for this reason is totally unjustified, and we're better off
>> disregarding these patches.

Indeed this patch set has a bug, For exampe, if we have 3 regions which
is [a, b] [c, d] [e, f] if address of pfn is bigger than the end address of
last region, we will increase early_region_idx to count of region, which is
out of bound of the regions. Fixed by patch below,

 mm/memblock.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 8279295..8283bf0 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1252,13 +1252,17 @@ unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
 		if (pfn >= start_pfn && pfn < end_pfn)
 			return pfn;

-		early_region_idx++;
+		/* try slow path */
+		if (++early_region_idx == type->cnt)
+			goto slow_path;
+
 		next_start_pfn = PFN_DOWN(regions[early_region_idx].base);

 		if (pfn >= end_pfn && pfn <= next_start_pfn)
 			return next_start_pfn;
 	}

+slow_path:
 	/* slow path, do the binary searching */
 	do {
 		mid = (right + left) / 2;

As the really impressive speedup on our ARM64 server system, could you reconsider
this patch set for merge? if you want more data I'm willing to clarify and give
more test.

Thanks
Hanjun


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
  2019-06-08  4:22       ` Hanjun Guo
  (?)
@ 2019-06-10 13:16         ` Ard Biesheuvel
  -1 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2019-06-10 13:16 UTC (permalink / raw)
  To: Hanjun Guo
  Cc: Will Deacon, Ard Biesheuvel, Mark Rutland, Michal Hocko,
	Catalin Marinas, Kemi Wang, Wei Yang, Linux-MM, Eugeniu Rosca,
	Petr Tesarik, Nikolay Borisov, Russell King, Daniel Jordan,
	AKASHI Takahiro, Mel Gorman, Andrey Ryabinin, Laura Abbott,
	Daniel Vacek, Vladimir Murzin, Kees Cook, Vlastimil Babka,
	Johannes Weiner, YASUAKI ISHIMATSU, Jia He, Jia He, Gioh Kim,
	linux-arm-kernel, Steve Capper, Linux Kernel Mailing List,
	James Morse, Philip Derrin, Andrew Morton

On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@huawei.com> wrote:
>
> Hi Ard, Will,
>
> This week we were trying to debug an issue of time consuming in mem_init(),
> and leading to this similar solution form Jia He, so I would like to bring this
> thread back, please see my detail test result below.
>
> On 2018/9/7 22:44, Will Deacon wrote:
> > On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
> >> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
> >>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> >>> where possible") optimized the loop in memmap_init_zone(). But it causes
> >>> possible panic bug. So Daniel Vacek reverted it later.
> >>>
> >>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
> >>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> >>>
> >>> More from what Daniel said:
> >>> "On arm and arm64, memblock is used by default. But generic version of
> >>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> >>> not always return the next valid one but skips more resulting in some
> >>> valid frames to be skipped (as if they were invalid). And that's why
> >>> kernel was eventually crashing on some !arm machines."
> >>>
> >>> About the performance consideration:
> >>> As said by James in b92df1de5,
> >>> "I have tested this patch on a virtual model of a Samurai CPU with a
> >>> sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
> >>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> >>>
> >>> Besides we can remain memblock_next_valid_pfn, there is still some room
> >>> for improvement. After this set, I can see the time overhead of memmap_init
> >>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> >>> memory, pagesize 64k). I believe arm server will benefit more if memory is
> >>> larger than TBs
> >>>
> >>
> >> OK so we can summarize the benefits of this series as follows:
> >> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
> >> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
> >> *milliseconds*
> >>
> >> Google was not very helpful in figuring out what a Samurai CPU is and
> >> why we should care about the boot time of Linux running on a virtual
> >> model of it, and the 15 ms speedup is not that compelling either.
>
> Testing this patch set on top of Kunpeng 920 based ARM64 server, with
> 384G memory in total, we got the time consuming below
>
>              without this patch set      with this patch set
> mem_init()        13310ms                      1415ms
>
> So we got about 8x speedup on this machine, which is very impressive.
>

Yes, this is impressive. But does it matter in the grand scheme of
things? How much time does this system take to arrive at this point
from power on?

> The time consuming is related the memory DIMM size and where to locate those
> memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
> We also tested 1T memory with 64G size for each memory DIMM on another ARM64
> machine, the time consuming reduced from 20s to 2s (I think it's related to
> firmware implementations).
>

I agree that this optimization looks good in isolation, but the fact
that you spotted a bug justifies my skepticism at the time. On the
other hand, now that we have several independent reports (from you,
but also from the Renesas folks) that the speedup is worthwhile for
real world use cases, I think it does make sense to revisit it.

So what I would like to see is the patch set being proposed again,
with the new data points added for documentation. Also, the commit
logs need to crystal clear about how the meaning of PFN validity
differs between ARM and other architectures, and why the assumptions
that the optimization is based on are guaranteed to hold.



> >>
> >> Apologies to Jia that it took 11 revisions to reach this conclusion,
> >> but in /my/ opinion, tweaking the fragile memblock/pfn handling code
> >> for this reason is totally unjustified, and we're better off
> >> disregarding these patches.
>
> Indeed this patch set has a bug, For exampe, if we have 3 regions which
> is [a, b] [c, d] [e, f] if address of pfn is bigger than the end address of
> last region, we will increase early_region_idx to count of region, which is
> out of bound of the regions. Fixed by patch below,
>
>  mm/memblock.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 8279295..8283bf0 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1252,13 +1252,17 @@ unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
>                 if (pfn >= start_pfn && pfn < end_pfn)
>                         return pfn;
>
> -               early_region_idx++;
> +               /* try slow path */
> +               if (++early_region_idx == type->cnt)
> +                       goto slow_path;
> +
>                 next_start_pfn = PFN_DOWN(regions[early_region_idx].base);
>
>                 if (pfn >= end_pfn && pfn <= next_start_pfn)
>                         return next_start_pfn;
>         }
>
> +slow_path:
>         /* slow path, do the binary searching */
>         do {
>                 mid = (right + left) / 2;
>
> As the really impressive speedup on our ARM64 server system, could you reconsider
> this patch set for merge? if you want more data I'm willing to clarify and give
> more test.
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2019-06-10 13:16         ` Ard Biesheuvel
  0 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2019-06-10 13:16 UTC (permalink / raw)
  To: Hanjun Guo
  Cc: Will Deacon, Ard Biesheuvel, Mark Rutland, Michal Hocko,
	Catalin Marinas, Kemi Wang, Wei Yang, Linux-MM, Eugeniu Rosca,
	Petr Tesarik, Nikolay Borisov, Russell King, Daniel Jordan,
	AKASHI Takahiro, Mel Gorman, Andrey Ryabinin, Laura Abbott,
	Daniel Vacek, Vladimir Murzin, Kees Cook, Vlastimil Babka,
	Johannes Weiner, YASUAKI ISHIMATSU, Jia He, Jia He, Gioh Kim,
	linux-arm-kernel, Steve Capper, Linux Kernel Mailing List,
	James Morse, Philip Derrin, Andrew Morton

On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@huawei.com> wrote:
>
> Hi Ard, Will,
>
> This week we were trying to debug an issue of time consuming in mem_init(),
> and leading to this similar solution form Jia He, so I would like to bring this
> thread back, please see my detail test result below.
>
> On 2018/9/7 22:44, Will Deacon wrote:
> > On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
> >> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
> >>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> >>> where possible") optimized the loop in memmap_init_zone(). But it causes
> >>> possible panic bug. So Daniel Vacek reverted it later.
> >>>
> >>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
> >>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> >>>
> >>> More from what Daniel said:
> >>> "On arm and arm64, memblock is used by default. But generic version of
> >>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> >>> not always return the next valid one but skips more resulting in some
> >>> valid frames to be skipped (as if they were invalid). And that's why
> >>> kernel was eventually crashing on some !arm machines."
> >>>
> >>> About the performance consideration:
> >>> As said by James in b92df1de5,
> >>> "I have tested this patch on a virtual model of a Samurai CPU with a
> >>> sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
> >>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> >>>
> >>> Besides we can remain memblock_next_valid_pfn, there is still some room
> >>> for improvement. After this set, I can see the time overhead of memmap_init
> >>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> >>> memory, pagesize 64k). I believe arm server will benefit more if memory is
> >>> larger than TBs
> >>>
> >>
> >> OK so we can summarize the benefits of this series as follows:
> >> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
> >> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
> >> *milliseconds*
> >>
> >> Google was not very helpful in figuring out what a Samurai CPU is and
> >> why we should care about the boot time of Linux running on a virtual
> >> model of it, and the 15 ms speedup is not that compelling either.
>
> Testing this patch set on top of Kunpeng 920 based ARM64 server, with
> 384G memory in total, we got the time consuming below
>
>              without this patch set      with this patch set
> mem_init()        13310ms                      1415ms
>
> So we got about 8x speedup on this machine, which is very impressive.
>

Yes, this is impressive. But does it matter in the grand scheme of
things? How much time does this system take to arrive at this point
from power on?

> The time consuming is related the memory DIMM size and where to locate those
> memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
> We also tested 1T memory with 64G size for each memory DIMM on another ARM64
> machine, the time consuming reduced from 20s to 2s (I think it's related to
> firmware implementations).
>

I agree that this optimization looks good in isolation, but the fact
that you spotted a bug justifies my skepticism at the time. On the
other hand, now that we have several independent reports (from you,
but also from the Renesas folks) that the speedup is worthwhile for
real world use cases, I think it does make sense to revisit it.

So what I would like to see is the patch set being proposed again,
with the new data points added for documentation. Also, the commit
logs need to crystal clear about how the meaning of PFN validity
differs between ARM and other architectures, and why the assumptions
that the optimization is based on are guaranteed to hold.



> >>
> >> Apologies to Jia that it took 11 revisions to reach this conclusion,
> >> but in /my/ opinion, tweaking the fragile memblock/pfn handling code
> >> for this reason is totally unjustified, and we're better off
> >> disregarding these patches.
>
> Indeed this patch set has a bug, For exampe, if we have 3 regions which
> is [a, b] [c, d] [e, f] if address of pfn is bigger than the end address of
> last region, we will increase early_region_idx to count of region, which is
> out of bound of the regions. Fixed by patch below,
>
>  mm/memblock.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 8279295..8283bf0 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1252,13 +1252,17 @@ unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
>                 if (pfn >= start_pfn && pfn < end_pfn)
>                         return pfn;
>
> -               early_region_idx++;
> +               /* try slow path */
> +               if (++early_region_idx == type->cnt)
> +                       goto slow_path;
> +
>                 next_start_pfn = PFN_DOWN(regions[early_region_idx].base);
>
>                 if (pfn >= end_pfn && pfn <= next_start_pfn)
>                         return next_start_pfn;
>         }
>
> +slow_path:
>         /* slow path, do the binary searching */
>         do {
>                 mid = (right + left) / 2;
>
> As the really impressive speedup on our ARM64 server system, could you reconsider
> this patch set for merge? if you want more data I'm willing to clarify and give
> more test.
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2019-06-10 13:16         ` Ard Biesheuvel
  0 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2019-06-10 13:16 UTC (permalink / raw)
  To: Hanjun Guo
  Cc: Mark Rutland, Michal Hocko, Catalin Marinas, Will Deacon,
	Wei Yang, Linux-MM, James Morse, Eugeniu Rosca, Petr Tesarik,
	Nikolay Borisov, Jia He, Russell King, Daniel Jordan,
	AKASHI Takahiro, Mel Gorman, Andrey Ryabinin, Laura Abbott,
	Daniel Vacek, Vladimir Murzin, Kees Cook, Philip Derrin,
	YASUAKI ISHIMATSU, Jia He, Ard Biesheuvel, Kemi Wang,
	Vlastimil Babka, linux-arm-kernel, Steve Capper,
	Linux Kernel Mailing List, Gioh Kim, Johannes Weiner,
	Andrew Morton

On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@huawei.com> wrote:
>
> Hi Ard, Will,
>
> This week we were trying to debug an issue of time consuming in mem_init(),
> and leading to this similar solution form Jia He, so I would like to bring this
> thread back, please see my detail test result below.
>
> On 2018/9/7 22:44, Will Deacon wrote:
> > On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
> >> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
> >>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> >>> where possible") optimized the loop in memmap_init_zone(). But it causes
> >>> possible panic bug. So Daniel Vacek reverted it later.
> >>>
> >>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
> >>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> >>>
> >>> More from what Daniel said:
> >>> "On arm and arm64, memblock is used by default. But generic version of
> >>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> >>> not always return the next valid one but skips more resulting in some
> >>> valid frames to be skipped (as if they were invalid). And that's why
> >>> kernel was eventually crashing on some !arm machines."
> >>>
> >>> About the performance consideration:
> >>> As said by James in b92df1de5,
> >>> "I have tested this patch on a virtual model of a Samurai CPU with a
> >>> sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
> >>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> >>>
> >>> Besides we can remain memblock_next_valid_pfn, there is still some room
> >>> for improvement. After this set, I can see the time overhead of memmap_init
> >>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> >>> memory, pagesize 64k). I believe arm server will benefit more if memory is
> >>> larger than TBs
> >>>
> >>
> >> OK so we can summarize the benefits of this series as follows:
> >> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
> >> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
> >> *milliseconds*
> >>
> >> Google was not very helpful in figuring out what a Samurai CPU is and
> >> why we should care about the boot time of Linux running on a virtual
> >> model of it, and the 15 ms speedup is not that compelling either.
>
> Testing this patch set on top of Kunpeng 920 based ARM64 server, with
> 384G memory in total, we got the time consuming below
>
>              without this patch set      with this patch set
> mem_init()        13310ms                      1415ms
>
> So we got about 8x speedup on this machine, which is very impressive.
>

Yes, this is impressive. But does it matter in the grand scheme of
things? How much time does this system take to arrive at this point
from power on?

> The time consuming is related the memory DIMM size and where to locate those
> memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
> We also tested 1T memory with 64G size for each memory DIMM on another ARM64
> machine, the time consuming reduced from 20s to 2s (I think it's related to
> firmware implementations).
>

I agree that this optimization looks good in isolation, but the fact
that you spotted a bug justifies my skepticism at the time. On the
other hand, now that we have several independent reports (from you,
but also from the Renesas folks) that the speedup is worthwhile for
real world use cases, I think it does make sense to revisit it.

So what I would like to see is the patch set being proposed again,
with the new data points added for documentation. Also, the commit
logs need to crystal clear about how the meaning of PFN validity
differs between ARM and other architectures, and why the assumptions
that the optimization is based on are guaranteed to hold.



> >>
> >> Apologies to Jia that it took 11 revisions to reach this conclusion,
> >> but in /my/ opinion, tweaking the fragile memblock/pfn handling code
> >> for this reason is totally unjustified, and we're better off
> >> disregarding these patches.
>
> Indeed this patch set has a bug, For exampe, if we have 3 regions which
> is [a, b] [c, d] [e, f] if address of pfn is bigger than the end address of
> last region, we will increase early_region_idx to count of region, which is
> out of bound of the regions. Fixed by patch below,
>
>  mm/memblock.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 8279295..8283bf0 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1252,13 +1252,17 @@ unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
>                 if (pfn >= start_pfn && pfn < end_pfn)
>                         return pfn;
>
> -               early_region_idx++;
> +               /* try slow path */
> +               if (++early_region_idx == type->cnt)
> +                       goto slow_path;
> +
>                 next_start_pfn = PFN_DOWN(regions[early_region_idx].base);
>
>                 if (pfn >= end_pfn && pfn <= next_start_pfn)
>                         return next_start_pfn;
>         }
>
> +slow_path:
>         /* slow path, do the binary searching */
>         do {
>                 mid = (right + left) / 2;
>
> As the really impressive speedup on our ARM64 server system, could you reconsider
> this patch set for merge? if you want more data I'm willing to clarify and give
> more test.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
  2019-06-10 13:16         ` Ard Biesheuvel
@ 2019-06-11 15:18           ` Hanjun Guo
  -1 siblings, 0 replies; 30+ messages in thread
From: Hanjun Guo @ 2019-06-11 15:18 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Will Deacon, Ard Biesheuvel, Mark Rutland, Michal Hocko,
	Catalin Marinas, Kemi Wang, Wei Yang, Linux-MM, Eugeniu Rosca,
	Petr Tesarik, Nikolay Borisov, Russell King, Daniel Jordan,
	AKASHI Takahiro, Mel Gorman, Andrey Ryabinin, Laura Abbott,
	Daniel Vacek, Vladimir Murzin, Kees Cook, Vlastimil Babka,
	Johannes Weiner, YASUAKI ISHIMATSU, Jia He, Jia He, Gioh Kim,
	linux-arm-kernel, Steve Capper, Linux Kernel Mailing List,
	James Morse, Philip Derrin, Andrew Morton

Hello Ard,

Thanks for the reply, please see my comments inline.

On 2019/6/10 21:16, Ard Biesheuvel wrote:
> On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@huawei.com> wrote:
>>
>> Hi Ard, Will,
>>
>> This week we were trying to debug an issue of time consuming in mem_init(),
>> and leading to this similar solution form Jia He, so I would like to bring this
>> thread back, please see my detail test result below.
>>
>> On 2018/9/7 22:44, Will Deacon wrote:
>>> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
>>>> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
>>>>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
>>>>> where possible") optimized the loop in memmap_init_zone(). But it causes
>>>>> possible panic bug. So Daniel Vacek reverted it later.
>>>>>
>>>>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
>>>>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
>>>>>
>>>>> More from what Daniel said:
>>>>> "On arm and arm64, memblock is used by default. But generic version of
>>>>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
>>>>> not always return the next valid one but skips more resulting in some
>>>>> valid frames to be skipped (as if they were invalid). And that's why
>>>>> kernel was eventually crashing on some !arm machines."
>>>>>
>>>>> About the performance consideration:
>>>>> As said by James in b92df1de5,
>>>>> "I have tested this patch on a virtual model of a Samurai CPU with a
>>>>> sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
>>>>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
>>>>>
>>>>> Besides we can remain memblock_next_valid_pfn, there is still some room
>>>>> for improvement. After this set, I can see the time overhead of memmap_init
>>>>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
>>>>> memory, pagesize 64k). I believe arm server will benefit more if memory is
>>>>> larger than TBs
>>>>>
>>>>
>>>> OK so we can summarize the benefits of this series as follows:
>>>> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
>>>> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
>>>> *milliseconds*
>>>>
>>>> Google was not very helpful in figuring out what a Samurai CPU is and
>>>> why we should care about the boot time of Linux running on a virtual
>>>> model of it, and the 15 ms speedup is not that compelling either.
>>
>> Testing this patch set on top of Kunpeng 920 based ARM64 server, with
>> 384G memory in total, we got the time consuming below
>>
>>              without this patch set      with this patch set
>> mem_init()        13310ms                      1415ms
>>
>> So we got about 8x speedup on this machine, which is very impressive.
>>
> 
> Yes, this is impressive. But does it matter in the grand scheme of
> things? 

It matters for this machine, because it's for storage and there is
a watchdog and the time consuming triggers the watchdog.

> How much time does this system take to arrive at this point
> from power on?

Sorry, I don't have such data, as the arch timer is not initialized
and I didn't see the time stamp at this point, but I read the cycles
from arch timer before and after the time consuming function to get
how much time consumed.

> 
>> The time consuming is related the memory DIMM size and where to locate those
>> memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
>> We also tested 1T memory with 64G size for each memory DIMM on another ARM64
>> machine, the time consuming reduced from 20s to 2s (I think it's related to
>> firmware implementations).
>>
> 
> I agree that this optimization looks good in isolation, but the fact
> that you spotted a bug justifies my skepticism at the time. On the
> other hand, now that we have several independent reports (from you,
> but also from the Renesas folks) that the speedup is worthwhile for
> real world use cases, I think it does make sense to revisit it.

Thank you very much for taking care of this :)

> 
> So what I would like to see is the patch set being proposed again,
> with the new data points added for documentation. Also, the commit
> logs need to crystal clear about how the meaning of PFN validity
> differs between ARM and other architectures, and why the assumptions
> that the optimization is based on are guaranteed to hold.

I think Jia He no longer works for HXT, if don't mind, I can repost
this patch set with Jia He's authority unchanged.

Thanks
Hanjun


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2019-06-11 15:18           ` Hanjun Guo
  0 siblings, 0 replies; 30+ messages in thread
From: Hanjun Guo @ 2019-06-11 15:18 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Mark Rutland, Michal Hocko, Catalin Marinas, Will Deacon,
	Wei Yang, Linux-MM, James Morse, Eugeniu Rosca, Petr Tesarik,
	Nikolay Borisov, Jia He, Russell King, Daniel Jordan,
	AKASHI Takahiro, Mel Gorman, Andrey Ryabinin, Laura Abbott,
	Daniel Vacek, Vladimir Murzin, Kees Cook, Philip Derrin,
	YASUAKI ISHIMATSU, Jia He, Ard Biesheuvel, Kemi Wang,
	Vlastimil Babka, linux-arm-kernel, Steve Capper,
	Linux Kernel Mailing List, Gioh Kim, Johannes Weiner,
	Andrew Morton

Hello Ard,

Thanks for the reply, please see my comments inline.

On 2019/6/10 21:16, Ard Biesheuvel wrote:
> On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@huawei.com> wrote:
>>
>> Hi Ard, Will,
>>
>> This week we were trying to debug an issue of time consuming in mem_init(),
>> and leading to this similar solution form Jia He, so I would like to bring this
>> thread back, please see my detail test result below.
>>
>> On 2018/9/7 22:44, Will Deacon wrote:
>>> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
>>>> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
>>>>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
>>>>> where possible") optimized the loop in memmap_init_zone(). But it causes
>>>>> possible panic bug. So Daniel Vacek reverted it later.
>>>>>
>>>>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
>>>>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
>>>>>
>>>>> More from what Daniel said:
>>>>> "On arm and arm64, memblock is used by default. But generic version of
>>>>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
>>>>> not always return the next valid one but skips more resulting in some
>>>>> valid frames to be skipped (as if they were invalid). And that's why
>>>>> kernel was eventually crashing on some !arm machines."
>>>>>
>>>>> About the performance consideration:
>>>>> As said by James in b92df1de5,
>>>>> "I have tested this patch on a virtual model of a Samurai CPU with a
>>>>> sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
>>>>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
>>>>>
>>>>> Besides we can remain memblock_next_valid_pfn, there is still some room
>>>>> for improvement. After this set, I can see the time overhead of memmap_init
>>>>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
>>>>> memory, pagesize 64k). I believe arm server will benefit more if memory is
>>>>> larger than TBs
>>>>>
>>>>
>>>> OK so we can summarize the benefits of this series as follows:
>>>> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
>>>> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
>>>> *milliseconds*
>>>>
>>>> Google was not very helpful in figuring out what a Samurai CPU is and
>>>> why we should care about the boot time of Linux running on a virtual
>>>> model of it, and the 15 ms speedup is not that compelling either.
>>
>> Testing this patch set on top of Kunpeng 920 based ARM64 server, with
>> 384G memory in total, we got the time consuming below
>>
>>              without this patch set      with this patch set
>> mem_init()        13310ms                      1415ms
>>
>> So we got about 8x speedup on this machine, which is very impressive.
>>
> 
> Yes, this is impressive. But does it matter in the grand scheme of
> things? 

It matters for this machine, because it's for storage and there is
a watchdog and the time consuming triggers the watchdog.

> How much time does this system take to arrive at this point
> from power on?

Sorry, I don't have such data, as the arch timer is not initialized
and I didn't see the time stamp at this point, but I read the cycles
from arch timer before and after the time consuming function to get
how much time consumed.

> 
>> The time consuming is related the memory DIMM size and where to locate those
>> memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
>> We also tested 1T memory with 64G size for each memory DIMM on another ARM64
>> machine, the time consuming reduced from 20s to 2s (I think it's related to
>> firmware implementations).
>>
> 
> I agree that this optimization looks good in isolation, but the fact
> that you spotted a bug justifies my skepticism at the time. On the
> other hand, now that we have several independent reports (from you,
> but also from the Renesas folks) that the speedup is worthwhile for
> real world use cases, I think it does make sense to revisit it.

Thank you very much for taking care of this :)

> 
> So what I would like to see is the patch set being proposed again,
> with the new data points added for documentation. Also, the commit
> logs need to crystal clear about how the meaning of PFN validity
> differs between ARM and other architectures, and why the assumptions
> that the optimization is based on are guaranteed to hold.

I think Jia He no longer works for HXT, if don't mind, I can repost
this patch set with Jia He's authority unchanged.

Thanks
Hanjun


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
  2019-06-11 15:18           ` Hanjun Guo
@ 2019-06-12  1:05             ` Jia He
  -1 siblings, 0 replies; 30+ messages in thread
From: Jia He @ 2019-06-12  1:05 UTC (permalink / raw)
  To: Hanjun Guo, Ard Biesheuvel
  Cc: Will Deacon, Ard Biesheuvel, Mark Rutland, Michal Hocko,
	Catalin Marinas, Kemi Wang, Wei Yang, Linux-MM, Eugeniu Rosca,
	Petr Tesarik, Nikolay Borisov, Russell King, Daniel Jordan,
	AKASHI Takahiro, Mel Gorman, Andrey Ryabinin, Laura Abbott,
	Daniel Vacek, Vladimir Murzin, Kees Cook, Vlastimil Babka,
	Johannes Weiner, YASUAKI ISHIMATSU, Jia He, Gioh Kim,
	linux-arm-kernel, Steve Capper, Linux Kernel Mailing List,
	James Morse, Philip Derrin, Andrew Morton

Hi Hanjun

On 2019/6/11 23:18, Hanjun Guo wrote:
> Hello Ard,
>
> Thanks for the reply, please see my comments inline.
>
> On 2019/6/10 21:16, Ard Biesheuvel wrote:
>> On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@huawei.com> wrote:
>>> Hi Ard, Will,
>>>
>>> This week we were trying to debug an issue of time consuming in mem_init(),
>>> and leading to this similar solution form Jia He, so I would like to bring this
>>> thread back, please see my detail test result below.
>>>
>>> On 2018/9/7 22:44, Will Deacon wrote:
>>>> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
>>>>> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
>>>>>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
>>>>>> where possible") optimized the loop in memmap_init_zone(). But it causes
>>>>>> possible panic bug. So Daniel Vacek reverted it later.
>>>>>>
>>>>>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
>>>>>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
>>>>>>
>>>>>> More from what Daniel said:
>>>>>> "On arm and arm64, memblock is used by default. But generic version of
>>>>>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
>>>>>> not always return the next valid one but skips more resulting in some
>>>>>> valid frames to be skipped (as if they were invalid). And that's why
>>>>>> kernel was eventually crashing on some !arm machines."
>>>>>>
>>>>>> About the performance consideration:
>>>>>> As said by James in b92df1de5,
>>>>>> "I have tested this patch on a virtual model of a Samurai CPU with a
>>>>>> sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
>>>>>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
>>>>>>
>>>>>> Besides we can remain memblock_next_valid_pfn, there is still some room
>>>>>> for improvement. After this set, I can see the time overhead of memmap_init
>>>>>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
>>>>>> memory, pagesize 64k). I believe arm server will benefit more if memory is
>>>>>> larger than TBs
>>>>>>
>>>>> OK so we can summarize the benefits of this series as follows:
>>>>> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
>>>>> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
>>>>> *milliseconds*
>>>>>
>>>>> Google was not very helpful in figuring out what a Samurai CPU is and
>>>>> why we should care about the boot time of Linux running on a virtual
>>>>> model of it, and the 15 ms speedup is not that compelling either.
>>> Testing this patch set on top of Kunpeng 920 based ARM64 server, with
>>> 384G memory in total, we got the time consuming below
>>>
>>>               without this patch set      with this patch set
>>> mem_init()        13310ms                      1415ms
>>>
>>> So we got about 8x speedup on this machine, which is very impressive.
>>>
>> Yes, this is impressive. But does it matter in the grand scheme of
>> things?
> It matters for this machine, because it's for storage and there is
> a watchdog and the time consuming triggers the watchdog.
>
>> How much time does this system take to arrive at this point
>> from power on?
> Sorry, I don't have such data, as the arch timer is not initialized
> and I didn't see the time stamp at this point, but I read the cycles
> from arch timer before and after the time consuming function to get
> how much time consumed.
>
>>> The time consuming is related the memory DIMM size and where to locate those
>>> memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
>>> We also tested 1T memory with 64G size for each memory DIMM on another ARM64
>>> machine, the time consuming reduced from 20s to 2s (I think it's related to
>>> firmware implementations).
>>>
>> I agree that this optimization looks good in isolation, but the fact
>> that you spotted a bug justifies my skepticism at the time. On the
>> other hand, now that we have several independent reports (from you,
>> but also from the Renesas folks) that the speedup is worthwhile for
>> real world use cases, I think it does make sense to revisit it.
> Thank you very much for taking care of this :)
>
>> So what I would like to see is the patch set being proposed again,
>> with the new data points added for documentation. Also, the commit
>> logs need to crystal clear about how the meaning of PFN validity
>> differs between ARM and other architectures, and why the assumptions
>> that the optimization is based on are guaranteed to hold.
> I think Jia He no longer works for HXT, if don't mind, I can repost
> this patch set with Jia He's authority unchanged.
Ok, I don't mind that, thanks for your followup :)

---
Cheers,
Justin (Jia He)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2019-06-12  1:05             ` Jia He
  0 siblings, 0 replies; 30+ messages in thread
From: Jia He @ 2019-06-12  1:05 UTC (permalink / raw)
  To: Hanjun Guo, Ard Biesheuvel
  Cc: Mark Rutland, Michal Hocko, Catalin Marinas, Will Deacon,
	Wei Yang, Linux-MM, James Morse, Eugeniu Rosca, Petr Tesarik,
	Nikolay Borisov, Russell King, Daniel Jordan, AKASHI Takahiro,
	Mel Gorman, Andrey Ryabinin, Laura Abbott, Daniel Vacek,
	Vladimir Murzin, Kees Cook, Philip Derrin, YASUAKI ISHIMATSU,
	Jia He, Ard Biesheuvel, Kemi Wang, Vlastimil Babka,
	linux-arm-kernel, Steve Capper, Linux Kernel Mailing List,
	Gioh Kim, Johannes Weiner, Andrew Morton

Hi Hanjun

On 2019/6/11 23:18, Hanjun Guo wrote:
> Hello Ard,
>
> Thanks for the reply, please see my comments inline.
>
> On 2019/6/10 21:16, Ard Biesheuvel wrote:
>> On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@huawei.com> wrote:
>>> Hi Ard, Will,
>>>
>>> This week we were trying to debug an issue of time consuming in mem_init(),
>>> and leading to this similar solution form Jia He, so I would like to bring this
>>> thread back, please see my detail test result below.
>>>
>>> On 2018/9/7 22:44, Will Deacon wrote:
>>>> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
>>>>> On 22 August 2018 at 05:07, Jia He <hejianet@gmail.com> wrote:
>>>>>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
>>>>>> where possible") optimized the loop in memmap_init_zone(). But it causes
>>>>>> possible panic bug. So Daniel Vacek reverted it later.
>>>>>>
>>>>>> But as suggested by Daniel Vacek, it is fine to using memblock to skip
>>>>>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
>>>>>>
>>>>>> More from what Daniel said:
>>>>>> "On arm and arm64, memblock is used by default. But generic version of
>>>>>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
>>>>>> not always return the next valid one but skips more resulting in some
>>>>>> valid frames to be skipped (as if they were invalid). And that's why
>>>>>> kernel was eventually crashing on some !arm machines."
>>>>>>
>>>>>> About the performance consideration:
>>>>>> As said by James in b92df1de5,
>>>>>> "I have tested this patch on a virtual model of a Samurai CPU with a
>>>>>> sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
>>>>>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
>>>>>>
>>>>>> Besides we can remain memblock_next_valid_pfn, there is still some room
>>>>>> for improvement. After this set, I can see the time overhead of memmap_init
>>>>>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
>>>>>> memory, pagesize 64k). I believe arm server will benefit more if memory is
>>>>>> larger than TBs
>>>>>>
>>>>> OK so we can summarize the benefits of this series as follows:
>>>>> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
>>>>> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
>>>>> *milliseconds*
>>>>>
>>>>> Google was not very helpful in figuring out what a Samurai CPU is and
>>>>> why we should care about the boot time of Linux running on a virtual
>>>>> model of it, and the 15 ms speedup is not that compelling either.
>>> Testing this patch set on top of Kunpeng 920 based ARM64 server, with
>>> 384G memory in total, we got the time consuming below
>>>
>>>               without this patch set      with this patch set
>>> mem_init()        13310ms                      1415ms
>>>
>>> So we got about 8x speedup on this machine, which is very impressive.
>>>
>> Yes, this is impressive. But does it matter in the grand scheme of
>> things?
> It matters for this machine, because it's for storage and there is
> a watchdog and the time consuming triggers the watchdog.
>
>> How much time does this system take to arrive at this point
>> from power on?
> Sorry, I don't have such data, as the arch timer is not initialized
> and I didn't see the time stamp at this point, but I read the cycles
> from arch timer before and after the time consuming function to get
> how much time consumed.
>
>>> The time consuming is related the memory DIMM size and where to locate those
>>> memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
>>> We also tested 1T memory with 64G size for each memory DIMM on another ARM64
>>> machine, the time consuming reduced from 20s to 2s (I think it's related to
>>> firmware implementations).
>>>
>> I agree that this optimization looks good in isolation, but the fact
>> that you spotted a bug justifies my skepticism at the time. On the
>> other hand, now that we have several independent reports (from you,
>> but also from the Renesas folks) that the speedup is worthwhile for
>> real world use cases, I think it does make sense to revisit it.
> Thank you very much for taking care of this :)
>
>> So what I would like to see is the patch set being proposed again,
>> with the new data points added for documentation. Also, the commit
>> logs need to crystal clear about how the meaning of PFN validity
>> differs between ARM and other architectures, and why the assumptions
>> that the optimization is based on are guaranteed to hold.
> I think Jia He no longer works for HXT, if don't mind, I can repost
> this patch set with Jia He's authority unchanged.
Ok, I don't mind that, thanks for your followup :)

---
Cheers,
Justin (Jia He)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
  2019-06-12  1:05             ` Jia He
@ 2019-06-12 12:48               ` Hanjun Guo
  -1 siblings, 0 replies; 30+ messages in thread
From: Hanjun Guo @ 2019-06-12 12:48 UTC (permalink / raw)
  To: Jia He, Ard Biesheuvel
  Cc: Will Deacon, Ard Biesheuvel, Mark Rutland, Michal Hocko,
	Catalin Marinas, Kemi Wang, Wei Yang, Linux-MM, Eugeniu Rosca,
	Petr Tesarik, Nikolay Borisov, Russell King, Daniel Jordan,
	AKASHI Takahiro, Mel Gorman, Andrey Ryabinin, Laura Abbott,
	Daniel Vacek, Vladimir Murzin, Kees Cook, Vlastimil Babka,
	Johannes Weiner, YASUAKI ISHIMATSU, Jia He, Gioh Kim,
	linux-arm-kernel, Steve Capper, Linux Kernel Mailing List,
	James Morse, Philip Derrin, Andrew Morton

On 2019/6/12 9:05, Jia He wrote:
>>
>>> So what I would like to see is the patch set being proposed again,
>>> with the new data points added for documentation. Also, the commit
>>> logs need to crystal clear about how the meaning of PFN validity
>>> differs between ARM and other architectures, and why the assumptions
>>> that the optimization is based on are guaranteed to hold.
>> I think Jia He no longer works for HXT, if don't mind, I can repost
>> this patch set with Jia He's authority unchanged.
> Ok, I don't mind that, thanks for your followup :)

That's great, I will prepare a new version with Ard's comments addressed
then repost.

Thanks
Hanjun


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
@ 2019-06-12 12:48               ` Hanjun Guo
  0 siblings, 0 replies; 30+ messages in thread
From: Hanjun Guo @ 2019-06-12 12:48 UTC (permalink / raw)
  To: Jia He, Ard Biesheuvel
  Cc: Mark Rutland, Michal Hocko, Catalin Marinas, Will Deacon,
	Wei Yang, Linux-MM, James Morse, Eugeniu Rosca, Petr Tesarik,
	Nikolay Borisov, Russell King, Daniel Jordan, AKASHI Takahiro,
	Mel Gorman, Andrey Ryabinin, Laura Abbott, Daniel Vacek,
	Vladimir Murzin, Kees Cook, Philip Derrin, YASUAKI ISHIMATSU,
	Jia He, Ard Biesheuvel, Kemi Wang, Vlastimil Babka,
	linux-arm-kernel, Steve Capper, Linux Kernel Mailing List,
	Gioh Kim, Johannes Weiner, Andrew Morton

On 2019/6/12 9:05, Jia He wrote:
>>
>>> So what I would like to see is the patch set being proposed again,
>>> with the new data points added for documentation. Also, the commit
>>> logs need to crystal clear about how the meaning of PFN validity
>>> differs between ARM and other architectures, and why the assumptions
>>> that the optimization is based on are guaranteed to hold.
>> I think Jia He no longer works for HXT, if don't mind, I can repost
>> this patch set with Jia He's authority unchanged.
> Ok, I don't mind that, thanks for your followup :)

That's great, I will prepare a new version with Ard's comments addressed
then repost.

Thanks
Hanjun


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-06-12 12:50 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-22  3:07 [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64 Jia He
2018-08-22  3:07 ` Jia He
2018-08-22  3:07 ` [PATCH v11 1/3] arm: arm64: introduce CONFIG_HAVE_MEMBLOCK_PFN_VALID Jia He
2018-08-22  3:07   ` Jia He
2018-08-22  3:07 ` [PATCH v11 2/3] mm: page_alloc: remain memblock_next_valid_pfn() on arm/arm64 Jia He
2018-08-22  3:07   ` Jia He
2018-08-22  3:07 ` [PATCH v11 3/3] mm: page_alloc: reduce unnecessary binary search in memblock_next_valid_pfn Jia He
2018-08-22  3:07   ` Jia He
2018-09-05 21:57 ` [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64 Andrew Morton
2018-09-05 21:57   ` Andrew Morton
2018-09-06 10:47   ` Will Deacon
2018-09-06 10:47     ` Will Deacon
2018-09-06 11:24 ` Ard Biesheuvel
2018-09-06 11:24   ` Ard Biesheuvel
2018-09-07 14:44   ` Will Deacon
2018-09-07 14:44     ` Will Deacon
2018-09-14 18:50     ` Eugeniu Rosca
2018-09-14 18:50       ` Eugeniu Rosca
2018-09-14 18:50       ` Eugeniu Rosca
2019-06-08  4:22     ` Hanjun Guo
2019-06-08  4:22       ` Hanjun Guo
2019-06-10 13:16       ` Ard Biesheuvel
2019-06-10 13:16         ` Ard Biesheuvel
2019-06-10 13:16         ` Ard Biesheuvel
2019-06-11 15:18         ` Hanjun Guo
2019-06-11 15:18           ` Hanjun Guo
2019-06-12  1:05           ` Jia He
2019-06-12  1:05             ` Jia He
2019-06-12 12:48             ` Hanjun Guo
2019-06-12 12:48               ` Hanjun Guo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.