linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
@ 2020-03-28 18:31 Hoan Tran
  2020-03-28 18:31 ` [PATCH v3 1/5] " Hoan Tran
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: Hoan Tran @ 2020-03-28 18:31 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Andrew Morton, Michal Hocko,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Mike Rapoport,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger
  Cc: open list:MEMORY MANAGEMENT, linux-arm-kernel, linux-s390,
	sparclinux, x86, linuxppc-dev, linux-kernel, lho, mmorana,
	Hoan Tran

In NUMA layout which nodes have memory ranges that span across other nodes,
the mm driver can detect the memory node id incorrectly.

For example, with layout below
Node 0 address: 0000 xxxx 0000 xxxx
Node 1 address: xxxx 1111 xxxx 1111

Note:
 - Memory from low to high
 - 0/1: Node id
 - x: Invalid memory of a node

When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
config, mm only checks the memory validity but not the node id.
Because of that, Node 1 also detects the memory from node 0 as below
when it scans from the start address to the end address of node 1.

Node 0 address: 0000 xxxx xxxx xxxx
Node 1 address: xxxx 1111 1111 1111

This layout could occur on any architecture. Most of them enables
this config by default with CONFIG_NUMA. This patch, by default, enables
CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.

v3:
 * Revise the patch description

V2:
 * Revise the patch description

Hoan Tran (5):
  mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
  x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
  sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
  s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES

 arch/powerpc/Kconfig | 9 ---------
 arch/s390/Kconfig    | 8 --------
 arch/sparc/Kconfig   | 9 ---------
 arch/x86/Kconfig     | 9 ---------
 mm/page_alloc.c      | 2 +-
 5 files changed, 1 insertion(+), 36 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 1/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-28 18:31 [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Hoan Tran
@ 2020-03-28 18:31 ` Hoan Tran
  2020-03-28 18:31 ` [PATCH v3 2/5] powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES Hoan Tran
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Hoan Tran @ 2020-03-28 18:31 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Andrew Morton, Michal Hocko,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Mike Rapoport,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger
  Cc: open list:MEMORY MANAGEMENT, linux-arm-kernel, linux-s390,
	sparclinux, x86, linuxppc-dev, linux-kernel, lho, mmorana,
	Hoan Tran

In NUMA layout which nodes have memory ranges that span across other nodes,
the mm driver can detect the memory node id incorrectly.

For example, with layout below
Node 0 address: 0000 xxxx 0000 xxxx
Node 1 address: xxxx 1111 xxxx 1111

Note:
 - Memory from low to high
 - 0/1: Node id
 - x: Invalid memory of a node

When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
config, mm only checks the memory validity but not the node id.
Because of that, Node 1 also detects the memory from node 0 as below
when it scans from the start address to the end address of node 1.

Node 0 address: 0000 xxxx xxxx xxxx
Node 1 address: xxxx 1111 1111 1111

This layout could occur on any architecture. Most of them enables
this config by default with CONFIG_NUMA. This patch, by default, enables
CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.

Signed-off-by: Hoan Tran <Hoan@os.amperecomputing.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d047bf7..948c1c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1467,7 +1467,7 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
 }
 #endif
 
-#ifdef CONFIG_NODES_SPAN_OTHER_NODES
+#ifdef CONFIG_NUMA
 /* Only safe to use early in boot when initialisation is single-threaded */
 static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
 {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 2/5] powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
  2020-03-28 18:31 [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Hoan Tran
  2020-03-28 18:31 ` [PATCH v3 1/5] " Hoan Tran
@ 2020-03-28 18:31 ` Hoan Tran
  2020-03-28 18:31 ` [PATCH v3 3/5] x86: " Hoan Tran
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Hoan Tran @ 2020-03-28 18:31 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Andrew Morton, Michal Hocko,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Mike Rapoport,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger
  Cc: open list:MEMORY MANAGEMENT, linux-arm-kernel, linux-s390,
	sparclinux, x86, linuxppc-dev, linux-kernel, lho, mmorana,
	Hoan Tran

Remove CONFIG_NODES_SPAN_OTHER_NODES as it's enabled by
default with NUMA.

Signed-off-by: Hoan Tran <Hoan@os.amperecomputing.com>
---
 arch/powerpc/Kconfig | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e2a4121..4af2699 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -686,15 +686,6 @@ config ARCH_MEMORY_PROBE
 	def_bool y
 	depends on MEMORY_HOTPLUG
 
-# Some NUMA nodes have memory ranges that span
-# other nodes.  Even though a pfn is valid and
-# between a node's start and end pfns, it may not
-# reside on that node.  See memmap_init_zone()
-# for details.
-config NODES_SPAN_OTHER_NODES
-	def_bool y
-	depends on NEED_MULTIPLE_NODES
-
 config STDBINUTILS
 	bool "Using standard binutils settings"
 	depends on 44x
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 3/5] x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
  2020-03-28 18:31 [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Hoan Tran
  2020-03-28 18:31 ` [PATCH v3 1/5] " Hoan Tran
  2020-03-28 18:31 ` [PATCH v3 2/5] powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES Hoan Tran
@ 2020-03-28 18:31 ` Hoan Tran
  2020-03-28 18:31 ` [PATCH v3 4/5] sparc: " Hoan Tran
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Hoan Tran @ 2020-03-28 18:31 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Andrew Morton, Michal Hocko,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Mike Rapoport,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger
  Cc: open list:MEMORY MANAGEMENT, linux-arm-kernel, linux-s390,
	sparclinux, x86, linuxppc-dev, linux-kernel, lho, mmorana,
	Hoan Tran

Remove CONFIG_NODES_SPAN_OTHER_NODES as it's enabled
by default. It is now enabled for x86(32 bit) configurations
and do not depend on X64_64_ACPI_NUMA config.
Because of that, on NUMA enabled system, early_pfn_in_nid() 
function is called by memmap_init_zone() during boot-time.
It doesn't affect the performance at run-time.

Signed-off-by: Hoan Tran <Hoan@os.amperecomputing.com>
---
 arch/x86/Kconfig | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5e89499..a938738 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1581,15 +1581,6 @@ config X86_64_ACPI_NUMA
 	---help---
 	  Enable ACPI SRAT based node topology detection.
 
-# Some NUMA nodes have memory ranges that span
-# other nodes.  Even though a pfn is valid and
-# between a node's start and end pfns, it may not
-# reside on that node.  See memmap_init_zone()
-# for details.
-config NODES_SPAN_OTHER_NODES
-	def_bool y
-	depends on X86_64_ACPI_NUMA
-
 config NUMA_EMU
 	bool "NUMA emulation"
 	depends on NUMA
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 4/5] sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
  2020-03-28 18:31 [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Hoan Tran
                   ` (2 preceding siblings ...)
  2020-03-28 18:31 ` [PATCH v3 3/5] x86: " Hoan Tran
@ 2020-03-28 18:31 ` Hoan Tran
  2020-03-28 18:31 ` [PATCH v3 5/5] s390: " Hoan Tran
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Hoan Tran @ 2020-03-28 18:31 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Andrew Morton, Michal Hocko,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Mike Rapoport,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger
  Cc: open list:MEMORY MANAGEMENT, linux-arm-kernel, linux-s390,
	sparclinux, x86, linuxppc-dev, linux-kernel, lho, mmorana,
	Hoan Tran

Remove CONFIG_NODES_SPAN_OTHER_NODES as it's enabled
by default with NUMA.

Signed-off-by: Hoan Tran <Hoan@os.amperecomputing.com>
---
 arch/sparc/Kconfig | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index eb24cb1..6fc615e 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -292,15 +292,6 @@ config NODES_SHIFT
 	  Specify the maximum number of NUMA Nodes available on the target
 	  system.  Increases memory reserved to accommodate various tables.
 
-# Some NUMA nodes have memory ranges that span
-# other nodes.  Even though a pfn is valid and
-# between a node's start and end pfns, it may not
-# reside on that node.  See memmap_init_zone()
-# for details.
-config NODES_SPAN_OTHER_NODES
-	def_bool y
-	depends on NEED_MULTIPLE_NODES
-
 config ARCH_SPARSEMEM_ENABLE
 	def_bool y if SPARC64
 	select SPARSEMEM_VMEMMAP_ENABLE
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 5/5] s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
  2020-03-28 18:31 [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Hoan Tran
                   ` (3 preceding siblings ...)
  2020-03-28 18:31 ` [PATCH v3 4/5] sparc: " Hoan Tran
@ 2020-03-28 18:31 ` Hoan Tran
  2020-03-29  0:19 ` [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Baoquan He
  2020-03-30  7:42 ` Michal Hocko
  6 siblings, 0 replies; 36+ messages in thread
From: Hoan Tran @ 2020-03-28 18:31 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Andrew Morton, Michal Hocko,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Mike Rapoport,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger
  Cc: open list:MEMORY MANAGEMENT, linux-arm-kernel, linux-s390,
	sparclinux, x86, linuxppc-dev, linux-kernel, lho, mmorana,
	Hoan Tran

Remove CONFIG_NODES_SPAN_OTHER_NODES as it's enabled
by default with NUMA.

Signed-off-by: Hoan Tran <Hoan@os.amperecomputing.com>
---
 arch/s390/Kconfig | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index bc88841..d86066e 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -449,14 +449,6 @@ config NR_CPUS
 config HOTPLUG_CPU
 	def_bool y
 
-# Some NUMA nodes have memory ranges that span
-# other nodes.	Even though a pfn is valid and
-# between a node's start and end pfns, it may not
-# reside on that node.	See memmap_init_zone()
-# for details. <- They meant memory holes!
-config NODES_SPAN_OTHER_NODES
-	def_bool NUMA
-
 config NUMA
 	bool "NUMA support"
 	depends on SCHED_TOPOLOGY
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-28 18:31 [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Hoan Tran
                   ` (4 preceding siblings ...)
  2020-03-28 18:31 ` [PATCH v3 5/5] s390: " Hoan Tran
@ 2020-03-29  0:19 ` Baoquan He
  2020-03-30  7:44   ` Michal Hocko
  2020-03-30  7:42 ` Michal Hocko
  6 siblings, 1 reply; 36+ messages in thread
From: Baoquan He @ 2020-03-29  0:19 UTC (permalink / raw)
  To: Hoan Tran
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, Michal Hocko,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Mike Rapoport,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On 03/28/20 at 11:31am, Hoan Tran wrote:
> In NUMA layout which nodes have memory ranges that span across other nodes,
> the mm driver can detect the memory node id incorrectly.
> 
> For example, with layout below
> Node 0 address: 0000 xxxx 0000 xxxx
> Node 1 address: xxxx 1111 xxxx 1111

Sorry, I read this example several times, but still don't get what it
means. Can it be given with real hex number address as an exmaple? I
mean just using the memory layout you have seen from some systems. The
change looks interesting though.

> 
> Note:
>  - Memory from low to high
>  - 0/1: Node id
>  - x: Invalid memory of a node
> 
> When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> config, mm only checks the memory validity but not the node id.
> Because of that, Node 1 also detects the memory from node 0 as below
> when it scans from the start address to the end address of node 1.
> 
> Node 0 address: 0000 xxxx xxxx xxxx
> Node 1 address: xxxx 1111 1111 1111
> 
> This layout could occur on any architecture. Most of them enables
> this config by default with CONFIG_NUMA. This patch, by default, enables
> CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> 
> v3:
>  * Revise the patch description
> 
> V2:
>  * Revise the patch description
> 
> Hoan Tran (5):
>   mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
>   powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
>   x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
>   sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
>   s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> 
>  arch/powerpc/Kconfig | 9 ---------
>  arch/s390/Kconfig    | 8 --------
>  arch/sparc/Kconfig   | 9 ---------
>  arch/x86/Kconfig     | 9 ---------
>  mm/page_alloc.c      | 2 +-
>  5 files changed, 1 insertion(+), 36 deletions(-)
> 
> -- 
> 1.8.3.1
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-28 18:31 [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Hoan Tran
                   ` (5 preceding siblings ...)
  2020-03-29  0:19 ` [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Baoquan He
@ 2020-03-30  7:42 ` Michal Hocko
  2020-03-30  8:16   ` Baoquan He
                     ` (3 more replies)
  6 siblings, 4 replies; 36+ messages in thread
From: Michal Hocko @ 2020-03-30  7:42 UTC (permalink / raw)
  To: Hoan Tran
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka,
	Oscar Salvador, Pavel Tatashin, Mike Rapoport, Alexander Duyck,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> In NUMA layout which nodes have memory ranges that span across other nodes,
> the mm driver can detect the memory node id incorrectly.
> 
> For example, with layout below
> Node 0 address: 0000 xxxx 0000 xxxx
> Node 1 address: xxxx 1111 xxxx 1111
> 
> Note:
>  - Memory from low to high
>  - 0/1: Node id
>  - x: Invalid memory of a node
> 
> When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> config, mm only checks the memory validity but not the node id.
> Because of that, Node 1 also detects the memory from node 0 as below
> when it scans from the start address to the end address of node 1.
> 
> Node 0 address: 0000 xxxx xxxx xxxx
> Node 1 address: xxxx 1111 1111 1111
> 
> This layout could occur on any architecture. Most of them enables
> this config by default with CONFIG_NUMA. This patch, by default, enables
> CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.

I am not opposed to this at all. It reduces the config space and that is
a good thing on its own. The history has shown that meory layout might
be really wild wrt NUMA. The config is only used for early_pfn_in_nid
which is clearly an overkill.

Your description doesn't really explain why this is safe though. The
history of this config is somehow messy, though. Mike has tried
to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
reasoning what so ever. This doesn't make it really easy see whether
reasons for reintroduction are still there. Maybe there are some subtle
dependencies. I do not see any TBH but that might be burried deep in an
arch specific code.

> v3:
>  * Revise the patch description
> 
> V2:
>  * Revise the patch description
> 
> Hoan Tran (5):
>   mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
>   powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
>   x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
>   sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
>   s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> 
>  arch/powerpc/Kconfig | 9 ---------
>  arch/s390/Kconfig    | 8 --------
>  arch/sparc/Kconfig   | 9 ---------
>  arch/x86/Kconfig     | 9 ---------
>  mm/page_alloc.c      | 2 +-
>  5 files changed, 1 insertion(+), 36 deletions(-)
> 
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-29  0:19 ` [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Baoquan He
@ 2020-03-30  7:44   ` Michal Hocko
  2020-03-30  8:04     ` Baoquan He
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2020-03-30  7:44 UTC (permalink / raw)
  To: Baoquan He
  Cc: Hoan Tran, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Mike Rapoport,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On Sun 29-03-20 08:19:24, Baoquan He wrote:
> On 03/28/20 at 11:31am, Hoan Tran wrote:
> > In NUMA layout which nodes have memory ranges that span across other nodes,
> > the mm driver can detect the memory node id incorrectly.
> > 
> > For example, with layout below
> > Node 0 address: 0000 xxxx 0000 xxxx
> > Node 1 address: xxxx 1111 xxxx 1111
> 
> Sorry, I read this example several times, but still don't get what it
> means. Can it be given with real hex number address as an exmaple? I
> mean just using the memory layout you have seen from some systems. The
> change looks interesting though.

Does this make it more clear?
           physical address range and its node associaion
         [0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-30  7:44   ` Michal Hocko
@ 2020-03-30  8:04     ` Baoquan He
  0 siblings, 0 replies; 36+ messages in thread
From: Baoquan He @ 2020-03-30  8:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hoan Tran, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Mike Rapoport,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On 03/30/20 at 09:44am, Michal Hocko wrote:
> On Sun 29-03-20 08:19:24, Baoquan He wrote:
> > On 03/28/20 at 11:31am, Hoan Tran wrote:
> > > In NUMA layout which nodes have memory ranges that span across other nodes,
> > > the mm driver can detect the memory node id incorrectly.
> > > 
> > > For example, with layout below
> > > Node 0 address: 0000 xxxx 0000 xxxx
> > > Node 1 address: xxxx 1111 xxxx 1111
> > 
> > Sorry, I read this example several times, but still don't get what it
> > means. Can it be given with real hex number address as an exmaple? I
> > mean just using the memory layout you have seen from some systems. The
> > change looks interesting though.
> 
> Does this make it more clear?
>            physical address range and its node associaion
>          [0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1]

I later read it again, have got what Hoan is trying to say, thanks.

I think the change in this patchset makes sense, still have some concern
though, let me add comment in other thread.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-30  7:42 ` Michal Hocko
@ 2020-03-30  8:16   ` Baoquan He
  2020-03-30  8:28     ` Baoquan He
  2020-03-30  9:21   ` Mike Rapoport
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Baoquan He @ 2020-03-30  8:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hoan Tran, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Mike Rapoport,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On 03/30/20 at 09:42am, Michal Hocko wrote:
> On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > In NUMA layout which nodes have memory ranges that span across other nodes,
> > the mm driver can detect the memory node id incorrectly.
> > 
> > For example, with layout below
> > Node 0 address: 0000 xxxx 0000 xxxx
> > Node 1 address: xxxx 1111 xxxx 1111
> > 
> > Note:
> >  - Memory from low to high
> >  - 0/1: Node id
> >  - x: Invalid memory of a node
> > 
> > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > config, mm only checks the memory validity but not the node id.
> > Because of that, Node 1 also detects the memory from node 0 as below
> > when it scans from the start address to the end address of node 1.
> > 
> > Node 0 address: 0000 xxxx xxxx xxxx
> > Node 1 address: xxxx 1111 1111 1111
> > 
> > This layout could occur on any architecture. Most of them enables
> > this config by default with CONFIG_NUMA. This patch, by default, enables
> > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> 
> I am not opposed to this at all. It reduces the config space and that is
> a good thing on its own. The history has shown that meory layout might
> be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> which is clearly an overkill.
> 
> Your description doesn't really explain why this is safe though. The
> history of this config is somehow messy, though. Mike has tried
> to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> reasoning what so ever. This doesn't make it really easy see whether
> reasons for reintroduction are still there. Maybe there are some subtle
> dependencies. I do not see any TBH but that might be burried deep in an
> arch specific code.

Yeah, since early_pfnnid_cache was added, we do not need worry about the
performance. But when I read the mem init code on x86 again, I do see there
are codes to handle the node overlapping, e.g in numa_cleanup_meminfo(),
when store node id into memblock. But the thing is if we have
encountered the node overlapping, we just return ahead of time, leave
something uninitialized. I am wondering if the system with node
overlapping can still run heathily.

> 
> > v3:
> >  * Revise the patch description
> > 
> > V2:
> >  * Revise the patch description
> > 
> > Hoan Tran (5):
> >   mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
> >   powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> > 
> >  arch/powerpc/Kconfig | 9 ---------
> >  arch/s390/Kconfig    | 8 --------
> >  arch/sparc/Kconfig   | 9 ---------
> >  arch/x86/Kconfig     | 9 ---------
> >  mm/page_alloc.c      | 2 +-
> >  5 files changed, 1 insertion(+), 36 deletions(-)
> > 
> > -- 
> > 1.8.3.1
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-30  8:16   ` Baoquan He
@ 2020-03-30  8:28     ` Baoquan He
  0 siblings, 0 replies; 36+ messages in thread
From: Baoquan He @ 2020-03-30  8:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hoan Tran, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Mike Rapoport,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On 03/30/20 at 04:16pm, Baoquan He wrote:
> On 03/30/20 at 09:42am, Michal Hocko wrote:
> > On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > > In NUMA layout which nodes have memory ranges that span across other nodes,
> > > the mm driver can detect the memory node id incorrectly.
> > > 
> > > For example, with layout below
> > > Node 0 address: 0000 xxxx 0000 xxxx
> > > Node 1 address: xxxx 1111 xxxx 1111
> > > 
> > > Note:
> > >  - Memory from low to high
> > >  - 0/1: Node id
> > >  - x: Invalid memory of a node
> > > 
> > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > > config, mm only checks the memory validity but not the node id.
> > > Because of that, Node 1 also detects the memory from node 0 as below
> > > when it scans from the start address to the end address of node 1.
> > > 
> > > Node 0 address: 0000 xxxx xxxx xxxx
> > > Node 1 address: xxxx 1111 1111 1111
> > > 
> > > This layout could occur on any architecture. Most of them enables
> > > this config by default with CONFIG_NUMA. This patch, by default, enables
> > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> > 
> > I am not opposed to this at all. It reduces the config space and that is
> > a good thing on its own. The history has shown that meory layout might
> > be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> > which is clearly an overkill.
> > 
> > Your description doesn't really explain why this is safe though. The
> > history of this config is somehow messy, though. Mike has tried
> > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> > reasoning what so ever. This doesn't make it really easy see whether
> > reasons for reintroduction are still there. Maybe there are some subtle
> > dependencies. I do not see any TBH but that might be burried deep in an
> > arch specific code.
> 
> Yeah, since early_pfnnid_cache was added, we do not need worry about the
> performance. But when I read the mem init code on x86 again, I do see there
> are codes to handle the node overlapping, e.g in numa_cleanup_meminfo(),
> when store node id into memblock. But the thing is if we have
> encountered the node overlapping, we just return ahead of time, leave
> something uninitialized. I am wondering if the system with node
> overlapping can still run heathily.

Ok, I didn't read code carefully. That is handling case where memblock
with different node id overlap, it needs return. In the example
Hoan gave, it has no problem, system can run well. Please ignore above
comment.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-30  7:42 ` Michal Hocko
  2020-03-30  8:16   ` Baoquan He
@ 2020-03-30  9:21   ` Mike Rapoport
  2020-03-30  9:58     ` Michal Hocko
  2020-03-30  9:26   ` [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Baoquan He
  2020-03-30 17:51   ` Mike Rapoport
  3 siblings, 1 reply; 36+ messages in thread
From: Mike Rapoport @ 2020-03-30  9:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hoan Tran, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Alexander Duyck,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > In NUMA layout which nodes have memory ranges that span across other nodes,
> > the mm driver can detect the memory node id incorrectly.
> > 
> > For example, with layout below
> > Node 0 address: 0000 xxxx 0000 xxxx
> > Node 1 address: xxxx 1111 xxxx 1111
> > 
> > Note:
> >  - Memory from low to high
> >  - 0/1: Node id
> >  - x: Invalid memory of a node
> > 
> > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > config, mm only checks the memory validity but not the node id.
> > Because of that, Node 1 also detects the memory from node 0 as below
> > when it scans from the start address to the end address of node 1.
> > 
> > Node 0 address: 0000 xxxx xxxx xxxx
> > Node 1 address: xxxx 1111 1111 1111
> > 
> > This layout could occur on any architecture. Most of them enables
> > this config by default with CONFIG_NUMA. This patch, by default, enables
> > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> 
> I am not opposed to this at all. It reduces the config space and that is
> a good thing on its own. The history has shown that meory layout might
> be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> which is clearly an overkill.
> 
> Your description doesn't really explain why this is safe though. The
> history of this config is somehow messy, though. Mike has tried
> to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> reasoning what so ever. This doesn't make it really easy see whether
> reasons for reintroduction are still there. Maybe there are some subtle
> dependencies. I do not see any TBH but that might be burried deep in an
> arch specific code.

Well, back then early_pfn_in_nid() was arch-dependant, today everyone
except ia64 rely on HAVE_MEMBLOCK_NODE_MAP. So, if the memblock node map
is correct, that using CONFIG_NUMA instead of CONFIG_NODES_SPAN_OTHER_NODES
would only mean that early_pfn_in_nid() will cost several cycles more on
architectures that didn't select CONFIG_NODES_SPAN_OTHER_NODES (i.e. arm64
and sh).
Agian, ia64 is an exception here.


> > v3:
> >  * Revise the patch description
> > 
> > V2:
> >  * Revise the patch description
> > 
> > Hoan Tran (5):
> >   mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
> >   powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> > 
> >  arch/powerpc/Kconfig | 9 ---------
> >  arch/s390/Kconfig    | 8 --------
> >  arch/sparc/Kconfig   | 9 ---------
> >  arch/x86/Kconfig     | 9 ---------
> >  mm/page_alloc.c      | 2 +-
> >  5 files changed, 1 insertion(+), 36 deletions(-)
> > 
> > -- 
> > 1.8.3.1
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-30  7:42 ` Michal Hocko
  2020-03-30  8:16   ` Baoquan He
  2020-03-30  9:21   ` Mike Rapoport
@ 2020-03-30  9:26   ` Baoquan He
  2020-03-30 17:51   ` Mike Rapoport
  3 siblings, 0 replies; 36+ messages in thread
From: Baoquan He @ 2020-03-30  9:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hoan Tran, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Mike Rapoport,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On 03/30/20 at 09:42am, Michal Hocko wrote:
> On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > In NUMA layout which nodes have memory ranges that span across other nodes,
> > the mm driver can detect the memory node id incorrectly.
> > 
> > For example, with layout below
> > Node 0 address: 0000 xxxx 0000 xxxx
> > Node 1 address: xxxx 1111 xxxx 1111
> > 
> > Note:
> >  - Memory from low to high
> >  - 0/1: Node id
> >  - x: Invalid memory of a node
> > 
> > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > config, mm only checks the memory validity but not the node id.
> > Because of that, Node 1 also detects the memory from node 0 as below
> > when it scans from the start address to the end address of node 1.
> > 
> > Node 0 address: 0000 xxxx xxxx xxxx
> > Node 1 address: xxxx 1111 1111 1111
> > 
> > This layout could occur on any architecture. Most of them enables
> > this config by default with CONFIG_NUMA. This patch, by default, enables
> > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> 
> I am not opposed to this at all. It reduces the config space and that is
> a good thing on its own. The history has shown that meory layout might
> be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> which is clearly an overkill.
> 
> Your description doesn't really explain why this is safe though. The
> history of this config is somehow messy, though. Mike has tried
> to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> reasoning what so ever. This doesn't make it really easy see whether
> reasons for reintroduction are still there. Maybe there are some subtle
> dependencies. I do not see any TBH but that might be burried deep in an
> arch specific code.

Since on all ARCHes NODES_SPAN_OTHER_NODES has dependency on NUMA,
replacing it with CONFIG_NUMA seems no risk. Just for those ARCHes which
don't have CONFIG_NODES_SPAN_OTHER_NODES before, it involves a tiny
performance degradation. Besides, s390 has removed support of
NODES_SPAN_OTHER_NODES already.

commit 701dc81e7412daaf3c5bf4bc55d35c8b1525112a
Author: Heiko Carstens <heiko.carstens@de.ibm.com>
Date:   Wed Feb 19 13:29:15 2020 +0100

    s390/mm: remove fake numa support

> 
> > v3:
> >  * Revise the patch description
> > 
> > V2:
> >  * Revise the patch description
> > 
> > Hoan Tran (5):
> >   mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
> >   powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> > 
> >  arch/powerpc/Kconfig | 9 ---------
> >  arch/s390/Kconfig    | 8 --------
> >  arch/sparc/Kconfig   | 9 ---------
> >  arch/x86/Kconfig     | 9 ---------
> >  mm/page_alloc.c      | 2 +-
> >  5 files changed, 1 insertion(+), 36 deletions(-)
> > 
> > -- 
> > 1.8.3.1
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-30  9:21   ` Mike Rapoport
@ 2020-03-30  9:58     ` Michal Hocko
  2020-03-30 10:26       ` Mike Rapoport
  2020-03-31 21:56       ` [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA) Mike Rapoport
  0 siblings, 2 replies; 36+ messages in thread
From: Michal Hocko @ 2020-03-30  9:58 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Hoan Tran, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Alexander Duyck,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On Mon 30-03-20 12:21:27, Mike Rapoport wrote:
> On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> > On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > > In NUMA layout which nodes have memory ranges that span across other nodes,
> > > the mm driver can detect the memory node id incorrectly.
> > > 
> > > For example, with layout below
> > > Node 0 address: 0000 xxxx 0000 xxxx
> > > Node 1 address: xxxx 1111 xxxx 1111
> > > 
> > > Note:
> > >  - Memory from low to high
> > >  - 0/1: Node id
> > >  - x: Invalid memory of a node
> > > 
> > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > > config, mm only checks the memory validity but not the node id.
> > > Because of that, Node 1 also detects the memory from node 0 as below
> > > when it scans from the start address to the end address of node 1.
> > > 
> > > Node 0 address: 0000 xxxx xxxx xxxx
> > > Node 1 address: xxxx 1111 1111 1111
> > > 
> > > This layout could occur on any architecture. Most of them enables
> > > this config by default with CONFIG_NUMA. This patch, by default, enables
> > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> > 
> > I am not opposed to this at all. It reduces the config space and that is
> > a good thing on its own. The history has shown that meory layout might
> > be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> > which is clearly an overkill.
> > 
> > Your description doesn't really explain why this is safe though. The
> > history of this config is somehow messy, though. Mike has tried
> > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> > reasoning what so ever. This doesn't make it really easy see whether
> > reasons for reintroduction are still there. Maybe there are some subtle
> > dependencies. I do not see any TBH but that might be burried deep in an
> > arch specific code.
> 
> Well, back then early_pfn_in_nid() was arch-dependant, today everyone
> except ia64 rely on HAVE_MEMBLOCK_NODE_MAP.

What would it take to make ia64 use HAVE_MEMBLOCK_NODE_MAP? I would
really love to see that thing go away. It is causing problems when
people try to use memblock api.

> So, if the memblock node map
> is correct, that using CONFIG_NUMA instead of CONFIG_NODES_SPAN_OTHER_NODES
> would only mean that early_pfn_in_nid() will cost several cycles more on
> architectures that didn't select CONFIG_NODES_SPAN_OTHER_NODES (i.e. arm64
> and sh).

Do we have any idea on how much of an overhead that is? Because this is
per each pfn so it can accumulate a lot! 

> Agian, ia64 is an exception here.

Thanks for the clarification!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-30  9:58     ` Michal Hocko
@ 2020-03-30 10:26       ` Mike Rapoport
  2020-03-30 10:43         ` Baoquan He
  2020-03-31 21:56       ` [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA) Mike Rapoport
  1 sibling, 1 reply; 36+ messages in thread
From: Mike Rapoport @ 2020-03-30 10:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hoan Tran, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Alexander Duyck,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On Mon, Mar 30, 2020 at 11:58:43AM +0200, Michal Hocko wrote:
> On Mon 30-03-20 12:21:27, Mike Rapoport wrote:
> > On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> > > On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > > > In NUMA layout which nodes have memory ranges that span across other nodes,
> > > > the mm driver can detect the memory node id incorrectly.
> > > > 
> > > > For example, with layout below
> > > > Node 0 address: 0000 xxxx 0000 xxxx
> > > > Node 1 address: xxxx 1111 xxxx 1111
> > > > 
> > > > Note:
> > > >  - Memory from low to high
> > > >  - 0/1: Node id
> > > >  - x: Invalid memory of a node
> > > > 
> > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > > > config, mm only checks the memory validity but not the node id.
> > > > Because of that, Node 1 also detects the memory from node 0 as below
> > > > when it scans from the start address to the end address of node 1.
> > > > 
> > > > Node 0 address: 0000 xxxx xxxx xxxx
> > > > Node 1 address: xxxx 1111 1111 1111
> > > > 
> > > > This layout could occur on any architecture. Most of them enables
> > > > this config by default with CONFIG_NUMA. This patch, by default, enables
> > > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> > > 
> > > I am not opposed to this at all. It reduces the config space and that is
> > > a good thing on its own. The history has shown that meory layout might
> > > be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> > > which is clearly an overkill.
> > > 
> > > Your description doesn't really explain why this is safe though. The
> > > history of this config is somehow messy, though. Mike has tried
> > > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> > > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> > > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> > > reasoning what so ever. This doesn't make it really easy see whether
> > > reasons for reintroduction are still there. Maybe there are some subtle
> > > dependencies. I do not see any TBH but that might be burried deep in an
> > > arch specific code.
> > 
> > Well, back then early_pfn_in_nid() was arch-dependant, today everyone
> > except ia64 rely on HAVE_MEMBLOCK_NODE_MAP.
> 
> What would it take to make ia64 use HAVE_MEMBLOCK_NODE_MAP? I would
> really love to see that thing go away. It is causing problems when
> people try to use memblock api.

Sorry, my bad, ia64 does not have NODES_SPAN_OTHER_NODES, but it does have
HAVE_MEMBLOCK_NODE_MAP.

I remember I've tried killing HAVE_MEMBLOCK_NODE_MAP, but I've run into
some problems and then I've got distracted. I too would like to have
HAVE_MEMBLOCK_NODE_MAP go away, maybe I'll take another look at it.
 
> > So, if the memblock node map
> > is correct, that using CONFIG_NUMA instead of CONFIG_NODES_SPAN_OTHER_NODES
> > would only mean that early_pfn_in_nid() will cost several cycles more on
> > architectures that didn't select CONFIG_NODES_SPAN_OTHER_NODES (i.e. arm64
> > and sh).
> 
> Do we have any idea on how much of an overhead that is? Because this is
> per each pfn so it can accumulate a lot! 

It's O(log(N)) where N is the amount of the memory banks (ie. memblock.memory.cnt)
 
> > Agian, ia64 is an exception here.
> 
> Thanks for the clarification!
> -- 
> Michal Hocko
> SUSE Labs

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-30 10:26       ` Mike Rapoport
@ 2020-03-30 10:43         ` Baoquan He
  0 siblings, 0 replies; 36+ messages in thread
From: Baoquan He @ 2020-03-30 10:43 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Michal Hocko, Hoan Tran, Catalin Marinas, Will Deacon,
	Andrew Morton, Vlastimil Babka, Oscar Salvador, Pavel Tatashin,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On 03/30/20 at 01:26pm, Mike Rapoport wrote:
> On Mon, Mar 30, 2020 at 11:58:43AM +0200, Michal Hocko wrote:
> > On Mon 30-03-20 12:21:27, Mike Rapoport wrote:
> > > On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> > > > On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > > > > In NUMA layout which nodes have memory ranges that span across other nodes,
> > > > > the mm driver can detect the memory node id incorrectly.
> > > > > 
> > > > > For example, with layout below
> > > > > Node 0 address: 0000 xxxx 0000 xxxx
> > > > > Node 1 address: xxxx 1111 xxxx 1111
> > > > > 
> > > > > Note:
> > > > >  - Memory from low to high
> > > > >  - 0/1: Node id
> > > > >  - x: Invalid memory of a node
> > > > > 
> > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > > > > config, mm only checks the memory validity but not the node id.
> > > > > Because of that, Node 1 also detects the memory from node 0 as below
> > > > > when it scans from the start address to the end address of node 1.
> > > > > 
> > > > > Node 0 address: 0000 xxxx xxxx xxxx
> > > > > Node 1 address: xxxx 1111 1111 1111
> > > > > 
> > > > > This layout could occur on any architecture. Most of them enables
> > > > > this config by default with CONFIG_NUMA. This patch, by default, enables
> > > > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> > > > 
> > > > I am not opposed to this at all. It reduces the config space and that is
> > > > a good thing on its own. The history has shown that meory layout might
> > > > be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> > > > which is clearly an overkill.
> > > > 
> > > > Your description doesn't really explain why this is safe though. The
> > > > history of this config is somehow messy, though. Mike has tried
> > > > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> > > > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> > > > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> > > > reasoning what so ever. This doesn't make it really easy see whether
> > > > reasons for reintroduction are still there. Maybe there are some subtle
> > > > dependencies. I do not see any TBH but that might be burried deep in an
> > > > arch specific code.
> > > 
> > > Well, back then early_pfn_in_nid() was arch-dependant, today everyone
> > > except ia64 rely on HAVE_MEMBLOCK_NODE_MAP.
> > 
> > What would it take to make ia64 use HAVE_MEMBLOCK_NODE_MAP? I would
> > really love to see that thing go away. It is causing problems when
> > people try to use memblock api.
> 
> Sorry, my bad, ia64 does not have NODES_SPAN_OTHER_NODES, but it does have
> HAVE_MEMBLOCK_NODE_MAP.
> 
> I remember I've tried killing HAVE_MEMBLOCK_NODE_MAP, but I've run into
> some problems and then I've got distracted. I too would like to have
> HAVE_MEMBLOCK_NODE_MAP go away, maybe I'll take another look at it.
>  
> > > So, if the memblock node map
> > > is correct, that using CONFIG_NUMA instead of CONFIG_NODES_SPAN_OTHER_NODES
> > > would only mean that early_pfn_in_nid() will cost several cycles more on
> > > architectures that didn't select CONFIG_NODES_SPAN_OTHER_NODES (i.e. arm64
> > > and sh).
> > 
> > Do we have any idea on how much of an overhead that is? Because this is
> > per each pfn so it can accumulate a lot! 
> 
> It's O(log(N)) where N is the amount of the memory banks (ie. memblock.memory.cnt)

This is for the Node id searching. But early_pfn_in_nid() is calling for
each pfn, this is the big one, I think. Otherwise, it may be optimized
as no-op.

>  
> > > Agian, ia64 is an exception here.
> > 
> > Thanks for the clarification!
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> -- 
> Sincerely yours,
> Mike.
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-30  7:42 ` Michal Hocko
                     ` (2 preceding siblings ...)
  2020-03-30  9:26   ` [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Baoquan He
@ 2020-03-30 17:51   ` Mike Rapoport
  2020-03-30 18:23     ` Michal Hocko
  3 siblings, 1 reply; 36+ messages in thread
From: Mike Rapoport @ 2020-03-30 17:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hoan Tran, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Alexander Duyck,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > In NUMA layout which nodes have memory ranges that span across other nodes,
> > the mm driver can detect the memory node id incorrectly.
> > 
> > For example, with layout below
> > Node 0 address: 0000 xxxx 0000 xxxx
> > Node 1 address: xxxx 1111 xxxx 1111
> > 
> > Note:
> >  - Memory from low to high
> >  - 0/1: Node id
> >  - x: Invalid memory of a node
> > 
> > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > config, mm only checks the memory validity but not the node id.
> > Because of that, Node 1 also detects the memory from node 0 as below
> > when it scans from the start address to the end address of node 1.
> > 
> > Node 0 address: 0000 xxxx xxxx xxxx
> > Node 1 address: xxxx 1111 1111 1111
> > 
> > This layout could occur on any architecture. Most of them enables
> > this config by default with CONFIG_NUMA. This patch, by default, enables
> > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> 
> I am not opposed to this at all. It reduces the config space and that is
> a good thing on its own. The history has shown that meory layout might
> be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> which is clearly an overkill.
> 
> Your description doesn't really explain why this is safe though. The
> history of this config is somehow messy, though. Mike has tried
> to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> reasoning what so ever. This doesn't make it really easy see whether
> reasons for reintroduction are still there. Maybe there are some subtle
> dependencies. I do not see any TBH but that might be burried deep in an
> arch specific code.

I've looked at this a bit more and it seems that the check for
early_pfn_in_nid() in memmap_init_zone() can be simply removed.

The commits you've mentioned were way before the addition of
HAVE_MEMBLOCK_NODE_MAP and the whole infrastructure that calculates zone
sizes and boundaries based on the memblock node map.
So, the memmap_init_zone() is called when zone boundaries are already
within a node.

I don't have access to machines with memory layout that required this check
at the first place, so if anybody who does could test the change below on
such machine it would be great.


diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..6d3eb0901864 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5908,10 +5908,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
                                pfn = next_pfn(pfn);
                                continue;
                        }
-                       if (!early_pfn_in_nid(pfn, nid)) {
-                               pfn++;
-                               continue;
-                       }
                        if (overlap_memmap_init(zone, &pfn))
                                continue;
                        if (defer_init(nid, pfn, end_pfn))

 
> > v3:
> >  * Revise the patch description
> > 
> > V2:
> >  * Revise the patch description
> > 
> > Hoan Tran (5):
> >   mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
> >   powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> > 
> >  arch/powerpc/Kconfig | 9 ---------
> >  arch/s390/Kconfig    | 8 --------
> >  arch/sparc/Kconfig   | 9 ---------
> >  arch/x86/Kconfig     | 9 ---------
> >  mm/page_alloc.c      | 2 +-
> >  5 files changed, 1 insertion(+), 36 deletions(-)
> > 
> > -- 
> > 1.8.3.1
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Sincerely yours,
Mike.


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-30 17:51   ` Mike Rapoport
@ 2020-03-30 18:23     ` Michal Hocko
  2020-03-31  8:14       ` Mike Rapoport
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2020-03-30 18:23 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Hoan Tran, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Alexander Duyck,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On Mon 30-03-20 20:51:00, Mike Rapoport wrote:
> On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> > On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > > In NUMA layout which nodes have memory ranges that span across other nodes,
> > > the mm driver can detect the memory node id incorrectly.
> > > 
> > > For example, with layout below
> > > Node 0 address: 0000 xxxx 0000 xxxx
> > > Node 1 address: xxxx 1111 xxxx 1111
> > > 
> > > Note:
> > >  - Memory from low to high
> > >  - 0/1: Node id
> > >  - x: Invalid memory of a node
> > > 
> > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > > config, mm only checks the memory validity but not the node id.
> > > Because of that, Node 1 also detects the memory from node 0 as below
> > > when it scans from the start address to the end address of node 1.
> > > 
> > > Node 0 address: 0000 xxxx xxxx xxxx
> > > Node 1 address: xxxx 1111 1111 1111
> > > 
> > > This layout could occur on any architecture. Most of them enables
> > > this config by default with CONFIG_NUMA. This patch, by default, enables
> > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> > 
> > I am not opposed to this at all. It reduces the config space and that is
> > a good thing on its own. The history has shown that meory layout might
> > be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> > which is clearly an overkill.
> > 
> > Your description doesn't really explain why this is safe though. The
> > history of this config is somehow messy, though. Mike has tried
> > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> > reasoning what so ever. This doesn't make it really easy see whether
> > reasons for reintroduction are still there. Maybe there are some subtle
> > dependencies. I do not see any TBH but that might be burried deep in an
> > arch specific code.
> 
> I've looked at this a bit more and it seems that the check for
> early_pfn_in_nid() in memmap_init_zone() can be simply removed.
> 
> The commits you've mentioned were way before the addition of
> HAVE_MEMBLOCK_NODE_MAP and the whole infrastructure that calculates zone
> sizes and boundaries based on the memblock node map.
> So, the memmap_init_zone() is called when zone boundaries are already
> within a node.

But zones from different nodes might overlap in the pfn range. And this
check is there to skip over those overlapping areas. The only way to
skip over this check I can see is to do a different pfn walk and go
through memblock ranges which are guaranteed to belong to a single node.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-30 18:23     ` Michal Hocko
@ 2020-03-31  8:14       ` Mike Rapoport
  2020-03-31  8:55         ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Mike Rapoport @ 2020-03-31  8:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hoan Tran, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Alexander Duyck,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On Mon, Mar 30, 2020 at 08:23:01PM +0200, Michal Hocko wrote:
> On Mon 30-03-20 20:51:00, Mike Rapoport wrote:
> > On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> > > On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > > > In NUMA layout which nodes have memory ranges that span across other nodes,
> > > > the mm driver can detect the memory node id incorrectly.
> > > > 
> > > > For example, with layout below
> > > > Node 0 address: 0000 xxxx 0000 xxxx
> > > > Node 1 address: xxxx 1111 xxxx 1111
> > > > 
> > > > Note:
> > > >  - Memory from low to high
> > > >  - 0/1: Node id
> > > >  - x: Invalid memory of a node
> > > > 
> > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > > > config, mm only checks the memory validity but not the node id.
> > > > Because of that, Node 1 also detects the memory from node 0 as below
> > > > when it scans from the start address to the end address of node 1.
> > > > 
> > > > Node 0 address: 0000 xxxx xxxx xxxx
> > > > Node 1 address: xxxx 1111 1111 1111
> > > > 
> > > > This layout could occur on any architecture. Most of them enables
> > > > this config by default with CONFIG_NUMA. This patch, by default, enables
> > > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> > > 
> > > I am not opposed to this at all. It reduces the config space and that is
> > > a good thing on its own. The history has shown that meory layout might
> > > be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> > > which is clearly an overkill.
> > > 
> > > Your description doesn't really explain why this is safe though. The
> > > history of this config is somehow messy, though. Mike has tried
> > > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> > > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> > > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> > > reasoning what so ever. This doesn't make it really easy see whether
> > > reasons for reintroduction are still there. Maybe there are some subtle
> > > dependencies. I do not see any TBH but that might be burried deep in an
> > > arch specific code.
> > 
> > I've looked at this a bit more and it seems that the check for
> > early_pfn_in_nid() in memmap_init_zone() can be simply removed.
> > 
> > The commits you've mentioned were way before the addition of
> > HAVE_MEMBLOCK_NODE_MAP and the whole infrastructure that calculates zone
> > sizes and boundaries based on the memblock node map.
> > So, the memmap_init_zone() is called when zone boundaries are already
> > within a node.
> 
> But zones from different nodes might overlap in the pfn range. And this
> check is there to skip over those overlapping areas.

Maybe I mis-read the code, but I don't see how this could happen. In the
HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
calculate_node_totalpages() that ensures that node->node_zones are entirely
within the node because this is checked in zone_spanned_pages_in_node().
So, for zones from different nodes to overlap in the pfn range the nodes
themself should overlap. Is this even possible?


> The only way to skip over this check I can see is to do a different pfn
> walk and go through memblock ranges which are guaranteed to belong to a
> single node.
> -- 
> Michal Hocko
> SUSE Labs

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-31  8:14       ` Mike Rapoport
@ 2020-03-31  8:55         ` Michal Hocko
  2020-03-31 14:03           ` Baoquan He
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2020-03-31  8:55 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Hoan Tran, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Alexander Duyck,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
> On Mon, Mar 30, 2020 at 08:23:01PM +0200, Michal Hocko wrote:
> > On Mon 30-03-20 20:51:00, Mike Rapoport wrote:
> > > On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> > > > On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > > > > In NUMA layout which nodes have memory ranges that span across other nodes,
> > > > > the mm driver can detect the memory node id incorrectly.
> > > > > 
> > > > > For example, with layout below
> > > > > Node 0 address: 0000 xxxx 0000 xxxx
> > > > > Node 1 address: xxxx 1111 xxxx 1111
> > > > > 
> > > > > Note:
> > > > >  - Memory from low to high
> > > > >  - 0/1: Node id
> > > > >  - x: Invalid memory of a node
> > > > > 
> > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > > > > config, mm only checks the memory validity but not the node id.
> > > > > Because of that, Node 1 also detects the memory from node 0 as below
> > > > > when it scans from the start address to the end address of node 1.
> > > > > 
> > > > > Node 0 address: 0000 xxxx xxxx xxxx
> > > > > Node 1 address: xxxx 1111 1111 1111
> > > > > 
> > > > > This layout could occur on any architecture. Most of them enables
> > > > > this config by default with CONFIG_NUMA. This patch, by default, enables
> > > > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> > > > 
> > > > I am not opposed to this at all. It reduces the config space and that is
> > > > a good thing on its own. The history has shown that meory layout might
> > > > be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> > > > which is clearly an overkill.
> > > > 
> > > > Your description doesn't really explain why this is safe though. The
> > > > history of this config is somehow messy, though. Mike has tried
> > > > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> > > > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> > > > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> > > > reasoning what so ever. This doesn't make it really easy see whether
> > > > reasons for reintroduction are still there. Maybe there are some subtle
> > > > dependencies. I do not see any TBH but that might be burried deep in an
> > > > arch specific code.
> > > 
> > > I've looked at this a bit more and it seems that the check for
> > > early_pfn_in_nid() in memmap_init_zone() can be simply removed.
> > > 
> > > The commits you've mentioned were way before the addition of
> > > HAVE_MEMBLOCK_NODE_MAP and the whole infrastructure that calculates zone
> > > sizes and boundaries based on the memblock node map.
> > > So, the memmap_init_zone() is called when zone boundaries are already
> > > within a node.
> > 
> > But zones from different nodes might overlap in the pfn range. And this
> > check is there to skip over those overlapping areas.
> 
> Maybe I mis-read the code, but I don't see how this could happen. In the
> HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
> calculate_node_totalpages() that ensures that node->node_zones are entirely
> within the node because this is checked in zone_spanned_pages_in_node().

zone_spanned_pages_in_node does chech the zone boundaries are within the
node boundaries. But that doesn't really tell anything about other
potential zones interleaving with the physical memory range.
zone->spanned_pages simply gives the physical range for the zone
including holes. Interleaving nodes are essentially a hole
(__absent_pages_in_range is going to skip those).

That means that when free_area_init_core simply goes over the whole
physical zone range including holes and that is why we need to check
both for physical and logical holes (aka other nodes).

The life would be so much easier if the whole thing would simply iterate
over memblocks...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-31  8:55         ` Michal Hocko
@ 2020-03-31 14:03           ` Baoquan He
  2020-03-31 14:21             ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Baoquan He @ 2020-03-31 14:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Hoan Tran, Catalin Marinas, Will Deacon,
	Andrew Morton, Vlastimil Babka, Oscar Salvador, Pavel Tatashin,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

Hi Michal,

On 03/31/20 at 10:55am, Michal Hocko wrote:
> On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
> > Maybe I mis-read the code, but I don't see how this could happen. In the
> > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
> > calculate_node_totalpages() that ensures that node->node_zones are entirely
> > within the node because this is checked in zone_spanned_pages_in_node().
> 
> zone_spanned_pages_in_node does chech the zone boundaries are within the
> node boundaries. But that doesn't really tell anything about other
> potential zones interleaving with the physical memory range.
> zone->spanned_pages simply gives the physical range for the zone
> including holes. Interleaving nodes are essentially a hole
> (__absent_pages_in_range is going to skip those).
> 
> That means that when free_area_init_core simply goes over the whole
> physical zone range including holes and that is why we need to check
> both for physical and logical holes (aka other nodes).
> 
> The life would be so much easier if the whole thing would simply iterate
> over memblocks...

The memblock iterating sounds a great idea. I tried with putting the
memblock iterating in the upper layer, memmap_init(), which is used for
boot mem only anyway. Do you think it's doable and OK? It yes, I can
work out a formal patch to make this simpler as you said. The draft code
is as below. Like this it uses the existing code and involves little change.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 138a56c0f48f..558d421f294b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		 * function.  They do not exist on hotplugged memory.
 		 */
 		if (context == MEMMAP_EARLY) {
-			if (!early_pfn_valid(pfn)) {
-				pfn = next_pfn(pfn);
-				continue;
-			}
-			if (!early_pfn_in_nid(pfn, nid)) {
-				pfn++;
-				continue;
-			}
 			if (overlap_memmap_init(zone, &pfn))
 				continue;
 			if (defer_init(nid, pfn, end_pfn))
@@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 }
 
 void __meminit __weak memmap_init(unsigned long size, int nid,
-				  unsigned long zone, unsigned long start_pfn)
+				  unsigned long zone, unsigned long range_start_pfn)
 {
-	memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
+	unsigned long start_pfn, end_pfn;
+	unsigned long range_end_pfn = range_start_pfn + size;
+	int i;
+	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
+		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
+		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
+		if (end_pfn > start_pfn)
+			memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
+	}
 }
 
 static int zone_batchsize(struct zone *zone)


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-31 14:03           ` Baoquan He
@ 2020-03-31 14:21             ` Michal Hocko
  2020-03-31 14:31               ` Baoquan He
  2020-04-09 16:27               ` Mike Rapoport
  0 siblings, 2 replies; 36+ messages in thread
From: Michal Hocko @ 2020-03-31 14:21 UTC (permalink / raw)
  To: Baoquan He
  Cc: Mike Rapoport, Hoan Tran, Catalin Marinas, Will Deacon,
	Andrew Morton, Vlastimil Babka, Oscar Salvador, Pavel Tatashin,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On Tue 31-03-20 22:03:32, Baoquan He wrote:
> Hi Michal,
> 
> On 03/31/20 at 10:55am, Michal Hocko wrote:
> > On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
> > > Maybe I mis-read the code, but I don't see how this could happen. In the
> > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
> > > calculate_node_totalpages() that ensures that node->node_zones are entirely
> > > within the node because this is checked in zone_spanned_pages_in_node().
> > 
> > zone_spanned_pages_in_node does chech the zone boundaries are within the
> > node boundaries. But that doesn't really tell anything about other
> > potential zones interleaving with the physical memory range.
> > zone->spanned_pages simply gives the physical range for the zone
> > including holes. Interleaving nodes are essentially a hole
> > (__absent_pages_in_range is going to skip those).
> > 
> > That means that when free_area_init_core simply goes over the whole
> > physical zone range including holes and that is why we need to check
> > both for physical and logical holes (aka other nodes).
> > 
> > The life would be so much easier if the whole thing would simply iterate
> > over memblocks...
> 
> The memblock iterating sounds a great idea. I tried with putting the
> memblock iterating in the upper layer, memmap_init(), which is used for
> boot mem only anyway. Do you think it's doable and OK? It yes, I can
> work out a formal patch to make this simpler as you said. The draft code
> is as below. Like this it uses the existing code and involves little change.

Doing this would be a step in the right direction! I haven't checked the
code very closely though. The below sounds way too simple to be truth I
am afraid. First for_each_mem_pfn_range is available only for
CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep
saying that I really hate that being conditional). Also I haven't really
checked the deferred initialization path - I have a very vague
recollection that it has been converted to the memblock api but I have
happilly dropped all that memory.
 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 138a56c0f48f..558d421f294b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>  		 * function.  They do not exist on hotplugged memory.
>  		 */
>  		if (context == MEMMAP_EARLY) {
> -			if (!early_pfn_valid(pfn)) {
> -				pfn = next_pfn(pfn);
> -				continue;
> -			}
> -			if (!early_pfn_in_nid(pfn, nid)) {
> -				pfn++;
> -				continue;
> -			}
>  			if (overlap_memmap_init(zone, &pfn))
>  				continue;
>  			if (defer_init(nid, pfn, end_pfn))
> @@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone *zone)
>  }
>  
>  void __meminit __weak memmap_init(unsigned long size, int nid,
> -				  unsigned long zone, unsigned long start_pfn)
> +				  unsigned long zone, unsigned long range_start_pfn)
>  {
> -	memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
> +	unsigned long start_pfn, end_pfn;
> +	unsigned long range_end_pfn = range_start_pfn + size;
> +	int i;
> +	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> +		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> +		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> +		if (end_pfn > start_pfn)
> +			memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
> +	}
>  }
>  
>  static int zone_batchsize(struct zone *zone)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-31 14:21             ` Michal Hocko
@ 2020-03-31 14:31               ` Baoquan He
  2020-04-03  4:46                 ` Hoan Tran
  2020-04-09 16:27               ` Mike Rapoport
  1 sibling, 1 reply; 36+ messages in thread
From: Baoquan He @ 2020-03-31 14:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Hoan Tran, Catalin Marinas, Will Deacon,
	Andrew Morton, Vlastimil Babka, Oscar Salvador, Pavel Tatashin,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On 03/31/20 at 04:21pm, Michal Hocko wrote:
> On Tue 31-03-20 22:03:32, Baoquan He wrote:
> > Hi Michal,
> > 
> > On 03/31/20 at 10:55am, Michal Hocko wrote:
> > > On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
> > > > Maybe I mis-read the code, but I don't see how this could happen. In the
> > > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
> > > > calculate_node_totalpages() that ensures that node->node_zones are entirely
> > > > within the node because this is checked in zone_spanned_pages_in_node().
> > > 
> > > zone_spanned_pages_in_node does chech the zone boundaries are within the
> > > node boundaries. But that doesn't really tell anything about other
> > > potential zones interleaving with the physical memory range.
> > > zone->spanned_pages simply gives the physical range for the zone
> > > including holes. Interleaving nodes are essentially a hole
> > > (__absent_pages_in_range is going to skip those).
> > > 
> > > That means that when free_area_init_core simply goes over the whole
> > > physical zone range including holes and that is why we need to check
> > > both for physical and logical holes (aka other nodes).
> > > 
> > > The life would be so much easier if the whole thing would simply iterate
> > > over memblocks...
> > 
> > The memblock iterating sounds a great idea. I tried with putting the
> > memblock iterating in the upper layer, memmap_init(), which is used for
> > boot mem only anyway. Do you think it's doable and OK? It yes, I can
> > work out a formal patch to make this simpler as you said. The draft code
> > is as below. Like this it uses the existing code and involves little change.
> 
> Doing this would be a step in the right direction! I haven't checked the
> code very closely though. The below sounds way too simple to be truth I
> am afraid. First for_each_mem_pfn_range is available only for
> CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep
> saying that I really hate that being conditional). Also I haven't really
> checked the deferred initialization path - I have a very vague
> recollection that it has been converted to the memblock api but I have
> happilly dropped all that memory.

Thanks for your quick response and pointing out the rest suspect aspects,
I will investigate what you mentioned, see if they impact.

>  
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 138a56c0f48f..558d421f294b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> >  		 * function.  They do not exist on hotplugged memory.
> >  		 */
> >  		if (context == MEMMAP_EARLY) {
> > -			if (!early_pfn_valid(pfn)) {
> > -				pfn = next_pfn(pfn);
> > -				continue;
> > -			}
> > -			if (!early_pfn_in_nid(pfn, nid)) {
> > -				pfn++;
> > -				continue;
> > -			}
> >  			if (overlap_memmap_init(zone, &pfn))
> >  				continue;
> >  			if (defer_init(nid, pfn, end_pfn))
> > @@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone *zone)
> >  }
> >  
> >  void __meminit __weak memmap_init(unsigned long size, int nid,
> > -				  unsigned long zone, unsigned long start_pfn)
> > +				  unsigned long zone, unsigned long range_start_pfn)
> >  {
> > -	memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
> > +	unsigned long start_pfn, end_pfn;
> > +	unsigned long range_end_pfn = range_start_pfn + size;
> > +	int i;
> > +	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > +		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> > +		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> > +		if (end_pfn > start_pfn)
> > +			memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
> > +	}
> >  }
> >  
> >  static int zone_batchsize(struct zone *zone)
> 
> -- 
> Michal Hocko
> SUSE Labs
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)
  2020-03-30  9:58     ` Michal Hocko
  2020-03-30 10:26       ` Mike Rapoport
@ 2020-03-31 21:56       ` Mike Rapoport
  2020-04-01  5:42         ` Baoquan He
  1 sibling, 1 reply; 36+ messages in thread
From: Mike Rapoport @ 2020-03-31 21:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hoan Tran, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Alexander Duyck,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On Mon, Mar 30, 2020 at 11:58:43AM +0200, Michal Hocko wrote:
> 
> What would it take to make ia64 use HAVE_MEMBLOCK_NODE_MAP? I would
> really love to see that thing go away. It is causing problems when
> people try to use memblock api.

Well, it's a small patch in the end :)

Currently all NUMA architectures currently enable
CONFIG_HAVE_MEMBLOCK_NODE_MAP and use free_area_init_nodes() to initialize
nodes and zones structures.

On the other hand, the systems that don't have
CONFIG_HAVE_MEMBLOCK_NODE_MAP use free_area_init() or free_area_init_node()
for this purpose.

With this assumptions, it's possible to make selection of the functions
that calculate spanned and absent pages at runtime.

This patch builds for arm and x86-64 and boots on qemu-system for both.

From f907df987db4d6735c4940b30cfb4764fc0007d4 Mon Sep 17 00:00:00 2001
From: Mike Rapoport <rppt@linux.ibm.com>
Date: Wed, 1 Apr 2020 00:27:17 +0300
Subject: [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option

The CONFIG_HAVE_MEMBLOCK_NODE_MAP is used to differentiate initialization
of nodes and zones structures between the systems that have region to node
mapping in memblock and those that don't.

Currently all the NUMA architectures enable this option and for the
non-NUMA systems we can presume that all the memory belongs to node 0 and
therefore the compile time configuration option is not required.

Still, free_area_init_node() must have a backward compatible version
because its semantics with and without CONFIG_HAVE_MEMBLOCK_NODE_MAP is
different. Once all the architectures will be converted from
free_area_init() and free_area_init_node() to free_area_init_nodes(), the
entire compatibility layer can be dropped.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 include/linux/memblock.h |  15 ------
 include/linux/mm.h       |  16 +-----
 include/linux/mmzone.h   |  13 -----
 mm/memblock.c            |   9 +---
 mm/memory_hotplug.c      |   4 --
 mm/page_alloc.c          | 103 ++++++++++++++++++++++-----------------
 6 files changed, 61 insertions(+), 99 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 079d17d96410..9de81112447e 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -50,9 +50,7 @@ struct memblock_region {
 	phys_addr_t base;
 	phys_addr_t size;
 	enum memblock_flags flags;
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	int nid;
-#endif
 };
 
 /**
@@ -215,7 +213,6 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
 	return m->flags & MEMBLOCK_NOMAP;
 }
 
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
 			    unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
@@ -234,7 +231,6 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 #define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)		\
 	for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid); \
 	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
-#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
@@ -310,7 +306,6 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
 	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
 			       nid, flags, p_start, p_end, p_nid)
 
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_set_node(phys_addr_t base, phys_addr_t size,
 		      struct memblock_type *type, int nid);
 
@@ -323,16 +318,6 @@ static inline int memblock_get_region_node(const struct memblock_region *r)
 {
 	return r->nid;
 }
-#else
-static inline void memblock_set_region_node(struct memblock_region *r, int nid)
-{
-}
-
-static inline int memblock_get_region_node(const struct memblock_region *r)
-{
-	return 0;
-}
-#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 /* Flags for memblock allocation APIs */
 #define MEMBLOCK_ALLOC_ANYWHERE	(~(phys_addr_t)0)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c54fb96cb1e6..368a45d4696a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2125,9 +2125,8 @@ static inline unsigned long get_num_physpages(void)
 	return phys_pages;
 }
 
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 /*
- * With CONFIG_HAVE_MEMBLOCK_NODE_MAP set, an architecture may initialise its
+ * Using memblock node mappings, an architecture may initialise its
  * zones, allocate the backing mem_map and account for memory holes in a more
  * architecture independent manner. This is a substitute for creating the
  * zone_sizes[] and zholes_size[] arrays and passing them to
@@ -2148,9 +2147,6 @@ static inline unsigned long get_num_physpages(void)
  * registered physical page range.  Similarly
  * sparse_memory_present_with_active_regions() calls memory_present() for
  * each range when SPARSEMEM is enabled.
- *
- * See mm/page_alloc.c for more information on each function exposed by
- * CONFIG_HAVE_MEMBLOCK_NODE_MAP.
  */
 extern void free_area_init_nodes(unsigned long *max_zone_pfn);
 unsigned long node_map_pfn_alignment(void);
@@ -2165,22 +2161,12 @@ extern void free_bootmem_with_active_regions(int nid,
 						unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
 
-#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
-#if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
-    !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
-static inline int __early_pfn_to_nid(unsigned long pfn,
-					struct mminit_pfnnid_cache *state)
-{
-	return 0;
-}
-#else
 /* please see mm/page_alloc.c */
 extern int __meminit early_pfn_to_nid(unsigned long pfn);
 /* there is a per-arch backend function. */
 extern int __meminit __early_pfn_to_nid(unsigned long pfn,
 					struct mminit_pfnnid_cache *state);
-#endif
 
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f6873905a..4422d1961d0e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -917,11 +917,7 @@ extern int movable_zone;
 #ifdef CONFIG_HIGHMEM
 static inline int zone_movable_is_highmem(void)
 {
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	return movable_zone == ZONE_HIGHMEM;
-#else
-	return (ZONE_MOVABLE - 1) == ZONE_HIGHMEM;
-#endif
 }
 #endif
 
@@ -1121,15 +1117,6 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
 #include <asm/sparsemem.h>
 #endif
 
-#if !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && \
-	!defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
-static inline unsigned long early_pfn_to_nid(unsigned long pfn)
-{
-	BUILD_BUG_ON(IS_ENABLED(CONFIG_NUMA));
-	return 0;
-}
-#endif
-
 #ifdef CONFIG_FLATMEM
 #define pfn_to_nid(pfn)		(0)
 #endif
diff --git a/mm/memblock.c b/mm/memblock.c
index eba94ee3de0b..819441133a21 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -620,9 +620,7 @@ static int __init_memblock memblock_add_range(struct memblock_type *type,
 		 * area, insert that portion.
 		 */
 		if (rbase > base) {
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 			WARN_ON(nid != memblock_get_region_node(rgn));
-#endif
 			WARN_ON(flags != rgn->flags);
 			nr_new++;
 			if (insert)
@@ -1197,7 +1195,6 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int nid,
 	*idx = ULLONG_MAX;
 }
 
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 /*
  * Common iterator interface used to define for_each_mem_pfn_range().
  */
@@ -1258,7 +1255,7 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
 	memblock_merge_regions(type);
 	return 0;
 }
-#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 /**
  * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
@@ -1797,7 +1794,6 @@ bool __init_memblock memblock_is_map_memory(phys_addr_t addr)
 	return !memblock_is_nomap(&memblock.memory.regions[i]);
 }
 
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
 			 unsigned long *start_pfn, unsigned long *end_pfn)
 {
@@ -1812,7 +1808,6 @@ int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
 
 	return type->regions[mid].nid;
 }
-#endif
 
 /**
  * memblock_is_region_memory - check if a region is a subset of memory
@@ -1903,11 +1898,9 @@ static void __init_memblock memblock_dump(struct memblock_type *type)
 		size = rgn->size;
 		end = base + size - 1;
 		flags = rgn->flags;
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 		if (memblock_get_region_node(rgn) != MAX_NUMNODES)
 			snprintf(nid_buf, sizeof(nid_buf), " on node %d",
 				 memblock_get_region_node(rgn));
-#endif
 		pr_info(" %s[%#x]\t[%pa-%pa], %pa bytes%s flags: %#x\n",
 			type->name, idx, &base, &end, &size, nid_buf, flags);
 	}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 19389cdc16a5..dc8828b087bf 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1366,11 +1366,7 @@ check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
 
 static int __init cmdline_parse_movable_node(char *p)
 {
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	movable_node_enabled = true;
-#else
-	pr_warn("movable_node parameter depends on CONFIG_HAVE_MEMBLOCK_NODE_MAP to work properly\n");
-#endif
 	return 0;
 }
 early_param("movable_node", cmdline_parse_movable_node);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..84a69d6e7e61 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -335,7 +335,6 @@ static unsigned long nr_kernel_pages __initdata;
 static unsigned long nr_all_pages __initdata;
 static unsigned long dma_reserve __initdata;
 
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __initdata;
 static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] __initdata;
 static unsigned long required_kernelcore __initdata;
@@ -348,7 +347,6 @@ static bool mirrored_kernelcore __meminitdata;
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
 EXPORT_SYMBOL(movable_zone);
-#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 #if MAX_NUMNODES > 1
 unsigned int nr_node_ids __read_mostly = MAX_NUMNODES;
@@ -1447,9 +1445,6 @@ void __free_pages_core(struct page *page, unsigned int order)
 	__free_pages(page, order);
 }
 
-#if defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) || \
-	defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
-
 static struct mminit_pfnnid_cache early_pfnnid_cache __meminitdata;
 
 int __meminit early_pfn_to_nid(unsigned long pfn)
@@ -1465,7 +1460,6 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
 
 	return nid;
 }
-#endif
 
 #ifdef CONFIG_NODES_SPAN_OTHER_NODES
 /* Only safe to use early in boot when initialisation is single-threaded */
@@ -5828,7 +5822,6 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
 static bool __meminit
 overlap_memmap_init(unsigned long zone, unsigned long *pfn)
 {
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	static struct memblock_region *r;
 
 	if (mirrored_kernelcore && zone == ZONE_MOVABLE) {
@@ -5844,7 +5837,7 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
 			return true;
 		}
 	}
-#endif
+
 	return false;
 }
 
@@ -6227,7 +6220,6 @@ void __meminit init_currently_empty_zone(struct zone *zone,
 	zone->initialized = 1;
 }
 
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
 
 /*
@@ -6503,8 +6495,7 @@ static unsigned long __init zone_absent_pages_in_node(int nid,
 	return nr_absent;
 }
 
-#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
-static inline unsigned long __init zone_spanned_pages_in_node(int nid,
+static inline unsigned long __init compat_zone_spanned_pages_in_node(int nid,
 					unsigned long zone_type,
 					unsigned long node_start_pfn,
 					unsigned long node_end_pfn,
@@ -6523,7 +6514,7 @@ static inline unsigned long __init zone_spanned_pages_in_node(int nid,
 	return zones_size[zone_type];
 }
 
-static inline unsigned long __init zone_absent_pages_in_node(int nid,
+static inline unsigned long __init compat_zone_absent_pages_in_node(int nid,
 						unsigned long zone_type,
 						unsigned long node_start_pfn,
 						unsigned long node_end_pfn,
@@ -6535,13 +6526,12 @@ static inline unsigned long __init zone_absent_pages_in_node(int nid,
 	return zholes_size[zone_type];
 }
 
-#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
-
 static void __init calculate_node_totalpages(struct pglist_data *pgdat,
 						unsigned long node_start_pfn,
 						unsigned long node_end_pfn,
 						unsigned long *zones_size,
-						unsigned long *zholes_size)
+						unsigned long *zholes_size,
+						bool compat)
 {
 	unsigned long realtotalpages = 0, totalpages = 0;
 	enum zone_type i;
@@ -6549,17 +6539,38 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 		unsigned long zone_start_pfn, zone_end_pfn;
+		unsigned long spanned, absent;
 		unsigned long size, real_size;
 
-		size = zone_spanned_pages_in_node(pgdat->node_id, i,
-						  node_start_pfn,
-						  node_end_pfn,
-						  &zone_start_pfn,
-						  &zone_end_pfn,
-						  zones_size);
-		real_size = size - zone_absent_pages_in_node(pgdat->node_id, i,
-						  node_start_pfn, node_end_pfn,
-						  zholes_size);
+		if (compat) {
+			spanned = compat_zone_spanned_pages_in_node(
+						pgdat->node_id, i,
+						node_start_pfn,
+						node_end_pfn,
+						&zone_start_pfn,
+						&zone_end_pfn,
+						zones_size);
+			absent = compat_zone_absent_pages_in_node(
+						pgdat->node_id, i,
+						node_start_pfn,
+						node_end_pfn,
+						zholes_size);
+		} else {
+			spanned = zone_spanned_pages_in_node(pgdat->node_id, i,
+						node_start_pfn,
+						node_end_pfn,
+						&zone_start_pfn,
+						&zone_end_pfn,
+						zones_size);
+			absent = zone_absent_pages_in_node(pgdat->node_id, i,
+						node_start_pfn,
+						node_end_pfn,
+						zholes_size);
+		}
+
+		size = spanned;
+		real_size = size - absent;
+
 		if (size)
 			zone->zone_start_pfn = zone_start_pfn;
 		else
@@ -6859,10 +6870,8 @@ static void __ref alloc_node_mem_map(struct pglist_data *pgdat)
 	 */
 	if (pgdat == NODE_DATA(0)) {
 		mem_map = NODE_DATA(0)->node_mem_map;
-#if defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) || defined(CONFIG_FLATMEM)
 		if (page_to_pfn(mem_map) != pgdat->node_start_pfn)
 			mem_map -= offset;
-#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 	}
 #endif
 }
@@ -6879,9 +6888,10 @@ static inline void pgdat_set_deferred_range(pg_data_t *pgdat)
 static inline void pgdat_set_deferred_range(pg_data_t *pgdat) {}
 #endif
 
-void __init free_area_init_node(int nid, unsigned long *zones_size,
-				   unsigned long node_start_pfn,
-				   unsigned long *zholes_size)
+void __init __free_area_init_node(int nid, unsigned long *zones_size,
+				  unsigned long node_start_pfn,
+				  unsigned long *zholes_size,
+				  bool compat)
 {
 	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long start_pfn = 0;
@@ -6893,16 +6903,17 @@ void __init free_area_init_node(int nid, unsigned long *zones_size,
 	pgdat->node_id = nid;
 	pgdat->node_start_pfn = node_start_pfn;
 	pgdat->per_cpu_nodestats = NULL;
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
-	pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
-		(u64)start_pfn << PAGE_SHIFT,
-		end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
-#else
-	start_pfn = node_start_pfn;
-#endif
+	if (!compat) {
+		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
+		pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
+			(u64)start_pfn << PAGE_SHIFT,
+			end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
+	} else {
+		start_pfn = node_start_pfn;
+	}
+
 	calculate_node_totalpages(pgdat, start_pfn, end_pfn,
-				  zones_size, zholes_size);
+				  zones_size, zholes_size, compat);
 
 	alloc_node_mem_map(pgdat);
 	pgdat_set_deferred_range(pgdat);
@@ -6910,6 +6921,14 @@ void __init free_area_init_node(int nid, unsigned long *zones_size,
 	free_area_init_core(pgdat);
 }
 
+void __init free_area_init_node(int nid, unsigned long *zones_size,
+				unsigned long node_start_pfn,
+				unsigned long *zholes_size)
+{
+	__free_area_init_node(nid, zones_size, node_start_pfn, zholes_size,
+			      true);
+}
+
 #if !defined(CONFIG_FLAT_NODE_MEM_MAP)
 /*
  * Initialize all valid struct pages in the range [spfn, epfn) and mark them
@@ -6993,8 +7012,6 @@ static inline void __init init_unavailable_mem(void)
 }
 #endif /* !CONFIG_FLAT_NODE_MEM_MAP */
 
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-
 #if MAX_NUMNODES > 1
 /*
  * Figure out the number of possible node ids.
@@ -7423,8 +7440,8 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 	init_unavailable_mem();
 	for_each_online_node(nid) {
 		pg_data_t *pgdat = NODE_DATA(nid);
-		free_area_init_node(nid, NULL,
-				find_min_pfn_for_node(nid), NULL);
+		__free_area_init_node(nid, NULL,
+				      find_min_pfn_for_node(nid), NULL, false);
 
 		/* Any memory on that node */
 		if (pgdat->node_present_pages)
@@ -7489,8 +7506,6 @@ static int __init cmdline_parse_movablecore(char *p)
 early_param("kernelcore", cmdline_parse_kernelcore);
 early_param("movablecore", cmdline_parse_movablecore);
 
-#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
-
 void adjust_managed_page_count(struct page *page, long count)
 {
 	atomic_long_add(count, &page_zone(page)->managed_pages);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)
  2020-03-31 21:56       ` [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA) Mike Rapoport
@ 2020-04-01  5:42         ` Baoquan He
  2020-04-01  7:51           ` Mike Rapoport
  0 siblings, 1 reply; 36+ messages in thread
From: Baoquan He @ 2020-04-01  5:42 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Michal Hocko, Hoan Tran, Catalin Marinas, Will Deacon,
	Andrew Morton, Vlastimil Babka, Oscar Salvador, Pavel Tatashin,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On 04/01/20 at 12:56am, Mike Rapoport wrote:
> On Mon, Mar 30, 2020 at 11:58:43AM +0200, Michal Hocko wrote:
> > 
> > What would it take to make ia64 use HAVE_MEMBLOCK_NODE_MAP? I would
> > really love to see that thing go away. It is causing problems when
> > people try to use memblock api.
> 
> Well, it's a small patch in the end :)
> 
> Currently all NUMA architectures currently enable
> CONFIG_HAVE_MEMBLOCK_NODE_MAP and use free_area_init_nodes() to initialize
> nodes and zones structures.

I did some investigation, there are nine ARCHes having NUMA config. And
among them, alpha doesn't have HAVE_MEMBLOCK_NODE_MAP support. While the
interesting thing is there are two ARCHes which have
HAVE_MEMBLOCK_NODE_MAP config, but don't have NUMA config adding, they
are microblaze and riscv. Obviously it was not carefully considered to
add HAVE_MEMBLOCK_NODE_MAP config into riscv and microblaze.

arch/alpha/Kconfig:config NUMA
arch/arm64/Kconfig:config NUMA
arch/ia64/Kconfig:config NUMA
arch/mips/Kconfig:config NUMA
arch/powerpc/Kconfig:config NUMA
arch/s390/Kconfig:config NUMA
arch/sh/mm/Kconfig:config NUMA
arch/sparc/Kconfig:config NUMA
arch/x86/Kconfig:config NUMA

From above information, we can remove HAVE_MEMBLOCK_NODE_MAP, and
replace it with CONFIG_NUMA. That sounds more sensible to store nid into
memblock when NUMA support is enabled.


> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 079d17d96410..9de81112447e 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -50,9 +50,7 @@ struct memblock_region {
>  	phys_addr_t base;
>  	phys_addr_t size;
>  	enum memblock_flags flags;
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  	int nid;
> -#endif

I didn't look into other change very carefully, but feel enabling
memblock node map for all ARCHes looks a little radical. After all, many
ARCHes even don't have NUMA support.

>  };
>  
>  /**
> @@ -215,7 +213,6 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
>  	return m->flags & MEMBLOCK_NOMAP;
>  }
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
>  			    unsigned long  *end_pfn);
>  void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> @@ -234,7 +231,6 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
>  #define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)		\
>  	for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid); \
>  	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
> -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> @@ -310,7 +306,6 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
>  	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
>  			       nid, flags, p_start, p_end, p_nid)
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  int memblock_set_node(phys_addr_t base, phys_addr_t size,
>  		      struct memblock_type *type, int nid);
>  
> @@ -323,16 +318,6 @@ static inline int memblock_get_region_node(const struct memblock_region *r)
>  {
>  	return r->nid;
>  }
> -#else
> -static inline void memblock_set_region_node(struct memblock_region *r, int nid)
> -{
> -}
> -
> -static inline int memblock_get_region_node(const struct memblock_region *r)
> -{
> -	return 0;
> -}
> -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
>  /* Flags for memblock allocation APIs */
>  #define MEMBLOCK_ALLOC_ANYWHERE	(~(phys_addr_t)0)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c54fb96cb1e6..368a45d4696a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2125,9 +2125,8 @@ static inline unsigned long get_num_physpages(void)
>  	return phys_pages;
>  }
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  /*
> - * With CONFIG_HAVE_MEMBLOCK_NODE_MAP set, an architecture may initialise its
> + * Using memblock node mappings, an architecture may initialise its
>   * zones, allocate the backing mem_map and account for memory holes in a more
>   * architecture independent manner. This is a substitute for creating the
>   * zone_sizes[] and zholes_size[] arrays and passing them to
> @@ -2148,9 +2147,6 @@ static inline unsigned long get_num_physpages(void)
>   * registered physical page range.  Similarly
>   * sparse_memory_present_with_active_regions() calls memory_present() for
>   * each range when SPARSEMEM is enabled.
> - *
> - * See mm/page_alloc.c for more information on each function exposed by
> - * CONFIG_HAVE_MEMBLOCK_NODE_MAP.
>   */
>  extern void free_area_init_nodes(unsigned long *max_zone_pfn);
>  unsigned long node_map_pfn_alignment(void);
> @@ -2165,22 +2161,12 @@ extern void free_bootmem_with_active_regions(int nid,
>  						unsigned long max_low_pfn);
>  extern void sparse_memory_present_with_active_regions(int nid);
>  
> -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
> -#if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
> -    !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
> -static inline int __early_pfn_to_nid(unsigned long pfn,
> -					struct mminit_pfnnid_cache *state)
> -{
> -	return 0;
> -}
> -#else
>  /* please see mm/page_alloc.c */
>  extern int __meminit early_pfn_to_nid(unsigned long pfn);
>  /* there is a per-arch backend function. */
>  extern int __meminit __early_pfn_to_nid(unsigned long pfn,
>  					struct mminit_pfnnid_cache *state);
> -#endif
>  
>  extern void set_dma_reserve(unsigned long new_dma_reserve);
>  extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 462f6873905a..4422d1961d0e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -917,11 +917,7 @@ extern int movable_zone;
>  #ifdef CONFIG_HIGHMEM
>  static inline int zone_movable_is_highmem(void)
>  {
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  	return movable_zone == ZONE_HIGHMEM;
> -#else
> -	return (ZONE_MOVABLE - 1) == ZONE_HIGHMEM;
> -#endif
>  }
>  #endif
>  
> @@ -1121,15 +1117,6 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
>  #include <asm/sparsemem.h>
>  #endif
>  
> -#if !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && \
> -	!defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
> -static inline unsigned long early_pfn_to_nid(unsigned long pfn)
> -{
> -	BUILD_BUG_ON(IS_ENABLED(CONFIG_NUMA));
> -	return 0;
> -}
> -#endif
> -
>  #ifdef CONFIG_FLATMEM
>  #define pfn_to_nid(pfn)		(0)
>  #endif
> diff --git a/mm/memblock.c b/mm/memblock.c
> index eba94ee3de0b..819441133a21 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -620,9 +620,7 @@ static int __init_memblock memblock_add_range(struct memblock_type *type,
>  		 * area, insert that portion.
>  		 */
>  		if (rbase > base) {
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  			WARN_ON(nid != memblock_get_region_node(rgn));
> -#endif
>  			WARN_ON(flags != rgn->flags);
>  			nr_new++;
>  			if (insert)
> @@ -1197,7 +1195,6 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int nid,
>  	*idx = ULLONG_MAX;
>  }
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  /*
>   * Common iterator interface used to define for_each_mem_pfn_range().
>   */
> @@ -1258,7 +1255,7 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
>  	memblock_merge_regions(type);
>  	return 0;
>  }
> -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> +
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  /**
>   * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
> @@ -1797,7 +1794,6 @@ bool __init_memblock memblock_is_map_memory(phys_addr_t addr)
>  	return !memblock_is_nomap(&memblock.memory.regions[i]);
>  }
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
>  			 unsigned long *start_pfn, unsigned long *end_pfn)
>  {
> @@ -1812,7 +1808,6 @@ int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
>  
>  	return type->regions[mid].nid;
>  }
> -#endif
>  
>  /**
>   * memblock_is_region_memory - check if a region is a subset of memory
> @@ -1903,11 +1898,9 @@ static void __init_memblock memblock_dump(struct memblock_type *type)
>  		size = rgn->size;
>  		end = base + size - 1;
>  		flags = rgn->flags;
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  		if (memblock_get_region_node(rgn) != MAX_NUMNODES)
>  			snprintf(nid_buf, sizeof(nid_buf), " on node %d",
>  				 memblock_get_region_node(rgn));
> -#endif
>  		pr_info(" %s[%#x]\t[%pa-%pa], %pa bytes%s flags: %#x\n",
>  			type->name, idx, &base, &end, &size, nid_buf, flags);
>  	}
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 19389cdc16a5..dc8828b087bf 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1366,11 +1366,7 @@ check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
>  
>  static int __init cmdline_parse_movable_node(char *p)
>  {
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  	movable_node_enabled = true;
> -#else
> -	pr_warn("movable_node parameter depends on CONFIG_HAVE_MEMBLOCK_NODE_MAP to work properly\n");
> -#endif
>  	return 0;
>  }
>  early_param("movable_node", cmdline_parse_movable_node);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c4eb750a199..84a69d6e7e61 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -335,7 +335,6 @@ static unsigned long nr_kernel_pages __initdata;
>  static unsigned long nr_all_pages __initdata;
>  static unsigned long dma_reserve __initdata;
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __initdata;
>  static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] __initdata;
>  static unsigned long required_kernelcore __initdata;
> @@ -348,7 +347,6 @@ static bool mirrored_kernelcore __meminitdata;
>  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
>  int movable_zone;
>  EXPORT_SYMBOL(movable_zone);
> -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
>  #if MAX_NUMNODES > 1
>  unsigned int nr_node_ids __read_mostly = MAX_NUMNODES;
> @@ -1447,9 +1445,6 @@ void __free_pages_core(struct page *page, unsigned int order)
>  	__free_pages(page, order);
>  }
>  
> -#if defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) || \
> -	defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
> -
>  static struct mminit_pfnnid_cache early_pfnnid_cache __meminitdata;
>  
>  int __meminit early_pfn_to_nid(unsigned long pfn)
> @@ -1465,7 +1460,6 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
>  
>  	return nid;
>  }
> -#endif
>  
>  #ifdef CONFIG_NODES_SPAN_OTHER_NODES
>  /* Only safe to use early in boot when initialisation is single-threaded */
> @@ -5828,7 +5822,6 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
>  static bool __meminit
>  overlap_memmap_init(unsigned long zone, unsigned long *pfn)
>  {
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  	static struct memblock_region *r;
>  
>  	if (mirrored_kernelcore && zone == ZONE_MOVABLE) {
> @@ -5844,7 +5837,7 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
>  			return true;
>  		}
>  	}
> -#endif
> +
>  	return false;
>  }
>  
> @@ -6227,7 +6220,6 @@ void __meminit init_currently_empty_zone(struct zone *zone,
>  	zone->initialized = 1;
>  }
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
>  
>  /*
> @@ -6503,8 +6495,7 @@ static unsigned long __init zone_absent_pages_in_node(int nid,
>  	return nr_absent;
>  }
>  
> -#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> -static inline unsigned long __init zone_spanned_pages_in_node(int nid,
> +static inline unsigned long __init compat_zone_spanned_pages_in_node(int nid,
>  					unsigned long zone_type,
>  					unsigned long node_start_pfn,
>  					unsigned long node_end_pfn,
> @@ -6523,7 +6514,7 @@ static inline unsigned long __init zone_spanned_pages_in_node(int nid,
>  	return zones_size[zone_type];
>  }
>  
> -static inline unsigned long __init zone_absent_pages_in_node(int nid,
> +static inline unsigned long __init compat_zone_absent_pages_in_node(int nid,
>  						unsigned long zone_type,
>  						unsigned long node_start_pfn,
>  						unsigned long node_end_pfn,
> @@ -6535,13 +6526,12 @@ static inline unsigned long __init zone_absent_pages_in_node(int nid,
>  	return zholes_size[zone_type];
>  }
>  
> -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> -
>  static void __init calculate_node_totalpages(struct pglist_data *pgdat,
>  						unsigned long node_start_pfn,
>  						unsigned long node_end_pfn,
>  						unsigned long *zones_size,
> -						unsigned long *zholes_size)
> +						unsigned long *zholes_size,
> +						bool compat)
>  {
>  	unsigned long realtotalpages = 0, totalpages = 0;
>  	enum zone_type i;
> @@ -6549,17 +6539,38 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
>  	for (i = 0; i < MAX_NR_ZONES; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
>  		unsigned long zone_start_pfn, zone_end_pfn;
> +		unsigned long spanned, absent;
>  		unsigned long size, real_size;
>  
> -		size = zone_spanned_pages_in_node(pgdat->node_id, i,
> -						  node_start_pfn,
> -						  node_end_pfn,
> -						  &zone_start_pfn,
> -						  &zone_end_pfn,
> -						  zones_size);
> -		real_size = size - zone_absent_pages_in_node(pgdat->node_id, i,
> -						  node_start_pfn, node_end_pfn,
> -						  zholes_size);
> +		if (compat) {
> +			spanned = compat_zone_spanned_pages_in_node(
> +						pgdat->node_id, i,
> +						node_start_pfn,
> +						node_end_pfn,
> +						&zone_start_pfn,
> +						&zone_end_pfn,
> +						zones_size);
> +			absent = compat_zone_absent_pages_in_node(
> +						pgdat->node_id, i,
> +						node_start_pfn,
> +						node_end_pfn,
> +						zholes_size);
> +		} else {
> +			spanned = zone_spanned_pages_in_node(pgdat->node_id, i,
> +						node_start_pfn,
> +						node_end_pfn,
> +						&zone_start_pfn,
> +						&zone_end_pfn,
> +						zones_size);
> +			absent = zone_absent_pages_in_node(pgdat->node_id, i,
> +						node_start_pfn,
> +						node_end_pfn,
> +						zholes_size);
> +		}
> +
> +		size = spanned;
> +		real_size = size - absent;
> +
>  		if (size)
>  			zone->zone_start_pfn = zone_start_pfn;
>  		else
> @@ -6859,10 +6870,8 @@ static void __ref alloc_node_mem_map(struct pglist_data *pgdat)
>  	 */
>  	if (pgdat == NODE_DATA(0)) {
>  		mem_map = NODE_DATA(0)->node_mem_map;
> -#if defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) || defined(CONFIG_FLATMEM)
>  		if (page_to_pfn(mem_map) != pgdat->node_start_pfn)
>  			mem_map -= offset;
> -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  	}
>  #endif
>  }
> @@ -6879,9 +6888,10 @@ static inline void pgdat_set_deferred_range(pg_data_t *pgdat)
>  static inline void pgdat_set_deferred_range(pg_data_t *pgdat) {}
>  #endif
>  
> -void __init free_area_init_node(int nid, unsigned long *zones_size,
> -				   unsigned long node_start_pfn,
> -				   unsigned long *zholes_size)
> +void __init __free_area_init_node(int nid, unsigned long *zones_size,
> +				  unsigned long node_start_pfn,
> +				  unsigned long *zholes_size,
> +				  bool compat)
>  {
>  	pg_data_t *pgdat = NODE_DATA(nid);
>  	unsigned long start_pfn = 0;
> @@ -6893,16 +6903,17 @@ void __init free_area_init_node(int nid, unsigned long *zones_size,
>  	pgdat->node_id = nid;
>  	pgdat->node_start_pfn = node_start_pfn;
>  	pgdat->per_cpu_nodestats = NULL;
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> -	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> -	pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
> -		(u64)start_pfn << PAGE_SHIFT,
> -		end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
> -#else
> -	start_pfn = node_start_pfn;
> -#endif
> +	if (!compat) {
> +		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> +		pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
> +			(u64)start_pfn << PAGE_SHIFT,
> +			end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
> +	} else {
> +		start_pfn = node_start_pfn;
> +	}
> +
>  	calculate_node_totalpages(pgdat, start_pfn, end_pfn,
> -				  zones_size, zholes_size);
> +				  zones_size, zholes_size, compat);
>  
>  	alloc_node_mem_map(pgdat);
>  	pgdat_set_deferred_range(pgdat);
> @@ -6910,6 +6921,14 @@ void __init free_area_init_node(int nid, unsigned long *zones_size,
>  	free_area_init_core(pgdat);
>  }
>  
> +void __init free_area_init_node(int nid, unsigned long *zones_size,
> +				unsigned long node_start_pfn,
> +				unsigned long *zholes_size)
> +{
> +	__free_area_init_node(nid, zones_size, node_start_pfn, zholes_size,
> +			      true);
> +}
> +
>  #if !defined(CONFIG_FLAT_NODE_MEM_MAP)
>  /*
>   * Initialize all valid struct pages in the range [spfn, epfn) and mark them
> @@ -6993,8 +7012,6 @@ static inline void __init init_unavailable_mem(void)
>  }
>  #endif /* !CONFIG_FLAT_NODE_MEM_MAP */
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> -
>  #if MAX_NUMNODES > 1
>  /*
>   * Figure out the number of possible node ids.
> @@ -7423,8 +7440,8 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
>  	init_unavailable_mem();
>  	for_each_online_node(nid) {
>  		pg_data_t *pgdat = NODE_DATA(nid);
> -		free_area_init_node(nid, NULL,
> -				find_min_pfn_for_node(nid), NULL);
> +		__free_area_init_node(nid, NULL,
> +				      find_min_pfn_for_node(nid), NULL, false);
>  
>  		/* Any memory on that node */
>  		if (pgdat->node_present_pages)
> @@ -7489,8 +7506,6 @@ static int __init cmdline_parse_movablecore(char *p)
>  early_param("kernelcore", cmdline_parse_kernelcore);
>  early_param("movablecore", cmdline_parse_movablecore);
>  
> -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> -
>  void adjust_managed_page_count(struct page *page, long count)
>  {
>  	atomic_long_add(count, &page_zone(page)->managed_pages);
> -- 
> 2.25.1
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)
  2020-04-01  5:42         ` Baoquan He
@ 2020-04-01  7:51           ` Mike Rapoport
  2020-04-02  8:01             ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Mike Rapoport @ 2020-04-01  7:51 UTC (permalink / raw)
  To: Baoquan He
  Cc: Michal Hocko, Hoan Tran, Catalin Marinas, Will Deacon,
	Andrew Morton, Vlastimil Babka, Oscar Salvador, Pavel Tatashin,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

Hi,

On Wed, Apr 01, 2020 at 01:42:27PM +0800, Baoquan He wrote:
> On 04/01/20 at 12:56am, Mike Rapoport wrote:
> > On Mon, Mar 30, 2020 at 11:58:43AM +0200, Michal Hocko wrote:
> > > 
> > > What would it take to make ia64 use HAVE_MEMBLOCK_NODE_MAP? I would
> > > really love to see that thing go away. It is causing problems when
> > > people try to use memblock api.
> > 
> > Well, it's a small patch in the end :)
> > 
> > Currently all NUMA architectures currently enable
> > CONFIG_HAVE_MEMBLOCK_NODE_MAP and use free_area_init_nodes() to initialize
> > nodes and zones structures.
> 
> I did some investigation, there are nine ARCHes having NUMA config. And
> among them, alpha doesn't have HAVE_MEMBLOCK_NODE_MAP support.

alpha has NUMA marked as BROKEN for a long time now. I doubt anybody is
going to fix that.

> While the interesting thing is there are two ARCHes which have
> HAVE_MEMBLOCK_NODE_MAP config, but don't have NUMA config adding, they
> are microblaze and riscv. Obviously it was not carefully considered to
> add HAVE_MEMBLOCK_NODE_MAP config into riscv and microblaze.

Well, microblaze probably copied it and forgot to remove and riscv might
get actual NUMA at some point.

> arch/alpha/Kconfig:config NUMA
> arch/arm64/Kconfig:config NUMA
> arch/ia64/Kconfig:config NUMA
> arch/mips/Kconfig:config NUMA
> arch/powerpc/Kconfig:config NUMA
> arch/s390/Kconfig:config NUMA
> arch/sh/mm/Kconfig:config NUMA
> arch/sparc/Kconfig:config NUMA
> arch/x86/Kconfig:config NUMA
> 
> From above information, we can remove HAVE_MEMBLOCK_NODE_MAP, and
> replace it with CONFIG_NUMA. That sounds more sensible to store nid into
> memblock when NUMA support is enabled.
 
Replacing CONFIG_HAVE_MEMBLOCK_NODE_MAP with CONFIG_NUMA will work, but
this will not help cleaning up the whole node/zone initialization mess and
we'll be stuck with two implementations.

The overhead of enabling HAVE_MEMBLOCK_NODE_MAP is only for init time as
most architectures will anyway discard the entire memblock, so having it in
a UMA arch won't be a problem. The only exception is arm that uses
memblock for pfn_valid(), here we may also think about a solution to
compensate the addition of nid to the memblock structures. 

We also may want to make all the movable_node and movable_core
functionality dependent on CONFIG_NUMA.
 
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index 079d17d96410..9de81112447e 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -50,9 +50,7 @@ struct memblock_region {
> >  	phys_addr_t base;
> >  	phys_addr_t size;
> >  	enum memblock_flags flags;
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  	int nid;
> > -#endif
> 
> I didn't look into other change very carefully, but feel enabling
> memblock node map for all ARCHes looks a little radical. After all, many
> ARCHes even don't have NUMA support.
> 
> >  };
> >  
> >  /**
> > @@ -215,7 +213,6 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
> >  	return m->flags & MEMBLOCK_NOMAP;
> >  }
> >  
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
> >  			    unsigned long  *end_pfn);
> >  void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> > @@ -234,7 +231,6 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> >  #define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)		\
> >  	for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid); \
> >  	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
> > -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> >  
> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >  void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> > @@ -310,7 +306,6 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> >  	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
> >  			       nid, flags, p_start, p_end, p_nid)
> >  
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  int memblock_set_node(phys_addr_t base, phys_addr_t size,
> >  		      struct memblock_type *type, int nid);
> >  
> > @@ -323,16 +318,6 @@ static inline int memblock_get_region_node(const struct memblock_region *r)
> >  {
> >  	return r->nid;
> >  }
> > -#else
> > -static inline void memblock_set_region_node(struct memblock_region *r, int nid)
> > -{
> > -}
> > -
> > -static inline int memblock_get_region_node(const struct memblock_region *r)
> > -{
> > -	return 0;
> > -}
> > -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> >  
> >  /* Flags for memblock allocation APIs */
> >  #define MEMBLOCK_ALLOC_ANYWHERE	(~(phys_addr_t)0)
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index c54fb96cb1e6..368a45d4696a 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2125,9 +2125,8 @@ static inline unsigned long get_num_physpages(void)
> >  	return phys_pages;
> >  }
> >  
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  /*
> > - * With CONFIG_HAVE_MEMBLOCK_NODE_MAP set, an architecture may initialise its
> > + * Using memblock node mappings, an architecture may initialise its
> >   * zones, allocate the backing mem_map and account for memory holes in a more
> >   * architecture independent manner. This is a substitute for creating the
> >   * zone_sizes[] and zholes_size[] arrays and passing them to
> > @@ -2148,9 +2147,6 @@ static inline unsigned long get_num_physpages(void)
> >   * registered physical page range.  Similarly
> >   * sparse_memory_present_with_active_regions() calls memory_present() for
> >   * each range when SPARSEMEM is enabled.
> > - *
> > - * See mm/page_alloc.c for more information on each function exposed by
> > - * CONFIG_HAVE_MEMBLOCK_NODE_MAP.
> >   */
> >  extern void free_area_init_nodes(unsigned long *max_zone_pfn);
> >  unsigned long node_map_pfn_alignment(void);
> > @@ -2165,22 +2161,12 @@ extern void free_bootmem_with_active_regions(int nid,
> >  						unsigned long max_low_pfn);
> >  extern void sparse_memory_present_with_active_regions(int nid);
> >  
> > -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> >  
> > -#if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
> > -    !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
> > -static inline int __early_pfn_to_nid(unsigned long pfn,
> > -					struct mminit_pfnnid_cache *state)
> > -{
> > -	return 0;
> > -}
> > -#else
> >  /* please see mm/page_alloc.c */
> >  extern int __meminit early_pfn_to_nid(unsigned long pfn);
> >  /* there is a per-arch backend function. */
> >  extern int __meminit __early_pfn_to_nid(unsigned long pfn,
> >  					struct mminit_pfnnid_cache *state);
> > -#endif
> >  
> >  extern void set_dma_reserve(unsigned long new_dma_reserve);
> >  extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 462f6873905a..4422d1961d0e 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -917,11 +917,7 @@ extern int movable_zone;
> >  #ifdef CONFIG_HIGHMEM
> >  static inline int zone_movable_is_highmem(void)
> >  {
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  	return movable_zone == ZONE_HIGHMEM;
> > -#else
> > -	return (ZONE_MOVABLE - 1) == ZONE_HIGHMEM;
> > -#endif
> >  }
> >  #endif
> >  
> > @@ -1121,15 +1117,6 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
> >  #include <asm/sparsemem.h>
> >  #endif
> >  
> > -#if !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && \
> > -	!defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
> > -static inline unsigned long early_pfn_to_nid(unsigned long pfn)
> > -{
> > -	BUILD_BUG_ON(IS_ENABLED(CONFIG_NUMA));
> > -	return 0;
> > -}
> > -#endif
> > -
> >  #ifdef CONFIG_FLATMEM
> >  #define pfn_to_nid(pfn)		(0)
> >  #endif
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index eba94ee3de0b..819441133a21 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -620,9 +620,7 @@ static int __init_memblock memblock_add_range(struct memblock_type *type,
> >  		 * area, insert that portion.
> >  		 */
> >  		if (rbase > base) {
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  			WARN_ON(nid != memblock_get_region_node(rgn));
> > -#endif
> >  			WARN_ON(flags != rgn->flags);
> >  			nr_new++;
> >  			if (insert)
> > @@ -1197,7 +1195,6 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int nid,
> >  	*idx = ULLONG_MAX;
> >  }
> >  
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  /*
> >   * Common iterator interface used to define for_each_mem_pfn_range().
> >   */
> > @@ -1258,7 +1255,7 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
> >  	memblock_merge_regions(type);
> >  	return 0;
> >  }
> > -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> > +
> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >  /**
> >   * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
> > @@ -1797,7 +1794,6 @@ bool __init_memblock memblock_is_map_memory(phys_addr_t addr)
> >  	return !memblock_is_nomap(&memblock.memory.regions[i]);
> >  }
> >  
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
> >  			 unsigned long *start_pfn, unsigned long *end_pfn)
> >  {
> > @@ -1812,7 +1808,6 @@ int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
> >  
> >  	return type->regions[mid].nid;
> >  }
> > -#endif
> >  
> >  /**
> >   * memblock_is_region_memory - check if a region is a subset of memory
> > @@ -1903,11 +1898,9 @@ static void __init_memblock memblock_dump(struct memblock_type *type)
> >  		size = rgn->size;
> >  		end = base + size - 1;
> >  		flags = rgn->flags;
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  		if (memblock_get_region_node(rgn) != MAX_NUMNODES)
> >  			snprintf(nid_buf, sizeof(nid_buf), " on node %d",
> >  				 memblock_get_region_node(rgn));
> > -#endif
> >  		pr_info(" %s[%#x]\t[%pa-%pa], %pa bytes%s flags: %#x\n",
> >  			type->name, idx, &base, &end, &size, nid_buf, flags);
> >  	}
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 19389cdc16a5..dc8828b087bf 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1366,11 +1366,7 @@ check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
> >  
> >  static int __init cmdline_parse_movable_node(char *p)
> >  {
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  	movable_node_enabled = true;
> > -#else
> > -	pr_warn("movable_node parameter depends on CONFIG_HAVE_MEMBLOCK_NODE_MAP to work properly\n");
> > -#endif
> >  	return 0;
> >  }
> >  early_param("movable_node", cmdline_parse_movable_node);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3c4eb750a199..84a69d6e7e61 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -335,7 +335,6 @@ static unsigned long nr_kernel_pages __initdata;
> >  static unsigned long nr_all_pages __initdata;
> >  static unsigned long dma_reserve __initdata;
> >  
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __initdata;
> >  static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] __initdata;
> >  static unsigned long required_kernelcore __initdata;
> > @@ -348,7 +347,6 @@ static bool mirrored_kernelcore __meminitdata;
> >  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
> >  int movable_zone;
> >  EXPORT_SYMBOL(movable_zone);
> > -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> >  
> >  #if MAX_NUMNODES > 1
> >  unsigned int nr_node_ids __read_mostly = MAX_NUMNODES;
> > @@ -1447,9 +1445,6 @@ void __free_pages_core(struct page *page, unsigned int order)
> >  	__free_pages(page, order);
> >  }
> >  
> > -#if defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) || \
> > -	defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
> > -
> >  static struct mminit_pfnnid_cache early_pfnnid_cache __meminitdata;
> >  
> >  int __meminit early_pfn_to_nid(unsigned long pfn)
> > @@ -1465,7 +1460,6 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
> >  
> >  	return nid;
> >  }
> > -#endif
> >  
> >  #ifdef CONFIG_NODES_SPAN_OTHER_NODES
> >  /* Only safe to use early in boot when initialisation is single-threaded */
> > @@ -5828,7 +5822,6 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
> >  static bool __meminit
> >  overlap_memmap_init(unsigned long zone, unsigned long *pfn)
> >  {
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  	static struct memblock_region *r;
> >  
> >  	if (mirrored_kernelcore && zone == ZONE_MOVABLE) {
> > @@ -5844,7 +5837,7 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
> >  			return true;
> >  		}
> >  	}
> > -#endif
> > +
> >  	return false;
> >  }
> >  
> > @@ -6227,7 +6220,6 @@ void __meminit init_currently_empty_zone(struct zone *zone,
> >  	zone->initialized = 1;
> >  }
> >  
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
> >  
> >  /*
> > @@ -6503,8 +6495,7 @@ static unsigned long __init zone_absent_pages_in_node(int nid,
> >  	return nr_absent;
> >  }
> >  
> > -#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> > -static inline unsigned long __init zone_spanned_pages_in_node(int nid,
> > +static inline unsigned long __init compat_zone_spanned_pages_in_node(int nid,
> >  					unsigned long zone_type,
> >  					unsigned long node_start_pfn,
> >  					unsigned long node_end_pfn,
> > @@ -6523,7 +6514,7 @@ static inline unsigned long __init zone_spanned_pages_in_node(int nid,
> >  	return zones_size[zone_type];
> >  }
> >  
> > -static inline unsigned long __init zone_absent_pages_in_node(int nid,
> > +static inline unsigned long __init compat_zone_absent_pages_in_node(int nid,
> >  						unsigned long zone_type,
> >  						unsigned long node_start_pfn,
> >  						unsigned long node_end_pfn,
> > @@ -6535,13 +6526,12 @@ static inline unsigned long __init zone_absent_pages_in_node(int nid,
> >  	return zholes_size[zone_type];
> >  }
> >  
> > -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> > -
> >  static void __init calculate_node_totalpages(struct pglist_data *pgdat,
> >  						unsigned long node_start_pfn,
> >  						unsigned long node_end_pfn,
> >  						unsigned long *zones_size,
> > -						unsigned long *zholes_size)
> > +						unsigned long *zholes_size,
> > +						bool compat)
> >  {
> >  	unsigned long realtotalpages = 0, totalpages = 0;
> >  	enum zone_type i;
> > @@ -6549,17 +6539,38 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
> >  	for (i = 0; i < MAX_NR_ZONES; i++) {
> >  		struct zone *zone = pgdat->node_zones + i;
> >  		unsigned long zone_start_pfn, zone_end_pfn;
> > +		unsigned long spanned, absent;
> >  		unsigned long size, real_size;
> >  
> > -		size = zone_spanned_pages_in_node(pgdat->node_id, i,
> > -						  node_start_pfn,
> > -						  node_end_pfn,
> > -						  &zone_start_pfn,
> > -						  &zone_end_pfn,
> > -						  zones_size);
> > -		real_size = size - zone_absent_pages_in_node(pgdat->node_id, i,
> > -						  node_start_pfn, node_end_pfn,
> > -						  zholes_size);
> > +		if (compat) {
> > +			spanned = compat_zone_spanned_pages_in_node(
> > +						pgdat->node_id, i,
> > +						node_start_pfn,
> > +						node_end_pfn,
> > +						&zone_start_pfn,
> > +						&zone_end_pfn,
> > +						zones_size);
> > +			absent = compat_zone_absent_pages_in_node(
> > +						pgdat->node_id, i,
> > +						node_start_pfn,
> > +						node_end_pfn,
> > +						zholes_size);
> > +		} else {
> > +			spanned = zone_spanned_pages_in_node(pgdat->node_id, i,
> > +						node_start_pfn,
> > +						node_end_pfn,
> > +						&zone_start_pfn,
> > +						&zone_end_pfn,
> > +						zones_size);
> > +			absent = zone_absent_pages_in_node(pgdat->node_id, i,
> > +						node_start_pfn,
> > +						node_end_pfn,
> > +						zholes_size);
> > +		}
> > +
> > +		size = spanned;
> > +		real_size = size - absent;
> > +
> >  		if (size)
> >  			zone->zone_start_pfn = zone_start_pfn;
> >  		else
> > @@ -6859,10 +6870,8 @@ static void __ref alloc_node_mem_map(struct pglist_data *pgdat)
> >  	 */
> >  	if (pgdat == NODE_DATA(0)) {
> >  		mem_map = NODE_DATA(0)->node_mem_map;
> > -#if defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) || defined(CONFIG_FLATMEM)
> >  		if (page_to_pfn(mem_map) != pgdat->node_start_pfn)
> >  			mem_map -= offset;
> > -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> >  	}
> >  #endif
> >  }
> > @@ -6879,9 +6888,10 @@ static inline void pgdat_set_deferred_range(pg_data_t *pgdat)
> >  static inline void pgdat_set_deferred_range(pg_data_t *pgdat) {}
> >  #endif
> >  
> > -void __init free_area_init_node(int nid, unsigned long *zones_size,
> > -				   unsigned long node_start_pfn,
> > -				   unsigned long *zholes_size)
> > +void __init __free_area_init_node(int nid, unsigned long *zones_size,
> > +				  unsigned long node_start_pfn,
> > +				  unsigned long *zholes_size,
> > +				  bool compat)
> >  {
> >  	pg_data_t *pgdat = NODE_DATA(nid);
> >  	unsigned long start_pfn = 0;
> > @@ -6893,16 +6903,17 @@ void __init free_area_init_node(int nid, unsigned long *zones_size,
> >  	pgdat->node_id = nid;
> >  	pgdat->node_start_pfn = node_start_pfn;
> >  	pgdat->per_cpu_nodestats = NULL;
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> > -	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> > -	pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
> > -		(u64)start_pfn << PAGE_SHIFT,
> > -		end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
> > -#else
> > -	start_pfn = node_start_pfn;
> > -#endif
> > +	if (!compat) {
> > +		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> > +		pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
> > +			(u64)start_pfn << PAGE_SHIFT,
> > +			end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
> > +	} else {
> > +		start_pfn = node_start_pfn;
> > +	}
> > +
> >  	calculate_node_totalpages(pgdat, start_pfn, end_pfn,
> > -				  zones_size, zholes_size);
> > +				  zones_size, zholes_size, compat);
> >  
> >  	alloc_node_mem_map(pgdat);
> >  	pgdat_set_deferred_range(pgdat);
> > @@ -6910,6 +6921,14 @@ void __init free_area_init_node(int nid, unsigned long *zones_size,
> >  	free_area_init_core(pgdat);
> >  }
> >  
> > +void __init free_area_init_node(int nid, unsigned long *zones_size,
> > +				unsigned long node_start_pfn,
> > +				unsigned long *zholes_size)
> > +{
> > +	__free_area_init_node(nid, zones_size, node_start_pfn, zholes_size,
> > +			      true);
> > +}
> > +
> >  #if !defined(CONFIG_FLAT_NODE_MEM_MAP)
> >  /*
> >   * Initialize all valid struct pages in the range [spfn, epfn) and mark them
> > @@ -6993,8 +7012,6 @@ static inline void __init init_unavailable_mem(void)
> >  }
> >  #endif /* !CONFIG_FLAT_NODE_MEM_MAP */
> >  
> > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> > -
> >  #if MAX_NUMNODES > 1
> >  /*
> >   * Figure out the number of possible node ids.
> > @@ -7423,8 +7440,8 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
> >  	init_unavailable_mem();
> >  	for_each_online_node(nid) {
> >  		pg_data_t *pgdat = NODE_DATA(nid);
> > -		free_area_init_node(nid, NULL,
> > -				find_min_pfn_for_node(nid), NULL);
> > +		__free_area_init_node(nid, NULL,
> > +				      find_min_pfn_for_node(nid), NULL, false);
> >  
> >  		/* Any memory on that node */
> >  		if (pgdat->node_present_pages)
> > @@ -7489,8 +7506,6 @@ static int __init cmdline_parse_movablecore(char *p)
> >  early_param("kernelcore", cmdline_parse_kernelcore);
> >  early_param("movablecore", cmdline_parse_movablecore);
> >  
> > -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> > -
> >  void adjust_managed_page_count(struct page *page, long count)
> >  {
> >  	atomic_long_add(count, &page_zone(page)->managed_pages);
> > -- 
> > 2.25.1
> > 
> > 
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)
  2020-04-01  7:51           ` Mike Rapoport
@ 2020-04-02  8:01             ` Michal Hocko
  2020-04-09 14:41               ` Baoquan He
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2020-04-02  8:01 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Baoquan He, Hoan Tran, Catalin Marinas, Will Deacon,
	Andrew Morton, Vlastimil Babka, Oscar Salvador, Pavel Tatashin,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On Wed 01-04-20 10:51:55, Mike Rapoport wrote:
> Hi,
> 
> On Wed, Apr 01, 2020 at 01:42:27PM +0800, Baoquan He wrote:
[...]
> > From above information, we can remove HAVE_MEMBLOCK_NODE_MAP, and
> > replace it with CONFIG_NUMA. That sounds more sensible to store nid into
> > memblock when NUMA support is enabled.
>  
> Replacing CONFIG_HAVE_MEMBLOCK_NODE_MAP with CONFIG_NUMA will work, but
> this will not help cleaning up the whole node/zone initialization mess and
> we'll be stuck with two implementations.

Yeah, this is far from optimal.

> The overhead of enabling HAVE_MEMBLOCK_NODE_MAP is only for init time as
> most architectures will anyway discard the entire memblock, so having it in
> a UMA arch won't be a problem. The only exception is arm that uses
> memblock for pfn_valid(), here we may also think about a solution to
> compensate the addition of nid to the memblock structures. 

Well, we can make memblock_region->nid defined only for CONFIG_NUMA.
memblock_get_region_node would then unconditionally return 0 on UMA.
Essentially the same way we do NUMA for other MM code. I only see few
direct usage of region->nid.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-31 14:31               ` Baoquan He
@ 2020-04-03  4:46                 ` Hoan Tran
  2020-04-03  7:09                   ` Baoquan He
  0 siblings, 1 reply; 36+ messages in thread
From: Hoan Tran @ 2020-04-03  4:46 UTC (permalink / raw)
  To: Baoquan He, Michal Hocko
  Cc: Mike Rapoport, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Alexander Duyck,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

Hi All,

On 3/31/20 7:31 AM, Baoquan He wrote:
> On 03/31/20 at 04:21pm, Michal Hocko wrote:
>> On Tue 31-03-20 22:03:32, Baoquan He wrote:
>>> Hi Michal,
>>>
>>> On 03/31/20 at 10:55am, Michal Hocko wrote:
>>>> On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
>>>>> Maybe I mis-read the code, but I don't see how this could happen. In the
>>>>> HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
>>>>> calculate_node_totalpages() that ensures that node->node_zones are entirely
>>>>> within the node because this is checked in zone_spanned_pages_in_node().
>>>>
>>>> zone_spanned_pages_in_node does chech the zone boundaries are within the
>>>> node boundaries. But that doesn't really tell anything about other
>>>> potential zones interleaving with the physical memory range.
>>>> zone->spanned_pages simply gives the physical range for the zone
>>>> including holes. Interleaving nodes are essentially a hole
>>>> (__absent_pages_in_range is going to skip those).
>>>>
>>>> That means that when free_area_init_core simply goes over the whole
>>>> physical zone range including holes and that is why we need to check
>>>> both for physical and logical holes (aka other nodes).
>>>>
>>>> The life would be so much easier if the whole thing would simply iterate
>>>> over memblocks...
>>>
>>> The memblock iterating sounds a great idea. I tried with putting the
>>> memblock iterating in the upper layer, memmap_init(), which is used for
>>> boot mem only anyway. Do you think it's doable and OK? It yes, I can
>>> work out a formal patch to make this simpler as you said. The draft code
>>> is as below. Like this it uses the existing code and involves little change.
>>
>> Doing this would be a step in the right direction! I haven't checked the
>> code very closely though. The below sounds way too simple to be truth I
>> am afraid. First for_each_mem_pfn_range is available only for
>> CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep
>> saying that I really hate that being conditional). Also I haven't really
>> checked the deferred initialization path - I have a very vague
>> recollection that it has been converted to the memblock api but I have
>> happilly dropped all that memory.
> 
> Thanks for your quick response and pointing out the rest suspect aspects,
> I will investigate what you mentioned, see if they impact.

I would like to check if we still move on with my patch to remove 
CONFIG_NODES_SPAN_OTHER_NODES and have another patch on top it?

Thanks
Hoan

> 
>>   
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 138a56c0f48f..558d421f294b 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>>   		 * function.  They do not exist on hotplugged memory.
>>>   		 */
>>>   		if (context == MEMMAP_EARLY) {
>>> -			if (!early_pfn_valid(pfn)) {
>>> -				pfn = next_pfn(pfn);
>>> -				continue;
>>> -			}
>>> -			if (!early_pfn_in_nid(pfn, nid)) {
>>> -				pfn++;
>>> -				continue;
>>> -			}
>>>   			if (overlap_memmap_init(zone, &pfn))
>>>   				continue;
>>>   			if (defer_init(nid, pfn, end_pfn))
>>> @@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone *zone)
>>>   }
>>>   
>>>   void __meminit __weak memmap_init(unsigned long size, int nid,
>>> -				  unsigned long zone, unsigned long start_pfn)
>>> +				  unsigned long zone, unsigned long range_start_pfn)
>>>   {
>>> -	memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
>>> +	unsigned long start_pfn, end_pfn;
>>> +	unsigned long range_end_pfn = range_start_pfn + size;
>>> +	int i;
>>> +	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
>>> +		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
>>> +		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
>>> +		if (end_pfn > start_pfn)
>>> +			memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
>>> +	}
>>>   }
>>>   
>>>   static int zone_batchsize(struct zone *zone)
>>
>> -- 
>> Michal Hocko
>> SUSE Labs
>>
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-04-03  4:46                 ` Hoan Tran
@ 2020-04-03  7:09                   ` Baoquan He
  2020-04-03 16:36                     ` Hoan Tran
  0 siblings, 1 reply; 36+ messages in thread
From: Baoquan He @ 2020-04-03  7:09 UTC (permalink / raw)
  To: Hoan Tran
  Cc: Michal Hocko, Mike Rapoport, Catalin Marinas, Will Deacon,
	Andrew Morton, Vlastimil Babka, Oscar Salvador, Pavel Tatashin,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On 04/02/20 at 09:46pm, Hoan Tran wrote:
> Hi All,
> 
> On 3/31/20 7:31 AM, Baoquan He wrote:
> > On 03/31/20 at 04:21pm, Michal Hocko wrote:
> > > On Tue 31-03-20 22:03:32, Baoquan He wrote:
> > > > Hi Michal,
> > > > 
> > > > On 03/31/20 at 10:55am, Michal Hocko wrote:
> > > > > On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
> > > > > > Maybe I mis-read the code, but I don't see how this could happen. In the
> > > > > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
> > > > > > calculate_node_totalpages() that ensures that node->node_zones are entirely
> > > > > > within the node because this is checked in zone_spanned_pages_in_node().
> > > > > 
> > > > > zone_spanned_pages_in_node does chech the zone boundaries are within the
> > > > > node boundaries. But that doesn't really tell anything about other
> > > > > potential zones interleaving with the physical memory range.
> > > > > zone->spanned_pages simply gives the physical range for the zone
> > > > > including holes. Interleaving nodes are essentially a hole
> > > > > (__absent_pages_in_range is going to skip those).
> > > > > 
> > > > > That means that when free_area_init_core simply goes over the whole
> > > > > physical zone range including holes and that is why we need to check
> > > > > both for physical and logical holes (aka other nodes).
> > > > > 
> > > > > The life would be so much easier if the whole thing would simply iterate
> > > > > over memblocks...
> > > > 
> > > > The memblock iterating sounds a great idea. I tried with putting the
> > > > memblock iterating in the upper layer, memmap_init(), which is used for
> > > > boot mem only anyway. Do you think it's doable and OK? It yes, I can
> > > > work out a formal patch to make this simpler as you said. The draft code
> > > > is as below. Like this it uses the existing code and involves little change.
> > > 
> > > Doing this would be a step in the right direction! I haven't checked the
> > > code very closely though. The below sounds way too simple to be truth I
> > > am afraid. First for_each_mem_pfn_range is available only for
> > > CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep
> > > saying that I really hate that being conditional). Also I haven't really
> > > checked the deferred initialization path - I have a very vague
> > > recollection that it has been converted to the memblock api but I have
> > > happilly dropped all that memory.
> > 
> > Thanks for your quick response and pointing out the rest suspect aspects,
> > I will investigate what you mentioned, see if they impact.
> 
> I would like to check if we still move on with my patch to remove
> CONFIG_NODES_SPAN_OTHER_NODES and have another patch on top it?

I think we would like to replace CONFIG_NODES_SPAN_OTHER_NODES with
CONFIG_NUMA, and just let UMA return 0 as node id, as Michal replied in
another mail. Anyway, your patch 2~5 are still needed to sit on top of
the change of this new plan.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-04-03  7:09                   ` Baoquan He
@ 2020-04-03 16:36                     ` Hoan Tran
  0 siblings, 0 replies; 36+ messages in thread
From: Hoan Tran @ 2020-04-03 16:36 UTC (permalink / raw)
  To: Baoquan He
  Cc: Michal Hocko, Mike Rapoport, Catalin Marinas, Will Deacon,
	Andrew Morton, Vlastimil Babka, Oscar Salvador, Pavel Tatashin,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

Hi,


On 4/3/20 12:09 AM, Baoquan He wrote:
> On 04/02/20 at 09:46pm, Hoan Tran wrote:
>> Hi All,
>>
>> On 3/31/20 7:31 AM, Baoquan He wrote:
>>> On 03/31/20 at 04:21pm, Michal Hocko wrote:
>>>> On Tue 31-03-20 22:03:32, Baoquan He wrote:
>>>>> Hi Michal,
>>>>>
>>>>> On 03/31/20 at 10:55am, Michal Hocko wrote:
>>>>>> On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
>>>>>>> Maybe I mis-read the code, but I don't see how this could happen. In the
>>>>>>> HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
>>>>>>> calculate_node_totalpages() that ensures that node->node_zones are entirely
>>>>>>> within the node because this is checked in zone_spanned_pages_in_node().
>>>>>>
>>>>>> zone_spanned_pages_in_node does chech the zone boundaries are within the
>>>>>> node boundaries. But that doesn't really tell anything about other
>>>>>> potential zones interleaving with the physical memory range.
>>>>>> zone->spanned_pages simply gives the physical range for the zone
>>>>>> including holes. Interleaving nodes are essentially a hole
>>>>>> (__absent_pages_in_range is going to skip those).
>>>>>>
>>>>>> That means that when free_area_init_core simply goes over the whole
>>>>>> physical zone range including holes and that is why we need to check
>>>>>> both for physical and logical holes (aka other nodes).
>>>>>>
>>>>>> The life would be so much easier if the whole thing would simply iterate
>>>>>> over memblocks...
>>>>>
>>>>> The memblock iterating sounds a great idea. I tried with putting the
>>>>> memblock iterating in the upper layer, memmap_init(), which is used for
>>>>> boot mem only anyway. Do you think it's doable and OK? It yes, I can
>>>>> work out a formal patch to make this simpler as you said. The draft code
>>>>> is as below. Like this it uses the existing code and involves little change.
>>>>
>>>> Doing this would be a step in the right direction! I haven't checked the
>>>> code very closely though. The below sounds way too simple to be truth I
>>>> am afraid. First for_each_mem_pfn_range is available only for
>>>> CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep
>>>> saying that I really hate that being conditional). Also I haven't really
>>>> checked the deferred initialization path - I have a very vague
>>>> recollection that it has been converted to the memblock api but I have
>>>> happilly dropped all that memory.
>>>
>>> Thanks for your quick response and pointing out the rest suspect aspects,
>>> I will investigate what you mentioned, see if they impact.
>>
>> I would like to check if we still move on with my patch to remove
>> CONFIG_NODES_SPAN_OTHER_NODES and have another patch on top it?
> 
> I think we would like to replace CONFIG_NODES_SPAN_OTHER_NODES with
> CONFIG_NUMA, and just let UMA return 0 as node id, as Michal replied in
> another mail. Anyway, your patch 2~5 are still needed to sit on top of
> the change of this new plan.

Got it. Thanks for quick response.

Regards
Hoan
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)
  2020-04-02  8:01             ` Michal Hocko
@ 2020-04-09 14:41               ` Baoquan He
  2020-04-09 15:33                 ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Baoquan He @ 2020-04-09 14:41 UTC (permalink / raw)
  To: Mike Rapoport, Michal Hocko
  Cc: Hoan Tran, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Alexander Duyck,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On 04/02/20 at 10:01am, Michal Hocko wrote:
> On Wed 01-04-20 10:51:55, Mike Rapoport wrote:
> > Hi,
> > 
> > On Wed, Apr 01, 2020 at 01:42:27PM +0800, Baoquan He wrote:
> [...]
> > > From above information, we can remove HAVE_MEMBLOCK_NODE_MAP, and
> > > replace it with CONFIG_NUMA. That sounds more sensible to store nid into
> > > memblock when NUMA support is enabled.
> >  
> > Replacing CONFIG_HAVE_MEMBLOCK_NODE_MAP with CONFIG_NUMA will work, but
> > this will not help cleaning up the whole node/zone initialization mess and
> > we'll be stuck with two implementations.
> 
> Yeah, this is far from optimal.
> 
> > The overhead of enabling HAVE_MEMBLOCK_NODE_MAP is only for init time as
> > most architectures will anyway discard the entire memblock, so having it in
> > a UMA arch won't be a problem. The only exception is arm that uses
> > memblock for pfn_valid(), here we may also think about a solution to
> > compensate the addition of nid to the memblock structures. 
> 
> Well, we can make memblock_region->nid defined only for CONFIG_NUMA.
> memblock_get_region_node would then unconditionally return 0 on UMA.
> Essentially the same way we do NUMA for other MM code. I only see few
> direct usage of region->nid.

Checked code again, seems HAVE_MEMBLOCK_NODE_MAP is selected directly in
all ARCHes which support it. Means HAVE_MEMBLOCK_NODE_MAP is enabled by
default on those ARCHes, and has no dependency on CONFIG_NUMA at all.
E.g on x86, it just calls free_area_init_nodes() in generic code path,
while free_area_init_nodes() is defined in CONFIG_HAVE_MEMBLOCK_NODE_MAP
ifdeffery scope. So I tend to agree with Mike to remove
HAVE_MEMBLOCK_NODE_MAP firstly on all ARCHes. We can check if it's worth
only defining memblock_region->nid for CONFIG_NUMA case after
HAVE_MEMBLOCK_NODE_MAP is removed.

config X86
        def_bool y
	...
	select HAVE_MEMBLOCK_NODE_MAP
	...


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)
  2020-04-09 14:41               ` Baoquan He
@ 2020-04-09 15:33                 ` Michal Hocko
  2020-04-10  6:46                   ` Baoquan He
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2020-04-09 15:33 UTC (permalink / raw)
  To: Baoquan He
  Cc: Mike Rapoport, Hoan Tran, Catalin Marinas, Will Deacon,
	Andrew Morton, Vlastimil Babka, Oscar Salvador, Pavel Tatashin,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On Thu 09-04-20 22:41:19, Baoquan He wrote:
> On 04/02/20 at 10:01am, Michal Hocko wrote:
> > On Wed 01-04-20 10:51:55, Mike Rapoport wrote:
> > > Hi,
> > > 
> > > On Wed, Apr 01, 2020 at 01:42:27PM +0800, Baoquan He wrote:
> > [...]
> > > > From above information, we can remove HAVE_MEMBLOCK_NODE_MAP, and
> > > > replace it with CONFIG_NUMA. That sounds more sensible to store nid into
> > > > memblock when NUMA support is enabled.
> > >  
> > > Replacing CONFIG_HAVE_MEMBLOCK_NODE_MAP with CONFIG_NUMA will work, but
> > > this will not help cleaning up the whole node/zone initialization mess and
> > > we'll be stuck with two implementations.
> > 
> > Yeah, this is far from optimal.
> > 
> > > The overhead of enabling HAVE_MEMBLOCK_NODE_MAP is only for init time as
> > > most architectures will anyway discard the entire memblock, so having it in
> > > a UMA arch won't be a problem. The only exception is arm that uses
> > > memblock for pfn_valid(), here we may also think about a solution to
> > > compensate the addition of nid to the memblock structures. 
> > 
> > Well, we can make memblock_region->nid defined only for CONFIG_NUMA.
> > memblock_get_region_node would then unconditionally return 0 on UMA.
> > Essentially the same way we do NUMA for other MM code. I only see few
> > direct usage of region->nid.
> 
> Checked code again, seems HAVE_MEMBLOCK_NODE_MAP is selected directly in
> all ARCHes which support it. Means HAVE_MEMBLOCK_NODE_MAP is enabled by
> default on those ARCHes, and has no dependency on CONFIG_NUMA at all.
> E.g on x86, it just calls free_area_init_nodes() in generic code path,
> while free_area_init_nodes() is defined in CONFIG_HAVE_MEMBLOCK_NODE_MAP
> ifdeffery scope. So I tend to agree with Mike to remove
> HAVE_MEMBLOCK_NODE_MAP firstly on all ARCHes. We can check if it's worth
> only defining memblock_region->nid for CONFIG_NUMA case after
> HAVE_MEMBLOCK_NODE_MAP is removed.

This can surely go in separate patches. What I meant to say is the
region->nid is by definition 0 on !CONFIG_NUMA.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-03-31 14:21             ` Michal Hocko
  2020-03-31 14:31               ` Baoquan He
@ 2020-04-09 16:27               ` Mike Rapoport
  2020-04-10  6:50                 ` Baoquan He
  1 sibling, 1 reply; 36+ messages in thread
From: Mike Rapoport @ 2020-04-09 16:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Baoquan He, Hoan Tran, Catalin Marinas, Will Deacon,
	Andrew Morton, Vlastimil Babka, Oscar Salvador, Pavel Tatashin,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On Tue, Mar 31, 2020 at 04:21:38PM +0200, Michal Hocko wrote:
> On Tue 31-03-20 22:03:32, Baoquan He wrote:
> > Hi Michal,
> > 
> > On 03/31/20 at 10:55am, Michal Hocko wrote:
> > > On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
> > > > Maybe I mis-read the code, but I don't see how this could happen. In the
> > > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
> > > > calculate_node_totalpages() that ensures that node->node_zones are entirely
> > > > within the node because this is checked in zone_spanned_pages_in_node().
> > > 
> > > zone_spanned_pages_in_node does chech the zone boundaries are within the
> > > node boundaries. But that doesn't really tell anything about other
> > > potential zones interleaving with the physical memory range.
> > > zone->spanned_pages simply gives the physical range for the zone
> > > including holes. Interleaving nodes are essentially a hole
> > > (__absent_pages_in_range is going to skip those).
> > > 
> > > That means that when free_area_init_core simply goes over the whole
> > > physical zone range including holes and that is why we need to check
> > > both for physical and logical holes (aka other nodes).
> > > 
> > > The life would be so much easier if the whole thing would simply iterate
> > > over memblocks...
> > 
> > The memblock iterating sounds a great idea. I tried with putting the
> > memblock iterating in the upper layer, memmap_init(), which is used for
> > boot mem only anyway. Do you think it's doable and OK? It yes, I can
> > work out a formal patch to make this simpler as you said. The draft code
> > is as below. Like this it uses the existing code and involves little change.
> 
> Doing this would be a step in the right direction! I haven't checked the
> code very closely though. The below sounds way too simple to be truth I
> am afraid. First for_each_mem_pfn_range is available only for
> CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep
> saying that I really hate that being conditional). Also I haven't really
> checked the deferred initialization path - I have a very vague
> recollection that it has been converted to the memblock api but I have
> happilly dropped all that memory.

The Baoquan's patch almost did it, at least for simple case of qemu with 2
nodes. It's only missing the adjustment to the size passed to
memmap_init_zone() as it may change because of clamping.

I've drafted something that removes HAVE_MEMBLOCK_NODE_MAP and added this
patch there [1]. For several memory configurations I could emulate with
qemu it worked.
I'm going to wait a bit to see of kbuild is happy and then I'll send the
patches.

Baoquan, I took liberty to add your SoB, hope you don't mind.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=memblock/all-have-node-map 
  
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 138a56c0f48f..558d421f294b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> >  		 * function.  They do not exist on hotplugged memory.
> >  		 */
> >  		if (context == MEMMAP_EARLY) {
> > -			if (!early_pfn_valid(pfn)) {
> > -				pfn = next_pfn(pfn);
> > -				continue;
> > -			}
> > -			if (!early_pfn_in_nid(pfn, nid)) {
> > -				pfn++;
> > -				continue;
> > -			}
> >  			if (overlap_memmap_init(zone, &pfn))
> >  				continue;
> >  			if (defer_init(nid, pfn, end_pfn))
> > @@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone *zone)
> >  }
> >  
> >  void __meminit __weak memmap_init(unsigned long size, int nid,
> > -				  unsigned long zone, unsigned long start_pfn)
> > +				  unsigned long zone, unsigned long range_start_pfn)
> >  {
> > -	memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
> > +	unsigned long start_pfn, end_pfn;
> > +	unsigned long range_end_pfn = range_start_pfn + size;
> > +	int i;
> > +	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > +		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> > +		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> > +		if (end_pfn > start_pfn)
> > +			memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
> > +	}
> >  }
> >  
> >  static int zone_batchsize(struct zone *zone)
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)
  2020-04-09 15:33                 ` Michal Hocko
@ 2020-04-10  6:46                   ` Baoquan He
  0 siblings, 0 replies; 36+ messages in thread
From: Baoquan He @ 2020-04-10  6:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Hoan Tran, Catalin Marinas, Will Deacon,
	Andrew Morton, Vlastimil Babka, Oscar Salvador, Pavel Tatashin,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On 04/09/20 at 05:33pm, Michal Hocko wrote:
> On Thu 09-04-20 22:41:19, Baoquan He wrote:
> > On 04/02/20 at 10:01am, Michal Hocko wrote:
> > > On Wed 01-04-20 10:51:55, Mike Rapoport wrote:
> > > > Hi,
> > > > 
> > > > On Wed, Apr 01, 2020 at 01:42:27PM +0800, Baoquan He wrote:
> > > [...]
> > > > > From above information, we can remove HAVE_MEMBLOCK_NODE_MAP, and
> > > > > replace it with CONFIG_NUMA. That sounds more sensible to store nid into
> > > > > memblock when NUMA support is enabled.
> > > >  
> > > > Replacing CONFIG_HAVE_MEMBLOCK_NODE_MAP with CONFIG_NUMA will work, but
> > > > this will not help cleaning up the whole node/zone initialization mess and
> > > > we'll be stuck with two implementations.
> > > 
> > > Yeah, this is far from optimal.
> > > 
> > > > The overhead of enabling HAVE_MEMBLOCK_NODE_MAP is only for init time as
> > > > most architectures will anyway discard the entire memblock, so having it in
> > > > a UMA arch won't be a problem. The only exception is arm that uses
> > > > memblock for pfn_valid(), here we may also think about a solution to
> > > > compensate the addition of nid to the memblock structures. 
> > > 
> > > Well, we can make memblock_region->nid defined only for CONFIG_NUMA.
> > > memblock_get_region_node would then unconditionally return 0 on UMA.
> > > Essentially the same way we do NUMA for other MM code. I only see few
> > > direct usage of region->nid.
> > 
> > Checked code again, seems HAVE_MEMBLOCK_NODE_MAP is selected directly in
> > all ARCHes which support it. Means HAVE_MEMBLOCK_NODE_MAP is enabled by
> > default on those ARCHes, and has no dependency on CONFIG_NUMA at all.
> > E.g on x86, it just calls free_area_init_nodes() in generic code path,
> > while free_area_init_nodes() is defined in CONFIG_HAVE_MEMBLOCK_NODE_MAP
> > ifdeffery scope. So I tend to agree with Mike to remove
> > HAVE_MEMBLOCK_NODE_MAP firstly on all ARCHes. We can check if it's worth
> > only defining memblock_region->nid for CONFIG_NUMA case after
> > HAVE_MEMBLOCK_NODE_MAP is removed.
> 
> This can surely go in separate patches. What I meant to say is the
> region->nid is by definition 0 on !CONFIG_NUMA.

I see, thanks.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
  2020-04-09 16:27               ` Mike Rapoport
@ 2020-04-10  6:50                 ` Baoquan He
  0 siblings, 0 replies; 36+ messages in thread
From: Baoquan He @ 2020-04-10  6:50 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Michal Hocko, Hoan Tran, Catalin Marinas, Will Deacon,
	Andrew Morton, Vlastimil Babka, Oscar Salvador, Pavel Tatashin,
	Alexander Duyck, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, David S. Miller, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, open list:MEMORY MANAGEMENT,
	linux-arm-kernel, linux-s390, sparclinux, x86, linuxppc-dev,
	linux-kernel, lho, mmorana

On 04/09/20 at 07:27pm, Mike Rapoport wrote:
> On Tue, Mar 31, 2020 at 04:21:38PM +0200, Michal Hocko wrote:
> > On Tue 31-03-20 22:03:32, Baoquan He wrote:
> > > Hi Michal,
> > > 
> > > On 03/31/20 at 10:55am, Michal Hocko wrote:
> > > > On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
> > > > > Maybe I mis-read the code, but I don't see how this could happen. In the
> > > > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
> > > > > calculate_node_totalpages() that ensures that node->node_zones are entirely
> > > > > within the node because this is checked in zone_spanned_pages_in_node().
> > > > 
> > > > zone_spanned_pages_in_node does chech the zone boundaries are within the
> > > > node boundaries. But that doesn't really tell anything about other
> > > > potential zones interleaving with the physical memory range.
> > > > zone->spanned_pages simply gives the physical range for the zone
> > > > including holes. Interleaving nodes are essentially a hole
> > > > (__absent_pages_in_range is going to skip those).
> > > > 
> > > > That means that when free_area_init_core simply goes over the whole
> > > > physical zone range including holes and that is why we need to check
> > > > both for physical and logical holes (aka other nodes).
> > > > 
> > > > The life would be so much easier if the whole thing would simply iterate
> > > > over memblocks...
> > > 
> > > The memblock iterating sounds a great idea. I tried with putting the
> > > memblock iterating in the upper layer, memmap_init(), which is used for
> > > boot mem only anyway. Do you think it's doable and OK? It yes, I can
> > > work out a formal patch to make this simpler as you said. The draft code
> > > is as below. Like this it uses the existing code and involves little change.
> > 
> > Doing this would be a step in the right direction! I haven't checked the
> > code very closely though. The below sounds way too simple to be truth I
> > am afraid. First for_each_mem_pfn_range is available only for
> > CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep
> > saying that I really hate that being conditional). Also I haven't really
> > checked the deferred initialization path - I have a very vague
> > recollection that it has been converted to the memblock api but I have
> > happilly dropped all that memory.
> 
> The Baoquan's patch almost did it, at least for simple case of qemu with 2
> nodes. It's only missing the adjustment to the size passed to
> memmap_init_zone() as it may change because of clamping.

Right, the size need be adjusted after start and end clamping.

> 
> I've drafted something that removes HAVE_MEMBLOCK_NODE_MAP and added this
> patch there [1]. For several memory configurations I could emulate with
> qemu it worked.
> I'm going to wait a bit to see of kbuild is happy and then I'll send the
> patches.
> 
> Baoquan, I took liberty to add your SoB, hope you don't mind.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=memblock/all-have-node-map 

Of course not. Thanks for doing this, and look forward to seeing your
formal patchset posting when it's ready.

>   
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 138a56c0f48f..558d421f294b 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> > >  		 * function.  They do not exist on hotplugged memory.
> > >  		 */
> > >  		if (context == MEMMAP_EARLY) {
> > > -			if (!early_pfn_valid(pfn)) {
> > > -				pfn = next_pfn(pfn);
> > > -				continue;
> > > -			}
> > > -			if (!early_pfn_in_nid(pfn, nid)) {
> > > -				pfn++;
> > > -				continue;
> > > -			}
> > >  			if (overlap_memmap_init(zone, &pfn))
> > >  				continue;
> > >  			if (defer_init(nid, pfn, end_pfn))
> > > @@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone *zone)
> > >  }
> > >  
> > >  void __meminit __weak memmap_init(unsigned long size, int nid,
> > > -				  unsigned long zone, unsigned long start_pfn)
> > > +				  unsigned long zone, unsigned long range_start_pfn)
> > >  {
> > > -	memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
> > > +	unsigned long start_pfn, end_pfn;
> > > +	unsigned long range_end_pfn = range_start_pfn + size;
> > > +	int i;
> > > +	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > > +		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> > > +		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> > > +		if (end_pfn > start_pfn)
> > > +			memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
> > > +	}
> > >  }
> > >  
> > >  static int zone_batchsize(struct zone *zone)
> > 
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> -- 
> Sincerely yours,
> Mike.
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2020-04-10  6:50 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-28 18:31 [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Hoan Tran
2020-03-28 18:31 ` [PATCH v3 1/5] " Hoan Tran
2020-03-28 18:31 ` [PATCH v3 2/5] powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES Hoan Tran
2020-03-28 18:31 ` [PATCH v3 3/5] x86: " Hoan Tran
2020-03-28 18:31 ` [PATCH v3 4/5] sparc: " Hoan Tran
2020-03-28 18:31 ` [PATCH v3 5/5] s390: " Hoan Tran
2020-03-29  0:19 ` [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Baoquan He
2020-03-30  7:44   ` Michal Hocko
2020-03-30  8:04     ` Baoquan He
2020-03-30  7:42 ` Michal Hocko
2020-03-30  8:16   ` Baoquan He
2020-03-30  8:28     ` Baoquan He
2020-03-30  9:21   ` Mike Rapoport
2020-03-30  9:58     ` Michal Hocko
2020-03-30 10:26       ` Mike Rapoport
2020-03-30 10:43         ` Baoquan He
2020-03-31 21:56       ` [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA) Mike Rapoport
2020-04-01  5:42         ` Baoquan He
2020-04-01  7:51           ` Mike Rapoport
2020-04-02  8:01             ` Michal Hocko
2020-04-09 14:41               ` Baoquan He
2020-04-09 15:33                 ` Michal Hocko
2020-04-10  6:46                   ` Baoquan He
2020-03-30  9:26   ` [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA Baoquan He
2020-03-30 17:51   ` Mike Rapoport
2020-03-30 18:23     ` Michal Hocko
2020-03-31  8:14       ` Mike Rapoport
2020-03-31  8:55         ` Michal Hocko
2020-03-31 14:03           ` Baoquan He
2020-03-31 14:21             ` Michal Hocko
2020-03-31 14:31               ` Baoquan He
2020-04-03  4:46                 ` Hoan Tran
2020-04-03  7:09                   ` Baoquan He
2020-04-03 16:36                     ` Hoan Tran
2020-04-09 16:27               ` Mike Rapoport
2020-04-10  6:50                 ` Baoquan He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).