linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Offline memoryless cpuless node 0
@ 2020-03-11 11:02 Srikar Dronamraju
  2020-03-11 11:02 ` [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2020-03-11 11:02 UTC (permalink / raw)
  To: Andrew Morton, Michael Ellerman
  Cc: Srikar Dronamraju, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Linus Torvalds

Linux kernel configured with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause
1. numa_balancing to be enabled on systems with only one online node.
2. Existence of dummy (cpuless and memoryless) node which can confuse
users/scripts looking at output of lscpu / numactl.

This patchset wants to correct this anomaly.

This should only affect systems that have CONFIG_MEMORYLESS_NODES.
Currently there are only 2 architectures ia64 and powerpc that have this
config.

v5.6-rc4
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
------------------
 /sys/devices/system/node/online:            0,2
 /proc/sys/kernel/numa_balancing:            1
 /sys/devices/system/node/has_cpu:           2
 /sys/devices/system/node/has_memory:        2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:          0-31

v5.6-rc4 + patches
------------------
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
------------------
/sys/devices/system/node/online:            2
/proc/sys/kernel/numa_balancing:            0
/sys/devices/system/node/has_cpu:           2
/sys/devices/system/node/has_memory:        2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:          0-31

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>

Srikar Dronamraju (3):
  powerpc/numa: Set numa_node for all possible cpus
  powerpc/numa: Prefer node id queried from vphn
  mm/page_alloc: Keep memoryless cpuless node 0 offline

 arch/powerpc/mm/numa.c | 32 ++++++++++++++++++++++----------
 mm/page_alloc.c        |  4 +++-
 2 files changed, 25 insertions(+), 11 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-11 11:02 [PATCH 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
@ 2020-03-11 11:02 ` Srikar Dronamraju
  2020-03-11 11:57   ` Michal Hocko
  2020-03-11 11:02 ` [PATCH 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
  2020-03-11 11:02 ` [PATCH 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
  2 siblings, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2020-03-11 11:02 UTC (permalink / raw)
  To: Andrew Morton, Michael Ellerman
  Cc: Srikar Dronamraju, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Linus Torvalds

A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8a399db..54dcd49 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -931,8 +931,20 @@ void __init mem_topology_setup(void)
 
 	reset_numa_cpu_lookup_table();
 
-	for_each_present_cpu(cpu)
-		numa_setup_cpu(cpu);
+	for_each_possible_cpu(cpu) {
+		/*
+		 * Powerpc with CONFIG_NUMA always used to have a node 0,
+		 * even if it was memoryless or cpuless. For all cpus that
+		 * are possible but not present, cpu_to_node() would point
+		 * to node 0. To remove a cpuless, memoryless dummy node,
+		 * powerpc need to make sure all possible but not present
+		 * cpu_to_node are set to a proper node.
+		 */
+		if (cpu_present(cpu))
+			numa_setup_cpu(cpu);
+		else
+			set_cpu_numa_node(cpu, first_online_node);
+	}
 }
 
 void __init initmem_init(void)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/3] powerpc/numa: Prefer node id queried from vphn
  2020-03-11 11:02 [PATCH 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
  2020-03-11 11:02 ` [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
@ 2020-03-11 11:02 ` Srikar Dronamraju
  2020-03-11 11:02 ` [PATCH 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
  2 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2020-03-11 11:02 UTC (permalink / raw)
  To: Andrew Morton, Michael Ellerman
  Cc: Srikar Dronamraju, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Linus Torvalds

Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 54dcd49..8735fed 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -719,20 +719,20 @@ static int __init parse_numa_properties(void)
 	 */
 	for_each_present_cpu(i) {
 		struct device_node *cpu;
-		int nid;
-
-		cpu = of_get_cpu_node(i, NULL);
-		BUG_ON(!cpu);
-		nid = of_node_to_nid_single(cpu);
-		of_node_put(cpu);
+		int nid = vphn_get_nid(i);
 
 		/*
 		 * Don't fall back to default_nid yet -- we will plug
 		 * cpus into nodes once the memory scan has discovered
 		 * the topology.
 		 */
-		if (nid < 0)
-			continue;
+		if (nid == NUMA_NO_NODE) {
+			cpu = of_get_cpu_node(i, NULL);
+			if (cpu) {
+				nid = of_node_to_nid_single(cpu);
+				of_node_put(cpu);
+			}
+		}
 		node_set_online(nid);
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-03-11 11:02 [PATCH 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
  2020-03-11 11:02 ` [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
  2020-03-11 11:02 ` [PATCH 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
@ 2020-03-11 11:02 ` Srikar Dronamraju
  2020-03-15 14:20   ` Christopher Lameter
  2 siblings, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2020-03-11 11:02 UTC (permalink / raw)
  To: Andrew Morton, Michael Ellerman
  Cc: Srikar Dronamraju, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Linus Torvalds

Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot.  However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

Lets stop assuming that Node 0 is always online.

v5.6-rc4
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
------------------
 /sys/devices/system/node/online:            0,2
 /proc/sys/kernel/numa_balancing:            1
 /sys/devices/system/node/has_cpu:           2
 /sys/devices/system/node/has_memory:        2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:          0-31

v5.6-rc4 + patch
------------------
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
------------------
/sys/devices/system/node/online:            2
/proc/sys/kernel/numa_balancing:            0
/sys/devices/system/node/has_cpu:           2
/sys/devices/system/node/has_memory:        2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:          0-31

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb75..68e635f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -116,8 +116,10 @@ struct pcpu_drain {
  */
 nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
 	[N_POSSIBLE] = NODE_MASK_ALL,
+#ifdef CONFIG_NUMA
+	[N_ONLINE] = NODE_MASK_NONE,
+#else
 	[N_ONLINE] = { { [0] = 1UL } },
-#ifndef CONFIG_NUMA
 	[N_NORMAL_MEMORY] = { { [0] = 1UL } },
 #ifdef CONFIG_HIGHMEM
 	[N_HIGH_MEMORY] = { { [0] = 1UL } },
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-11 11:02 ` [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
@ 2020-03-11 11:57   ` Michal Hocko
  2020-03-12  5:27     ` Srikar Dronamraju
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2020-03-11 11:57 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, Michael Ellerman, linuxppc-dev, linux-mm,
	linux-kernel, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Linus Torvalds

On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> A Powerpc system with multiple possible nodes and with CONFIG_NUMA
> enabled always used to have a node 0, even if node 0 does not any cpus
> or memory attached to it. As per PAPR, node affinity of a cpu is only
> available once its present / online. For all cpus that are possible but
> not present, cpu_to_node() would point to node 0.
> 
> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> to make sure all possible but not present cpu_to_node are set to a
> proper node.

Just curious, is this somehow related to
http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?

> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/numa.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 8a399db..54dcd49 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -931,8 +931,20 @@ void __init mem_topology_setup(void)
>  
>  	reset_numa_cpu_lookup_table();
>  
> -	for_each_present_cpu(cpu)
> -		numa_setup_cpu(cpu);
> +	for_each_possible_cpu(cpu) {
> +		/*
> +		 * Powerpc with CONFIG_NUMA always used to have a node 0,
> +		 * even if it was memoryless or cpuless. For all cpus that
> +		 * are possible but not present, cpu_to_node() would point
> +		 * to node 0. To remove a cpuless, memoryless dummy node,
> +		 * powerpc need to make sure all possible but not present
> +		 * cpu_to_node are set to a proper node.
> +		 */
> +		if (cpu_present(cpu))
> +			numa_setup_cpu(cpu);
> +		else
> +			set_cpu_numa_node(cpu, first_online_node);
> +	}
>  }
>  
>  void __init initmem_init(void)
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-11 11:57   ` Michal Hocko
@ 2020-03-12  5:27     ` Srikar Dronamraju
  2020-03-12  8:23       ` Sachin Sant
  0 siblings, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2020-03-12  5:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Michael Ellerman, linuxppc-dev, linux-mm,
	linux-kernel, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Linus Torvalds

* Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:

> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> > A Powerpc system with multiple possible nodes and with CONFIG_NUMA
> > enabled always used to have a node 0, even if node 0 does not any cpus
> > or memory attached to it. As per PAPR, node affinity of a cpu is only
> > available once its present / online. For all cpus that are possible but
> > not present, cpu_to_node() would point to node 0.
> > 
> > To ensure a cpuless, memoryless dummy node is not online, powerpc need
> > to make sure all possible but not present cpu_to_node are set to a
> > proper node.
> 
> Just curious, is this somehow related to
> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
> 

The issue I am trying to fix is a known issue in Powerpc since many years.
So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node"). 

I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
kernel. Will work with Sachin/Abdul (reporters of the issue).


> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: linux-mm@kvack.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> > Cc: Christopher Lameter <cl@linux.com>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/mm/numa.c | 16 ++++++++++++++--
> >  1 file changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 8a399db..54dcd49 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -931,8 +931,20 @@ void __init mem_topology_setup(void)
> >  
> >  	reset_numa_cpu_lookup_table();
> >  
> > -	for_each_present_cpu(cpu)
> > -		numa_setup_cpu(cpu);
> > +	for_each_possible_cpu(cpu) {
> > +		/*
> > +		 * Powerpc with CONFIG_NUMA always used to have a node 0,
> > +		 * even if it was memoryless or cpuless. For all cpus that
> > +		 * are possible but not present, cpu_to_node() would point
> > +		 * to node 0. To remove a cpuless, memoryless dummy node,
> > +		 * powerpc need to make sure all possible but not present
> > +		 * cpu_to_node are set to a proper node.
> > +		 */
> > +		if (cpu_present(cpu))
> > +			numa_setup_cpu(cpu);
> > +		else
> > +			set_cpu_numa_node(cpu, first_online_node);
> > +	}
> >  }
> >  
> >  void __init initmem_init(void)
> > -- 
> > 1.8.3.1
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-12  5:27     ` Srikar Dronamraju
@ 2020-03-12  8:23       ` Sachin Sant
  2020-03-12  9:30         ` Vlastimil Babka
  0 siblings, 1 reply; 24+ messages in thread
From: Sachin Sant @ 2020-03-12  8:23 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Michal Hocko, Linus Torvalds, LKML, linux-mm, Mel Gorman,
	Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Vlastimil Babka

[-- Attachment #1: Type: text/plain, Size: 4183 bytes --]



> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> 
> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
> 
>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
>>> A Powerpc system with multiple possible nodes and with CONFIG_NUMA
>>> enabled always used to have a node 0, even if node 0 does not any cpus
>>> or memory attached to it. As per PAPR, node affinity of a cpu is only
>>> available once its present / online. For all cpus that are possible but
>>> not present, cpu_to_node() would point to node 0.
>>> 
>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
>>> to make sure all possible but not present cpu_to_node are set to a
>>> proper node.
>> 
>> Just curious, is this somehow related to
>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
>> 
> 
> The issue I am trying to fix is a known issue in Powerpc since many years.
> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
> shrinker_map on appropriate NUMA node"). 
> 
> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
> kernel. Will work with Sachin/Abdul (reporters of the issue).
> 

I applied this 3 patch series on top of March 11 next tree (commit d44a64766795 )
The kernel still fails to boot with same call trace.

[    6.159357] BUG: Kernel NULL pointer dereference on read at 0x000073b0
[    6.159363] Faulting instruction address: 0xc0000000003d7174
[    6.159368] Oops: Kernel access of bad area, sig: 11 [#1]
[    6.159372] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[    6.159378] Modules linked in:
[    6.159382] CPU: 17 PID: 1 Comm: systemd Not tainted 5.6.0-rc5-next-20200311-autotest+ #1
[    6.159388] NIP:  c0000000003d7174 LR: c0000000003d7714 CTR: c000000000400e70
[    6.159393] REGS: c0000008b36836d0 TRAP: 0300   Not tainted  (5.6.0-rc5-next-20200311-autotest+)
[    6.159398] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24004848  XER: 00000000
[    6.159406] CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
[    6.159406] GPR00: c0000000003d7714 c0000008b3683960 c00000000155e300 c0000008b301f500
[    6.159406] GPR04: 0000000000000dc0 0000000000000000 c0000000003456f8 c0000008bb198620
[    6.159406] GPR08: 00000008ba0f0000 0000000000000001 0000000000000000 0000000000000000
[    6.159406] GPR12: 0000000024004848 c00000001ec55e00 0000000000000000 0000000000000000
[    6.159406] GPR16: c0000008b0a82048 c000000001595898 c000000001750ca8 0000000000000002
[    6.159406] GPR20: c000000001750cb8 c000000001624478 0000000fffffffe0 5deadbeef0000122
[    6.159406] GPR24: 0000000000000001 0000000000000dc0 0000000000000000 c0000000003456f8
[    6.159406] GPR28: c0000008b301f500 c0000008bb198620 0000000000000000 c00c000002285a40
[    6.159453] NIP [c0000000003d7174] ___slab_alloc+0x1f4/0x760
[    6.159458] LR [c0000000003d7714] __slab_alloc+0x34/0x60
[    6.159462] Call Trace:
[    6.159465] [c0000008b3683a40] [c0000008b3683a70] 0xc0000008b3683a70
[    6.159471] [c0000008b3683a70] [c0000000003d8b20] __kmalloc_node+0x110/0x490
[    6.159477] [c0000008b3683af0] [c0000000003456f8] kvmalloc_node+0x58/0x110
[    6.159483] [c0000008b3683b30] [c000000000400f78] mem_cgroup_css_online+0x108/0x270
[    6.159489] [c0000008b3683b90] [c000000000236ed8] online_css+0x48/0xd0
[    6.159494] [c0000008b3683bc0] [c00000000023ffac] cgroup_apply_control_enable+0x2ec/0x4d0
[    6.159501] [c0000008b3683ca0] [c0000000002437c8] cgroup_mkdir+0x228/0x5f0
[    6.159506] [c0000008b3683d10] [c000000000521780] kernfs_iop_mkdir+0x90/0xf0
[    6.159512] [c0000008b3683d50] [c00000000043f670] vfs_mkdir+0x110/0x230
[    6.159517] [c0000008b3683da0] [c000000000443150] do_mkdirat+0xb0/0x1a0
[    6.159523] [c0000008b3683e20] [c00000000000b278] system_call+0x5c/0x68
[    6.159527] Instruction dump:
[    6.159531] 7c421378 e95f0000 714a0001 4082fff0 4bffff64 60000000 60000000 faa10088
[    6.159538] 3ea2000c 3ab56178 7b4a1f24 7d55502a <e94a73b0> 2faa0000 409e0394 3d02002a
[    6.159545] ---[ end trace 36d65cb66091a5b6 ]—

Boot log attached.

Thanks
-Sachin

[-- Attachment #2: memory-less-node-boot.log --]
[-- Type: application/octet-stream, Size: 19236 bytes --]

# kexec -e
[ 4149.149473] kexec_core: Starting new kernel
[ 4149.169501] kexec: waiting for cpu 2 (physical 2) to enter 1 state
[ 4149.169512] kexec: waiting for cpu 23 (physical 23) to enter 1 state
[ 4149.169521] kexec: waiting for cpu 1 (physical 1) to enter 2 state
[ 4149.169596] kexec: waiting for cpu 2 (physical 2) to enter 2 state
[ 4149.169610] kexec: waiting for cpu 3 (physical 3) to enter 2 state
[ 4149.169620] kexec: waiting for cpu 8 (physical 8) to enter 2 state
[ 4149.333175] kexec: Starting switchover sequence.
I'm in purgatory
[    0.000000] hash-mmu: Page sizes from device-tree:
[    0.000000] hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
[    0.000000] hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
[    0.000000] hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
[    0.000000] hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
[    0.000000] hash-mmu: base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8
[    0.000000] hash-mmu: base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0
[    0.000000] hash-mmu: base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3
[    0.000000] Using 1TB segments
[    0.000000] hash-mmu: Initializing hash mmu with SLB
[    0.000000] Linux version 5.6.0-rc5-next-20200311-autotest+ (root@ltc-zzci-2.aus.stglabs.ibm.com) (gcc version 8.3.1 20190507 (Red Hat 8.3.1-4) (GCC)) #1 SMP Thu Mar 12 03:03:59 CDT 2020
[    0.000000] Found initrd at 0xc000000003350000:0xc000000004d9808f
[    0.000000] Using pSeries machine description
[    0.000000] printk: bootconsole [udbg0] enabled
[    0.000000] Partition configured for 32 cpus.
[    0.000000] CPU maps initialized for 8 threads per core
[    0.000000] -----------------------------------------------------
[    0.000000] phys_mem_size     = 0x8c0000000
[    0.000000] dcache_bsize      = 0x80
[    0.000000] icache_bsize      = 0x80
[    0.000000] cpu_features      = 0x0001c07f8f5f91a7
[    0.000000]   possible        = 0x0003fbffcf5fb1a7
[    0.000000]   always          = 0x00000003800081a1
[    0.000000] cpu_user_features = 0xdc0065c2 0xefe00000
[    0.000000] mmu_features      = 0x7c006001
[    0.000000] firmware_features = 0x00000097c45bfc57
[    0.000000] vmalloc start     = 0xc008000000000000
[    0.000000] IO start          = 0xc00a000000000000
[    0.000000] vmemmap start     = 0xc00c000000000000
[    0.000000] hash-mmu: ppc64_pft_size    = 0x1c
[    0.000000] hash-mmu: htab_hash_mask    = 0x1fffff
[    0.000000] -----------------------------------------------------
[    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
[    0.000000] rfi-flush: fallback displacement flush available
[    0.000000] rfi-flush: mttrig type flush available
[    0.000000] link-stack-flush: software flush enabled.
[    0.000000] count-cache-flush: software flush disabled.
[    0.000000] stf-barrier: eieio barrier available
[    0.000000] lpar: H_BLOCK_REMOVE supports base psize:0 psize:0 block size:8
[    0.000000] lpar: H_BLOCK_REMOVE supports base psize:0 psize:2 block size:8
[    0.000000] lpar: H_BLOCK_REMOVE supports base psize:0 psize:10 block size:8
[    0.000000] lpar: H_BLOCK_REMOVE supports base psize:2 psize:2 block size:8
[    0.000000] lpar: H_BLOCK_REMOVE supports base psize:2 psize:10 block size:8
[    0.000000] PPC64 nvram contains 15360 bytes
[    0.000000] barrier-nospec: using ORI speculation barrier
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x00000008bfffffff]
[    0.000000]   Device   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   1: [mem 0x0000000000000000-0x00000008bfffffff]
[    0.000000] Initmem setup node 1 [mem 0x0000000000000000-0x00000008bfffffff]
[    0.000000] percpu: Embedded 11 pages/cpu s624024 r0 d96872 u1048576
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 572880
[    0.000000] Policy zone: Normal
[    0.000000] Kernel command line: root=UUID=681ebf25-b7c8-49b9-b247-35a96bc8183f 
[    0.000000] Dentry cache hash table entries: 8388608 (order: 10, 67108864 bytes, linear)
[    0.000000] Inode-cache hash table entries: 4194304 (order: 9, 33554432 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 36388480K/36700160K available (11840K kernel code, 1728K rwdata, 3712K rodata, 4992K init, 2845K bss, 311680K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=32, Nodes=32
[    0.000000] ftrace: allocating 29890 entries in 11 pages
[    0.000000] ftrace: allocated 11 pages with 3 groups
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: 	RCU restricting CPUs from NR_CPUS=2048 to nr_cpu_ids=32.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
[    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=32
[    0.000000] NR_IRQS: 512, nr_irqs: 512, preallocated irqs: 16
[    0.000000] xive: Using IRQ range [94000-9401f]
[    0.000000] xive: Interrupt handling initialized with spapr backend
[    0.000000] xive: Using priority 7 for all interrupts
[    0.000000] xive: Using 64kB queues
[    0.000000] rcu: 	Offload RCU callbacks from CPUs: (none).
[    0.000000] random: get_random_u64 called from start_kernel+0x748/0x9a4 with crng_init=0
[    0.000001] time_init: 56 bit decrementer (max: 7fffffffffffff)
[    0.000065] clocksource: timebase: mask: 0xffffffffffffffff max_cycles: 0x761537d007, max_idle_ns: 440795202126 ns
[    0.000173] clocksource: timebase mult[1f40000] shift[24] registered
[    0.000299] Console: colour dummy device 80x25
[    0.000350] printk: console [hvc0] enabled
[    0.000350] printk: console [hvc0] enabled
[    0.000397] printk: bootconsole [udbg0] disabled
[    0.000397] printk: bootconsole [udbg0] disabled
[    0.000472] pid_max: default: 32768 minimum: 301
[    0.000621] Mount-cache hash table entries: 131072 (order: 4, 1048576 bytes, linear)
[    0.000690] Mountpoint-cache hash table entries: 131072 (order: 4, 1048576 bytes, linear)
[    0.001491] EEH: pSeries platform initialized
[    0.001499] POWER9 performance monitor hardware support registered
[    0.001531] rcu: Hierarchical SRCU implementation.
[    0.002227] smp: Bringing up secondary CPUs ...
[    0.011406] smp: Brought up 1 node, 32 CPUs
[    0.011413] numa: Node 1 CPUs: 0-31
[    0.011417] Using small cores at SMT level
[    0.011420] Using shared cache scheduler topology
[    0.012363] devtmpfs: initialized
[    0.015539] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.015550] futex hash table entries: 8192 (order: 4, 1048576 bytes, linear)
[    0.015773] thermal_sys: Registered thermal governor 'fair_share'
[    0.015774] thermal_sys: Registered thermal governor 'step_wise'
[    0.015880] NET: Registered protocol family 16
[    0.016023] audit: initializing netlink subsys (disabled)
[    0.016073] audit: type=2000 audit(1584000985.010:1): state=initialized audit_enabled=0 res=1
[    0.016180] cpuidle: using governor menu
[    0.016353] pstore: Registered nvram as persistent store backend
[    0.020998] PCI: Probing PCI hardware
[    0.021005] EEH: No capable adapters found: recovery disabled.
[    0.021071] pseries-rng: Registering arch random hook.
[    0.022801] HugeTLB registered 16.0 MiB page size, pre-allocated 0 pages
[    0.022808] HugeTLB registered 16.0 GiB page size, pre-allocated 0 pages
[    0.263342] random: fast init done
[    0.264479] iommu: Default domain type: Translated 
[    0.264517] vgaarb: loaded
[    0.264595] SCSI subsystem initialized
[    0.264626] usbcore: registered new interface driver usbfs
[    0.264635] usbcore: registered new interface driver hub
[    0.264714] usbcore: registered new device driver usb
[    0.264835] EDAC MC: Ver: 3.0.0
[    0.265112] clocksource: Switched to clocksource timebase
[    0.276114] VFS: Disk quotas dquot_6.6.0
[    0.276139] VFS: Dquot-cache hash table entries: 8192 (order 0, 65536 bytes)
[    0.277659] NET: Registered protocol family 2
[    0.277834] tcp_listen_portaddr_hash hash table entries: 32768 (order: 3, 524288 bytes, linear)
[    0.277888] TCP established hash table entries: 524288 (order: 6, 4194304 bytes, linear)
[    0.278586] TCP bind hash table entries: 65536 (order: 4, 1048576 bytes, linear)
[    0.278680] TCP: Hash tables configured (established 524288 bind 65536)
[    0.278718] UDP hash table entries: 32768 (order: 4, 1048576 bytes, linear)
[    0.278817] UDP-Lite hash table entries: 32768 (order: 4, 1048576 bytes, linear)
[    0.279011] NET: Registered protocol family 1
[    0.279020] PCI: CLS 0 bytes, default 128
[    0.279056] Trying to unpack rootfs image as initramfs...
[    0.690890] Freeing initrd memory: 26880K
[    0.693555] IOMMU table initialized, virtual merging enabled
[    0.713723] hv-24x7: read 1530 catalog entries, created 537 event attrs (0 failures), 275 descs
[    0.714748] workingset: timestamp_bits=38 max_order=20 bucket_order=0
[    0.715826] zbud: loaded
[    0.725026] NET: Registered protocol family 38
[    0.725033] Key type asymmetric registered
[    0.725037] Asymmetric key parser 'x509' registered
[    0.725046] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 249)
[    0.725138] io scheduler mq-deadline registered
[    0.725143] io scheduler kyber registered
[    0.725573] atomic64_test: passed
[    0.725607] PowerPC PowerNV PCI Hotplug Driver version: 0.1
[    0.725872] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    0.726066] Non-volatile memory driver v1.3
[    0.726089] Linux agpgart interface v0.103
[    6.005301] tpm_ibmvtpm 30000003: CRQ initialization completed
[    6.005309] tpm_ibmvtpm 30000003: ibmvtpm device is not ready
[    6.005310] tpm_ibmvtpm 30000003: ibmvtpm device is not ready
[    6.005540] rdac: device handler registered
[    6.005579] hp_sw: device handler registered
[    6.005583] emc: device handler registered
[    6.005651] alua: device handler registered
[    6.005730] libphy: Fixed MDIO Bus: probed
[    6.005763] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    6.005773] ehci-pci: EHCI PCI platform driver
[    6.005781] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[    6.005790] ohci-pci: OHCI PCI platform driver
[    6.005797] uhci_hcd: USB Universal Host Controller Interface driver
[    6.005828] usbcore: registered new interface driver usbserial_generic
[    6.005835] usbserial: USB Serial support registered for generic
[    6.005882] mousedev: PS/2 mouse device common for all mice
[    6.005988] rtc-generic rtc-generic: registered as rtc0
[    6.006323] nx_compress_pseries ibm,compression-v1: nx842_OF_upd: max_sync_size new:65536 old:0
[    6.006332] nx_compress_pseries ibm,compression-v1: nx842_OF_upd: max_sync_sg new:510 old:0
[    6.006339] nx_compress_pseries ibm,compression-v1: nx842_OF_upd: max_sg_len new:4080 old:0
[    6.006396] alg: No test for 842 (842-nx)
[    6.007511] hid: raw HID events driver (C) Jiri Kosina
[    6.007602] usbcore: registered new interface driver usbhid
[    6.007605] usbhid: USB HID core driver
[    6.007657] drop_monitor: Initializing network drop monitor service
[    6.007735] Initializing XFRM netlink socket
[    6.007866] NET: Registered protocol family 10
[    6.008125] Segment Routing with IPv6
[    6.008141] NET: Registered protocol family 17
[    6.008645] registered taskstats version 1
[    6.008677] zswap: loaded using pool lzo/zbud
[    6.008809] pstore: Using crash dump compression: deflate
[    6.012620] Key type big_key registered
[    6.012774] rtc-generic rtc-generic: setting system clock to 2020-03-12T08:16:31 UTC (1584000991)
[    6.014119] Freeing unused kernel memory: 4992K
[    6.014124] Kernel memory protection not selected by kernel config.
[    6.014128] Run /init as init process
[    6.023254] systemd[1]: systemd 239 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=legacy)
[    6.023441] systemd[1]: Detected architecture ppc64-le.
[    6.023447] systemd[1]: Running in initial RAM disk.

Welcome to Red Hat Enterprise Linux 8.1 Beta (Ootpa) dracut-049-26.git20190806.el8 (Initramfs)!

[    6.105290] systemd[1]: Set hostname to <ltc-zzci-2.aus.stglabs.ibm.com>.
[    6.157928] random: systemd: uninitialized urandom read (16 bytes read)
[    6.157996] systemd[1]: Listening on udev Kernel Socket.
[  OK  ] Listening on udev Kernel Socket.
[    6.158142] random: systemd: uninitialized urandom read (16 bytes read)
[    6.158225] systemd[1]: Listening on Journal Socket.
[  OK  ] Listening on Journal Socket.
[    6.158356] random: systemd: uninitialized urandom read (16 bytes read)
[    6.158368] systemd[1]: Reached target Timers.
[  OK  ] Reached target Timers.
[    6.159357] BUG: Kernel NULL pointer dereference on read at 0x000073b0
[    6.159363] Faulting instruction address: 0xc0000000003d7174
[    6.159368] Oops: Kernel access of bad area, sig: 11 [#1]
[    6.159372] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[    6.159378] Modules linked in:
[    6.159382] CPU: 17 PID: 1 Comm: systemd Not tainted 5.6.0-rc5-next-20200311-autotest+ #1
[    6.159388] NIP:  c0000000003d7174 LR: c0000000003d7714 CTR: c000000000400e70
[    6.159393] REGS: c0000008b36836d0 TRAP: 0300   Not tainted  (5.6.0-rc5-next-20200311-autotest+)
[    6.159398] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24004848  XER: 00000000
[    6.159406] CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1 
[    6.159406] GPR00: c0000000003d7714 c0000008b3683960 c00000000155e300 c0000008b301f500 
[    6.159406] GPR04: 0000000000000dc0 0000000000000000 c0000000003456f8 c0000008bb198620 
[    6.159406] GPR08: 00000008ba0f0000 0000000000000001 0000000000000000 0000000000000000 
[    6.159406] GPR12: 0000000024004848 c00000001ec55e00 0000000000000000 0000000000000000 
[    6.159406] GPR16: c0000008b0a82048 c000000001595898 c000000001750ca8 0000000000000002 
[    6.159406] GPR20: c000000001750cb8 c000000001624478 0000000fffffffe0 5deadbeef0000122 
[    6.159406] GPR24: 0000000000000001 0000000000000dc0 0000000000000000 c0000000003456f8 
[    6.159406] GPR28: c0000008b301f500 c0000008bb198620 0000000000000000 c00c000002285a40 
[    6.159453] NIP [c0000000003d7174] ___slab_alloc+0x1f4/0x760
[    6.159458] LR [c0000000003d7714] __slab_alloc+0x34/0x60
[    6.159462] Call Trace:
[    6.159465] [c0000008b3683a40] [c0000008b3683a70] 0xc0000008b3683a70
[    6.159471] [c0000008b3683a70] [c0000000003d8b20] __kmalloc_node+0x110/0x490
[    6.159477] [c0000008b3683af0] [c0000000003456f8] kvmalloc_node+0x58/0x110
[    6.159483] [c0000008b3683b30] [c000000000400f78] mem_cgroup_css_online+0x108/0x270
[    6.159489] [c0000008b3683b90] [c000000000236ed8] online_css+0x48/0xd0
[    6.159494] [c0000008b3683bc0] [c00000000023ffac] cgroup_apply_control_enable+0x2ec/0x4d0
[    6.159501] [c0000008b3683ca0] [c0000000002437c8] cgroup_mkdir+0x228/0x5f0
[    6.159506] [c0000008b3683d10] [c000000000521780] kernfs_iop_mkdir+0x90/0xf0
[    6.159512] [c0000008b3683d50] [c00000000043f670] vfs_mkdir+0x110/0x230
[    6.159517] [c0000008b3683da0] [c000000000443150] do_mkdirat+0xb0/0x1a0
[    6.159523] [c0000008b3683e20] [c00000000000b278] system_call+0x5c/0x68
[    6.159527] Instruction dump:
[    6.159531] 7c421378 e95f0000 714a0001 4082fff0 4bffff64 60000000 60000000 faa10088 
[    6.159538] 3ea2000c 3ab56178 7b4a1f24 7d55502a <e94a73b0> 2faa0000 409e0394 3d02002a 
[    6.159545] ---[ end trace 36d65cb66091a5b6 ]---
[    6.161610] 
[    7.161622] Kernel panic - not syncing: Fatal exception
[    7.169280] ------------[ cut here ]------------
[    7.169289] WARNING: CPU: 17 PID: 1 at drivers/tty/vt/vt.c:4266 do_unblank_screen+0x190/0x250
[    7.169297] Modules linked in:
[    7.169303] CPU: 17 PID: 1 Comm: systemd Tainted: G      D           5.6.0-rc5-next-20200311-autotest+ #1
[    7.169312] NIP:  c0000000006ed370 LR: c0000000006ed35c CTR: c000000000b7b960
[    7.169320] REGS: c0000008b36831b0 TRAP: 0700   Tainted: G      D            (5.6.0-rc5-next-20200311-autotest+)
[    7.169330] MSR:  8000000000021033 <SF,ME,IR,DR,RI,LE>  CR: 28002242  XER: 2004000c
[    7.169341] CFAR: c0000000001c8948 IRQMASK: 3 
[    7.169341] GPR00: c0000000006ed35c c0000008b3683440 c00000000155e300 0000000000000000 
[    7.169341] GPR04: 0000000000000003 c0000008b06c200e 0000000000001dd7 c0000008b3683380 
[    7.169341] GPR08: c000000001423760 0000000000000000 0000000000000000 c0000008b36831ff 
[    7.169341] GPR12: 0000000028002448 c00000001ec55e00 0000000000000000 0000000000000000 
[    7.169341] GPR16: c0000008b0a82048 c000000001595898 c000000001750ca8 0000000000000002 
[    7.169341] GPR20: c000000001750cb8 c000000001624478 0000000fffffffe0 5deadbeef0000122 
[    7.169341] GPR24: 0000000000000001 0000000000000dc0 c00000000142c830 c0000000003456f8 
[    7.169341] GPR28: c000000001636f58 c000000001636f80 0000000000000000 c000000001745a88 
[    7.169409] NIP [c0000000006ed370] do_unblank_screen+0x190/0x250
[    7.169417] LR [c0000000006ed35c] do_unblank_screen+0x17c/0x250
[    7.169423] Call Trace:
[    7.169428] [c0000008b3683440] [c0000000006ed38c] do_unblank_screen+0x1ac/0x250 (unreliable)
[    7.169439] [c0000008b36834c0] [c00000000013eb24] panic+0x1e8/0x414
[    7.169447] [c0000008b3683560] [c00000000002c71c] oops_end+0x1ac/0x1b0
[    7.169455] [c0000008b36835e0] [c0000000000868e0] bad_page_fault+0x190/0x1e0
[    7.169464] [c0000008b3683660] [c00000000000a8a4] handle_page_fault+0x2c/0x30
[    7.169475] --- interrupt: 300 at ___slab_alloc+0x1f4/0x760
[    7.169475]     LR = __slab_alloc+0x34/0x60
[    7.169484] [c0000008b3683960] [0000000000000000] 0x0 (unreliable)
[    7.169491] [c0000008b3683a40] [c0000008b3683a70] 0xc0000008b3683a70
[    7.169500] [c0000008b3683a70] [c0000000003d8b20] __kmalloc_node+0x110/0x490
[    7.169509] [c0000008b3683af0] [c0000000003456f8] kvmalloc_node+0x58/0x110
[    7.169516] [c0000008b3683b30] [c000000000400f78] mem_cgroup_css_online+0x108/0x270
[    7.169525] [c0000008b3683b90] [c000000000236ed8] online_css+0x48/0xd0
[    7.169533] [c0000008b3683bc0] [c00000000023ffac] cgroup_apply_control_enable+0x2ec/0x4d0
[    7.169542] [c0000008b3683ca0] [c0000000002437c8] cgroup_mkdir+0x228/0x5f0
[    7.169550] [c0000008b3683d10] [c000000000521780] kernfs_iop_mkdir+0x90/0xf0
[    7.169559] [c0000008b3683d50] [c00000000043f670] vfs_mkdir+0x110/0x230
[    7.169567] [c0000008b3683da0] [c000000000443150] do_mkdirat+0xb0/0x1a0
[    7.169575] [c0000008b3683e20] [c00000000000b278] system_call+0x5c/0x68
[    7.169581] Instruction dump:
[    7.169586] 4e800020 60000000 60000000 60000000 7c0802a6 f8010090 4badb5e1 60000000 
[    7.169597] 813f0000 7d231b78 2f830000 409e0034 <0fe00000> e8010090 7c0803a6 4bfffeac 
[    7.169608] ---[ end trace 36d65cb66091a5b7 ]---
[    7.169615] Rebooting in 10 seconds..

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-12  8:23       ` Sachin Sant
@ 2020-03-12  9:30         ` Vlastimil Babka
  2020-03-12 13:14           ` Srikar Dronamraju
  0 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2020-03-12  9:30 UTC (permalink / raw)
  To: Sachin Sant, Srikar Dronamraju
  Cc: Michal Hocko, Linus Torvalds, LKML, linux-mm, Mel Gorman,
	Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter

On 3/12/20 9:23 AM, Sachin Sant wrote:
> 
> 
>> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
>> 
>> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
>> 
>>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
>>>> A Powerpc system with multiple possible nodes and with CONFIG_NUMA
>>>> enabled always used to have a node 0, even if node 0 does not any cpus
>>>> or memory attached to it. As per PAPR, node affinity of a cpu is only
>>>> available once its present / online. For all cpus that are possible but
>>>> not present, cpu_to_node() would point to node 0.
>>>> 
>>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
>>>> to make sure all possible but not present cpu_to_node are set to a
>>>> proper node.
>>> 
>>> Just curious, is this somehow related to
>>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
>>> 
>> 
>> The issue I am trying to fix is a known issue in Powerpc since many years.
>> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
>> shrinker_map on appropriate NUMA node"). 
>> 
>> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
>> kernel. Will work with Sachin/Abdul (reporters of the issue).
>> 
> 
> I applied this 3 patch series on top of March 11 next tree (commit d44a64766795 )
> The kernel still fails to boot with same call trace.

Yeah when I skimmed the patches, I don't think they address the issue where
node_to_mem_node(0) = 0 [1]. You could reapply the debug print patch to verify,
but it seems very likely. So I'm not surprised you get the same trace.

[1] https://lore.kernel.org/linux-next/9a86f865-50b5-7483-9257-dbb08fecd62b@suse.cz/

> [    6.159357] BUG: Kernel NULL pointer dereference on read at 0x000073b0
> [    6.159363] Faulting instruction address: 0xc0000000003d7174
> [    6.159368] Oops: Kernel access of bad area, sig: 11 [#1]
> [    6.159372] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> [    6.159378] Modules linked in:
> [    6.159382] CPU: 17 PID: 1 Comm: systemd Not tainted 5.6.0-rc5-next-20200311-autotest+ #1
> [    6.159388] NIP:  c0000000003d7174 LR: c0000000003d7714 CTR: c000000000400e70
> [    6.159393] REGS: c0000008b36836d0 TRAP: 0300   Not tainted  (5.6.0-rc5-next-20200311-autotest+)
> [    6.159398] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24004848  XER: 00000000
> [    6.159406] CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
> [    6.159406] GPR00: c0000000003d7714 c0000008b3683960 c00000000155e300 c0000008b301f500
> [    6.159406] GPR04: 0000000000000dc0 0000000000000000 c0000000003456f8 c0000008bb198620
> [    6.159406] GPR08: 00000008ba0f0000 0000000000000001 0000000000000000 0000000000000000
> [    6.159406] GPR12: 0000000024004848 c00000001ec55e00 0000000000000000 0000000000000000
> [    6.159406] GPR16: c0000008b0a82048 c000000001595898 c000000001750ca8 0000000000000002
> [    6.159406] GPR20: c000000001750cb8 c000000001624478 0000000fffffffe0 5deadbeef0000122
> [    6.159406] GPR24: 0000000000000001 0000000000000dc0 0000000000000000 c0000000003456f8
> [    6.159406] GPR28: c0000008b301f500 c0000008bb198620 0000000000000000 c00c000002285a40
> [    6.159453] NIP [c0000000003d7174] ___slab_alloc+0x1f4/0x760
> [    6.159458] LR [c0000000003d7714] __slab_alloc+0x34/0x60
> [    6.159462] Call Trace:
> [    6.159465] [c0000008b3683a40] [c0000008b3683a70] 0xc0000008b3683a70
> [    6.159471] [c0000008b3683a70] [c0000000003d8b20] __kmalloc_node+0x110/0x490
> [    6.159477] [c0000008b3683af0] [c0000000003456f8] kvmalloc_node+0x58/0x110
> [    6.159483] [c0000008b3683b30] [c000000000400f78] mem_cgroup_css_online+0x108/0x270
> [    6.159489] [c0000008b3683b90] [c000000000236ed8] online_css+0x48/0xd0
> [    6.159494] [c0000008b3683bc0] [c00000000023ffac] cgroup_apply_control_enable+0x2ec/0x4d0
> [    6.159501] [c0000008b3683ca0] [c0000000002437c8] cgroup_mkdir+0x228/0x5f0
> [    6.159506] [c0000008b3683d10] [c000000000521780] kernfs_iop_mkdir+0x90/0xf0
> [    6.159512] [c0000008b3683d50] [c00000000043f670] vfs_mkdir+0x110/0x230
> [    6.159517] [c0000008b3683da0] [c000000000443150] do_mkdirat+0xb0/0x1a0
> [    6.159523] [c0000008b3683e20] [c00000000000b278] system_call+0x5c/0x68
> [    6.159527] Instruction dump:
> [    6.159531] 7c421378 e95f0000 714a0001 4082fff0 4bffff64 60000000 60000000 faa10088
> [    6.159538] 3ea2000c 3ab56178 7b4a1f24 7d55502a <e94a73b0> 2faa0000 409e0394 3d02002a
> [    6.159545] ---[ end trace 36d65cb66091a5b6 ]—
> 
> Boot log attached.
> 
> Thanks
> -Sachin
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-12  9:30         ` Vlastimil Babka
@ 2020-03-12 13:14           ` Srikar Dronamraju
  2020-03-12 13:51             ` Vlastimil Babka
  0 siblings, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2020-03-12 13:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sachin Sant, Michal Hocko, Linus Torvalds, LKML, linux-mm,
	Mel Gorman, Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter

* Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]:

> On 3/12/20 9:23 AM, Sachin Sant wrote:
> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> >>>> to make sure all possible but not present cpu_to_node are set to a
> >>>> proper node.
> >>> 
> >>> Just curious, is this somehow related to
> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
> >>> 
> >> 
> >> The issue I am trying to fix is a known issue in Powerpc since many years.
> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
> >> shrinker_map on appropriate NUMA node"). 
> >> 
> >> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
> >> kernel. Will work with Sachin/Abdul (reporters of the issue).

I had used v1 and not v2. So my mistake.

> > I applied this 3 patch series on top of March 11 next tree (commit d44a64766795 )
> > The kernel still fails to boot with same call trace.
> 

While I am not an expert in the slub area, I looked at the patch
a75056fc1e7c and had some thoughts on why this could be causing this issue.

On the system where the crash happens, the possible number of nodes is much
greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
available for onlined nodes.

With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
for all possible nodes and in ___slab_alloc we end up looking at the
node_present_pages which is NODE_DATA(nid)->node_present_pages.
i.e for a node whose pdgat struct is not allocated, we are trying to
dereference.

Also for a memoryless/cpuless node or possible but not present nodes,
node_to_mem_node(node) will still end up as node (atleast on powerpc).

I tried with this hunk below and it works.

But I am not sure if we need to check at other places were
node_present_pages is being called.

diff --git a/mm/slub.c b/mm/slub.c
index 626cbcbd977f..bddb93bed55e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	if (unlikely(!node_match(page, node))) {
 		int searchnode = node;
 
-		if (node != NUMA_NO_NODE && !node_present_pages(node))
-			searchnode = node_to_mem_node(node);
-
+		if (node != NUMA_NO_NODE) {
+			if (!node_online(node) || !node_present_pages(node)) {
+				searchnode = node_to_mem_node(node);
+				if (!node_online(searchnode))
+					searchnode = first_online_node;
+			}
+		}
 		if (unlikely(!node_match(page, searchnode))) {
 			stat(s, ALLOC_NODE_MISMATCH);
 			deactivate_slab(s, page, c->freelist, c);

> > 
> 

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-12 13:14           ` Srikar Dronamraju
@ 2020-03-12 13:51             ` Vlastimil Babka
  2020-03-12 16:13               ` Srikar Dronamraju
  0 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2020-03-12 13:51 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Sachin Sant, Michal Hocko, Linus Torvalds, LKML, linux-mm,
	Mel Gorman, Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Joonsoo Kim

On 3/12/20 2:14 PM, Srikar Dronamraju wrote:
> * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]:
> 
>> On 3/12/20 9:23 AM, Sachin Sant wrote:
>> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
>> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
>> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
>> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
>> >>>> to make sure all possible but not present cpu_to_node are set to a
>> >>>> proper node.
>> >>> 
>> >>> Just curious, is this somehow related to
>> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
>> >>> 
>> >> 
>> >> The issue I am trying to fix is a known issue in Powerpc since many years.
>> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
>> >> shrinker_map on appropriate NUMA node"). 
>> >> 
>> >> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
>> >> kernel. Will work with Sachin/Abdul (reporters of the issue).
> 
> I had used v1 and not v2. So my mistake.
> 
>> > I applied this 3 patch series on top of March 11 next tree (commit d44a64766795 )
>> > The kernel still fails to boot with same call trace.
>> 
> 
> While I am not an expert in the slub area, I looked at the patch
> a75056fc1e7c and had some thoughts on why this could be causing this issue.
> 
> On the system where the crash happens, the possible number of nodes is much
> greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
> available for onlined nodes.
> 
> With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
> for all possible nodes and in ___slab_alloc we end up looking at the
> node_present_pages which is NODE_DATA(nid)->node_present_pages.
> i.e for a node whose pdgat struct is not allocated, we are trying to
> dereference.

From what we saw, the pgdat does exist, the problem is that slab's per-node data
doesn't exist for a node that doesn't have present pages, as it would be a waste
of memory.

Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In
Sachin's first report [1] we have

[    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
[    0.000000] numa:     NODE_DATA(0) on node 1
[    0.000000] numa:   NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff]

But in this thread, with your patches Sachin reports:

[    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]

So I assume it's just node 1. In that case, node_present_pages is really dangerous.

[1]
https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/

> Also for a memoryless/cpuless node or possible but not present nodes,
> node_to_mem_node(node) will still end up as node (atleast on powerpc).

I think that's the place where this would be best to fix.

> I tried with this hunk below and it works.
> 
> But I am not sure if we need to check at other places were
> node_present_pages is being called.

I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
return only nodes that are online with present memory?
CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add
support for node_to_mem_node() to determine the fallback node")

I think we do need well defined and documented rules around node_to_mem_node(),
cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
everyone handles it the same, safe way.

> diff --git a/mm/slub.c b/mm/slub.c
> index 626cbcbd977f..bddb93bed55e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	if (unlikely(!node_match(page, node))) {
>  		int searchnode = node;
>  
> -		if (node != NUMA_NO_NODE && !node_present_pages(node))
> -			searchnode = node_to_mem_node(node);
> -
> +		if (node != NUMA_NO_NODE) {
> +			if (!node_online(node) || !node_present_pages(node)) {
> +				searchnode = node_to_mem_node(node);
> +				if (!node_online(searchnode))
> +					searchnode = first_online_node;
> +			}
> +		}
>  		if (unlikely(!node_match(page, searchnode))) {
>  			stat(s, ALLOC_NODE_MISMATCH);
>  			deactivate_slab(s, page, c->freelist, c);
> 
>> > 
>> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-12 13:51             ` Vlastimil Babka
@ 2020-03-12 16:13               ` Srikar Dronamraju
  2020-03-12 16:41                 ` Vlastimil Babka
  0 siblings, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2020-03-12 16:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sachin Sant, Michal Hocko, Linus Torvalds, LKML, linux-mm,
	Mel Gorman, Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Joonsoo Kim, Kirill Tkhai

* Vlastimil Babka <vbabka@suse.cz> [2020-03-12 14:51:38]:

> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]:
> > 
> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> >> >>>> to make sure all possible but not present cpu_to_node are set to a
> >> >>>> proper node.
> >> >>> 
> >> >>> Just curious, is this somehow related to
> >> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
> >> >>> 
> >> >> 
> >> >> The issue I am trying to fix is a known issue in Powerpc since many years.
> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
> >> >> shrinker_map on appropriate NUMA node"). 
> >> >> 
> > 
> > While I am not an expert in the slub area, I looked at the patch
> > a75056fc1e7c and had some thoughts on why this could be causing this issue.
> > 
> > On the system where the crash happens, the possible number of nodes is much
> > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
> > available for onlined nodes.
> > 
> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
> > for all possible nodes and in ___slab_alloc we end up looking at the
> > node_present_pages which is NODE_DATA(nid)->node_present_pages.
> > i.e for a node whose pdgat struct is not allocated, we are trying to
> > dereference.
> 
> From what we saw, the pgdat does exist, the problem is that slab's per-node data
> doesn't exist for a node that doesn't have present pages, as it would be a waste
> of memory.

Just to be clear
Before my 3 patches to fix dummy node:
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
0-31
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
0-1

> 
> Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In
> Sachin's first report [1] we have
> 
> [    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
> [    0.000000] numa:     NODE_DATA(0) on node 1
> [    0.000000] numa:   NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff]
> 

So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the
rest 30 nodes.

> But in this thread, with your patches Sachin reports:

and with my patches
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
0-31
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
1

> 
> [    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
> 

so we only see one pgdat.

> So I assume it's just node 1. In that case, node_present_pages is really dangerous.
> 
> [1]
> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
> 
> > Also for a memoryless/cpuless node or possible but not present nodes,
> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
> 
> I think that's the place where this would be best to fix.
> 

Maybe. I thought about it but the current set_numa_mem semantics are apt
for memoryless cpu node and not for possible nodes.  We could have upto 256
possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
node_to_mem_node seems to return what is set in set_numa_mem().
set_numa_mem() seems to say set my numa_mem node for the current memoryless
node to the param passed.

But how do we set numa_mem for all the other 253 possible nodes, which
probably will have 0 as default?

Should we introduce another API such that we could update for all possible
nodes?

> > I tried with this hunk below and it works.
> > 
> > But I am not sure if we need to check at other places were
> > node_present_pages is being called.
> 
> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
> return only nodes that are online with present memory?
> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add
> support for node_to_mem_node() to determine the fallback node")
> 

Agree 

> I think we do need well defined and documented rules around node_to_mem_node(),
> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
> everyone handles it the same, safe way.
> 

Other option would be to tweak Kirill Tkhai's patch such that we call
kvmalloc_node()/kzalloc_node() if node is online and call kvmalloc/kvzalloc
if the node is offline.

> > diff --git a/mm/slub.c b/mm/slub.c
> > index 626cbcbd977f..bddb93bed55e 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >  	if (unlikely(!node_match(page, node))) {
> >  		int searchnode = node;
> >  
> > -		if (node != NUMA_NO_NODE && !node_present_pages(node))
> > -			searchnode = node_to_mem_node(node);
> > -
> > +		if (node != NUMA_NO_NODE) {
> > +			if (!node_online(node) || !node_present_pages(node)) {
> > +				searchnode = node_to_mem_node(node);
> > +				if (!node_online(searchnode))
> > +					searchnode = first_online_node;
> > +			}
> > +		}
> >  		if (unlikely(!node_match(page, searchnode))) {
> >  			stat(s, ALLOC_NODE_MISMATCH);
> >  			deactivate_slab(s, page, c->freelist, c);

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-12 16:13               ` Srikar Dronamraju
@ 2020-03-12 16:41                 ` Vlastimil Babka
  2020-03-13  9:47                   ` Joonsoo Kim
                                     ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Vlastimil Babka @ 2020-03-12 16:41 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Sachin Sant, Michal Hocko, Linus Torvalds, LKML, linux-mm,
	Mel Gorman, Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Joonsoo Kim, Kirill Tkhai

On 3/12/20 5:13 PM, Srikar Dronamraju wrote:
> * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 14:51:38]:
> 
>> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]:
>> > 
>> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
>> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
>> >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
>> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
>> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
>> >> >>>> to make sure all possible but not present cpu_to_node are set to a
>> >> >>>> proper node.
>> >> >>> 
>> >> >>> Just curious, is this somehow related to
>> >> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
>> >> >>> 
>> >> >> 
>> >> >> The issue I am trying to fix is a known issue in Powerpc since many years.
>> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
>> >> >> shrinker_map on appropriate NUMA node"). 
>> >> >> 
>> > 
>> > While I am not an expert in the slub area, I looked at the patch
>> > a75056fc1e7c and had some thoughts on why this could be causing this issue.
>> > 
>> > On the system where the crash happens, the possible number of nodes is much
>> > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
>> > available for onlined nodes.
>> > 
>> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
>> > for all possible nodes and in ___slab_alloc we end up looking at the
>> > node_present_pages which is NODE_DATA(nid)->node_present_pages.
>> > i.e for a node whose pdgat struct is not allocated, we are trying to
>> > dereference.
>> 
>> From what we saw, the pgdat does exist, the problem is that slab's per-node data
>> doesn't exist for a node that doesn't have present pages, as it would be a waste
>> of memory.
> 
> Just to be clear
> Before my 3 patches to fix dummy node:
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
> 0-31
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
> 0-1

OK

>> 
>> Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In
>> Sachin's first report [1] we have
>> 
>> [    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
>> [    0.000000] numa:     NODE_DATA(0) on node 1
>> [    0.000000] numa:   NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff]
>> 
> 
> So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the
> rest 30 nodes.

I see. Perhaps node_present_pages(node) is not safe in SLUB then and it should
check online first, as you suggested.

>> But in this thread, with your patches Sachin reports:
> 
> and with my patches
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
> 0-31
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
> 1
> 
>> 
>> [    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
>> 
> 
> so we only see one pgdat.
> 
>> So I assume it's just node 1. In that case, node_present_pages is really dangerous.
>> 
>> [1]
>> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
>> 
>> > Also for a memoryless/cpuless node or possible but not present nodes,
>> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
>> 
>> I think that's the place where this would be best to fix.
>> 
> 
> Maybe. I thought about it but the current set_numa_mem semantics are apt
> for memoryless cpu node and not for possible nodes.  We could have upto 256
> possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
> node_to_mem_node seems to return what is set in set_numa_mem().
> set_numa_mem() seems to say set my numa_mem node for the current memoryless
> node to the param passed.
> 
> But how do we set numa_mem for all the other 253 possible nodes, which
> probably will have 0 as default?
> 
> Should we introduce another API such that we could update for all possible
> nodes?

If we want to rely on node_to_mem_node() to give us something safe for each
possible node, then probably it would have to be like that, yeah.

>> > I tried with this hunk below and it works.
>> > 
>> > But I am not sure if we need to check at other places were
>> > node_present_pages is being called.
>> 
>> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
>> return only nodes that are online with present memory?
>> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add
>> support for node_to_mem_node() to determine the fallback node")
>> 
> 
> Agree 
> 
>> I think we do need well defined and documented rules around node_to_mem_node(),
>> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
>> everyone handles it the same, safe way.

So let's try to brainstorm how this would look like? What I mean are some rules
like below, even if some details in my current understanding are most likely
incorrect:

with nid present in:
N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online
node with memory so that we don't require everyone to search for it in slightly
different ways
N_ONLINE - pgdat must exist, there doesn't have to be present memory,
node_to_mem_node() still has to return something else (?)
N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns itself
N_HIGH_MEMORY - node has present high memory

> 
> Other option would be to tweak Kirill Tkhai's patch such that we call
> kvmalloc_node()/kzalloc_node() if node is online and call kvmalloc/kvzalloc
> if the node is offline.

I really would like a solution that hides these ugly details from callers so
they don't have to workaround the APIs we provide. kvmalloc_node() really
shouldn't crash, and it should fallback automatically if we don't give it
__GFP_THISNODE

However, taking a step back, memcg_alloc_shrinker_maps() is probably rather
wasteful on systems with 256 possible nodes and only few present, by allocating
effectively dead structures for each memcg.

SLUB tries to be smart, so it allocates the per-node per-cache structures only
when the node goes online in slab_mem_going_online_callback(). This is why
there's a crash when such non-existing structures are accessed for a node that's
not online, and why they shouldn't be accessed.

Perhaps memcg should do the same on-demand allocation, if possible.

>> > diff --git a/mm/slub.c b/mm/slub.c
>> > index 626cbcbd977f..bddb93bed55e 100644
>> > --- a/mm/slub.c
>> > +++ b/mm/slub.c
>> > @@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>> >  	if (unlikely(!node_match(page, node))) {
>> >  		int searchnode = node;
>> >  
>> > -		if (node != NUMA_NO_NODE && !node_present_pages(node))
>> > -			searchnode = node_to_mem_node(node);
>> > -
>> > +		if (node != NUMA_NO_NODE) {
>> > +			if (!node_online(node) || !node_present_pages(node)) {
>> > +				searchnode = node_to_mem_node(node);
>> > +				if (!node_online(searchnode))
>> > +					searchnode = first_online_node;
>> > +			}
>> > +		}
>> >  		if (unlikely(!node_match(page, searchnode))) {
>> >  			stat(s, ALLOC_NODE_MISMATCH);
>> >  			deactivate_slab(s, page, c->freelist, c);
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-12 16:41                 ` Vlastimil Babka
@ 2020-03-13  9:47                   ` Joonsoo Kim
  2020-03-13 11:04                     ` Srikar Dronamraju
  2020-03-13 11:22                   ` Srikar Dronamraju
  2020-03-16  9:06                   ` Michal Hocko
  2 siblings, 1 reply; 24+ messages in thread
From: Joonsoo Kim @ 2020-03-13  9:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Srikar Dronamraju, Sachin Sant, Michal Hocko, Linus Torvalds,
	LKML, Linux Memory Management List, Mel Gorman,
	Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Joonsoo Kim, Kirill Tkhai

2020년 3월 13일 (금) 오전 1:42, Vlastimil Babka <vbabka@suse.cz>님이 작성:
>
> On 3/12/20 5:13 PM, Srikar Dronamraju wrote:
> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 14:51:38]:
> >
> >> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]:
> >> >
> >> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
> >> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> >> >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
> >> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> >> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> >> >> >>>> to make sure all possible but not present cpu_to_node are set to a
> >> >> >>>> proper node.
> >> >> >>>
> >> >> >>> Just curious, is this somehow related to
> >> >> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
> >> >> >>>
> >> >> >>
> >> >> >> The issue I am trying to fix is a known issue in Powerpc since many years.
> >> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
> >> >> >> shrinker_map on appropriate NUMA node").
> >> >> >>
> >> >
> >> > While I am not an expert in the slub area, I looked at the patch
> >> > a75056fc1e7c and had some thoughts on why this could be causing this issue.
> >> >
> >> > On the system where the crash happens, the possible number of nodes is much
> >> > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
> >> > available for onlined nodes.
> >> >
> >> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
> >> > for all possible nodes and in ___slab_alloc we end up looking at the
> >> > node_present_pages which is NODE_DATA(nid)->node_present_pages.
> >> > i.e for a node whose pdgat struct is not allocated, we are trying to
> >> > dereference.
> >>
> >> From what we saw, the pgdat does exist, the problem is that slab's per-node data
> >> doesn't exist for a node that doesn't have present pages, as it would be a waste
> >> of memory.
> >
> > Just to be clear
> > Before my 3 patches to fix dummy node:
> > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
> > 0-31
> > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
> > 0-1
>
> OK
>
> >>
> >> Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In
> >> Sachin's first report [1] we have
> >>
> >> [    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
> >> [    0.000000] numa:     NODE_DATA(0) on node 1
> >> [    0.000000] numa:   NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff]
> >>
> >
> > So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the
> > rest 30 nodes.
>
> I see. Perhaps node_present_pages(node) is not safe in SLUB then and it should
> check online first, as you suggested.
>
> >> But in this thread, with your patches Sachin reports:
> >
> > and with my patches
> > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
> > 0-31
> > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
> > 1
> >
> >>
> >> [    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
> >>
> >
> > so we only see one pgdat.
> >
> >> So I assume it's just node 1. In that case, node_present_pages is really dangerous.
> >>
> >> [1]
> >> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
> >>
> >> > Also for a memoryless/cpuless node or possible but not present nodes,
> >> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
> >>
> >> I think that's the place where this would be best to fix.
> >>
> >
> > Maybe. I thought about it but the current set_numa_mem semantics are apt
> > for memoryless cpu node and not for possible nodes.  We could have upto 256
> > possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
> > node_to_mem_node seems to return what is set in set_numa_mem().
> > set_numa_mem() seems to say set my numa_mem node for the current memoryless
> > node to the param passed.
> >
> > But how do we set numa_mem for all the other 253 possible nodes, which
> > probably will have 0 as default?
> >
> > Should we introduce another API such that we could update for all possible
> > nodes?
>
> If we want to rely on node_to_mem_node() to give us something safe for each
> possible node, then probably it would have to be like that, yeah.
>
> >> > I tried with this hunk below and it works.
> >> >
> >> > But I am not sure if we need to check at other places were
> >> > node_present_pages is being called.
> >>
> >> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
> >> return only nodes that are online with present memory?
> >> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add
> >> support for node_to_mem_node() to determine the fallback node")
> >>
> >
> > Agree

I lost all the memory about it. :)
Anyway, how about this?

1. make node_present_pages() safer
static inline node_present_pages(nid)
{
if (!node_online(nid)) return 0;
return (NODE_DATA(nid)->node_present_pages);
}

2. make node_to_mem_node() safer for all cases
In ppc arch's mem_topology_setup(void)
for_each_present_cpu(cpu) {
 numa_setup_cpu(cpu);
 mem_node = node_to_mem_node(numa_mem_id());
 if (!node_present_pages(mem_node)) {
  _node_numa_mem_[numa_mem_id()] = first_online_node;
 }
}

With these two changes, we can uses node_present_pages() and node_to_mem_node()
as intended.

Thanks.

> >> I think we do need well defined and documented rules around node_to_mem_node(),
> >> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
> >> everyone handles it the same, safe way.
>
> So let's try to brainstorm how this would look like? What I mean are some rules
> like below, even if some details in my current understanding are most likely
> incorrect:
>
> with nid present in:
> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online
> node with memory so that we don't require everyone to search for it in slightly
> different ways
> N_ONLINE - pgdat must exist, there doesn't have to be present memory,
> node_to_mem_node() still has to return something else (?)
> N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns itself
> N_HIGH_MEMORY - node has present high memory

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-13  9:47                   ` Joonsoo Kim
@ 2020-03-13 11:04                     ` Srikar Dronamraju
  2020-03-13 11:38                       ` Vlastimil Babka
  0 siblings, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2020-03-13 11:04 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vlastimil Babka, Sachin Sant, Michal Hocko, Linus Torvalds, LKML,
	Linux Memory Management List, Mel Gorman, Kirill A. Shutemov,
	Andrew Morton, linuxppc-dev, Christopher Lameter, Joonsoo Kim,
	Kirill Tkhai

* Joonsoo Kim <js1304@gmail.com> [2020-03-13 18:47:49]:

> > >>
> > >> > Also for a memoryless/cpuless node or possible but not present nodes,
> > >> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
> > >>
> > >> I think that's the place where this would be best to fix.
> > >>
> > >
> > > Maybe. I thought about it but the current set_numa_mem semantics are apt
> > > for memoryless cpu node and not for possible nodes.  We could have upto 256
> > > possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
> > > node_to_mem_node seems to return what is set in set_numa_mem().
> > > set_numa_mem() seems to say set my numa_mem node for the current memoryless
> > > node to the param passed.
> > >
> > > But how do we set numa_mem for all the other 253 possible nodes, which
> > > probably will have 0 as default?
> > >
> > > Should we introduce another API such that we could update for all possible
> > > nodes?
> >
> > If we want to rely on node_to_mem_node() to give us something safe for each
> > possible node, then probably it would have to be like that, yeah.
> >
> > >> > I tried with this hunk below and it works.
> > >> >
> > >> > But I am not sure if we need to check at other places were
> > >> > node_present_pages is being called.
> > >>
> > >> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
> > >> return only nodes that are online with present memory?
> > >> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add
> > >> support for node_to_mem_node() to determine the fallback node")
> > >>
> > >
> > > Agree
> 
> I lost all the memory about it. :)
> Anyway, how about this?
> 
> 1. make node_present_pages() safer
> static inline node_present_pages(nid)
> {
> if (!node_online(nid)) return 0;
> return (NODE_DATA(nid)->node_present_pages);
> }
> 

Yes this would help.

> 2. make node_to_mem_node() safer for all cases
> In ppc arch's mem_topology_setup(void)
> for_each_present_cpu(cpu) {
>  numa_setup_cpu(cpu);
>  mem_node = node_to_mem_node(numa_mem_id());
>  if (!node_present_pages(mem_node)) {
>   _node_numa_mem_[numa_mem_id()] = first_online_node;
>  }
> }
> 

But here as discussed above, we miss the case of possible but not present nodes.
For such nodes, the above change may not update, resulting in they still
having 0. And node 0 can be only possible but not present.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-12 16:41                 ` Vlastimil Babka
  2020-03-13  9:47                   ` Joonsoo Kim
@ 2020-03-13 11:22                   ` Srikar Dronamraju
  2020-03-16  9:06                   ` Michal Hocko
  2 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2020-03-13 11:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sachin Sant, Michal Hocko, Linus Torvalds, LKML, linux-mm,
	Mel Gorman, Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Joonsoo Kim, Kirill Tkhai

* Vlastimil Babka <vbabka@suse.cz> [2020-03-12 17:41:58]:

> On 3/12/20 5:13 PM, Srikar Dronamraju wrote:
> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 14:51:38]:
> > 
> >> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]:
> >> > 
> >> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
> >> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> >> >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
> >> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> >> I think we do need well defined and documented rules around node_to_mem_node(),
> >> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
> >> everyone handles it the same, safe way.
> 
> So let's try to brainstorm how this would look like? What I mean are some rules
> like below, even if some details in my current understanding are most likely
> incorrect:
> 

Agree.

> with nid present in:
> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online
> node with memory so that we don't require everyone to search for it in slightly
> different ways
> N_ONLINE - pgdat must exist, there doesn't have to be present memory,
> node_to_mem_node() still has to return something else (?)

Right, think this has been taken care of at this time.

> N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns itself
> N_HIGH_MEMORY - node has present high memory
> 

dont see any problems with the above two to. That leaves us with N_POSSIBLE.

> > 
> > Other option would be to tweak Kirill Tkhai's patch such that we call
> > kvmalloc_node()/kzalloc_node() if node is online and call kvmalloc/kvzalloc
> > if the node is offline.
> 
> I really would like a solution that hides these ugly details from callers so
> they don't have to workaround the APIs we provide. kvmalloc_node() really
> shouldn't crash, and it should fallback automatically if we don't give it
> __GFP_THISNODE
> 

Agree thats its better to make API's robust where possible.

> However, taking a step back, memcg_alloc_shrinker_maps() is probably rather
> wasteful on systems with 256 possible nodes and only few present, by allocating
> effectively dead structures for each memcg.
> 

If we dont allocate now, we would have to allocate them when we online the
nodes. To me it looks better to allocate as soon as the nodes are onlined,

> SLUB tries to be smart, so it allocates the per-node per-cache structures only
> when the node goes online in slab_mem_going_online_callback(). This is why
> there's a crash when such non-existing structures are accessed for a node that's
> not online, and why they shouldn't be accessed.
> 
> Perhaps memcg should do the same on-demand allocation, if possible.
> 

Right.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-13 11:04                     ` Srikar Dronamraju
@ 2020-03-13 11:38                       ` Vlastimil Babka
  2020-03-16  8:15                         ` Joonsoo Kim
  0 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2020-03-13 11:38 UTC (permalink / raw)
  To: Srikar Dronamraju, Joonsoo Kim
  Cc: Sachin Sant, Michal Hocko, Linus Torvalds, LKML,
	Linux Memory Management List, Mel Gorman, Kirill A. Shutemov,
	Andrew Morton, linuxppc-dev, Christopher Lameter, Joonsoo Kim,
	Kirill Tkhai, Michael Ellerman

On 3/13/20 12:04 PM, Srikar Dronamraju wrote:
>> I lost all the memory about it. :)
>> Anyway, how about this?
>> 
>> 1. make node_present_pages() safer
>> static inline node_present_pages(nid)
>> {
>> if (!node_online(nid)) return 0;
>> return (NODE_DATA(nid)->node_present_pages);
>> }
>> 
> 
> Yes this would help.

Looks good, yeah.

>> 2. make node_to_mem_node() safer for all cases
>> In ppc arch's mem_topology_setup(void)
>> for_each_present_cpu(cpu) {
>>  numa_setup_cpu(cpu);
>>  mem_node = node_to_mem_node(numa_mem_id());
>>  if (!node_present_pages(mem_node)) {
>>   _node_numa_mem_[numa_mem_id()] = first_online_node;
>>  }
>> }
>> 
> 
> But here as discussed above, we miss the case of possible but not present nodes.
> For such nodes, the above change may not update, resulting in they still
> having 0. And node 0 can be only possible but not present.

So is there other way to do the setup so that node_to_mem_node() returns an
online+present node when called for any possible node?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-03-11 11:02 ` [PATCH 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
@ 2020-03-15 14:20   ` Christopher Lameter
  2020-03-16  8:54     ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Christopher Lameter @ 2020-03-15 14:20 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, Michael Ellerman, linuxppc-dev, linux-mm,
	linux-kernel, Michal Hocko, Mel Gorman, Vlastimil Babka,
	Kirill A. Shutemov, Linus Torvalds

On Wed, 11 Mar 2020, Srikar Dronamraju wrote:

> Currently Linux kernel with CONFIG_NUMA on a system with multiple
> possible nodes, marks node 0 as online at boot.  However in practice,
> there are systems which have node 0 as memoryless and cpuless.

Would it not be better and simpler to require that node 0 always has
memory (and processors)? A  mininum operational set?

We can dynamically number the nodes right? So just make sure that the
firmware properly creates memory on node 0?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-13 11:38                       ` Vlastimil Babka
@ 2020-03-16  8:15                         ` Joonsoo Kim
  0 siblings, 0 replies; 24+ messages in thread
From: Joonsoo Kim @ 2020-03-16  8:15 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Srikar Dronamraju, Sachin Sant, Michal Hocko, Linus Torvalds,
	LKML, Linux Memory Management List, Mel Gorman,
	Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Joonsoo Kim, Kirill Tkhai, Michael Ellerman

2020년 3월 13일 (금) 오후 8:38, Vlastimil Babka <vbabka@suse.cz>님이 작성:
>
> On 3/13/20 12:04 PM, Srikar Dronamraju wrote:
> >> I lost all the memory about it. :)
> >> Anyway, how about this?
> >>
> >> 1. make node_present_pages() safer
> >> static inline node_present_pages(nid)
> >> {
> >> if (!node_online(nid)) return 0;
> >> return (NODE_DATA(nid)->node_present_pages);
> >> }
> >>
> >
> > Yes this would help.
>
> Looks good, yeah.
>
> >> 2. make node_to_mem_node() safer for all cases
> >> In ppc arch's mem_topology_setup(void)
> >> for_each_present_cpu(cpu) {
> >>  numa_setup_cpu(cpu);
> >>  mem_node = node_to_mem_node(numa_mem_id());
> >>  if (!node_present_pages(mem_node)) {
> >>   _node_numa_mem_[numa_mem_id()] = first_online_node;
> >>  }
> >> }
> >>
> >
> > But here as discussed above, we miss the case of possible but not present nodes.
> > For such nodes, the above change may not update, resulting in they still
> > having 0. And node 0 can be only possible but not present.

Oops, I don't read full thread so miss the case.

> So is there other way to do the setup so that node_to_mem_node() returns an
> online+present node when called for any possible node?

Two changes seems to be sufficient.

1. initialize all node's _node_numa_mem_[] = first_online_node in
mem_topology_setup()
2. replace the node with online+present node for _node_to_mem_node_[]
in set_cpu_numa_mem().

 static inline void set_cpu_numa_mem(int cpu, int node)
 {
        per_cpu(_numa_mem_, cpu) = node;
+       if (!node_present_pages(node))
+               node = first_online_node;
        _node_numa_mem_[cpu_to_node(cpu)] = node;
 }
 #endif

With these two change, we can safely call node_to_mem_node() anywhere.

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-03-15 14:20   ` Christopher Lameter
@ 2020-03-16  8:54     ` Michal Hocko
  2020-03-18  7:50       ` Srikar Dronamraju
  2020-03-18 18:57       ` Christopher Lameter
  0 siblings, 2 replies; 24+ messages in thread
From: Michal Hocko @ 2020-03-16  8:54 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Srikar Dronamraju, Andrew Morton, Michael Ellerman, linuxppc-dev,
	linux-mm, linux-kernel, Mel Gorman, Vlastimil Babka,
	Kirill A. Shutemov, Linus Torvalds

On Sun 15-03-20 14:20:05, Cristopher Lameter wrote:
> On Wed, 11 Mar 2020, Srikar Dronamraju wrote:
> 
> > Currently Linux kernel with CONFIG_NUMA on a system with multiple
> > possible nodes, marks node 0 as online at boot.  However in practice,
> > there are systems which have node 0 as memoryless and cpuless.
> 
> Would it not be better and simpler to require that node 0 always has
> memory (and processors)? A  mininum operational set?

I do not think you can simply ignore the reality. I cannot say that I am
a fan of memoryless/cpuless numa configurations but they are a sad
reality of different LPAR configurations. We have to deal with them.
Besides that I do not really see any strong technical arguments to lack
a support for those crippled configurations. We do have zonelists that
allow to do reasonable decisions on memoryless nodes. So no, I do not
think that this is a viable approach.

> We can dynamically number the nodes right? So just make sure that the
> firmware properly creates memory on node 0?

Are you suggesting that the OS would renumber NUMA nodes coming
from FW just to satisfy node 0 existence? If yes then I believe this is
really a bad idea because it would make HW/LPAR configuration matching
to the resulting memory layout really hard to follow.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-12 16:41                 ` Vlastimil Babka
  2020-03-13  9:47                   ` Joonsoo Kim
  2020-03-13 11:22                   ` Srikar Dronamraju
@ 2020-03-16  9:06                   ` Michal Hocko
  2020-03-17 13:44                     ` Vlastimil Babka
  2 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2020-03-16  9:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Srikar Dronamraju, Sachin Sant, Linus Torvalds, LKML, linux-mm,
	Mel Gorman, Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Joonsoo Kim, Kirill Tkhai

On Thu 12-03-20 17:41:58, Vlastimil Babka wrote:
[...]
> with nid present in:
> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online

I would rather have a dummy pgdat for those. Have a look at 
$ git grep "NODE_DATA.*->" | wc -l
63

Who knows how many else we have there. I haven't looked more closely.
Besides that what is a real reason to not have pgdat ther and force all
users of a $random node from those that the platform considers possible
for special casing? Is that a memory overhead? Is that really a thing?

Somebody has suggested to tweak some of the low level routines to do the
special casing but I really have to say I do not like that. We shouldn't
use the first online node or anything like that. We should simply always
follow the topology presented by FW and of that we need to have a pgdat.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-16  9:06                   ` Michal Hocko
@ 2020-03-17 13:44                     ` Vlastimil Babka
  2020-03-17 14:01                       ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2020-03-17 13:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Srikar Dronamraju, Sachin Sant, Linus Torvalds, LKML, linux-mm,
	Mel Gorman, Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Joonsoo Kim, Kirill Tkhai

On 3/16/20 10:06 AM, Michal Hocko wrote:
> On Thu 12-03-20 17:41:58, Vlastimil Babka wrote:
> [...]
>> with nid present in:
>> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online
> 
> I would rather have a dummy pgdat for those. Have a look at 
> $ git grep "NODE_DATA.*->" | wc -l
> 63
> 
> Who knows how many else we have there. I haven't looked more closely.
> Besides that what is a real reason to not have pgdat ther and force all
> users of a $random node from those that the platform considers possible
> for special casing? Is that a memory overhead? Is that really a thing?

I guess we can ignore memory overhead. I guess there only might be some concern
that for nodes that are initially offline, we will allocate the pgdat on a
different node, and after they are online, it will stay on a different node with
more access latency from local cpus. If we only allocate for online nodes, it
can always be local? But I guess it doesn't matter that much.

> Somebody has suggested to tweak some of the low level routines to do the
> special casing but I really have to say I do not like that. We shouldn't
> use the first online node or anything like that. We should simply always
> follow the topology presented by FW and of that we need to have a pgdat.
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-03-17 13:44                     ` Vlastimil Babka
@ 2020-03-17 14:01                       ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2020-03-17 14:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Srikar Dronamraju, Sachin Sant, Linus Torvalds, LKML, linux-mm,
	Mel Gorman, Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Joonsoo Kim, Kirill Tkhai

On Tue 17-03-20 14:44:45, Vlastimil Babka wrote:
> On 3/16/20 10:06 AM, Michal Hocko wrote:
> > On Thu 12-03-20 17:41:58, Vlastimil Babka wrote:
> > [...]
> >> with nid present in:
> >> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online
> > 
> > I would rather have a dummy pgdat for those. Have a look at 
> > $ git grep "NODE_DATA.*->" | wc -l
> > 63
> > 
> > Who knows how many else we have there. I haven't looked more closely.
> > Besides that what is a real reason to not have pgdat ther and force all
> > users of a $random node from those that the platform considers possible
> > for special casing? Is that a memory overhead? Is that really a thing?
> 
> I guess we can ignore memory overhead. I guess there only might be some concern
> that for nodes that are initially offline, we will allocate the pgdat on a
> different node, and after they are online, it will stay on a different node with
> more access latency from local cpus. If we only allocate for online nodes, it
> can always be local? But I guess it doesn't matter that much.

This is not the case even now because of chicke&egg. You need a memory
to allocate from and that memory has to be managed somewhere per node
(pgdat). Keep in mind we do not have the bootmem allocator for the
hotplug. Have a look at hotadd_new_pgdat and when it is called. There
are some attempts to allocate memmap from the hotpluged memory but I am
not sure we can do the whole thing without pgdat in place. If we can
then can come up with some replace the pgdat magic. But still I am not
even sure this is something we really have to optimize for.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-03-16  8:54     ` Michal Hocko
@ 2020-03-18  7:50       ` Srikar Dronamraju
  2020-03-18 18:57       ` Christopher Lameter
  1 sibling, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2020-03-18  7:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christopher Lameter, Andrew Morton, Michael Ellerman,
	linuxppc-dev, linux-mm, linux-kernel, Mel Gorman,
	Vlastimil Babka, Kirill A. Shutemov, Linus Torvalds

* Michal Hocko <mhocko@kernel.org> [2020-03-16 09:54:25]:

> On Sun 15-03-20 14:20:05, Cristopher Lameter wrote:
> > On Wed, 11 Mar 2020, Srikar Dronamraju wrote:
> > 
> > > Currently Linux kernel with CONFIG_NUMA on a system with multiple
> > > possible nodes, marks node 0 as online at boot.  However in practice,
> > > there are systems which have node 0 as memoryless and cpuless.
> > 
> > Would it not be better and simpler to require that node 0 always has
> > memory (and processors)? A  mininum operational set?
> 
> I do not think you can simply ignore the reality. I cannot say that I am
> a fan of memoryless/cpuless numa configurations but they are a sad
> reality of different LPAR configurations. We have to deal with them.
> Besides that I do not really see any strong technical arguments to lack
> a support for those crippled configurations. We do have zonelists that
> allow to do reasonable decisions on memoryless nodes. So no, I do not
> think that this is a viable approach.
> 

I agree with Michal, kernel should accept the reality and work with
different Lpar configurations.

> > We can dynamically number the nodes right? So just make sure that the
> > firmware properly creates memory on node 0?
> 
> Are you suggesting that the OS would renumber NUMA nodes coming
> from FW just to satisfy node 0 existence? If yes then I believe this is
> really a bad idea because it would make HW/LPAR configuration matching
> to the resulting memory layout really hard to follow.
> 
> -- 
> Michal Hocko
> SUSE Labs

Michal, Vlastimil, Christoph and others, do you have any more comments,
suggestions or any other feedback. If not, can you please add your
reviewed-by, acked etc.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-03-16  8:54     ` Michal Hocko
  2020-03-18  7:50       ` Srikar Dronamraju
@ 2020-03-18 18:57       ` Christopher Lameter
  1 sibling, 0 replies; 24+ messages in thread
From: Christopher Lameter @ 2020-03-18 18:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Srikar Dronamraju, Andrew Morton, Michael Ellerman, linuxppc-dev,
	linux-mm, linux-kernel, Mel Gorman, Vlastimil Babka,
	Kirill A. Shutemov, Linus Torvalds

On Mon, 16 Mar 2020, Michal Hocko wrote:

> > We can dynamically number the nodes right? So just make sure that the
> > firmware properly creates memory on node 0?
>
> Are you suggesting that the OS would renumber NUMA nodes coming
> from FW just to satisfy node 0 existence? If yes then I believe this is
> really a bad idea because it would make HW/LPAR configuration matching
> to the resulting memory layout really hard to follow.

NUMA nodes are created by the OS based on information provided by the
firmware. Either the FW would need to ensure that a viable node 0 exists
or the bootstrap arch code could setup things to the same effect.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-03-18 18:57 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-11 11:02 [PATCH 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
2020-03-11 11:02 ` [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
2020-03-11 11:57   ` Michal Hocko
2020-03-12  5:27     ` Srikar Dronamraju
2020-03-12  8:23       ` Sachin Sant
2020-03-12  9:30         ` Vlastimil Babka
2020-03-12 13:14           ` Srikar Dronamraju
2020-03-12 13:51             ` Vlastimil Babka
2020-03-12 16:13               ` Srikar Dronamraju
2020-03-12 16:41                 ` Vlastimil Babka
2020-03-13  9:47                   ` Joonsoo Kim
2020-03-13 11:04                     ` Srikar Dronamraju
2020-03-13 11:38                       ` Vlastimil Babka
2020-03-16  8:15                         ` Joonsoo Kim
2020-03-13 11:22                   ` Srikar Dronamraju
2020-03-16  9:06                   ` Michal Hocko
2020-03-17 13:44                     ` Vlastimil Babka
2020-03-17 14:01                       ` Michal Hocko
2020-03-11 11:02 ` [PATCH 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
2020-03-11 11:02 ` [PATCH 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
2020-03-15 14:20   ` Christopher Lameter
2020-03-16  8:54     ` Michal Hocko
2020-03-18  7:50       ` Srikar Dronamraju
2020-03-18 18:57       ` Christopher Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).