LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 0/3] Offline memoryless cpuless node 0
@ 2020-04-28  9:38 Srikar Dronamraju
  2020-04-28  9:38 ` [PATCH v2 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Srikar Dronamraju @ 2020-04-28  9:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Srikar Dronamraju, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds

Changelog v1:->v2:
- Rebased to v5.7-rc3
- Updated the changelog.

Linux kernel configured with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause
1. numa_balancing to be enabled on systems with only one online node.
2. Existence of dummy (cpuless and memoryless) node which can confuse
users/scripts looking at output of lscpu / numactl.

This patchset wants to correct this anomaly.

This should only affect systems that have CONFIG_MEMORYLESS_NODES.
Currently there are only 2 architectures ia64 and powerpc that have this
config.

Note: Patch 3 in this patch series depends on patches 1 and 2.
Without patches 1 and 2, patch 3 might crash powerpc.

v5.7-rc3
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
------------------
 /sys/devices/system/node/online:            0,2
 /proc/sys/kernel/numa_balancing:            1
 /sys/devices/system/node/has_cpu:           2
 /sys/devices/system/node/has_memory:        2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:          0-31

v5.7-rc3 + patches
------------------
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
------------------
/sys/devices/system/node/online:            2
/proc/sys/kernel/numa_balancing:            0
/sys/devices/system/node/has_cpu:           2
/sys/devices/system/node/has_memory:        2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:          0-31

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>

Srikar Dronamraju (3):
  powerpc/numa: Set numa_node for all possible cpus
  powerpc/numa: Prefer node id queried from vphn
  mm/page_alloc: Keep memoryless cpuless node 0 offline

 arch/powerpc/mm/numa.c | 32 ++++++++++++++++++++++----------
 mm/page_alloc.c        |  4 +++-
 2 files changed, 25 insertions(+), 11 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-04-28  9:38 [PATCH v2 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
@ 2020-04-28  9:38 ` Srikar Dronamraju
  2020-04-28  9:38 ` [PATCH v2 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
  2020-04-28  9:38 ` [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
  2 siblings, 0 replies; 17+ messages in thread
From: Srikar Dronamraju @ 2020-04-28  9:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Srikar Dronamraju, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds

A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1:->v2:
- Rebased to v5.7-rc3

 arch/powerpc/mm/numa.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9fcf2d195830..b3615b7fdbdf 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -931,8 +931,20 @@ void __init mem_topology_setup(void)
 
 	reset_numa_cpu_lookup_table();
 
-	for_each_present_cpu(cpu)
-		numa_setup_cpu(cpu);
+	for_each_possible_cpu(cpu) {
+		/*
+		 * Powerpc with CONFIG_NUMA always used to have a node 0,
+		 * even if it was memoryless or cpuless. For all cpus that
+		 * are possible but not present, cpu_to_node() would point
+		 * to node 0. To remove a cpuless, memoryless dummy node,
+		 * powerpc need to make sure all possible but not present
+		 * cpu_to_node are set to a proper node.
+		 */
+		if (cpu_present(cpu))
+			numa_setup_cpu(cpu);
+		else
+			set_cpu_numa_node(cpu, first_online_node);
+	}
 }
 
 void __init initmem_init(void)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 2/3] powerpc/numa: Prefer node id queried from vphn
  2020-04-28  9:38 [PATCH v2 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
  2020-04-28  9:38 ` [PATCH v2 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
@ 2020-04-28  9:38 ` Srikar Dronamraju
  2020-04-29  6:52   ` Gautham R Shenoy
  2020-04-28  9:38 ` [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
  2 siblings, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2020-04-28  9:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Srikar Dronamraju, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds

Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1:->v2:
- Rebased to v5.7-rc3

 arch/powerpc/mm/numa.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b3615b7fdbdf..281531340230 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -719,20 +719,20 @@ static int __init parse_numa_properties(void)
 	 */
 	for_each_present_cpu(i) {
 		struct device_node *cpu;
-		int nid;
-
-		cpu = of_get_cpu_node(i, NULL);
-		BUG_ON(!cpu);
-		nid = of_node_to_nid_single(cpu);
-		of_node_put(cpu);
+		int nid = vphn_get_nid(i);
 
 		/*
 		 * Don't fall back to default_nid yet -- we will plug
 		 * cpus into nodes once the memory scan has discovered
 		 * the topology.
 		 */
-		if (nid < 0)
-			continue;
+		if (nid == NUMA_NO_NODE) {
+			cpu = of_get_cpu_node(i, NULL);
+			if (cpu) {
+				nid = of_node_to_nid_single(cpu);
+				of_node_put(cpu);
+			}
+		}
 		node_set_online(nid);
 	}
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-04-28  9:38 [PATCH v2 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
  2020-04-28  9:38 ` [PATCH v2 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
  2020-04-28  9:38 ` [PATCH v2 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
@ 2020-04-28  9:38 ` Srikar Dronamraju
  2020-04-28 23:59   ` Andrew Morton
  2 siblings, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2020-04-28  9:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Srikar Dronamraju, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds

Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot.  However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
always online.

v5.7-rc3
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
------------------
 /sys/devices/system/node/online:            0,2
 /proc/sys/kernel/numa_balancing:            1
 /sys/devices/system/node/has_cpu:           2
 /sys/devices/system/node/has_memory:        2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:          0-31

v5.7-rc3 + patch
------------------
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
------------------
/sys/devices/system/node/online:            2
/proc/sys/kernel/numa_balancing:            0
/sys/devices/system/node/has_cpu:           2
/sys/devices/system/node/has_memory:        2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:          0-31

Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1:->v2:
- Rebased to v5.7-rc3
- Updated the changelog.

 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 69827d4fa052..03b89592af04 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
  */
 nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
 	[N_POSSIBLE] = NODE_MASK_ALL,
+#ifdef CONFIG_NUMA
+	[N_ONLINE] = NODE_MASK_NONE,
+#else
 	[N_ONLINE] = { { [0] = 1UL } },
-#ifndef CONFIG_NUMA
 	[N_NORMAL_MEMORY] = { { [0] = 1UL } },
 #ifdef CONFIG_HIGHMEM
 	[N_HIGH_MEMORY] = { { [0] = 1UL } },
-- 
2.20.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-04-28  9:38 ` [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
@ 2020-04-28 23:59   ` Andrew Morton
  2020-04-29  1:41     ` Srikar Dronamraju
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2020-04-28 23:59 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: linuxppc-dev, linux-mm, linux-kernel, Michal Hocko, Mel Gorman,
	Vlastimil Babka, Kirill A. Shutemov, Christopher Lameter,
	Michael Ellerman, Linus Torvalds

On Tue, 28 Apr 2020 15:08:36 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> Currently Linux kernel with CONFIG_NUMA on a system with multiple
> possible nodes, marks node 0 as online at boot.  However in practice,
> there are systems which have node 0 as memoryless and cpuless.
> 
> This can cause numa_balancing to be enabled on systems with only one node
> with memory and CPUs. The existence of this dummy node which is cpuless and
> memoryless node can confuse users/scripts looking at output of lscpu /
> numactl.
> 
> By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
> always online.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
>   */
>  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
>  	[N_POSSIBLE] = NODE_MASK_ALL,
> +#ifdef CONFIG_NUMA
> +	[N_ONLINE] = NODE_MASK_NONE,
> +#else
>  	[N_ONLINE] = { { [0] = 1UL } },
> -#ifndef CONFIG_NUMA
>  	[N_NORMAL_MEMORY] = { { [0] = 1UL } },
>  #ifdef CONFIG_HIGHMEM
>  	[N_HIGH_MEMORY] = { { [0] = 1UL } },

So on all other NUMA machines, when does node 0 get marked online?

This change means that for some time during boot, such machines will
now be running with node 0 marked as offline.  What are the
implications of this?  Will something break?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-04-28 23:59   ` Andrew Morton
@ 2020-04-29  1:41     ` Srikar Dronamraju
  2020-04-29 12:22       ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2020-04-29  1:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linuxppc-dev, linux-mm, linux-kernel, Michal Hocko, Mel Gorman,
	Vlastimil Babka, Kirill A. Shutemov, Christopher Lameter,
	Michael Ellerman, Linus Torvalds

> > 
> > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
> > always online.
> > 
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
> >   */
> >  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> >  	[N_POSSIBLE] = NODE_MASK_ALL,
> > +#ifdef CONFIG_NUMA
> > +	[N_ONLINE] = NODE_MASK_NONE,
> > +#else
> >  	[N_ONLINE] = { { [0] = 1UL } },
> > -#ifndef CONFIG_NUMA
> >  	[N_NORMAL_MEMORY] = { { [0] = 1UL } },
> >  #ifdef CONFIG_HIGHMEM
> >  	[N_HIGH_MEMORY] = { { [0] = 1UL } },
> 
> So on all other NUMA machines, when does node 0 get marked online?
> 
> This change means that for some time during boot, such machines will
> now be running with node 0 marked as offline.  What are the
> implications of this?  Will something break?

Till the nodes are detected, marking Node 0 as online tends to be redundant.
Because the system doesn't know if its a NUMA or a non-NUMA system.
Once we detect the nodes, we online them immediately. Hence I don't see any
side-effects or negative implications of this change.

However if I am missing anything, please do let me know.

From my part, I have tested this on
1. Non-NUMA Single node but CPUs and memory coming from zero node.
2. Non-NUMA Single node but CPUs and memory coming from non-zero node.
3. NUMA Multi node but with CPUs and memory from node 0.
4. NUMA Multi node but with no CPUs and memory from node 0.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/3] powerpc/numa: Prefer node id queried from vphn
  2020-04-28  9:38 ` [PATCH v2 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
@ 2020-04-29  6:52   ` Gautham R Shenoy
  2020-04-30  4:34     ` Srikar Dronamraju
  0 siblings, 1 reply; 17+ messages in thread
From: Gautham R Shenoy @ 2020-04-29  6:52 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds

Hello Srikar,

On Tue, Apr 28, 2020 at 03:08:35PM +0530, Srikar Dronamraju wrote:
> Node id queried from the static device tree may not
> be correct. For example: it may always show 0 on a shared processor.
> Hence prefer the node id queried from vphn and fallback on the device tree
> based node id if vphn query fails.
> 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> Changelog v1:->v2:
> - Rebased to v5.7-rc3
> 
>  arch/powerpc/mm/numa.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b3615b7fdbdf..281531340230 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -719,20 +719,20 @@ static int __init parse_numa_properties(void)
>  	 */
>  	for_each_present_cpu(i) {
>  		struct device_node *cpu;
> -		int nid;
> -
> -		cpu = of_get_cpu_node(i, NULL);
> -		BUG_ON(!cpu);
> -		nid = of_node_to_nid_single(cpu);
> -		of_node_put(cpu);
> +		int nid = vphn_get_nid(i);
> 
>  		/*
>  		 * Don't fall back to default_nid yet -- we will plug
>  		 * cpus into nodes once the memory scan has discovered
>  		 * the topology.
>  		 */
> -		if (nid < 0)
> -			continue;


> +		if (nid == NUMA_NO_NODE) {
> +			cpu = of_get_cpu_node(i, NULL);
> +			if (cpu) {

Why are we not retaining the BUG_ON(!cpu) assert here ?

> +				nid = of_node_to_nid_single(cpu);
> +				of_node_put(cpu);
> +			}
> +		}

Is it possible at this point that both vphn_get_nid(i) and
of_node_to_nid_single(cpu) returns NUMA_NO_NODE ? If so,
should we still call node_set_online() below ?


>  		node_set_online(nid);
>  	}
> 
> -- 
> 2.20.1
> 
--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-04-29  1:41     ` Srikar Dronamraju
@ 2020-04-29 12:22       ` Michal Hocko
  2020-04-30  7:18         ` Srikar Dronamraju
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2020-04-29 12:22 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel, Mel Gorman,
	Vlastimil Babka, Kirill A. Shutemov, Christopher Lameter,
	Michael Ellerman, Linus Torvalds

On Wed 29-04-20 07:11:45, Srikar Dronamraju wrote:
> > > 
> > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
> > > always online.
> > > 
> > > ...
> > >
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
> > >   */
> > >  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> > >  	[N_POSSIBLE] = NODE_MASK_ALL,
> > > +#ifdef CONFIG_NUMA
> > > +	[N_ONLINE] = NODE_MASK_NONE,
> > > +#else
> > >  	[N_ONLINE] = { { [0] = 1UL } },
> > > -#ifndef CONFIG_NUMA
> > >  	[N_NORMAL_MEMORY] = { { [0] = 1UL } },
> > >  #ifdef CONFIG_HIGHMEM
> > >  	[N_HIGH_MEMORY] = { { [0] = 1UL } },
> > 
> > So on all other NUMA machines, when does node 0 get marked online?
> > 
> > This change means that for some time during boot, such machines will
> > now be running with node 0 marked as offline.  What are the
> > implications of this?  Will something break?
> 
> Till the nodes are detected, marking Node 0 as online tends to be redundant.
> Because the system doesn't know if its a NUMA or a non-NUMA system.
> Once we detect the nodes, we online them immediately. Hence I don't see any
> side-effects or negative implications of this change.
> 
> However if I am missing anything, please do let me know.
> 
> >From my part, I have tested this on
> 1. Non-NUMA Single node but CPUs and memory coming from zero node.
> 2. Non-NUMA Single node but CPUs and memory coming from non-zero node.
> 3. NUMA Multi node but with CPUs and memory from node 0.
> 4. NUMA Multi node but with no CPUs and memory from node 0.

Have you tested on something else than ppc? Each arch does the NUMA
setup separately and this is a big mess. E.g. x86 marks even memory less
nodes (see init_memory_less_node) as online.

Honestly I have hard time to evaluate the effect of this patch. It makes
some sense to assume all nodes offline before they get online but this
is a land mine territory.

I am also not sure what kind of problem this is going to address. You
have mentioned numa balancing without many details.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/3] powerpc/numa: Prefer node id queried from vphn
  2020-04-29  6:52   ` Gautham R Shenoy
@ 2020-04-30  4:34     ` Srikar Dronamraju
  0 siblings, 0 replies; 17+ messages in thread
From: Srikar Dronamraju @ 2020-04-30  4:34 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds

* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-04-29 12:22:29]:

> Hello Srikar,
> 
> 
> > +		if (nid == NUMA_NO_NODE) {
> > +			cpu = of_get_cpu_node(i, NULL);
> > +			if (cpu) {
> 
> Why are we not retaining the BUG_ON(!cpu) assert here ?
> 
> > +				nid = of_node_to_nid_single(cpu);
> > +				of_node_put(cpu);
> > +			}
> > +		}
> 
> Is it possible at this point that both vphn_get_nid(i) and
> of_node_to_nid_single(cpu) returns NUMA_NO_NODE ? If so,
> should we still call node_set_online() below ?

Yeah, I think It makes sense to retain the BUG_ON and if check.

Will incorporate both of them in the next version.

> 
> 
> >  		node_set_online(nid);
> >  	}
> > 
> > -- 
> > 2.20.1
> > 
> --
> Thanks and Regards
> gautham.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-04-29 12:22       ` Michal Hocko
@ 2020-04-30  7:18         ` Srikar Dronamraju
  2020-05-04  9:37           ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2020-04-30  7:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel, Mel Gorman,
	Vlastimil Babka, Kirill A. Shutemov, Christopher Lameter,
	Michael Ellerman, Linus Torvalds

* Michal Hocko <mhocko@kernel.org> [2020-04-29 14:22:11]:

> On Wed 29-04-20 07:11:45, Srikar Dronamraju wrote:
> > > > 
> > > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
> > > > always online.
> > > > 
> > > > ...
> > > >
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
> > > >   */
> > > >  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> > > >  	[N_POSSIBLE] = NODE_MASK_ALL,
> > > > +#ifdef CONFIG_NUMA
> > > > +	[N_ONLINE] = NODE_MASK_NONE,
> > > > +#else
> > > >  	[N_ONLINE] = { { [0] = 1UL } },
> > > > -#ifndef CONFIG_NUMA
> > > >  	[N_NORMAL_MEMORY] = { { [0] = 1UL } },
> > > >  #ifdef CONFIG_HIGHMEM
> > > >  	[N_HIGH_MEMORY] = { { [0] = 1UL } },
> > > 
> > > So on all other NUMA machines, when does node 0 get marked online?
> > > 
> > > This change means that for some time during boot, such machines will
> > > now be running with node 0 marked as offline.  What are the
> > > implications of this?  Will something break?
> > 
> > Till the nodes are detected, marking Node 0 as online tends to be redundant.
> > Because the system doesn't know if its a NUMA or a non-NUMA system.
> > Once we detect the nodes, we online them immediately. Hence I don't see any
> > side-effects or negative implications of this change.
> > 
> > However if I am missing anything, please do let me know.
> > 
> > >From my part, I have tested this on
> > 1. Non-NUMA Single node but CPUs and memory coming from zero node.
> > 2. Non-NUMA Single node but CPUs and memory coming from non-zero node.
> > 3. NUMA Multi node but with CPUs and memory from node 0.
> > 4. NUMA Multi node but with no CPUs and memory from node 0.
> 
> Have you tested on something else than ppc? Each arch does the NUMA
> setup separately and this is a big mess. E.g. x86 marks even memory less
> nodes (see init_memory_less_node) as online.
> 

while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA
enabled/disabled on both single node and multi node machines.
However, I dont have a cpuless/memoryless x86 system.

> Honestly I have hard time to evaluate the effect of this patch. It makes
> some sense to assume all nodes offline before they get online but this
> is a land mine territory.
> 
> I am also not sure what kind of problem this is going to address. You
> have mentioned numa balancing without many details.

1. On a machine with just one node with node number not being 0,
the current setup will end up showing 2 online nodes. And when there are
more than one online nodes, numa_balancing gets enabled.

Without patch
$ grep numa /proc/vmstat
numa_hit 95179
numa_miss 0
numa_foreign 0
numa_interleave 3764
numa_local 95179
numa_other 0
numa_pte_updates 1206973                 <----------
numa_huge_pte_updates 4654                 <----------
numa_hint_faults 19560                 <----------
numa_hint_faults_local 19560                 <----------
numa_pages_migrated 0


With patch
$ grep numa /proc/vmstat 
numa_hit 322338756
numa_miss 0
numa_foreign 0
numa_interleave 3790
numa_local 322338756
numa_other 0
numa_pte_updates 0                 <----------
numa_huge_pte_updates 0                 <----------
numa_hint_faults 0                 <----------
numa_hint_faults_local 0                 <----------
numa_pages_migrated 0

So we have a redundant page hinting numa faults which we can avoid.

2. Few people have complained about existence of this dummy node when
parsing lscpu and numactl o/p. They somehow start to think that the tools
are reporting incorrectly or the kernel is not able to recognize resources
connected to the node.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-04-30  7:18         ` Srikar Dronamraju
@ 2020-05-04  9:37           ` Michal Hocko
  2020-05-08 13:03             ` Srikar Dronamraju
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2020-05-04  9:37 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel, Mel Gorman,
	Vlastimil Babka, Kirill A. Shutemov, Christopher Lameter,
	Michael Ellerman, Linus Torvalds

On Thu 30-04-20 12:48:20, Srikar Dronamraju wrote:
> * Michal Hocko <mhocko@kernel.org> [2020-04-29 14:22:11]:
> 
> > On Wed 29-04-20 07:11:45, Srikar Dronamraju wrote:
> > > > > 
> > > > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
> > > > > always online.
> > > > > 
> > > > > ...
> > > > >
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
> > > > >   */
> > > > >  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> > > > >  	[N_POSSIBLE] = NODE_MASK_ALL,
> > > > > +#ifdef CONFIG_NUMA
> > > > > +	[N_ONLINE] = NODE_MASK_NONE,
> > > > > +#else
> > > > >  	[N_ONLINE] = { { [0] = 1UL } },
> > > > > -#ifndef CONFIG_NUMA
> > > > >  	[N_NORMAL_MEMORY] = { { [0] = 1UL } },
> > > > >  #ifdef CONFIG_HIGHMEM
> > > > >  	[N_HIGH_MEMORY] = { { [0] = 1UL } },
> > > > 
> > > > So on all other NUMA machines, when does node 0 get marked online?
> > > > 
> > > > This change means that for some time during boot, such machines will
> > > > now be running with node 0 marked as offline.  What are the
> > > > implications of this?  Will something break?
> > > 
> > > Till the nodes are detected, marking Node 0 as online tends to be redundant.
> > > Because the system doesn't know if its a NUMA or a non-NUMA system.
> > > Once we detect the nodes, we online them immediately. Hence I don't see any
> > > side-effects or negative implications of this change.
> > > 
> > > However if I am missing anything, please do let me know.
> > > 
> > > >From my part, I have tested this on
> > > 1. Non-NUMA Single node but CPUs and memory coming from zero node.
> > > 2. Non-NUMA Single node but CPUs and memory coming from non-zero node.
> > > 3. NUMA Multi node but with CPUs and memory from node 0.
> > > 4. NUMA Multi node but with no CPUs and memory from node 0.
> > 
> > Have you tested on something else than ppc? Each arch does the NUMA
> > setup separately and this is a big mess. E.g. x86 marks even memory less
> > nodes (see init_memory_less_node) as online.
> > 
> 
> while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA
> enabled/disabled on both single node and multi node machines.
> However, I dont have a cpuless/memoryless x86 system.

This should be able to emulate inside kvm, I believe.

> > Honestly I have hard time to evaluate the effect of this patch. It makes
> > some sense to assume all nodes offline before they get online but this
> > is a land mine territory.
> > 
> > I am also not sure what kind of problem this is going to address. You
> > have mentioned numa balancing without many details.
> 
> 1. On a machine with just one node with node number not being 0,
> the current setup will end up showing 2 online nodes. And when there are
> more than one online nodes, numa_balancing gets enabled.
> 
> Without patch
> $ grep numa /proc/vmstat
> numa_hit 95179
> numa_miss 0
> numa_foreign 0
> numa_interleave 3764
> numa_local 95179
> numa_other 0
> numa_pte_updates 1206973                 <----------
> numa_huge_pte_updates 4654                 <----------
> numa_hint_faults 19560                 <----------
> numa_hint_faults_local 19560                 <----------
> numa_pages_migrated 0
> 
> 
> With patch
> $ grep numa /proc/vmstat 
> numa_hit 322338756
> numa_miss 0
> numa_foreign 0
> numa_interleave 3790
> numa_local 322338756
> numa_other 0
> numa_pte_updates 0                 <----------
> numa_huge_pte_updates 0                 <----------
> numa_hint_faults 0                 <----------
> numa_hint_faults_local 0                 <----------
> numa_pages_migrated 0
> 
> So we have a redundant page hinting numa faults which we can avoid.

interesting. Does this lead to any observable differences? Btw. it would
be really great to describe how the online state influences the numa
balancing.

> 2. Few people have complained about existence of this dummy node when
> parsing lscpu and numactl o/p. They somehow start to think that the tools
> are reporting incorrectly or the kernel is not able to recognize resources
> connected to the node.

Please be more specific.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-05-04  9:37           ` Michal Hocko
@ 2020-05-08 13:03             ` Srikar Dronamraju
  2020-05-08 13:39               ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2020-05-08 13:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel, Mel Gorman,
	Vlastimil Babka, Kirill A. Shutemov, Christopher Lameter,
	Michael Ellerman, Linus Torvalds

* Michal Hocko <mhocko@kernel.org> [2020-05-04 11:37:12]:

> > > 
> > > Have you tested on something else than ppc? Each arch does the NUMA
> > > setup separately and this is a big mess. E.g. x86 marks even memory less
> > > nodes (see init_memory_less_node) as online.
> > > 
> > 
> > while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA
> > enabled/disabled on both single node and multi node machines.
> > However, I dont have a cpuless/memoryless x86 system.
> 
> This should be able to emulate inside kvm, I believe.
> 

I did try but somehow not able to get cpuless / memoryless node in a x86 kvm
guest.

Also I am unable to see how to enable HAVE_MEMORYLESS_NODES on x86 system.
# git grep -w HAVE_MEMORYLESS_NODES | cat
arch/ia64/Kconfig:config HAVE_MEMORYLESS_NODES
arch/powerpc/Kconfig:config HAVE_MEMORYLESS_NODES
#
I forced enabled but it got disabled while kernel build.
May be I am missing something.

> > 
> > So we have a redundant page hinting numa faults which we can avoid.
> 
> interesting. Does this lead to any observable differences? Btw. it would
> be really great to describe how the online state influences the numa
> balancing.
> 

If numa_balancing is enabled, it has a check to see if the number of online
nodes is 1. If its one, it disables numa_balancing, else the numa_balancing
stays as is. In this case, the actual node (node nr > 0) and
node 0 were marked online without the patch.

Here are 2 sample numa programs.

numa01.sh is a set of 2 process each running threads as many as number of cpus;
each thread doing 50 loops on 3GB process shared memory operations.

numa02.sh is a single process with threads as many as number of cpus;
each thread doing 800 loops on 32MB thread local memory operations.

Testcase         Time:  Min      Max      Avg      StdDev
./numa01.sh      Real:  149.62   149.66   149.64   0.02
./numa01.sh      Sys:   3.21     3.71     3.46     0.25
./numa01.sh      User:  4755.13  4758.15  4756.64  1.51
./numa02.sh      Real:  24.98    25.02    25.00    0.02
./numa02.sh      Sys:   0.51     0.59     0.55     0.04
./numa02.sh      User:  790.28   790.88   790.58   0.30

Testcase         Time:  Min      Max      Avg      StdDev  %Change
./numa01.sh      Real:  149.44   149.46   149.45   0.01    0.127133%
./numa01.sh      Sys:   0.71     0.89     0.80     0.09    332.5%
./numa01.sh      User:  4754.19  4754.48  4754.33  0.15    0.0485873%
./numa02.sh      Real:  24.97    24.98    24.98    0.00    0.0800641%
./numa02.sh      Sys:   0.26     0.41     0.33     0.08    66.6667%
./numa02.sh      User:  789.75   790.28   790.01   0.27    0.072151%

numa01.sh
param                   no_patch    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        1131164     0           -100%
numa_hint_faults_local  1131164     0           -100%
numa_hit                213696      214244      0.256439%
numa_local              213696      214244      0.256439%
numa_pte_updates        1131294     0           -100%
pgfault                 1380845     241424      -82.5162%
pgmajfault              75          60          -20%

numa02.sh
param                   no_patch    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        111878      0           -100%
numa_hint_faults_local  111878      0           -100%
numa_hit                41854       43220       3.26373%
numa_local              41854       43220       3.26373%
numa_pte_updates        113926      0           -100%
pgfault                 163662      51210       -68.7099%
pgmajfault              56          52          -7.14286%

Observations:
The real time and user time actually doesn't change much. However the system
time changes to some extent. The reason being the number of numa hinting
faults. With the patch we are not seeing the numa hinting faults.

> > 2. Few people have complained about existence of this dummy node when
> > parsing lscpu and numactl o/p. They somehow start to think that the tools
> > are reporting incorrectly or the kernel is not able to recognize resources
> > connected to the node.
> 
> Please be more specific.

Taking the below example of numactl
available: 2 nodes (0,7)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 7 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 7 size: 16238 MB
node 7 free: 15449 MB
node distances:
node   0   7 
  0:  10  20 
  7:  20  10 

We know node 0 can be special, but users may not feel the same.

When users parse numactl/lscpu or /sys directory; they find there are 2
online nodes. They find none of the resources for a node(node 0) are
available but still online. However they find other nodes (nodes 1-6) with
don't have resources but not online. So they tend to think the kernel has
been unable to online some of the resources or the resources have gone bad.
Please do note that on hypervisors like PowerVM, the admins don't have
control over which nodes the resources are allocated.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-05-08 13:03             ` Srikar Dronamraju
@ 2020-05-08 13:39               ` David Hildenbrand
  2020-05-08 13:42                 ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2020-05-08 13:39 UTC (permalink / raw)
  To: Srikar Dronamraju, Michal Hocko
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel, Mel Gorman,
	Vlastimil Babka, Kirill A. Shutemov, Christopher Lameter,
	Michael Ellerman, Linus Torvalds

On 08.05.20 15:03, Srikar Dronamraju wrote:
> * Michal Hocko <mhocko@kernel.org> [2020-05-04 11:37:12]:
> 
>>>>
>>>> Have you tested on something else than ppc? Each arch does the NUMA
>>>> setup separately and this is a big mess. E.g. x86 marks even memory less
>>>> nodes (see init_memory_less_node) as online.
>>>>
>>>
>>> while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA
>>> enabled/disabled on both single node and multi node machines.
>>> However, I dont have a cpuless/memoryless x86 system.
>>
>> This should be able to emulate inside kvm, I believe.
>>
> 
> I did try but somehow not able to get cpuless / memoryless node in a x86 kvm
> guest.

I use the following

#! /bin/bash
sudo x86_64-softmmu/qemu-system-x86_64 \
    --enable-kvm \
    -m 4G,maxmem=20G,slots=2 \
    -smp sockets=2,cores=2 \
    -numa node,nodeid=0,cpus=0-1,mem=4G -numa node,nodeid=1,cpus=2-3,mem=0G \
    -kernel /home/dhildenb/git/linux/arch/x86_64/boot/bzImage \
    -append "console=ttyS0 rd.shell rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0" \
    -initrd /boot/initramfs-5.2.8-200.fc30.x86_64.img \
    -machine pc,nvdimm \
    -nographic \
    -nodefaults \
    -chardev stdio,id=serial \
    -device isa-serial,chardev=serial \
    -chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \
    -mon chardev=monitor,mode=readline

to get a cpu-less and memory-less node 1. Never tried with node 0.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-05-08 13:39               ` David Hildenbrand
@ 2020-05-08 13:42                 ` David Hildenbrand
  2020-05-11 17:47                   ` Srikar Dronamraju
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2020-05-08 13:42 UTC (permalink / raw)
  To: Srikar Dronamraju, Michal Hocko
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel, Mel Gorman,
	Vlastimil Babka, Kirill A. Shutemov, Christopher Lameter,
	Michael Ellerman, Linus Torvalds

On 08.05.20 15:39, David Hildenbrand wrote:
> On 08.05.20 15:03, Srikar Dronamraju wrote:
>> * Michal Hocko <mhocko@kernel.org> [2020-05-04 11:37:12]:
>>
>>>>>
>>>>> Have you tested on something else than ppc? Each arch does the NUMA
>>>>> setup separately and this is a big mess. E.g. x86 marks even memory less
>>>>> nodes (see init_memory_less_node) as online.
>>>>>
>>>>
>>>> while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA
>>>> enabled/disabled on both single node and multi node machines.
>>>> However, I dont have a cpuless/memoryless x86 system.
>>>
>>> This should be able to emulate inside kvm, I believe.
>>>
>>
>> I did try but somehow not able to get cpuless / memoryless node in a x86 kvm
>> guest.
> 
> I use the following
> 
> #! /bin/bash
> sudo x86_64-softmmu/qemu-system-x86_64 \
>     --enable-kvm \
>     -m 4G,maxmem=20G,slots=2 \
>     -smp sockets=2,cores=2 \
>     -numa node,nodeid=0,cpus=0-1,mem=4G -numa node,nodeid=1,cpus=2-3,mem=0G \

Sorry, this line has to be

-numa node,nodeid=0,cpus=0-3,mem=4G -numa node,nodeid=1,mem=0G \

>     -kernel /home/dhildenb/git/linux/arch/x86_64/boot/bzImage \
>     -append "console=ttyS0 rd.shell rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0" \
>     -initrd /boot/initramfs-5.2.8-200.fc30.x86_64.img \
>     -machine pc,nvdimm \
>     -nographic \
>     -nodefaults \
>     -chardev stdio,id=serial \
>     -device isa-serial,chardev=serial \
>     -chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \
>     -mon chardev=monitor,mode=readline
> 
> to get a cpu-less and memory-less node 1. Never tried with node 0.
> 


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-05-08 13:42                 ` David Hildenbrand
@ 2020-05-11 17:47                   ` Srikar Dronamraju
  2020-05-12  7:49                     ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Srikar Dronamraju @ 2020-05-11 17:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Andrew Morton, linuxppc-dev, linux-mm,
	linux-kernel, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds

* David Hildenbrand <david@redhat.com> [2020-05-08 15:42:12]:

Hi David,

Thanks for the steps to tryout.

> > 
> > #! /bin/bash
> > sudo x86_64-softmmu/qemu-system-x86_64 \
> >     --enable-kvm \
> >     -m 4G,maxmem=20G,slots=2 \
> >     -smp sockets=2,cores=2 \
> >     -numa node,nodeid=0,cpus=0-1,mem=4G -numa node,nodeid=1,cpus=2-3,mem=0G \
> 
> Sorry, this line has to be
> 
> -numa node,nodeid=0,cpus=0-3,mem=4G -numa node,nodeid=1,mem=0G \
> 
> >     -kernel /home/dhildenb/git/linux/arch/x86_64/boot/bzImage \
> >     -append "console=ttyS0 rd.shell rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0" \
> >     -initrd /boot/initramfs-5.2.8-200.fc30.x86_64.img \
> >     -machine pc,nvdimm \
> >     -nographic \
> >     -nodefaults \
> >     -chardev stdio,id=serial \
> >     -device isa-serial,chardev=serial \
> >     -chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \
> >     -mon chardev=monitor,mode=readline
> > 
> > to get a cpu-less and memory-less node 1. Never tried with node 0.
> > 

I tried 

qemu-system-x86_64 -enable-kvm -m 4G,maxmem=20G,slots=2 -smp sockets=2,cores=2 -cpu host -numa node,nodeid=0,cpus=0-3,mem=4G -numa node,nodeid=1,mem=0G -vga none -nographic -serial mon:stdio /home/srikar/fedora.qcow2

and the resulting guest was.

[root@localhost ~]# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 3927 MB
node 0 free: 3316 MB
node distances:
node   0
  0:  10

[root@localhost ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       40 bits physical, 48 bits virtual
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  2
Socket(s):           2
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               46
Model name:          Intel(R) Xeon(R) CPU           X7560  @ 2.27GHz
Stepping:            6
CPU MHz:             2260.986
BogoMIPS:            4521.97
Virtualization:      VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni vmx ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb tpr_shadow vnmi flexpriority ept vpid tsc_adjust arat umip arch_capabilities

[root@localhost ~]# cat /sys/devices/system/node/online
0
[root@localhost ~]# cat /sys/devices/system/node/possible
0-1

---------------------------------------------------------------------------------

I also tried

qemu-system-x86_64 -enable-kvm -m 4G,maxmem=20G,slots=2 -smp sockets=2,cores=2 -cpu host -numa node,nodeid=1,cpus=0-3,mem=4G -numa node,nodeid=0,mem=0G -vga none -nographic -serial mon:stdio /home/srikar/fedora.qcow2

and the resulting guest was.

[root@localhost ~]# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 3927 MB
node 0 free: 3316 MB
node distances:
node   0
  0:  10

[root@localhost ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       40 bits physical, 48 bits virtual
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  2
Socket(s):           2
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               46
Model name:          Intel(R) Xeon(R) CPU           X7560  @ 2.27GHz
Stepping:            6
CPU MHz:             2260.986
BogoMIPS:            4521.97
Virtualization:      VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni vmx ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb tpr_shadow vnmi flexpriority ept vpid tsc_adjust arat umip arch_capabilities

[root@localhost ~]# cat /sys/devices/system/node/online
0
[root@localhost ~]# cat /sys/devices/system/node/possible
0-1

Even without my patch, both the combinations, I am still unable to see a
cpuless, memoryless node being online. And the interesting part being even
if I mark node 0 as cpuless,memoryless and node 1 as actual node, the system
somewhere marks node 0 as the actual node.

> 
> David / dhildenb
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-05-11 17:47                   ` Srikar Dronamraju
@ 2020-05-12  7:49                     ` David Hildenbrand
  2020-05-12 10:42                       ` Srikar Dronamraju
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2020-05-12  7:49 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Michal Hocko, Andrew Morton, linuxppc-dev, linux-mm,
	linux-kernel, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds

On 11.05.20 19:47, Srikar Dronamraju wrote:
> * David Hildenbrand <david@redhat.com> [2020-05-08 15:42:12]:
> 
> Hi David,
> 
> Thanks for the steps to tryout.
> 
>>>
>>> #! /bin/bash
>>> sudo x86_64-softmmu/qemu-system-x86_64 \
>>>     --enable-kvm \
>>>     -m 4G,maxmem=20G,slots=2 \
>>>     -smp sockets=2,cores=2 \
>>>     -numa node,nodeid=0,cpus=0-1,mem=4G -numa node,nodeid=1,cpus=2-3,mem=0G \
>>
>> Sorry, this line has to be
>>
>> -numa node,nodeid=0,cpus=0-3,mem=4G -numa node,nodeid=1,mem=0G \
>>
>>>     -kernel /home/dhildenb/git/linux/arch/x86_64/boot/bzImage \
>>>     -append "console=ttyS0 rd.shell rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0" \
>>>     -initrd /boot/initramfs-5.2.8-200.fc30.x86_64.img \
>>>     -machine pc,nvdimm \
>>>     -nographic \
>>>     -nodefaults \
>>>     -chardev stdio,id=serial \
>>>     -device isa-serial,chardev=serial \
>>>     -chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \
>>>     -mon chardev=monitor,mode=readline
>>>
>>> to get a cpu-less and memory-less node 1. Never tried with node 0.
>>>
> 
> I tried 
> 
> qemu-system-x86_64 -enable-kvm -m 4G,maxmem=20G,slots=2 -smp sockets=2,cores=2 -cpu host -numa node,nodeid=0,cpus=0-3,mem=4G -numa node,nodeid=1,mem=0G -vga none -nographic -serial mon:stdio /home/srikar/fedora.qcow2
> 
> and the resulting guest was.
> 
> [root@localhost ~]# numactl -H
> available: 1 nodes (0)
> node 0 cpus: 0 1 2 3
> node 0 size: 3927 MB
> node 0 free: 3316 MB
> node distances:
> node   0
>   0:  10
> 
> [root@localhost ~]# lscpu
> Architecture:        x86_64
> CPU op-mode(s):      32-bit, 64-bit
> Byte Order:          Little Endian
> Address sizes:       40 bits physical, 48 bits virtual
> CPU(s):              4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  1
> Core(s) per socket:  2
> Socket(s):           2
> NUMA node(s):        1
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               46
> Model name:          Intel(R) Xeon(R) CPU           X7560  @ 2.27GHz
> Stepping:            6
> CPU MHz:             2260.986
> BogoMIPS:            4521.97
> Virtualization:      VT-x
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            4096K
> L3 cache:            16384K
> NUMA node0 CPU(s):   0-3
> Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni vmx ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb tpr_shadow vnmi flexpriority ept vpid tsc_adjust arat umip arch_capabilities
> 
> [root@localhost ~]# cat /sys/devices/system/node/online
> 0
> [root@localhost ~]# cat /sys/devices/system/node/possible
> 0-1
> 
> ---------------------------------------------------------------------------------
> 
> I also tried
> 
> qemu-system-x86_64 -enable-kvm -m 4G,maxmem=20G,slots=2 -smp sockets=2,cores=2 -cpu host -numa node,nodeid=1,cpus=0-3,mem=4G -numa node,nodeid=0,mem=0G -vga none -nographic -serial mon:stdio /home/srikar/fedora.qcow2
> 
> and the resulting guest was.
> 
> [root@localhost ~]# numactl -H
> available: 1 nodes (0)
> node 0 cpus: 0 1 2 3
> node 0 size: 3927 MB
> node 0 free: 3316 MB
> node distances:
> node   0
>   0:  10
> 
> [root@localhost ~]# lscpu
> Architecture:        x86_64
> CPU op-mode(s):      32-bit, 64-bit
> Byte Order:          Little Endian
> Address sizes:       40 bits physical, 48 bits virtual
> CPU(s):              4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  1
> Core(s) per socket:  2
> Socket(s):           2
> NUMA node(s):        1
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               46
> Model name:          Intel(R) Xeon(R) CPU           X7560  @ 2.27GHz
> Stepping:            6
> CPU MHz:             2260.986
> BogoMIPS:            4521.97
> Virtualization:      VT-x
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            4096K
> L3 cache:            16384K
> NUMA node0 CPU(s):   0-3
> Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni vmx ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb tpr_shadow vnmi flexpriority ept vpid tsc_adjust arat umip arch_capabilities
> 
> [root@localhost ~]# cat /sys/devices/system/node/online
> 0
> [root@localhost ~]# cat /sys/devices/system/node/possible
> 0-1
> 
> Even without my patch, both the combinations, I am still unable to see a
> cpuless, memoryless node being online. And the interesting part being even

Yeah, I think on x86, all memory-less and cpu-less nodes are offline as
default. Especially when hotunplugging cpus/memory, we set them offline
as well.

But as Michal mentioned, the node handling code is complicated and
differs between various architectures.

> if I mark node 0 as cpuless,memoryless and node 1 as actual node, the system
> somewhere marks node 0 as the actual node.

Is the kernel maybe mapping PXM 1 to node 0 in that case, because it
always requires node 0 to be online/contain memory? Would be interesting
what happens if you hotplug a DIMM to (QEMU )node 0 - if PXM 0 will be
mapped to node 1 then as well.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-05-12  7:49                     ` David Hildenbrand
@ 2020-05-12 10:42                       ` Srikar Dronamraju
  0 siblings, 0 replies; 17+ messages in thread
From: Srikar Dronamraju @ 2020-05-12 10:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Andrew Morton, linuxppc-dev, linux-mm,
	linux-kernel, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds,
	Satheesh Rajendran

* David Hildenbrand <david@redhat.com> [2020-05-12 09:49:05]:

> On 11.05.20 19:47, Srikar Dronamraju wrote:
> > * David Hildenbrand <david@redhat.com> [2020-05-08 15:42:12]:
> > 
> > 
> > [root@localhost ~]# cat /sys/devices/system/node/online
> > 0
> > [root@localhost ~]# cat /sys/devices/system/node/possible
> > 0-1
> > 
> > Even without my patch, both the combinations, I am still unable to see a
> > cpuless, memoryless node being online. And the interesting part being even
> 
> Yeah, I think on x86, all memory-less and cpu-less nodes are offline as
> default. Especially when hotunplugging cpus/memory, we set them offline
> as well.

I also came to the same conclusion that we may not have a cpuless,memoryless
node on x86.

> 
> But as Michal mentioned, the node handling code is complicated and
> differs between various architectures.
> 

I do agree that node handling code differs across various architectures and
quite complicated.

> > if I mark node 0 as cpuless,memoryless and node 1 as actual node, the system
> > somewhere marks node 0 as the actual node.
> 
> Is the kernel maybe mapping PXM 1 to node 0 in that case, because it
> always requires node 0 to be online/contain memory? Would be interesting
> what happens if you hotplug a DIMM to (QEMU )node 0 - if PXM 0 will be
> mapped to node 1 then as well.
> 

Satheesh Rajendra had tried with cpu hotplug on a similar setup and we found
that it crashes the x86 system.
reference: https://bugzilla.kernel.org/show_bug.cgi?id=202187

Even if we were able to hotplug 1 DIMM memory into node 1, that would no
more be a memoryless node.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, back to index

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-28  9:38 [PATCH v2 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
2020-04-28  9:38 ` [PATCH v2 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
2020-04-28  9:38 ` [PATCH v2 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
2020-04-29  6:52   ` Gautham R Shenoy
2020-04-30  4:34     ` Srikar Dronamraju
2020-04-28  9:38 ` [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
2020-04-28 23:59   ` Andrew Morton
2020-04-29  1:41     ` Srikar Dronamraju
2020-04-29 12:22       ` Michal Hocko
2020-04-30  7:18         ` Srikar Dronamraju
2020-05-04  9:37           ` Michal Hocko
2020-05-08 13:03             ` Srikar Dronamraju
2020-05-08 13:39               ` David Hildenbrand
2020-05-08 13:42                 ` David Hildenbrand
2020-05-11 17:47                   ` Srikar Dronamraju
2020-05-12  7:49                     ` David Hildenbrand
2020-05-12 10:42                       ` Srikar Dronamraju

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git