Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v5 0/3] Offline memoryless cpuless node 0
@ 2020-06-24  9:28 Srikar Dronamraju
  2020-06-24  9:28 ` [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Srikar Dronamraju @ 2020-06-24  9:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Srikar Dronamraju, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds,
	Gautham R Shenoy, Satheesh Rajendran, David Hildenbrand

Changelog v4:->v5:
- rebased to v5.8-rc2
link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v3:->v4:
- Resolved comments from Christopher.
Link v3: http://lore.kernel.org/lkml/20200501031128.19584-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v2:->v3:
- Resolved comments from Gautham.
Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3
- Updated the changelog.
Link v1: https://lore.kernel.org/linuxppc-dev/20200311110237.5731-1-srikar@linux.vnet.ibm.com/t/#u

Linux kernel configured with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause
1. numa_balancing to be enabled on systems with only one online node.
2. Existence of dummy (cpuless and memoryless) node which can confuse
users/scripts looking at output of lscpu / numactl.

This patchset wants to correct this anomaly.

This should only affect systems that have CONFIG_MEMORYLESS_NODES.
Currently there are only 2 architectures ia64 and powerpc that have this
config.

Note: Patch 3 in this patch series depends on patches 1 and 2.
Without patches 1 and 2, patch 3 might crash powerpc.

v5.8-rc2
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
------------------
 /sys/devices/system/node/online:            0,2
 /proc/sys/kernel/numa_balancing:            1
 /sys/devices/system/node/has_cpu:           2
 /sys/devices/system/node/has_memory:        2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:          0-31

v5.8-rc2 + patches
------------------
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
------------------
/sys/devices/system/node/online:            2
/proc/sys/kernel/numa_balancing:            0
/sys/devices/system/node/has_cpu:           2
/sys/devices/system/node/has_memory:        2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:          0-31

1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.

2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump

3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.

On a machine with just one node with node number not being 0,
the current setup will end up showing 2 online nodes. And when there are
more than one online nodes, numa_balancing gets enabled.

Without patch
$ grep numa /proc/vmstat
numa_hit 95179
numa_miss 0
numa_foreign 0
numa_interleave 3764
numa_local 95179
numa_other 0
numa_pte_updates 1206973                 <----------
numa_huge_pte_updates 4654                 <----------
numa_hint_faults 19560                 <----------
numa_hint_faults_local 19560                 <----------
numa_pages_migrated 0

With patch
$ grep numa /proc/vmstat
numa_hit 322338756
numa_miss 0
numa_foreign 0
numa_interleave 3790
numa_local 322338756
numa_other 0
numa_pte_updates 0                 <----------
numa_huge_pte_updates 0                 <----------
numa_hint_faults 0                 <----------
numa_hint_faults_local 0                 <----------
numa_pages_migrated 0

Here are 2 sample numa programs.

numa01.sh is a set of 2 process each running threads as many as number of
cpus;
each thread doing 50 loops on 3GB process shared memory operations.

numa02.sh is a single process with threads as many as number of cpus;
each thread doing 800 loops on 32MB thread local memory operations.

Testcase         Time:  Min      Max      Avg      StdDev
./numa01.sh      Real:  149.62   149.66   149.64   0.02
./numa01.sh      Sys:   3.21     3.71     3.46     0.25
./numa01.sh      User:  4755.13  4758.15  4756.64  1.51
./numa02.sh      Real:  24.98    25.02    25.00    0.02
./numa02.sh      Sys:   0.51     0.59     0.55     0.04
./numa02.sh      User:  790.28   790.88   790.58   0.30

Testcase         Time:  Min      Max      Avg      StdDev  %Change
./numa01.sh      Real:  149.44   149.46   149.45   0.01    0.127133%
./numa01.sh      Sys:   0.71     0.89     0.80     0.09    332.5%
./numa01.sh      User:  4754.19  4754.48  4754.33  0.15    0.0485873%
./numa02.sh      Real:  24.97    24.98    24.98    0.00    0.0800641%
./numa02.sh      Sys:   0.26     0.41     0.33     0.08    66.6667%
./numa02.sh      User:  789.75   790.28   790.01   0.27    0.072151%

numa01.sh
param                   no_patch    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        1131164     0           -100%
numa_hint_faults_local  1131164     0           -100%
numa_hit                213696      214244      0.256439%
numa_local              213696      214244      0.256439%
numa_pte_updates        1131294     0           -100%
pgfault                 1380845     241424      -82.5162%
pgmajfault              75          60          -20%

Here are 2 sample numa programs.

numa01.sh is a set of 2 process each running threads as many as number of
cpus;
each thread doing 50 loops on 3GB process shared memory operations.

numa02.sh is a single process with threads as many as number of cpus;
each thread doing 800 loops on 32MB thread local memory operations.

Without patch
-------------
Testcase         Time:  Min      Max      Avg      StdDev
./numa01.sh      Real:  149.62   149.66   149.64   0.02
./numa01.sh      Sys:   3.21     3.71     3.46     0.25
./numa01.sh      User:  4755.13  4758.15  4756.64  1.51
./numa02.sh      Real:  24.98    25.02    25.00    0.02
./numa02.sh      Sys:   0.51     0.59     0.55     0.04
./numa02.sh      User:  790.28   790.88   790.58   0.30

With patch
-----------
Testcase         Time:  Min      Max      Avg      StdDev  %Change
./numa01.sh      Real:  149.44   149.46   149.45   0.01    0.127133%
./numa01.sh      Sys:   0.71     0.89     0.80     0.09    332.5%
./numa01.sh      User:  4754.19  4754.48  4754.33  0.15    0.0485873%
./numa02.sh      Real:  24.97    24.98    24.98    0.00    0.0800641%
./numa02.sh      Sys:   0.26     0.41     0.33     0.08    66.6667%
./numa02.sh      User:  789.75   790.28   790.01   0.27    0.072151%

numa01.sh
param                   no_patch    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        1131164     0           -100%
numa_hint_faults_local  1131164     0           -100%
numa_hit                213696      214244      0.256439%
numa_local              213696      214244      0.256439%
numa_pte_updates        1131294     0           -100%
pgfault                 1380845     241424      -82.5162%
pgmajfault              75          60          -20%

numa02.sh
param                   no_patch    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        111878      0           -100%
numa_hint_faults_local  111878      0           -100%
numa_hit                41854       43220       3.26373%
numa_local              41854       43220       3.26373%
numa_pte_updates        113926      0           -100%
pgfault                 163662      51210       -68.7099%
pgmajfault              56          52          -7.14286%

Observations:
The real time and user time actually doesn't change much. However the system
time changes to some extent. The reason being the number of numa hinting
faults. With the patch we are not seeing the numa hinting faults.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>

Srikar Dronamraju (3):
  powerpc/numa: Set numa_node for all possible cpus
  powerpc/numa: Prefer node id queried from vphn
  mm/page_alloc: Keep memoryless cpuless node 0 offline

 arch/powerpc/mm/numa.c | 35 +++++++++++++++++++++++++----------
 mm/page_alloc.c        |  4 +++-
 2 files changed, 28 insertions(+), 11 deletions(-)

-- 
2.18.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-06-24  9:28 [PATCH v5 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
@ 2020-06-24  9:28 ` Srikar Dronamraju
  2020-06-24  9:48   ` Gautham R Shenoy
  2020-06-24  9:28 ` [PATCH v5 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
  2020-06-24  9:28 ` [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
  2 siblings, 1 reply; 26+ messages in thread
From: Srikar Dronamraju @ 2020-06-24  9:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Srikar Dronamraju, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds,
	Gautham R Shenoy, Satheesh Rajendran, David Hildenbrand

A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v4:->v5:
- rebased to v5.8-rc2
link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v3:->v4:
- Resolved comments from Christopher.
Link v3: http://lore.kernel.org/lkml/20200501031128.19584-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3

 arch/powerpc/mm/numa.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9fcf2d195830..5b7918c132f5 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -506,6 +506,11 @@ static int numa_setup_cpu(unsigned long lcpu)
 	int fcpu = cpu_first_thread_sibling(lcpu);
 	int nid = NUMA_NO_NODE;
 
+	if (!cpu_present(lcpu)) {
+		set_cpu_numa_node(lcpu, first_online_node);
+		return first_online_node;
+	}
+
 	/*
 	 * If a valid cpu-to-node mapping is already available, use it
 	 * directly instead of querying the firmware, since it represents
@@ -931,8 +936,17 @@ void __init mem_topology_setup(void)
 
 	reset_numa_cpu_lookup_table();
 
-	for_each_present_cpu(cpu)
+	for_each_possible_cpu(cpu) {
+		/*
+		 * Powerpc with CONFIG_NUMA always used to have a node 0,
+		 * even if it was memoryless or cpuless. For all cpus that
+		 * are possible but not present, cpu_to_node() would point
+		 * to node 0. To remove a cpuless, memoryless dummy node,
+		 * powerpc need to make sure all possible but not present
+		 * cpu_to_node are set to a proper node.
+		 */
 		numa_setup_cpu(cpu);
+	}
 }
 
 void __init initmem_init(void)
-- 
2.18.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v5 2/3] powerpc/numa: Prefer node id queried from vphn
  2020-06-24  9:28 [PATCH v5 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
  2020-06-24  9:28 ` [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
@ 2020-06-24  9:28 ` Srikar Dronamraju
  2020-06-24 10:29   ` Gautham R Shenoy
  2020-06-24  9:28 ` [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
  2 siblings, 1 reply; 26+ messages in thread
From: Srikar Dronamraju @ 2020-06-24  9:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Srikar Dronamraju, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds,
	Gautham R Shenoy, Satheesh Rajendran, David Hildenbrand

Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v4:->v5:
- rebased to v5.8-rc2
link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v2:->v3:
- Resolved comments from Gautham.
Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3

 arch/powerpc/mm/numa.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 5b7918c132f5..163f5dc8fd46 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -724,21 +724,22 @@ static int __init parse_numa_properties(void)
 	 */
 	for_each_present_cpu(i) {
 		struct device_node *cpu;
-		int nid;
-
-		cpu = of_get_cpu_node(i, NULL);
-		BUG_ON(!cpu);
-		nid = of_node_to_nid_single(cpu);
-		of_node_put(cpu);
+		int nid = vphn_get_nid(i);
 
 		/*
 		 * Don't fall back to default_nid yet -- we will plug
 		 * cpus into nodes once the memory scan has discovered
 		 * the topology.
 		 */
-		if (nid < 0)
-			continue;
-		node_set_online(nid);
+		if (nid == NUMA_NO_NODE) {
+			cpu = of_get_cpu_node(i, NULL);
+			BUG_ON(!cpu);
+			nid = of_node_to_nid_single(cpu);
+			of_node_put(cpu);
+		}
+
+		if (likely(nid > 0))
+			node_set_online(nid);
 	}
 
 	get_n_mem_cells(&n_mem_addr_cells, &n_mem_size_cells);
-- 
2.18.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-06-24  9:28 [PATCH v5 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
  2020-06-24  9:28 ` [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
  2020-06-24  9:28 ` [PATCH v5 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
@ 2020-06-24  9:28 ` Srikar Dronamraju
  2020-06-29 14:58   ` Christopher Lameter
  2020-07-01  8:42   ` Michal Hocko
  2 siblings, 2 replies; 26+ messages in thread
From: Srikar Dronamraju @ 2020-06-24  9:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Srikar Dronamraju, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds,
	Gautham R Shenoy, Satheesh Rajendran, David Hildenbrand

Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot.  However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
always online.

v5.8-rc2
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
------------------
 /sys/devices/system/node/online:            0,2
 /proc/sys/kernel/numa_balancing:            1
 /sys/devices/system/node/has_cpu:           2
 /sys/devices/system/node/has_memory:        2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:          0-31

v5.8-rc2 + patch
------------------
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
------------------
/sys/devices/system/node/online:            2
/proc/sys/kernel/numa_balancing:            0
/sys/devices/system/node/has_cpu:           2
/sys/devices/system/node/has_memory:        2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:          0-31

Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.

1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.

2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump

3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v4:->v5:
- rebased to v5.8-rc2
link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3
Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u

 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48eb0f1410d4..5187664558e1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -117,8 +117,10 @@ EXPORT_SYMBOL(latent_entropy);
  */
 nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
 	[N_POSSIBLE] = NODE_MASK_ALL,
+#ifdef CONFIG_NUMA
+	[N_ONLINE] = NODE_MASK_NONE,
+#else
 	[N_ONLINE] = { { [0] = 1UL } },
-#ifndef CONFIG_NUMA
 	[N_NORMAL_MEMORY] = { { [0] = 1UL } },
 #ifdef CONFIG_HIGHMEM
 	[N_HIGH_MEMORY] = { { [0] = 1UL } },
-- 
2.18.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus
  2020-06-24  9:28 ` [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
@ 2020-06-24  9:48   ` Gautham R Shenoy
  0 siblings, 0 replies; 26+ messages in thread
From: Gautham R Shenoy @ 2020-06-24  9:48 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds,
	Gautham R Shenoy, Satheesh Rajendran, David Hildenbrand

On Wed, Jun 24, 2020 at 02:58:44PM +0530, Srikar Dronamraju wrote:
> A Powerpc system with multiple possible nodes and with CONFIG_NUMA
> enabled always used to have a node 0, even if node 0 does not any cpus
> or memory attached to it. As per PAPR, node affinity of a cpu is only
> available once its present / online. For all cpus that are possible but
> not present, cpu_to_node() would point to node 0.
> 
> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> to make sure all possible but not present cpu_to_node are set to a
> proper node.
> 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

This looks good to me.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

--
Thanks and Regards
gautham.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] powerpc/numa: Prefer node id queried from vphn
  2020-06-24  9:28 ` [PATCH v5 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
@ 2020-06-24 10:29   ` Gautham R Shenoy
  0 siblings, 0 replies; 26+ messages in thread
From: Gautham R Shenoy @ 2020-06-24 10:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds,
	Gautham R Shenoy, Satheesh Rajendran, David Hildenbrand

Hello Srikar,


On Wed, Jun 24, 2020 at 02:58:45PM +0530, Srikar Dronamraju wrote:
> Node id queried from the static device tree may not
> be correct. For example: it may always show 0 on a shared processor.
> Hence prefer the node id queried from vphn and fallback on the device tree
> based node id if vphn query fails.
> 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>


This patch looks good to me.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>


--
Thanks and Regards
gautham.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-06-24  9:28 ` [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
@ 2020-06-29 14:58   ` Christopher Lameter
  2020-06-30  4:01     ` Srikar Dronamraju
  2020-07-01  8:42   ` Michal Hocko
  1 sibling, 1 reply; 26+ messages in thread
From: Christopher Lameter @ 2020-06-29 14:58 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Michael Ellerman, Linus Torvalds, Gautham R Shenoy,
	Satheesh Rajendran, David Hildenbrand

On Wed, 24 Jun 2020, Srikar Dronamraju wrote:

> Currently Linux kernel with CONFIG_NUMA on a system with multiple
> possible nodes, marks node 0 as online at boot.  However in practice,
> there are systems which have node 0 as memoryless and cpuless.

Maybe add something to explain why you are not simply mapping the
existing memory to NUMA node 0 which is after all just a numbering scheme
used by the kernel and can be used arbitrarily?

This could be seen more as a bug in the arch code during the setup of NUMA
nodes. The two nodes are created by the firmwware / bootstrap code after
all. Just do not do it?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-06-29 14:58   ` Christopher Lameter
@ 2020-06-30  4:01     ` Srikar Dronamraju
  2020-07-01 12:23       ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Srikar Dronamraju @ 2020-06-30  4:01 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel,
	Michal Hocko, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Michael Ellerman, Linus Torvalds, Gautham R Shenoy,
	Satheesh Rajendran, David Hildenbrand

* Christopher Lameter <cl@linux.com> [2020-06-29 14:58:40]:

> On Wed, 24 Jun 2020, Srikar Dronamraju wrote:
> 
> > Currently Linux kernel with CONFIG_NUMA on a system with multiple
> > possible nodes, marks node 0 as online at boot.  However in practice,
> > there are systems which have node 0 as memoryless and cpuless.
> 
> Maybe add something to explain why you are not simply mapping the
> existing memory to NUMA node 0 which is after all just a numbering scheme
> used by the kernel and can be used arbitrarily?
> 

I thought Michal Hocko already gave a clear picture on why mapping is a bad
idea. https://lore.kernel.org/lkml/20200316085425.GB11482@dhcp22.suse.cz/t/#u
Are you suggesting that we add that as part of the changelog?

> This could be seen more as a bug in the arch code during the setup of NUMA
> nodes. The two nodes are created by the firmwware / bootstrap code after
> all. Just do not do it?
> 

- The arch/setup code in powerpc is not onlining these nodes. 
- Later on cpus/memory in node 0 can be onlined.
- Firmware in this case Phyp is an independent code by itself.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-06-24  9:28 ` [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
  2020-06-29 14:58   ` Christopher Lameter
@ 2020-07-01  8:42   ` Michal Hocko
  2020-07-01 10:04     ` Srikar Dronamraju
  1 sibling, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2020-07-01  8:42 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel, Mel Gorman,
	Vlastimil Babka, Kirill A. Shutemov, Christopher Lameter,
	Michael Ellerman, Linus Torvalds, Gautham R Shenoy,
	Satheesh Rajendran, David Hildenbrand


On Wed 24-06-20 14:58:46, Srikar Dronamraju wrote:
> Currently Linux kernel with CONFIG_NUMA on a system with multiple
> possible nodes, marks node 0 as online at boot.  However in practice,
> there are systems which have node 0 as memoryless and cpuless.
> 
> This can cause numa_balancing to be enabled on systems with only one node
> with memory and CPUs. The existence of this dummy node which is cpuless and
> memoryless node can confuse users/scripts looking at output of lscpu /
> numactl.
> 
> By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
> always online.
> 
> v5.8-rc2
>  available: 2 nodes (0,2)
>  node 0 cpus:
>  node 0 size: 0 MB
>  node 0 free: 0 MB
>  node 2 cpus: 0 1 2 3 4 5 6 7
>  node 2 size: 32625 MB
>  node 2 free: 31490 MB
>  node distances:
>  node   0   2
>    0:  10  20
>    2:  20  10
> 
> proc and sys files
> ------------------
>  /sys/devices/system/node/online:            0,2
>  /proc/sys/kernel/numa_balancing:            1
>  /sys/devices/system/node/has_cpu:           2
>  /sys/devices/system/node/has_memory:        2
>  /sys/devices/system/node/has_normal_memory: 2
>  /sys/devices/system/node/possible:          0-31
> 
> v5.8-rc2 + patch
> ------------------
>  available: 1 nodes (2)
>  node 2 cpus: 0 1 2 3 4 5 6 7
>  node 2 size: 32625 MB
>  node 2 free: 31487 MB
>  node distances:
>  node   2
>    2:  10
> 
> proc and sys files
> ------------------
> /sys/devices/system/node/online:            2
> /proc/sys/kernel/numa_balancing:            0
> /sys/devices/system/node/has_cpu:           2
> /sys/devices/system/node/has_memory:        2
> /sys/devices/system/node/has_normal_memory: 2
> /sys/devices/system/node/possible:          0-31
> 
> Note: On Powerpc, cpu_to_node of possible but not present cpus would
> previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
> numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
> queried from vphn"). Without the 2 commits, Powerpc system might crash.
> 
> 1. User space applications like Numactl, lscpu, that parse the sysfs tend to
> believe there is an extra online node. This tends to confuse users and
> applications. Other user space applications start believing that system was
> not able to use all the resources (i.e missing resources) or the system was
> not setup correctly.
> 
> 2. Also existence of dummy node also leads to inconsistent information. The
> number of online nodes is inconsistent with the information in the
> device-tree and resource-dump
> 
> 3. When the dummy node is present, single node non-Numa systems end up showing
> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> the hit from the unnecessary numa hinting faults.

I have to say that I dislike the node online/offline state and directly
exporting that to the userspace. Users should only care whether the node
has memory/cpus. Numa nodes can be online without any memory. Just
offline all the present memory blocks but do not physically hot remove
them and you are in the same situation. If users are confused by an
output of tools like numactl -H then those could be updated and hide
nodes without any memory&cpus.

The autonuma problem sounds interesting but again this patch doesn't
really solve the underlying problem because I strongly suspect that the
problem is still there when a numa node gets all its memory offline as
mentioned above.

While I completely agree that making node 0 special is wrong, I have
still hard time to review this very simply looking patch because all the
numa initialization is so spread around that this might just blow up
at unexpected places. IIRC we have discussed testing in the previous
version and David has provided a way to emulate these configurations
on x86. Did you manage to use those instruction for additional testing
on other than ppc architectures?

> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> Changelog v4:->v5:
> - rebased to v5.8-rc2
> link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u
> 
> Changelog v1:->v2:
> - Rebased to v5.7-rc3
> Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u
> 
>  mm/page_alloc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48eb0f1410d4..5187664558e1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -117,8 +117,10 @@ EXPORT_SYMBOL(latent_entropy);
>   */
>  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
>  	[N_POSSIBLE] = NODE_MASK_ALL,
> +#ifdef CONFIG_NUMA
> +	[N_ONLINE] = NODE_MASK_NONE,
> +#else
>  	[N_ONLINE] = { { [0] = 1UL } },
> -#ifndef CONFIG_NUMA
>  	[N_NORMAL_MEMORY] = { { [0] = 1UL } },
>  #ifdef CONFIG_HIGHMEM
>  	[N_HIGH_MEMORY] = { { [0] = 1UL } },
> -- 
> 2.18.1
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-01  8:42   ` Michal Hocko
@ 2020-07-01 10:04     ` Srikar Dronamraju
  2020-07-01 10:15       ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Srikar Dronamraju @ 2020-07-01 10:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel, Mel Gorman,
	Vlastimil Babka, Kirill A. Shutemov, Christopher Lameter,
	Michael Ellerman, Linus Torvalds, Gautham R Shenoy,
	Satheesh Rajendran, David Hildenbrand

* Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]:

> 
> > 
> > 2. Also existence of dummy node also leads to inconsistent information. The
> > number of online nodes is inconsistent with the information in the
> > device-tree and resource-dump
> > 
> > 3. When the dummy node is present, single node non-Numa systems end up showing
> > up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > the hit from the unnecessary numa hinting faults.
> 
> I have to say that I dislike the node online/offline state and directly
> exporting that to the userspace. Users should only care whether the node
> has memory/cpus. Numa nodes can be online without any memory. Just
> offline all the present memory blocks but do not physically hot remove
> them and you are in the same situation. If users are confused by an
> output of tools like numactl -H then those could be updated and hide
> nodes without any memory&cpus.
> 
> The autonuma problem sounds interesting but again this patch doesn't
> really solve the underlying problem because I strongly suspect that the
> problem is still there when a numa node gets all its memory offline as
> mentioned above.
> 
> While I completely agree that making node 0 special is wrong, I have
> still hard time to review this very simply looking patch because all the
> numa initialization is so spread around that this might just blow up
> at unexpected places. IIRC we have discussed testing in the previous
> version and David has provided a way to emulate these configurations
> on x86. Did you manage to use those instruction for additional testing
> on other than ppc architectures?
> 

I have tried all the steps that David mentioned and reported back at
https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u

As a summary, David's steps are still not creating a memoryless/cpuless on
x86 VM. I have tried booting with Numa/non-numa on all the x86 machines that
I could get to.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-01 10:04     ` Srikar Dronamraju
@ 2020-07-01 10:15       ` David Hildenbrand
  2020-07-01 11:01         ` Srikar Dronamraju
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2020-07-01 10:15 UTC (permalink / raw)
  To: Srikar Dronamraju, Michal Hocko
  Cc: Andrew Morton, linuxppc-dev, linux-mm, linux-kernel, Mel Gorman,
	Vlastimil Babka, Kirill A. Shutemov, Christopher Lameter,
	Michael Ellerman, Linus Torvalds, Gautham R Shenoy,
	Satheesh Rajendran

On 01.07.20 12:04, Srikar Dronamraju wrote:
> * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]:
> 
>>
>>>
>>> 2. Also existence of dummy node also leads to inconsistent information. The
>>> number of online nodes is inconsistent with the information in the
>>> device-tree and resource-dump
>>>
>>> 3. When the dummy node is present, single node non-Numa systems end up showing
>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
>>> the hit from the unnecessary numa hinting faults.
>>
>> I have to say that I dislike the node online/offline state and directly
>> exporting that to the userspace. Users should only care whether the node
>> has memory/cpus. Numa nodes can be online without any memory. Just
>> offline all the present memory blocks but do not physically hot remove
>> them and you are in the same situation. If users are confused by an
>> output of tools like numactl -H then those could be updated and hide
>> nodes without any memory&cpus.
>>
>> The autonuma problem sounds interesting but again this patch doesn't
>> really solve the underlying problem because I strongly suspect that the
>> problem is still there when a numa node gets all its memory offline as
>> mentioned above.
>>
>> While I completely agree that making node 0 special is wrong, I have
>> still hard time to review this very simply looking patch because all the
>> numa initialization is so spread around that this might just blow up
>> at unexpected places. IIRC we have discussed testing in the previous
>> version and David has provided a way to emulate these configurations
>> on x86. Did you manage to use those instruction for additional testing
>> on other than ppc architectures?
>>
> 
> I have tried all the steps that David mentioned and reported back at
> https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u
> 
> As a summary, David's steps are still not creating a memoryless/cpuless on
> x86 VM.

Now, that is wrong. You get a memoryless/cpuless node, which is *not
online*. Once you hotplug some memory, it will switch online. Once you
remove memory, it will switch back offline.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-01 10:15       ` David Hildenbrand
@ 2020-07-01 11:01         ` Srikar Dronamraju
  2020-07-01 11:06           ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Srikar Dronamraju @ 2020-07-01 11:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Andrew Morton, linuxppc-dev, linux-mm,
	linux-kernel, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds,
	Gautham R Shenoy, Satheesh Rajendran

* David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]:

> On 01.07.20 12:04, Srikar Dronamraju wrote:
> > * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]:
> > 
> >>
> >>>
> >>> 2. Also existence of dummy node also leads to inconsistent information. The
> >>> number of online nodes is inconsistent with the information in the
> >>> device-tree and resource-dump
> >>>
> >>> 3. When the dummy node is present, single node non-Numa systems end up showing
> >>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> >>> the hit from the unnecessary numa hinting faults.
> >>
> >> I have to say that I dislike the node online/offline state and directly
> >> exporting that to the userspace. Users should only care whether the node
> >> has memory/cpus. Numa nodes can be online without any memory. Just
> >> offline all the present memory blocks but do not physically hot remove
> >> them and you are in the same situation. If users are confused by an
> >> output of tools like numactl -H then those could be updated and hide
> >> nodes without any memory&cpus.
> >>
> >> The autonuma problem sounds interesting but again this patch doesn't
> >> really solve the underlying problem because I strongly suspect that the
> >> problem is still there when a numa node gets all its memory offline as
> >> mentioned above.
> >>
> >> While I completely agree that making node 0 special is wrong, I have
> >> still hard time to review this very simply looking patch because all the
> >> numa initialization is so spread around that this might just blow up
> >> at unexpected places. IIRC we have discussed testing in the previous
> >> version and David has provided a way to emulate these configurations
> >> on x86. Did you manage to use those instruction for additional testing
> >> on other than ppc architectures?
> >>
> > 
> > I have tried all the steps that David mentioned and reported back at
> > https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u
> > 
> > As a summary, David's steps are still not creating a memoryless/cpuless on
> > x86 VM.
> 
> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> online*. Once you hotplug some memory, it will switch online. Once you
> remove memory, it will switch back offline.
> 

Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
boot.  The code in question tries to handle a cpuless/memoryless node 0 at
boot.

With the steps that you gave the node 0 was always populated, node 1 or
some other node would be memoryless/cpuless and offline. But that should
have no impact by patch.

I don't see how adding/hotplugging/removing memory to a node after boot is
going to affect the changes that I have made. Please do correct me if I have
misunderstood.


-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-01 11:01         ` Srikar Dronamraju
@ 2020-07-01 11:06           ` David Hildenbrand
  2020-07-01 11:30             ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2020-07-01 11:06 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Michal Hocko, Andrew Morton, linuxppc-dev, linux-mm,
	linux-kernel, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds,
	Gautham R Shenoy, Satheesh Rajendran

On 01.07.20 13:01, Srikar Dronamraju wrote:
> * David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]:
> 
>> On 01.07.20 12:04, Srikar Dronamraju wrote:
>>> * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]:
>>>
>>>>
>>>>>
>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
>>>>> number of online nodes is inconsistent with the information in the
>>>>> device-tree and resource-dump
>>>>>
>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
>>>>> the hit from the unnecessary numa hinting faults.
>>>>
>>>> I have to say that I dislike the node online/offline state and directly
>>>> exporting that to the userspace. Users should only care whether the node
>>>> has memory/cpus. Numa nodes can be online without any memory. Just
>>>> offline all the present memory blocks but do not physically hot remove
>>>> them and you are in the same situation. If users are confused by an
>>>> output of tools like numactl -H then those could be updated and hide
>>>> nodes without any memory&cpus.
>>>>
>>>> The autonuma problem sounds interesting but again this patch doesn't
>>>> really solve the underlying problem because I strongly suspect that the
>>>> problem is still there when a numa node gets all its memory offline as
>>>> mentioned above.
>>>>
>>>> While I completely agree that making node 0 special is wrong, I have
>>>> still hard time to review this very simply looking patch because all the
>>>> numa initialization is so spread around that this might just blow up
>>>> at unexpected places. IIRC we have discussed testing in the previous
>>>> version and David has provided a way to emulate these configurations
>>>> on x86. Did you manage to use those instruction for additional testing
>>>> on other than ppc architectures?
>>>>
>>>
>>> I have tried all the steps that David mentioned and reported back at
>>> https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u
>>>
>>> As a summary, David's steps are still not creating a memoryless/cpuless on
>>> x86 VM.
>>
>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
>> online*. Once you hotplug some memory, it will switch online. Once you
>> remove memory, it will switch back offline.
>>
> 
> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
> boot.

I was just correcting your statement, because it was wrong.

Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
have CPUs nor memory. That would imply that we can, in fact, never have
node 0 offline during boot.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-01 11:06           ` David Hildenbrand
@ 2020-07-01 11:30             ` David Hildenbrand
  2020-07-01 12:21               ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2020-07-01 11:30 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Michal Hocko, Andrew Morton, linuxppc-dev, linux-mm,
	linux-kernel, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds,
	Gautham R Shenoy, Satheesh Rajendran

On 01.07.20 13:06, David Hildenbrand wrote:
> On 01.07.20 13:01, Srikar Dronamraju wrote:
>> * David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]:
>>
>>> On 01.07.20 12:04, Srikar Dronamraju wrote:
>>>> * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]:
>>>>
>>>>>
>>>>>>
>>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
>>>>>> number of online nodes is inconsistent with the information in the
>>>>>> device-tree and resource-dump
>>>>>>
>>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
>>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
>>>>>> the hit from the unnecessary numa hinting faults.
>>>>>
>>>>> I have to say that I dislike the node online/offline state and directly
>>>>> exporting that to the userspace. Users should only care whether the node
>>>>> has memory/cpus. Numa nodes can be online without any memory. Just
>>>>> offline all the present memory blocks but do not physically hot remove
>>>>> them and you are in the same situation. If users are confused by an
>>>>> output of tools like numactl -H then those could be updated and hide
>>>>> nodes without any memory&cpus.
>>>>>
>>>>> The autonuma problem sounds interesting but again this patch doesn't
>>>>> really solve the underlying problem because I strongly suspect that the
>>>>> problem is still there when a numa node gets all its memory offline as
>>>>> mentioned above.
>>>>>
>>>>> While I completely agree that making node 0 special is wrong, I have
>>>>> still hard time to review this very simply looking patch because all the
>>>>> numa initialization is so spread around that this might just blow up
>>>>> at unexpected places. IIRC we have discussed testing in the previous
>>>>> version and David has provided a way to emulate these configurations
>>>>> on x86. Did you manage to use those instruction for additional testing
>>>>> on other than ppc architectures?
>>>>>
>>>>
>>>> I have tried all the steps that David mentioned and reported back at
>>>> https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u
>>>>
>>>> As a summary, David's steps are still not creating a memoryless/cpuless on
>>>> x86 VM.
>>>
>>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
>>> online*. Once you hotplug some memory, it will switch online. Once you
>>> remove memory, it will switch back offline.
>>>
>>
>> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
>> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
>> boot.
> 
> I was just correcting your statement, because it was wrong.
> 
> Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> have CPUs nor memory. That would imply that we can, in fact, never have
> node 0 offline during boot.
> 

Yep, looks like it.

[    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
[    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
[    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
[    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
[    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
[    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
[    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]



-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-01 11:30             ` David Hildenbrand
@ 2020-07-01 12:21               ` Michal Hocko
  2020-07-02  6:44                 ` Srikar Dronamraju
  2020-07-03  9:10                 ` Michal Suchánek
  0 siblings, 2 replies; 26+ messages in thread
From: Michal Hocko @ 2020-07-01 12:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Srikar Dronamraju, Andrew Morton, linuxppc-dev, linux-mm,
	linux-kernel, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds,
	Gautham R Shenoy, Satheesh Rajendran

On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> On 01.07.20 13:06, David Hildenbrand wrote:
> > On 01.07.20 13:01, Srikar Dronamraju wrote:
> >> * David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]:
> >>
> >>> On 01.07.20 12:04, Srikar Dronamraju wrote:
> >>>> * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]:
> >>>>
> >>>>>
> >>>>>>
> >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
> >>>>>> number of online nodes is inconsistent with the information in the
> >>>>>> device-tree and resource-dump
> >>>>>>
> >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
> >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> >>>>>> the hit from the unnecessary numa hinting faults.
> >>>>>
> >>>>> I have to say that I dislike the node online/offline state and directly
> >>>>> exporting that to the userspace. Users should only care whether the node
> >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> >>>>> offline all the present memory blocks but do not physically hot remove
> >>>>> them and you are in the same situation. If users are confused by an
> >>>>> output of tools like numactl -H then those could be updated and hide
> >>>>> nodes without any memory&cpus.
> >>>>>
> >>>>> The autonuma problem sounds interesting but again this patch doesn't
> >>>>> really solve the underlying problem because I strongly suspect that the
> >>>>> problem is still there when a numa node gets all its memory offline as
> >>>>> mentioned above.

I would really appreciate a feedback to these two as well.

> >>>>> While I completely agree that making node 0 special is wrong, I have
> >>>>> still hard time to review this very simply looking patch because all the
> >>>>> numa initialization is so spread around that this might just blow up
> >>>>> at unexpected places. IIRC we have discussed testing in the previous
> >>>>> version and David has provided a way to emulate these configurations
> >>>>> on x86. Did you manage to use those instruction for additional testing
> >>>>> on other than ppc architectures?
> >>>>>
> >>>>
> >>>> I have tried all the steps that David mentioned and reported back at
> >>>> https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u
> >>>>
> >>>> As a summary, David's steps are still not creating a memoryless/cpuless on
> >>>> x86 VM.
> >>>
> >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> >>> online*. Once you hotplug some memory, it will switch online. Once you
> >>> remove memory, it will switch back offline.
> >>>
> >>
> >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
> >> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
> >> boot.
> > 
> > I was just correcting your statement, because it was wrong.
> > 
> > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> > have CPUs nor memory. That would imply that we can, in fact, never have
> > node 0 offline during boot.
> > 
> 
> Yep, looks like it.
> 
> [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]

This begs a question whether ppc can do the same thing?

I would swear that we've had x86 system with node 0 but I cannot really
find it and it is possible that it was not x86 after all...
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-06-30  4:01     ` Srikar Dronamraju
@ 2020-07-01 12:23       ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2020-07-01 12:23 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Christopher Lameter, Andrew Morton, linuxppc-dev, linux-mm,
	linux-kernel, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Michael Ellerman, Linus Torvalds, Gautham R Shenoy,
	Satheesh Rajendran, David Hildenbrand

On Tue 30-06-20 09:31:25, Srikar Dronamraju wrote:
> * Christopher Lameter <cl@linux.com> [2020-06-29 14:58:40]:
> 
> > On Wed, 24 Jun 2020, Srikar Dronamraju wrote:
> > 
> > > Currently Linux kernel with CONFIG_NUMA on a system with multiple
> > > possible nodes, marks node 0 as online at boot.  However in practice,
> > > there are systems which have node 0 as memoryless and cpuless.
> > 
> > Maybe add something to explain why you are not simply mapping the
> > existing memory to NUMA node 0 which is after all just a numbering scheme
> > used by the kernel and can be used arbitrarily?
> > 
> 
> I thought Michal Hocko already gave a clear picture on why mapping is a bad
> idea. https://lore.kernel.org/lkml/20200316085425.GB11482@dhcp22.suse.cz/t/#u
> Are you suggesting that we add that as part of the changelog?

Well, I was not aware x86 already does renumber. So there is a certain
precendence. As I've said I do not really like that but this is what
already is happening. If renumbering is not an option then just handle
that in the ppc code explicitly. Generic solution would be preferable of
course but as I've said it is really hard to check for correctness and
potential subtle issues.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-01 12:21               ` Michal Hocko
@ 2020-07-02  6:44                 ` Srikar Dronamraju
  2020-07-02  8:41                   ` Michal Hocko
  2020-07-03  9:10                 ` Michal Suchánek
  1 sibling, 1 reply; 26+ messages in thread
From: Srikar Dronamraju @ 2020-07-02  6:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Andrew Morton, linuxppc-dev, linux-mm,
	linux-kernel, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds,
	Gautham R Shenoy, Satheesh Rajendran

* Michal Hocko <mhocko@kernel.org> [2020-07-01 14:21:10]:

> > >>>>>>
> > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
> > >>>>>> number of online nodes is inconsistent with the information in the
> > >>>>>> device-tree and resource-dump
> > >>>>>>
> > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
> > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > >>>>>> the hit from the unnecessary numa hinting faults.
> > >>>>>
> > >>>>> I have to say that I dislike the node online/offline state and directly
> > >>>>> exporting that to the userspace. Users should only care whether the node
> > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> > >>>>> offline all the present memory blocks but do not physically hot remove
> > >>>>> them and you are in the same situation. If users are confused by an
> > >>>>> output of tools like numactl -H then those could be updated and hide
> > >>>>> nodes without any memory&cpus.
> > >>>>>
> > >>>>> The autonuma problem sounds interesting but again this patch doesn't
> > >>>>> really solve the underlying problem because I strongly suspect that the
> > >>>>> problem is still there when a numa node gets all its memory offline as
> > >>>>> mentioned above.
> 
> I would really appreciate a feedback to these two as well.

1. Its not just numactl that's to be fixed but all tools/utilities that
depend on /sys/devices/system/node/online. Are we saying to not rely/believe
in the output given by the kernel but do further verification?  

Also how would the user space differentiate between the case where the
Kernel missed marking a node as offline to the case where the memory was
offlined on a cpuless node but node wasn't offline?.

2. Regarding the autonuma, the case of offline memory is user/admin driven,
so if there is a performance hit, its something that's driven by his
user/admin actions. Also how often do we see users offline complete memory
of cpuless node on a 2 node system?

> 
> > [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> 
> This begs a question whether ppc can do the same thing?

Certainly ppc can be made to adapt to this situation but that would be a
workaround. Do we have a reason why we think node 0 is unique and special?
If yes can we document it so that in future also people know why we consider
node 0 to be special. I do understand the *fear of the unknown* but when we
are unable to theoretically or practically come up a case, then it may
probably be better we hit the situation to understand what that unknown is?

> I would swear that we've had x86 system with node 0 but I cannot really
> find it and it is possible that it was not x86 after all...

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-02  6:44                 ` Srikar Dronamraju
@ 2020-07-02  8:41                   ` Michal Hocko
  2020-07-02 14:32                     ` Srikar Dronamraju
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2020-07-02  8:41 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: David Hildenbrand, Andrew Morton, linuxppc-dev, linux-mm,
	linux-kernel, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds,
	Gautham R Shenoy, Satheesh Rajendran

On Thu 02-07-20 12:14:08, Srikar Dronamraju wrote:
> * Michal Hocko <mhocko@kernel.org> [2020-07-01 14:21:10]:
> 
> > > >>>>>>
> > > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
> > > >>>>>> number of online nodes is inconsistent with the information in the
> > > >>>>>> device-tree and resource-dump
> > > >>>>>>
> > > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
> > > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > > >>>>>> the hit from the unnecessary numa hinting faults.
> > > >>>>>
> > > >>>>> I have to say that I dislike the node online/offline state and directly
> > > >>>>> exporting that to the userspace. Users should only care whether the node
> > > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> > > >>>>> offline all the present memory blocks but do not physically hot remove
> > > >>>>> them and you are in the same situation. If users are confused by an
> > > >>>>> output of tools like numactl -H then those could be updated and hide
> > > >>>>> nodes without any memory&cpus.
> > > >>>>>
> > > >>>>> The autonuma problem sounds interesting but again this patch doesn't
> > > >>>>> really solve the underlying problem because I strongly suspect that the
> > > >>>>> problem is still there when a numa node gets all its memory offline as
> > > >>>>> mentioned above.
> > 
> > I would really appreciate a feedback to these two as well.
> 
> 1. Its not just numactl that's to be fixed but all tools/utilities that
> depend on /sys/devices/system/node/online. Are we saying to not rely/believe
> in the output given by the kernel but do further verification?  

No, what we are saying is that even an online node might have zero
number of online pages/cpus. So the online status is not really
something that matters. If people are confused by that output then user
space tools can make their confusion go away. I really do not understand
why the kernel should do any logic there.

> Also how would the user space differentiate between the case where the
> Kernel missed marking a node as offline to the case where the memory was
> offlined on a cpuless node but node wasn't offline?.

What I am arguing is that those two shouldn't be any different. Really!

> 2. Regarding the autonuma, the case of offline memory is user/admin driven,
> so if there is a performance hit, its something that's driven by his
> user/admin actions. Also how often do we see users offline complete memory
> of cpuless node on a 2 node system?

How often do we see crippled HW configurations like that? Really if
autonuma should be made more clever for one case it should recognize the
other as well.

> > > [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > > [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > > [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> > 
> > This begs a question whether ppc can do the same thing?
> 
> Certainly ppc can be made to adapt to this situation but that would be a
> workaround. Do we have a reason why we think node 0 is unique and special?

It is not. As replied in other email in this thread. I would hope for
having less hacks in the numa initialization. Cleaning up the mess is
would be a lot of work and testing on all NUMA capable architectures.
This is a heritage from the past I am afraid. All that I am arguing here
is that your touch to the generic code with a very simple looking patch
might have side effects which are pretty much impossible to review.
Moreover it seems that nothing but ppc really needs this treatment.
So fixing it in ppc specific code sounds much more safe.

Normally I would really push for a generic solution but after getting
burned several times in this area I do not dare anymore. The problem is
not in the code complexity but in how spread it is in places where you
do not expect side effects.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-02  8:41                   ` Michal Hocko
@ 2020-07-02 14:32                     ` Srikar Dronamraju
  0 siblings, 0 replies; 26+ messages in thread
From: Srikar Dronamraju @ 2020-07-02 14:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Andrew Morton, linuxppc-dev, linux-mm,
	linux-kernel, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Christopher Lameter, Michael Ellerman, Linus Torvalds,
	Gautham R Shenoy, Satheesh Rajendran

* Michal Hocko <mhocko@kernel.org> [2020-07-02 10:41:23]:

> On Thu 02-07-20 12:14:08, Srikar Dronamraju wrote:
> > * Michal Hocko <mhocko@kernel.org> [2020-07-01 14:21:10]:
> > 
> > > > >>>>> The autonuma problem sounds interesting but again this patch doesn't
> > > > >>>>> really solve the underlying problem because I strongly suspect that the
> > > > >>>>> problem is still there when a numa node gets all its memory offline as
> > > > >>>>> mentioned above.
> > > 
> > > I would really appreciate a feedback to these two as well.
> > 
> > 1. Its not just numactl that's to be fixed but all tools/utilities that
> > depend on /sys/devices/system/node/online. Are we saying to not rely/believe
> > in the output given by the kernel but do further verification?  
> 
> No, what we are saying is that even an online node might have zero
> number of online pages/cpus. So the online status is not really
> something that matters. If people are confused by that output then user
> space tools can make their confusion go away. I really do not understand
> why the kernel should do any logic there.

The user facing teams are saying they are getting queries from the users who
are unable to understand from the tools/sysfs files why a node is online and
but has no attached resources. Its the amount of time that is being spent on
these issues that triggered the patch. Initially even I was skeptical that
this was a non-issue.

> 
> > Also how would the user space differentiate between the case where the
> > Kernel missed marking a node as offline to the case where the memory was
> > offlined on a cpuless node but node wasn't offline?.
> 
> What I am arguing is that those two shouldn't be any different. Really!
> 
> > 2. Regarding the autonuma, the case of offline memory is user/admin driven,
> > so if there is a performance hit, its something that's driven by his
> > user/admin actions. Also how often do we see users offline complete memory
> > of cpuless node on a 2 node system?
> 
> How often do we see crippled HW configurations like that? Really if
> autonuma should be made more clever for one case it should recognize the
> other as well.
> 

Lets take a 16 socket PowerVM system and assume that 32 lpars are created
on that socket, i.e 2 lpars for each socket. (PowerVM has the final say on
how the lpars are created.) In such a case, we can expect 30 out of the 32
lpars to face this problem, with the only 2 lpars that actually run on
socket 0 having the correct configuration.

> > > 
> > > This begs a question whether ppc can do the same thing?
> > 
> > Certainly ppc can be made to adapt to this situation but that would be a
> > workaround. Do we have a reason why we think node 0 is unique and special?
> 
> It is not. As replied in other email in this thread. I would hope for
> having less hacks in the numa initialization. Cleaning up the mess is
> would be a lot of work and testing on all NUMA capable architectures.
> This is a heritage from the past I am afraid. All that I am arguing here
> is that your touch to the generic code with a very simple looking patch
> might have side effects which are pretty much impossible to review.
> Moreover it seems that nothing but ppc really needs this treatment.
> So fixing it in ppc specific code sounds much more safe.
> 
> Normally I would really push for a generic solution but after getting
> burned several times in this area I do not dare anymore. The problem is
> not in the code complexity but in how spread it is in places where you
> do not expect side effects.
> 

I do understand and respect your viewpoint.

> -- 
> Michal Hocko
> SUSE Labs

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-01 12:21               ` Michal Hocko
  2020-07-02  6:44                 ` Srikar Dronamraju
@ 2020-07-03  9:10                 ` Michal Suchánek
  2020-07-03  9:24                   ` Michal Hocko
  1 sibling, 1 reply; 26+ messages in thread
From: Michal Suchánek @ 2020-07-03  9:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Gautham R Shenoy, Srikar Dronamraju,
	Linus Torvalds, linux-kernel, linux-mm, Satheesh Rajendran,
	Mel Gorman, Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Vlastimil Babka

On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> > On 01.07.20 13:06, David Hildenbrand wrote:
> > > On 01.07.20 13:01, Srikar Dronamraju wrote:
> > >> * David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]:
> > >>
> > >>> On 01.07.20 12:04, Srikar Dronamraju wrote:
> > >>>> * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]:
> > >>>>
> > >>>>>
> > >>>>>>
> > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
> > >>>>>> number of online nodes is inconsistent with the information in the
> > >>>>>> device-tree and resource-dump
> > >>>>>>
> > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
> > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > >>>>>> the hit from the unnecessary numa hinting faults.
> > >>>>>
> > >>>>> I have to say that I dislike the node online/offline state and directly
> > >>>>> exporting that to the userspace. Users should only care whether the node
> > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> > >>>>> offline all the present memory blocks but do not physically hot remove
> > >>>>> them and you are in the same situation. If users are confused by an
> > >>>>> output of tools like numactl -H then those could be updated and hide
> > >>>>> nodes without any memory&cpus.
> > >>>>>
> > >>>>> The autonuma problem sounds interesting but again this patch doesn't
> > >>>>> really solve the underlying problem because I strongly suspect that the
> > >>>>> problem is still there when a numa node gets all its memory offline as
> > >>>>> mentioned above.
> 
> I would really appreciate a feedback to these two as well.
> 
> > >>>>> While I completely agree that making node 0 special is wrong, I have
> > >>>>> still hard time to review this very simply looking patch because all the
> > >>>>> numa initialization is so spread around that this might just blow up
> > >>>>> at unexpected places. IIRC we have discussed testing in the previous
> > >>>>> version and David has provided a way to emulate these configurations
> > >>>>> on x86. Did you manage to use those instruction for additional testing
> > >>>>> on other than ppc architectures?
> > >>>>>
> > >>>>
> > >>>> I have tried all the steps that David mentioned and reported back at
> > >>>> https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u
> > >>>>
> > >>>> As a summary, David's steps are still not creating a memoryless/cpuless on
> > >>>> x86 VM.
> > >>>
> > >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> > >>> online*. Once you hotplug some memory, it will switch online. Once you
> > >>> remove memory, it will switch back offline.
> > >>>
> > >>
> > >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
> > >> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
> > >> boot.
> > > 
> > > I was just correcting your statement, because it was wrong.
> > > 
> > > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> > > have CPUs nor memory. That would imply that we can, in fact, never have
> > > node 0 offline during boot.
> > > 
> > 
> > Yep, looks like it.
> > 
> > [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> 
> This begs a question whether ppc can do the same thing?
Or x86 stop doing it so that you can see on what node you are running?

What's the point of this indirection other than another way of avoiding
empty node 0?

Thanks

Michal


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-03  9:10                 ` Michal Suchánek
@ 2020-07-03  9:24                   ` Michal Hocko
  2020-07-03 10:59                     ` Michal Hocko
  2020-07-06 16:08                     ` Andi Kleen
  0 siblings, 2 replies; 26+ messages in thread
From: Michal Hocko @ 2020-07-03  9:24 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: David Hildenbrand, Gautham R Shenoy, Srikar Dronamraju,
	Linus Torvalds, linux-kernel, linux-mm, Satheesh Rajendran,
	Mel Gorman, Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Vlastimil Babka, Andi Kleen

[Cc Andi]

On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> > On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
[...]
> > > Yep, looks like it.
> > > 
> > > [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > > [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > > [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> > 
> > This begs a question whether ppc can do the same thing?
> Or x86 stop doing it so that you can see on what node you are running?
> 
> What's the point of this indirection other than another way of avoiding
> empty node 0?

Honestly, I do not have any idea. I've traced it down to
Author: Andi Kleen <ak@suse.de>
Date:   Tue Jan 11 15:35:48 2005 -0800

    [PATCH] x86_64: Fix ACPI SRAT NUMA parsing

    Fix fallout from the recent nodemask_t changes. The node ids assigned
    in the SRAT parser were off by one.

    I added a new first_unset_node() function to nodemask.h to allocate
    IDs sanely.

    Signed-off-by: Andi Kleen <ak@suse.de>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

which doesn't really tell all that much. The historical baggage and a
long term behavior which is not really trivial to fix I suspect.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-03  9:24                   ` Michal Hocko
@ 2020-07-03 10:59                     ` Michal Hocko
  2020-07-03 11:32                       ` David Hildenbrand
  2020-07-03 12:58                       ` Srikar Dronamraju
  2020-07-06 16:08                     ` Andi Kleen
  1 sibling, 2 replies; 26+ messages in thread
From: Michal Hocko @ 2020-07-03 10:59 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: David Hildenbrand, Gautham R Shenoy, Srikar Dronamraju,
	Linus Torvalds, linux-kernel, linux-mm, Satheesh Rajendran,
	Mel Gorman, Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Vlastimil Babka, Andi Kleen

On Fri 03-07-20 11:24:17, Michal Hocko wrote:
> [Cc Andi]
> 
> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> > On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> > > On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> [...]
> > > > Yep, looks like it.
> > > > 
> > > > [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > > [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > > [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > > [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > > [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > > > [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > > > [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> > > 
> > > This begs a question whether ppc can do the same thing?
> > Or x86 stop doing it so that you can see on what node you are running?
> > 
> > What's the point of this indirection other than another way of avoiding
> > empty node 0?
> 
> Honestly, I do not have any idea. I've traced it down to
> Author: Andi Kleen <ak@suse.de>
> Date:   Tue Jan 11 15:35:48 2005 -0800
> 
>     [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> 
>     Fix fallout from the recent nodemask_t changes. The node ids assigned
>     in the SRAT parser were off by one.
> 
>     I added a new first_unset_node() function to nodemask.h to allocate
>     IDs sanely.
> 
>     Signed-off-by: Andi Kleen <ak@suse.de>
>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> 
> which doesn't really tell all that much. The historical baggage and a
> long term behavior which is not really trivial to fix I suspect.

Thinking about this some more, this logic makes some sense afterall.
Especially in the world without memory hotplug which was very likely the
case back then. It is much better to have compact node mask rather than
sparse one. After all node numbers shouldn't really matter as long as
you have a clear mapping to the HW. I am not sure we export that
information (except for the kernel ring buffer) though.

The memory hotplug changes that somehow because you can hotremove numa
nodes and therefore make the nodemask sparse but that is not a common
case. I am not sure what would happen if a completely new node was added
and its corresponding node was already used by the renumbered one
though. It would likely conflate the two I am afraid. But I am not sure
this is really possible with x86 and a lack of a bug report would
suggest that nobody is doing that at least.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-03 10:59                     ` Michal Hocko
@ 2020-07-03 11:32                       ` David Hildenbrand
  2020-07-03 11:46                         ` Michal Hocko
  2020-07-03 12:58                       ` Srikar Dronamraju
  1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2020-07-03 11:32 UTC (permalink / raw)
  To: Michal Hocko, Michal Suchánek
  Cc: Gautham R Shenoy, Srikar Dronamraju, Linus Torvalds,
	linux-kernel, linux-mm, Satheesh Rajendran, Mel Gorman,
	Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Vlastimil Babka, Andi Kleen

On 03.07.20 12:59, Michal Hocko wrote:
> On Fri 03-07-20 11:24:17, Michal Hocko wrote:
>> [Cc Andi]
>>
>> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
>>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
>>>> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
>> [...]
>>>>> Yep, looks like it.
>>>>>
>>>>> [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
>>>>> [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
>>>>> [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
>>>>> [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
>>>>> [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
>>>>> [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
>>>>> [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
>>>>
>>>> This begs a question whether ppc can do the same thing?
>>> Or x86 stop doing it so that you can see on what node you are running?
>>>
>>> What's the point of this indirection other than another way of avoiding
>>> empty node 0?
>>
>> Honestly, I do not have any idea. I've traced it down to
>> Author: Andi Kleen <ak@suse.de>
>> Date:   Tue Jan 11 15:35:48 2005 -0800
>>
>>     [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
>>
>>     Fix fallout from the recent nodemask_t changes. The node ids assigned
>>     in the SRAT parser were off by one.
>>
>>     I added a new first_unset_node() function to nodemask.h to allocate
>>     IDs sanely.
>>
>>     Signed-off-by: Andi Kleen <ak@suse.de>
>>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
>>
>> which doesn't really tell all that much. The historical baggage and a
>> long term behavior which is not really trivial to fix I suspect.
> 
> Thinking about this some more, this logic makes some sense afterall.
> Especially in the world without memory hotplug which was very likely the
> case back then. It is much better to have compact node mask rather than
> sparse one. After all node numbers shouldn't really matter as long as
> you have a clear mapping to the HW. I am not sure we export that
> information (except for the kernel ring buffer) though.
> 
> The memory hotplug changes that somehow because you can hotremove numa
> nodes and therefore make the nodemask sparse but that is not a common
> case. I am not sure what would happen if a completely new node was added
> and its corresponding node was already used by the renumbered one
> though. It would likely conflate the two I am afraid. But I am not sure
> this is really possible with x86 and a lack of a bug report would
> suggest that nobody is doing that at least.
> 

I think the ACPI code takes care of properly mapping PXM to nodes.

So if I start with PXM 0 empty and PXM 1 populated, I will get
PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU

$ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U /var/tmp/monitor
$ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U /var/tmp/monitor

$ echo "info numa" | sudo nc -U /var/tmp/monitor
QEMU 5.0.50 monitor - type 'help' for more information
(qemu) info numa
2 nodes
node 0 cpus:
node 0 size: 1024 MB
node 0 plugged: 1024 MB
node 1 cpus: 0 1 2 3
node 1 size: 4096 MB
node 1 plugged: 0 MB

I get in the guest:

[   50.174435] ------------[ cut here ]------------
[   50.175436] node 1 was absent from the node_possible_map
[   50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 add_memory_resource+0x8c/0x290
[   50.176844] Modules linked in:
[   50.176845] CPU: 0 PID: 7 Comm: kworker/u8:0 Not tainted 5.8.0-rc2+ #4
[   50.176846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
[   50.176846] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   50.176847] RIP: 0010:add_memory_resource+0x8c/0x290
[   50.176849] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 63 c5 48 89 04 24 48 0f a3 05 94 6c 1c 01 72 17 89 ee 48 c78
[   50.176849] RSP: 0018:ffffa7a1c0043d48 EFLAGS: 00010296
[   50.176850] RAX: 000000000000002c RBX: ffff8bc633e63b80 RCX: 0000000000000000
[   50.176851] RDX: ffff8bc63bc27060 RSI: ffff8bc63bc18d00 RDI: ffff8bc63bc18d00
[   50.176851] RBP: 0000000000000001 R08: 00000000000001e1 R09: ffffa7a1c0043bd8
[   50.176852] R10: 0000000000000005 R11: 0000000000000000 R12: 0000000140000000
[   50.176852] R13: 000000017fffffff R14: 0000000040000000 R15: 0000000180000000
[   50.176853] FS:  0000000000000000(0000) GS:ffff8bc63bc00000(0000) knlGS:0000000000000000
[   50.176853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.176855] CR2: 000055dfcbfc5ee8 CR3: 00000000aca0a000 CR4: 00000000000006f0
[   50.176855] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   50.176856] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   50.176856] Call Trace:
[   50.176856]  __add_memory+0x33/0x70
[   50.176857]  acpi_memory_device_add+0x132/0x2f2
[   50.176857]  acpi_bus_attach+0xd2/0x200
[   50.176858]  acpi_bus_scan+0x33/0x70
[   50.176858]  acpi_device_hotplug+0x298/0x390
[   50.176858]  acpi_hotplug_work_fn+0x3d/0x50
[   50.176859]  process_one_work+0x1b4/0x370
[   50.176859]  worker_thread+0x53/0x3e0
[   50.176860]  ? process_one_work+0x370/0x370
[   50.176860]  kthread+0x119/0x140
[   50.176860]  ? __kthread_bind_mask+0x60/0x60
[   50.176861]  ret_from_fork+0x22/0x30
[   50.176861] ---[ end trace 9a2a837c1e0164f1 ]---
[   50.209816] acpi PNP0C80:00: add_memory failed
[   50.210510] acpi PNP0C80:00: acpi_memory_enable_device() error
[   50.211445] acpi PNP0C80:00: Enumeration failure


I remember that we added that check just recently (due to powerpc if I am not wrong).
Not sure why that triggers here.

But it properly maps PXM 0 to node 1.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-03 11:32                       ` David Hildenbrand
@ 2020-07-03 11:46                         ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2020-07-03 11:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Suchánek, Gautham R Shenoy, Srikar Dronamraju,
	Linus Torvalds, linux-kernel, linux-mm, Satheesh Rajendran,
	Mel Gorman, Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Vlastimil Babka, Andi Kleen

On Fri 03-07-20 13:32:21, David Hildenbrand wrote:
> On 03.07.20 12:59, Michal Hocko wrote:
> > On Fri 03-07-20 11:24:17, Michal Hocko wrote:
> >> [Cc Andi]
> >>
> >> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> >>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> >>>> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> >> [...]
> >>>>> Yep, looks like it.
> >>>>>
> >>>>> [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> >>>>> [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> >>>>> [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> >>>>> [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> >>>>> [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> >>>>> [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> >>>>> [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> >>>>
> >>>> This begs a question whether ppc can do the same thing?
> >>> Or x86 stop doing it so that you can see on what node you are running?
> >>>
> >>> What's the point of this indirection other than another way of avoiding
> >>> empty node 0?
> >>
> >> Honestly, I do not have any idea. I've traced it down to
> >> Author: Andi Kleen <ak@suse.de>
> >> Date:   Tue Jan 11 15:35:48 2005 -0800
> >>
> >>     [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> >>
> >>     Fix fallout from the recent nodemask_t changes. The node ids assigned
> >>     in the SRAT parser were off by one.
> >>
> >>     I added a new first_unset_node() function to nodemask.h to allocate
> >>     IDs sanely.
> >>
> >>     Signed-off-by: Andi Kleen <ak@suse.de>
> >>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> >>
> >> which doesn't really tell all that much. The historical baggage and a
> >> long term behavior which is not really trivial to fix I suspect.
> > 
> > Thinking about this some more, this logic makes some sense afterall.
> > Especially in the world without memory hotplug which was very likely the
> > case back then. It is much better to have compact node mask rather than
> > sparse one. After all node numbers shouldn't really matter as long as
> > you have a clear mapping to the HW. I am not sure we export that
> > information (except for the kernel ring buffer) though.
> > 
> > The memory hotplug changes that somehow because you can hotremove numa
> > nodes and therefore make the nodemask sparse but that is not a common
> > case. I am not sure what would happen if a completely new node was added
> > and its corresponding node was already used by the renumbered one
> > though. It would likely conflate the two I am afraid. But I am not sure
> > this is really possible with x86 and a lack of a bug report would
> > suggest that nobody is doing that at least.
> > 
> 
> I think the ACPI code takes care of properly mapping PXM to nodes.
> 
> So if I start with PXM 0 empty and PXM 1 populated, I will get
> PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU
> 
> $ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U /var/tmp/monitor
> $ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U /var/tmp/monitor
> 
> $ echo "info numa" | sudo nc -U /var/tmp/monitor
> QEMU 5.0.50 monitor - type 'help' for more information
> (qemu) info numa
> 2 nodes
> node 0 cpus:
> node 0 size: 1024 MB
> node 0 plugged: 1024 MB
> node 1 cpus: 0 1 2 3
> node 1 size: 4096 MB
> node 1 plugged: 0 MB

Thanks for double checking.

> I get in the guest:
> 
> [   50.174435] ------------[ cut here ]------------
> [   50.175436] node 1 was absent from the node_possible_map
> [   50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 add_memory_resource+0x8c/0x290

This would mean that the ACPI code or whoever does the remaping is not
adding the new node into possible nodes.

[...]
> I remember that we added that check just recently (due to powerpc if I am not wrong).
> Not sure why that triggers here.

This was a misbehaving Qemu IIRC providing a garbage map.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-03 10:59                     ` Michal Hocko
  2020-07-03 11:32                       ` David Hildenbrand
@ 2020-07-03 12:58                       ` Srikar Dronamraju
  1 sibling, 0 replies; 26+ messages in thread
From: Srikar Dronamraju @ 2020-07-03 12:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michal Such?nek, David Hildenbrand, Gautham R Shenoy,
	Linus Torvalds, linux-kernel, linux-mm, Satheesh Rajendran,
	Mel Gorman, Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Vlastimil Babka, Andi Kleen

* Michal Hocko <mhocko@kernel.org> [2020-07-03 12:59:44]:

> > Honestly, I do not have any idea. I've traced it down to
> > Author: Andi Kleen <ak@suse.de>
> > Date:   Tue Jan 11 15:35:48 2005 -0800
> > 
> >     [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> > 
> >     Fix fallout from the recent nodemask_t changes. The node ids assigned
> >     in the SRAT parser were off by one.
> > 
> >     I added a new first_unset_node() function to nodemask.h to allocate
> >     IDs sanely.
> > 
> >     Signed-off-by: Andi Kleen <ak@suse.de>
> >     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> > 
> > which doesn't really tell all that much. The historical baggage and a
> > long term behavior which is not really trivial to fix I suspect.
> 
> Thinking about this some more, this logic makes some sense afterall.
> Especially in the world without memory hotplug which was very likely the
> case back then. It is much better to have compact node mask rather than
> sparse one. After all node numbers shouldn't really matter as long as
> you have a clear mapping to the HW. I am not sure we export that
> information (except for the kernel ring buffer) though.
> 
> The memory hotplug changes that somehow because you can hotremove numa
> nodes and therefore make the nodemask sparse but that is not a common
> case. I am not sure what would happen if a completely new node was added
> and its corresponding node was already used by the renumbered one
> though. It would likely conflate the two I am afraid. But I am not sure
> this is really possible with x86 and a lack of a bug report would
> suggest that nobody is doing that at least.
> 

JFYI,
Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
hotplug on memoryless node. 

https://bugzilla.kernel.org/show_bug.cgi?id=202187

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
  2020-07-03  9:24                   ` Michal Hocko
  2020-07-03 10:59                     ` Michal Hocko
@ 2020-07-06 16:08                     ` Andi Kleen
  1 sibling, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2020-07-06 16:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michal Suchánek, David Hildenbrand, Gautham R Shenoy,
	Srikar Dronamraju, Linus Torvalds, linux-kernel, linux-mm,
	Satheesh Rajendran, Mel Gorman, Kirill A. Shutemov,
	Andrew Morton, linuxppc-dev, Christopher Lameter,
	Vlastimil Babka

> > What's the point of this indirection other than another way of avoiding
> > empty node 0?
> 
> Honestly, I do not have any idea. I've traced it down to
> Author: Andi Kleen <ak@suse.de>
> Date:   Tue Jan 11 15:35:48 2005 -0800

I don't remember all the details, and I can't even find the commit
(is it in linux-historic?).

But AFAIK there's no guarantee PXMs are small and continuous, so it
seemed better to have a clean zero based space.

Back then we had a lot of problems with buggy SRAT tables in BIOS,
so we really tried to avoid trusting the BIOS as much as possible.

-Andi



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, back to index

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-24  9:28 [PATCH v5 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
2020-06-24  9:28 ` [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
2020-06-24  9:48   ` Gautham R Shenoy
2020-06-24  9:28 ` [PATCH v5 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
2020-06-24 10:29   ` Gautham R Shenoy
2020-06-24  9:28 ` [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
2020-06-29 14:58   ` Christopher Lameter
2020-06-30  4:01     ` Srikar Dronamraju
2020-07-01 12:23       ` Michal Hocko
2020-07-01  8:42   ` Michal Hocko
2020-07-01 10:04     ` Srikar Dronamraju
2020-07-01 10:15       ` David Hildenbrand
2020-07-01 11:01         ` Srikar Dronamraju
2020-07-01 11:06           ` David Hildenbrand
2020-07-01 11:30             ` David Hildenbrand
2020-07-01 12:21               ` Michal Hocko
2020-07-02  6:44                 ` Srikar Dronamraju
2020-07-02  8:41                   ` Michal Hocko
2020-07-02 14:32                     ` Srikar Dronamraju
2020-07-03  9:10                 ` Michal Suchánek
2020-07-03  9:24                   ` Michal Hocko
2020-07-03 10:59                     ` Michal Hocko
2020-07-03 11:32                       ` David Hildenbrand
2020-07-03 11:46                         ` Michal Hocko
2020-07-03 12:58                       ` Srikar Dronamraju
2020-07-06 16:08                     ` Andi Kleen

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git