All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] arm64/drivers: avoid alloc memory on offline node
@ 2018-05-31 12:14 ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-05-31 12:14 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, bhelgaas, gregkh, rafael.j.wysocki,
	jarkko.sakkinen
  Cc: linux-arm-kernel, linux-kernel, guohanjun, wanghuiqiang, tnowicki

A numa system may return node which is not online.
For example, a numa node:
1) without memory
2) NR_CPUS is very small, and the cpus on the node are not brought up

In this situation, we use NUMA_NO_NODE to avoid oops.

[   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
[   25.740982] Mem abort info:
[   25.743762]   ESR = 0x96000005
[   25.746803]   Exception class = DABT (current EL), IL = 32 bits
[   25.752711]   SET = 0, FnV = 0
[   25.755751]   EA = 0, S1PTW = 0
[   25.758878] Data abort info:
[   25.761745]   ISV = 0, ISS = 0x00000005
[   25.765568]   CM = 0, WnR = 0
[   25.768521] [0000000000001988] user address but active_mm is swapper
[   25.774861] Internal error: Oops: 96000005 [#1] SMP
[   25.779724] Modules linked in:
[   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
[   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
[   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
[   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
[   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
[   25.813252] sp : ffff00000996f660
[   25.816553] x29: ffff00000996f660 x28: 0000000000000000
[   25.821852] x27: 00000000014012c0 x26: 0000000000000000
[   25.827150] x25: 0000000000000003 x24: ffff000008099eac
[   25.832449] x23: 0000000000400000 x22: 0000000000000000
[   25.837747] x21: 0000000000000001 x20: 0000000000000000
[   25.843045] x19: 0000000000400000 x18: 0000000000010e00
[   25.848343] x17: 000000000437f790 x16: 0000000000000020
[   25.853641] x15: 0000000000000000 x14: 6549435020524541
[   25.858939] x13: 20454d502067756c x12: 0000000000000000
[   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
[   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
[   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
[   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
[   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
[   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
[   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
[   25.902712] Call trace:
[   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
[   25.909577]  allocate_slab+0x94/0x590
[   25.913225]  new_slab+0x68/0xc8
[   25.916353]  ___slab_alloc+0x444/0x4f8
[   25.920088]  __slab_alloc+0x50/0x68
[   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
[   25.928426]  pci_acpi_scan_root+0x94/0x278
[   25.932510]  acpi_pci_root_add+0x228/0x4b0
[   25.936593]  acpi_bus_attach+0x10c/0x218
[   25.940501]  acpi_bus_attach+0xac/0x218
[   25.944323]  acpi_bus_attach+0xac/0x218
[   25.948144]  acpi_bus_scan+0x5c/0xc0
[   25.951708]  acpi_scan_init+0xf8/0x254
[   25.955443]  acpi_init+0x310/0x37c
[   25.958831]  do_one_initcall+0x54/0x208
[   25.962653]  kernel_init_freeable+0x244/0x340
[   25.966999]  kernel_init+0x18/0x118
[   25.970474]  ret_from_fork+0x10/0x1c
[   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
[   25.980162] ---[ end trace 64f0893eb21ec283 ]---
[   25.984765] Kernel panic - not syncing: Fatal exception

Xie XiuQi (2):
  arm64: avoid alloc memory on offline node
  drivers: check numa node's online status in dev_to_node

 arch/arm64/kernel/pci.c | 3 +++
 include/linux/device.h  | 7 ++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 0/2] arm64/drivers: avoid alloc memory on offline node
@ 2018-05-31 12:14 ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-05-31 12:14 UTC (permalink / raw)
  To: linux-arm-kernel

A numa system may return node which is not online.
For example, a numa node:
1) without memory
2) NR_CPUS is very small, and the cpus on the node are not brought up

In this situation, we use NUMA_NO_NODE to avoid oops.

[   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
[   25.740982] Mem abort info:
[   25.743762]   ESR = 0x96000005
[   25.746803]   Exception class = DABT (current EL), IL = 32 bits
[   25.752711]   SET = 0, FnV = 0
[   25.755751]   EA = 0, S1PTW = 0
[   25.758878] Data abort info:
[   25.761745]   ISV = 0, ISS = 0x00000005
[   25.765568]   CM = 0, WnR = 0
[   25.768521] [0000000000001988] user address but active_mm is swapper
[   25.774861] Internal error: Oops: 96000005 [#1] SMP
[   25.779724] Modules linked in:
[   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
[   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
[   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
[   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
[   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
[   25.813252] sp : ffff00000996f660
[   25.816553] x29: ffff00000996f660 x28: 0000000000000000
[   25.821852] x27: 00000000014012c0 x26: 0000000000000000
[   25.827150] x25: 0000000000000003 x24: ffff000008099eac
[   25.832449] x23: 0000000000400000 x22: 0000000000000000
[   25.837747] x21: 0000000000000001 x20: 0000000000000000
[   25.843045] x19: 0000000000400000 x18: 0000000000010e00
[   25.848343] x17: 000000000437f790 x16: 0000000000000020
[   25.853641] x15: 0000000000000000 x14: 6549435020524541
[   25.858939] x13: 20454d502067756c x12: 0000000000000000
[   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
[   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
[   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
[   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
[   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
[   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
[   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
[   25.902712] Call trace:
[   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
[   25.909577]  allocate_slab+0x94/0x590
[   25.913225]  new_slab+0x68/0xc8
[   25.916353]  ___slab_alloc+0x444/0x4f8
[   25.920088]  __slab_alloc+0x50/0x68
[   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
[   25.928426]  pci_acpi_scan_root+0x94/0x278
[   25.932510]  acpi_pci_root_add+0x228/0x4b0
[   25.936593]  acpi_bus_attach+0x10c/0x218
[   25.940501]  acpi_bus_attach+0xac/0x218
[   25.944323]  acpi_bus_attach+0xac/0x218
[   25.948144]  acpi_bus_scan+0x5c/0xc0
[   25.951708]  acpi_scan_init+0xf8/0x254
[   25.955443]  acpi_init+0x310/0x37c
[   25.958831]  do_one_initcall+0x54/0x208
[   25.962653]  kernel_init_freeable+0x244/0x340
[   25.966999]  kernel_init+0x18/0x118
[   25.970474]  ret_from_fork+0x10/0x1c
[   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
[   25.980162] ---[ end trace 64f0893eb21ec283 ]---
[   25.984765] Kernel panic - not syncing: Fatal exception

Xie XiuQi (2):
  arm64: avoid alloc memory on offline node
  drivers: check numa node's online status in dev_to_node

 arch/arm64/kernel/pci.c | 3 +++
 include/linux/device.h  | 7 ++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-05-31 12:14 ` Xie XiuQi
@ 2018-05-31 12:14   ` Xie XiuQi
  -1 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-05-31 12:14 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, bhelgaas, gregkh, rafael.j.wysocki,
	jarkko.sakkinen
  Cc: linux-arm-kernel, linux-kernel, guohanjun, wanghuiqiang, tnowicki

A numa system may return node which is not online.
For example, a numa node:
1) without memory
2) NR_CPUS is very small, and the cpus on the node are not brought up

In this situation, we use NUMA_NO_NODE to avoid oops.

[   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
[   25.740982] Mem abort info:
[   25.743762]   ESR = 0x96000005
[   25.746803]   Exception class = DABT (current EL), IL = 32 bits
[   25.752711]   SET = 0, FnV = 0
[   25.755751]   EA = 0, S1PTW = 0
[   25.758878] Data abort info:
[   25.761745]   ISV = 0, ISS = 0x00000005
[   25.765568]   CM = 0, WnR = 0
[   25.768521] [0000000000001988] user address but active_mm is swapper
[   25.774861] Internal error: Oops: 96000005 [#1] SMP
[   25.779724] Modules linked in:
[   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
[   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
[   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
[   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
[   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
[   25.813252] sp : ffff00000996f660
[   25.816553] x29: ffff00000996f660 x28: 0000000000000000
[   25.821852] x27: 00000000014012c0 x26: 0000000000000000
[   25.827150] x25: 0000000000000003 x24: ffff000008099eac
[   25.832449] x23: 0000000000400000 x22: 0000000000000000
[   25.837747] x21: 0000000000000001 x20: 0000000000000000
[   25.843045] x19: 0000000000400000 x18: 0000000000010e00
[   25.848343] x17: 000000000437f790 x16: 0000000000000020
[   25.853641] x15: 0000000000000000 x14: 6549435020524541
[   25.858939] x13: 20454d502067756c x12: 0000000000000000
[   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
[   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
[   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
[   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
[   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
[   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
[   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
[   25.902712] Call trace:
[   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
[   25.909577]  allocate_slab+0x94/0x590
[   25.913225]  new_slab+0x68/0xc8
[   25.916353]  ___slab_alloc+0x444/0x4f8
[   25.920088]  __slab_alloc+0x50/0x68
[   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
[   25.928426]  pci_acpi_scan_root+0x94/0x278
[   25.932510]  acpi_pci_root_add+0x228/0x4b0
[   25.936593]  acpi_bus_attach+0x10c/0x218
[   25.940501]  acpi_bus_attach+0xac/0x218
[   25.944323]  acpi_bus_attach+0xac/0x218
[   25.948144]  acpi_bus_scan+0x5c/0xc0
[   25.951708]  acpi_scan_init+0xf8/0x254
[   25.955443]  acpi_init+0x310/0x37c
[   25.958831]  do_one_initcall+0x54/0x208
[   25.962653]  kernel_init_freeable+0x244/0x340
[   25.966999]  kernel_init+0x18/0x118
[   25.970474]  ret_from_fork+0x10/0x1c
[   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
[   25.980162] ---[ end trace 64f0893eb21ec283 ]---
[   25.984765] Kernel panic - not syncing: Fatal exception

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
Cc: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
---
 arch/arm64/kernel/pci.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
index 0e2ea1c..e17cc45 100644
--- a/arch/arm64/kernel/pci.c
+++ b/arch/arm64/kernel/pci.c
@@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
 	struct pci_bus *bus, *child;
 	struct acpi_pci_root_ops *root_ops;
 
+	if (node != NUMA_NO_NODE && !node_online(node))
+		node = NUMA_NO_NODE;
+
 	ri = kzalloc_node(sizeof(*ri), GFP_KERNEL, node);
 	if (!ri)
 		return NULL;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-05-31 12:14   ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-05-31 12:14 UTC (permalink / raw)
  To: linux-arm-kernel

A numa system may return node which is not online.
For example, a numa node:
1) without memory
2) NR_CPUS is very small, and the cpus on the node are not brought up

In this situation, we use NUMA_NO_NODE to avoid oops.

[   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
[   25.740982] Mem abort info:
[   25.743762]   ESR = 0x96000005
[   25.746803]   Exception class = DABT (current EL), IL = 32 bits
[   25.752711]   SET = 0, FnV = 0
[   25.755751]   EA = 0, S1PTW = 0
[   25.758878] Data abort info:
[   25.761745]   ISV = 0, ISS = 0x00000005
[   25.765568]   CM = 0, WnR = 0
[   25.768521] [0000000000001988] user address but active_mm is swapper
[   25.774861] Internal error: Oops: 96000005 [#1] SMP
[   25.779724] Modules linked in:
[   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
[   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
[   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
[   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
[   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
[   25.813252] sp : ffff00000996f660
[   25.816553] x29: ffff00000996f660 x28: 0000000000000000
[   25.821852] x27: 00000000014012c0 x26: 0000000000000000
[   25.827150] x25: 0000000000000003 x24: ffff000008099eac
[   25.832449] x23: 0000000000400000 x22: 0000000000000000
[   25.837747] x21: 0000000000000001 x20: 0000000000000000
[   25.843045] x19: 0000000000400000 x18: 0000000000010e00
[   25.848343] x17: 000000000437f790 x16: 0000000000000020
[   25.853641] x15: 0000000000000000 x14: 6549435020524541
[   25.858939] x13: 20454d502067756c x12: 0000000000000000
[   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
[   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
[   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
[   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
[   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
[   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
[   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
[   25.902712] Call trace:
[   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
[   25.909577]  allocate_slab+0x94/0x590
[   25.913225]  new_slab+0x68/0xc8
[   25.916353]  ___slab_alloc+0x444/0x4f8
[   25.920088]  __slab_alloc+0x50/0x68
[   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
[   25.928426]  pci_acpi_scan_root+0x94/0x278
[   25.932510]  acpi_pci_root_add+0x228/0x4b0
[   25.936593]  acpi_bus_attach+0x10c/0x218
[   25.940501]  acpi_bus_attach+0xac/0x218
[   25.944323]  acpi_bus_attach+0xac/0x218
[   25.948144]  acpi_bus_scan+0x5c/0xc0
[   25.951708]  acpi_scan_init+0xf8/0x254
[   25.955443]  acpi_init+0x310/0x37c
[   25.958831]  do_one_initcall+0x54/0x208
[   25.962653]  kernel_init_freeable+0x244/0x340
[   25.966999]  kernel_init+0x18/0x118
[   25.970474]  ret_from_fork+0x10/0x1c
[   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
[   25.980162] ---[ end trace 64f0893eb21ec283 ]---
[   25.984765] Kernel panic - not syncing: Fatal exception

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
Cc: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
---
 arch/arm64/kernel/pci.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
index 0e2ea1c..e17cc45 100644
--- a/arch/arm64/kernel/pci.c
+++ b/arch/arm64/kernel/pci.c
@@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
 	struct pci_bus *bus, *child;
 	struct acpi_pci_root_ops *root_ops;
 
+	if (node != NUMA_NO_NODE && !node_online(node))
+		node = NUMA_NO_NODE;
+
 	ri = kzalloc_node(sizeof(*ri), GFP_KERNEL, node);
 	if (!ri)
 		return NULL;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 2/2] drivers: check numa node's online status in dev_to_node
  2018-05-31 12:14 ` Xie XiuQi
@ 2018-05-31 12:14   ` Xie XiuQi
  -1 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-05-31 12:14 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, bhelgaas, gregkh, rafael.j.wysocki,
	jarkko.sakkinen
  Cc: linux-arm-kernel, linux-kernel, guohanjun, wanghuiqiang, tnowicki

If dev->numa_node is not available (or offline), we should
return NUMA_NO_NODE to prevent alloc memory on offline
nodes, which could cause oops.

For example, a numa node:
1) without memory
2) NR_CPUS is very small, and the cpus on the node are not brought up

[   27.851041] Unable to handle kernel NULL pointer dereference at virtual address 00001988
[   27.859128] Mem abort info:
[   27.861908]   ESR = 0x96000005
[   27.864949]   Exception class = DABT (current EL), IL = 32 bits
[   27.870860]   SET = 0, FnV = 0
[   27.873900]   EA = 0, S1PTW = 0
[   27.877029] Data abort info:
[   27.879895]   ISV = 0, ISS = 0x00000005
[   27.883716]   CM = 0, WnR = 0
[   27.886673] [0000000000001988] user address but active_mm is swapper
[   27.893012] Internal error: Oops: 96000005 [#1] SMP
[   27.897876] Modules linked in:
[   27.900919] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #116
[   27.907865] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B306 05/28/2018
[   27.916983] pstate: 80c00009 (Nzcv daif +PAN +UAO)
[   27.921763] pc : __alloc_pages_nodemask+0xf0/0xe70
[   27.926540] lr : __alloc_pages_nodemask+0x184/0xe70
[   27.931403] sp : ffff00000996f7e0
[   27.934704] x29: ffff00000996f7e0 x28: ffff000008cb10a0
[   27.940003] x27: 00000000014012c0 x26: 0000000000000000
[   27.945301] x25: 0000000000000003 x24: ffff0000085bbc14
[   27.950600] x23: 0000000000400000 x22: 0000000000000000
[   27.955898] x21: 0000000000000001 x20: 0000000000000000
[   27.961196] x19: 0000000000400000 x18: 0000000000000f00
[   27.966494] x17: 00000000003bff88 x16: 0000000000000020
[   27.971792] x15: 000000000000003b x14: ffffffffffffffff
[   27.977090] x13: ffffffffffff0000 x12: 0000000000000030
[   27.982388] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
[   27.987686] x9 : 2e64716e622e7364 x8 : 7f7f7f7f7f7f7f7f
[   27.992984] x7 : 0000000000000000 x6 : ffff000008d73c08
[   27.998282] x5 : 0000000000000000 x4 : 0000000000000081
[   28.003580] x3 : 0000000000000000 x2 : 0000000000000000
[   28.008878] x1 : 0000000000000001 x0 : 0000000000001980
[   28.014177] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
[   28.020863] Call trace:
[   28.023296]  __alloc_pages_nodemask+0xf0/0xe70
[   28.027727]  allocate_slab+0x94/0x590
[   28.031374]  new_slab+0x68/0xc8
[   28.034502]  ___slab_alloc+0x444/0x4f8
[   28.038237]  __slab_alloc+0x50/0x68
[   28.041713]  __kmalloc_node_track_caller+0x100/0x320
[   28.046664]  devm_kmalloc+0x3c/0x90
[   28.050139]  pinctrl_bind_pins+0x4c/0x298
[   28.054135]  driver_probe_device+0xb4/0x4a0
[   28.058305]  __driver_attach+0x124/0x128
[   28.062213]  bus_for_each_dev+0x78/0xe0
[   28.066035]  driver_attach+0x30/0x40
[   28.069597]  bus_add_driver+0x248/0x2b8
[   28.073419]  driver_register+0x68/0x100
[   28.077242]  __pci_register_driver+0x64/0x78
[   28.081500]  pcie_portdrv_init+0x44/0x4c
[   28.085410]  do_one_initcall+0x54/0x208
[   28.089232]  kernel_init_freeable+0x244/0x340
[   28.093577]  kernel_init+0x18/0x118
[   28.097052]  ret_from_fork+0x10/0x1c
[   28.100614] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
[   28.106740] ---[ end trace e32df44e6e1c3a4b ]---

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
Cc: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
---
 include/linux/device.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 4779569..2a4fb08 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1017,7 +1017,12 @@ extern __printf(2, 3)
 #ifdef CONFIG_NUMA
 static inline int dev_to_node(struct device *dev)
 {
-	return dev->numa_node;
+	int node = dev->numa_node;
+
+	if (unlikely(node != NUMA_NO_NODE && !node_online(node)))
+		return NUMA_NO_NODE;
+
+	return node;
 }
 static inline void set_dev_node(struct device *dev, int node)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 2/2] drivers: check numa node's online status in dev_to_node
@ 2018-05-31 12:14   ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-05-31 12:14 UTC (permalink / raw)
  To: linux-arm-kernel

If dev->numa_node is not available (or offline), we should
return NUMA_NO_NODE to prevent alloc memory on offline
nodes, which could cause oops.

For example, a numa node:
1) without memory
2) NR_CPUS is very small, and the cpus on the node are not brought up

[   27.851041] Unable to handle kernel NULL pointer dereference at virtual address 00001988
[   27.859128] Mem abort info:
[   27.861908]   ESR = 0x96000005
[   27.864949]   Exception class = DABT (current EL), IL = 32 bits
[   27.870860]   SET = 0, FnV = 0
[   27.873900]   EA = 0, S1PTW = 0
[   27.877029] Data abort info:
[   27.879895]   ISV = 0, ISS = 0x00000005
[   27.883716]   CM = 0, WnR = 0
[   27.886673] [0000000000001988] user address but active_mm is swapper
[   27.893012] Internal error: Oops: 96000005 [#1] SMP
[   27.897876] Modules linked in:
[   27.900919] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #116
[   27.907865] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B306 05/28/2018
[   27.916983] pstate: 80c00009 (Nzcv daif +PAN +UAO)
[   27.921763] pc : __alloc_pages_nodemask+0xf0/0xe70
[   27.926540] lr : __alloc_pages_nodemask+0x184/0xe70
[   27.931403] sp : ffff00000996f7e0
[   27.934704] x29: ffff00000996f7e0 x28: ffff000008cb10a0
[   27.940003] x27: 00000000014012c0 x26: 0000000000000000
[   27.945301] x25: 0000000000000003 x24: ffff0000085bbc14
[   27.950600] x23: 0000000000400000 x22: 0000000000000000
[   27.955898] x21: 0000000000000001 x20: 0000000000000000
[   27.961196] x19: 0000000000400000 x18: 0000000000000f00
[   27.966494] x17: 00000000003bff88 x16: 0000000000000020
[   27.971792] x15: 000000000000003b x14: ffffffffffffffff
[   27.977090] x13: ffffffffffff0000 x12: 0000000000000030
[   27.982388] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
[   27.987686] x9 : 2e64716e622e7364 x8 : 7f7f7f7f7f7f7f7f
[   27.992984] x7 : 0000000000000000 x6 : ffff000008d73c08
[   27.998282] x5 : 0000000000000000 x4 : 0000000000000081
[   28.003580] x3 : 0000000000000000 x2 : 0000000000000000
[   28.008878] x1 : 0000000000000001 x0 : 0000000000001980
[   28.014177] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
[   28.020863] Call trace:
[   28.023296]  __alloc_pages_nodemask+0xf0/0xe70
[   28.027727]  allocate_slab+0x94/0x590
[   28.031374]  new_slab+0x68/0xc8
[   28.034502]  ___slab_alloc+0x444/0x4f8
[   28.038237]  __slab_alloc+0x50/0x68
[   28.041713]  __kmalloc_node_track_caller+0x100/0x320
[   28.046664]  devm_kmalloc+0x3c/0x90
[   28.050139]  pinctrl_bind_pins+0x4c/0x298
[   28.054135]  driver_probe_device+0xb4/0x4a0
[   28.058305]  __driver_attach+0x124/0x128
[   28.062213]  bus_for_each_dev+0x78/0xe0
[   28.066035]  driver_attach+0x30/0x40
[   28.069597]  bus_add_driver+0x248/0x2b8
[   28.073419]  driver_register+0x68/0x100
[   28.077242]  __pci_register_driver+0x64/0x78
[   28.081500]  pcie_portdrv_init+0x44/0x4c
[   28.085410]  do_one_initcall+0x54/0x208
[   28.089232]  kernel_init_freeable+0x244/0x340
[   28.093577]  kernel_init+0x18/0x118
[   28.097052]  ret_from_fork+0x10/0x1c
[   28.100614] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
[   28.106740] ---[ end trace e32df44e6e1c3a4b ]---

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
Cc: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
---
 include/linux/device.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 4779569..2a4fb08 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1017,7 +1017,12 @@ extern __printf(2, 3)
 #ifdef CONFIG_NUMA
 static inline int dev_to_node(struct device *dev)
 {
-	return dev->numa_node;
+	int node = dev->numa_node;
+
+	if (unlikely(node != NUMA_NO_NODE && !node_online(node)))
+		return NUMA_NO_NODE;
+
+	return node;
 }
 static inline void set_dev_node(struct device *dev, int node)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/2] arm64/drivers: avoid alloc memory on offline node
  2018-05-31 12:14 ` Xie XiuQi
@ 2018-05-31 14:00   ` Hanjun Guo
  -1 siblings, 0 replies; 103+ messages in thread
From: Hanjun Guo @ 2018-05-31 14:00 UTC (permalink / raw)
  To: Xie XiuQi, catalin.marinas, will.deacon, bhelgaas, gregkh,
	rafael.j.wysocki, jarkko.sakkinen
  Cc: linux-arm-kernel, linux-kernel, wanghuiqiang, tnowicki

Hi Xiuqi,

On 2018/5/31 20:14, Xie XiuQi wrote:
> A numa system may return node which is not online.
> For example, a numa node:
> 1) without memory
> 2) NR_CPUS is very small, and the cpus on the node are not brought up

I think adding detail info will be easy to be understood:
 - NUMA node will be built if CPUs and (or) memory are valid on this NUMA node;

 - But if we boot the system with memory-less node and also with CONFIG_NR_CPUS
   less than CPUs in SRAT, for example, 64 CPUs total with 4 NUMA nodes, 16 CPUs
   on each NUMA node, if we boot with CONFIG_NR_CPUS=48, then we will not built
   numa node for node 3, but with devices on that numa node, alloc memory will
   be panic because NUMA node 3 is not a valid node.

> 
> In this situation, we use NUMA_NO_NODE to avoid oops.
[snip]
> 
> Xie XiuQi (2):
>   arm64: avoid alloc memory on offline node
>   drivers: check numa node's online status in dev_to_node

I think we still missing devices like SMMU, ITS, so how about check
the numa node online in the core memory allocation such as kmalloc_node()?

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 0/2] arm64/drivers: avoid alloc memory on offline node
@ 2018-05-31 14:00   ` Hanjun Guo
  0 siblings, 0 replies; 103+ messages in thread
From: Hanjun Guo @ 2018-05-31 14:00 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Xiuqi,

On 2018/5/31 20:14, Xie XiuQi wrote:
> A numa system may return node which is not online.
> For example, a numa node:
> 1) without memory
> 2) NR_CPUS is very small, and the cpus on the node are not brought up

I think adding detail info will be easy to be understood:
 - NUMA node will be built if CPUs and (or) memory are valid on this NUMA node;

 - But if we boot the system with memory-less node and also with CONFIG_NR_CPUS
   less than CPUs in SRAT, for example, 64 CPUs total with 4 NUMA nodes, 16 CPUs
   on each NUMA node, if we boot with CONFIG_NR_CPUS=48, then we will not built
   numa node for node 3, but with devices on that numa node, alloc memory will
   be panic because NUMA node 3 is not a valid node.

> 
> In this situation, we use NUMA_NO_NODE to avoid oops.
[snip]
> 
> Xie XiuQi (2):
>   arm64: avoid alloc memory on offline node
>   drivers: check numa node's online status in dev_to_node

I think we still missing devices like SMMU, ITS, so how about check
the numa node online in the core memory allocation such as kmalloc_node()?

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-05-31 12:14   ` Xie XiuQi
@ 2018-06-06 15:45     ` Will Deacon
  -1 siblings, 0 replies; 103+ messages in thread
From: Will Deacon @ 2018-06-06 15:45 UTC (permalink / raw)
  To: Xie XiuQi
  Cc: catalin.marinas, bhelgaas, gregkh, rafael.j.wysocki,
	jarkko.sakkinen, linux-arm-kernel, linux-kernel, guohanjun,
	wanghuiqiang, tnowicki

On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
> A numa system may return node which is not online.
> For example, a numa node:
> 1) without memory
> 2) NR_CPUS is very small, and the cpus on the node are not brought up
> 
> In this situation, we use NUMA_NO_NODE to avoid oops.
> 
> [   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
> [   25.740982] Mem abort info:
> [   25.743762]   ESR = 0x96000005
> [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
> [   25.752711]   SET = 0, FnV = 0
> [   25.755751]   EA = 0, S1PTW = 0
> [   25.758878] Data abort info:
> [   25.761745]   ISV = 0, ISS = 0x00000005
> [   25.765568]   CM = 0, WnR = 0
> [   25.768521] [0000000000001988] user address but active_mm is swapper
> [   25.774861] Internal error: Oops: 96000005 [#1] SMP
> [   25.779724] Modules linked in:
> [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
> [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
> [   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
> [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
> [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
> [   25.813252] sp : ffff00000996f660
> [   25.816553] x29: ffff00000996f660 x28: 0000000000000000
> [   25.821852] x27: 00000000014012c0 x26: 0000000000000000
> [   25.827150] x25: 0000000000000003 x24: ffff000008099eac
> [   25.832449] x23: 0000000000400000 x22: 0000000000000000
> [   25.837747] x21: 0000000000000001 x20: 0000000000000000
> [   25.843045] x19: 0000000000400000 x18: 0000000000010e00
> [   25.848343] x17: 000000000437f790 x16: 0000000000000020
> [   25.853641] x15: 0000000000000000 x14: 6549435020524541
> [   25.858939] x13: 20454d502067756c x12: 0000000000000000
> [   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
> [   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
> [   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
> [   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
> [   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
> [   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
> [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
> [   25.902712] Call trace:
> [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
> [   25.909577]  allocate_slab+0x94/0x590
> [   25.913225]  new_slab+0x68/0xc8
> [   25.916353]  ___slab_alloc+0x444/0x4f8
> [   25.920088]  __slab_alloc+0x50/0x68
> [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
> [   25.928426]  pci_acpi_scan_root+0x94/0x278
> [   25.932510]  acpi_pci_root_add+0x228/0x4b0
> [   25.936593]  acpi_bus_attach+0x10c/0x218
> [   25.940501]  acpi_bus_attach+0xac/0x218
> [   25.944323]  acpi_bus_attach+0xac/0x218
> [   25.948144]  acpi_bus_scan+0x5c/0xc0
> [   25.951708]  acpi_scan_init+0xf8/0x254
> [   25.955443]  acpi_init+0x310/0x37c
> [   25.958831]  do_one_initcall+0x54/0x208
> [   25.962653]  kernel_init_freeable+0x244/0x340
> [   25.966999]  kernel_init+0x18/0x118
> [   25.970474]  ret_from_fork+0x10/0x1c
> [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
> [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
> [   25.984765] Kernel panic - not syncing: Fatal exception
> 
> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
> Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
> Cc: Hanjun Guo <hanjun.guo@linaro.org>
> Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
> Cc: Xishi Qiu <qiuxishi@huawei.com>
> ---
>  arch/arm64/kernel/pci.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> index 0e2ea1c..e17cc45 100644
> --- a/arch/arm64/kernel/pci.c
> +++ b/arch/arm64/kernel/pci.c
> @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
>  	struct pci_bus *bus, *child;
>  	struct acpi_pci_root_ops *root_ops;
>  
> +	if (node != NUMA_NO_NODE && !node_online(node))
> +		node = NUMA_NO_NODE;
> +

This really feels like a bodge, but it does appear to be what other
architectures do, so:

Acked-by: Will Deacon <will.deacon@arm.com>

Will

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-06 15:45     ` Will Deacon
  0 siblings, 0 replies; 103+ messages in thread
From: Will Deacon @ 2018-06-06 15:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
> A numa system may return node which is not online.
> For example, a numa node:
> 1) without memory
> 2) NR_CPUS is very small, and the cpus on the node are not brought up
> 
> In this situation, we use NUMA_NO_NODE to avoid oops.
> 
> [   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
> [   25.740982] Mem abort info:
> [   25.743762]   ESR = 0x96000005
> [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
> [   25.752711]   SET = 0, FnV = 0
> [   25.755751]   EA = 0, S1PTW = 0
> [   25.758878] Data abort info:
> [   25.761745]   ISV = 0, ISS = 0x00000005
> [   25.765568]   CM = 0, WnR = 0
> [   25.768521] [0000000000001988] user address but active_mm is swapper
> [   25.774861] Internal error: Oops: 96000005 [#1] SMP
> [   25.779724] Modules linked in:
> [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
> [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
> [   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
> [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
> [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
> [   25.813252] sp : ffff00000996f660
> [   25.816553] x29: ffff00000996f660 x28: 0000000000000000
> [   25.821852] x27: 00000000014012c0 x26: 0000000000000000
> [   25.827150] x25: 0000000000000003 x24: ffff000008099eac
> [   25.832449] x23: 0000000000400000 x22: 0000000000000000
> [   25.837747] x21: 0000000000000001 x20: 0000000000000000
> [   25.843045] x19: 0000000000400000 x18: 0000000000010e00
> [   25.848343] x17: 000000000437f790 x16: 0000000000000020
> [   25.853641] x15: 0000000000000000 x14: 6549435020524541
> [   25.858939] x13: 20454d502067756c x12: 0000000000000000
> [   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
> [   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
> [   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
> [   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
> [   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
> [   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
> [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
> [   25.902712] Call trace:
> [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
> [   25.909577]  allocate_slab+0x94/0x590
> [   25.913225]  new_slab+0x68/0xc8
> [   25.916353]  ___slab_alloc+0x444/0x4f8
> [   25.920088]  __slab_alloc+0x50/0x68
> [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
> [   25.928426]  pci_acpi_scan_root+0x94/0x278
> [   25.932510]  acpi_pci_root_add+0x228/0x4b0
> [   25.936593]  acpi_bus_attach+0x10c/0x218
> [   25.940501]  acpi_bus_attach+0xac/0x218
> [   25.944323]  acpi_bus_attach+0xac/0x218
> [   25.948144]  acpi_bus_scan+0x5c/0xc0
> [   25.951708]  acpi_scan_init+0xf8/0x254
> [   25.955443]  acpi_init+0x310/0x37c
> [   25.958831]  do_one_initcall+0x54/0x208
> [   25.962653]  kernel_init_freeable+0x244/0x340
> [   25.966999]  kernel_init+0x18/0x118
> [   25.970474]  ret_from_fork+0x10/0x1c
> [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
> [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
> [   25.984765] Kernel panic - not syncing: Fatal exception
> 
> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
> Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
> Cc: Hanjun Guo <hanjun.guo@linaro.org>
> Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
> Cc: Xishi Qiu <qiuxishi@huawei.com>
> ---
>  arch/arm64/kernel/pci.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> index 0e2ea1c..e17cc45 100644
> --- a/arch/arm64/kernel/pci.c
> +++ b/arch/arm64/kernel/pci.c
> @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
>  	struct pci_bus *bus, *child;
>  	struct acpi_pci_root_ops *root_ops;
>  
> +	if (node != NUMA_NO_NODE && !node_online(node))
> +		node = NUMA_NO_NODE;
> +

This really feels like a bodge, but it does appear to be what other
architectures do, so:

Acked-by: Will Deacon <will.deacon@arm.com>

Will

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-06 15:45     ` Will Deacon
  (?)
@ 2018-06-06 20:39       ` Bjorn Helgaas
  -1 siblings, 0 replies; 103+ messages in thread
From: Bjorn Helgaas @ 2018-06-06 20:39 UTC (permalink / raw)
  To: Will Deacon
  Cc: xiexiuqi, Catalin Marinas, Greg Kroah-Hartman, Rafael J. Wysocki,
	Jarkko Sakkinen, linux-arm, Linux Kernel Mailing List,
	Hanjun Guo, wanghuiqiang, tnowicki, linux-pci, Andrew Morton,
	linux-mm

[+cc akpm, linux-mm, linux-pci]

On Wed, Jun 6, 2018 at 10:44 AM Will Deacon <will.deacon@arm.com> wrote:
>
> On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
> > A numa system may return node which is not online.
> > For example, a numa node:
> > 1) without memory
> > 2) NR_CPUS is very small, and the cpus on the node are not brought up
> >
> > In this situation, we use NUMA_NO_NODE to avoid oops.
> >
> > [   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
> > [   25.740982] Mem abort info:
> > [   25.743762]   ESR = 0x96000005
> > [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
> > [   25.752711]   SET = 0, FnV = 0
> > [   25.755751]   EA = 0, S1PTW = 0
> > [   25.758878] Data abort info:
> > [   25.761745]   ISV = 0, ISS = 0x00000005
> > [   25.765568]   CM = 0, WnR = 0
> > [   25.768521] [0000000000001988] user address but active_mm is swapper
> > [   25.774861] Internal error: Oops: 96000005 [#1] SMP
> > [   25.779724] Modules linked in:
> > [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
> > [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
> > [   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
> > [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
> > [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
> > [   25.813252] sp : ffff00000996f660
> > [   25.816553] x29: ffff00000996f660 x28: 0000000000000000
> > [   25.821852] x27: 00000000014012c0 x26: 0000000000000000
> > [   25.827150] x25: 0000000000000003 x24: ffff000008099eac
> > [   25.832449] x23: 0000000000400000 x22: 0000000000000000
> > [   25.837747] x21: 0000000000000001 x20: 0000000000000000
> > [   25.843045] x19: 0000000000400000 x18: 0000000000010e00
> > [   25.848343] x17: 000000000437f790 x16: 0000000000000020
> > [   25.853641] x15: 0000000000000000 x14: 6549435020524541
> > [   25.858939] x13: 20454d502067756c x12: 0000000000000000
> > [   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
> > [   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
> > [   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
> > [   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
> > [   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
> > [   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
> > [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
> > [   25.902712] Call trace:
> > [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
> > [   25.909577]  allocate_slab+0x94/0x590
> > [   25.913225]  new_slab+0x68/0xc8
> > [   25.916353]  ___slab_alloc+0x444/0x4f8
> > [   25.920088]  __slab_alloc+0x50/0x68
> > [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
> > [   25.928426]  pci_acpi_scan_root+0x94/0x278
> > [   25.932510]  acpi_pci_root_add+0x228/0x4b0
> > [   25.936593]  acpi_bus_attach+0x10c/0x218
> > [   25.940501]  acpi_bus_attach+0xac/0x218
> > [   25.944323]  acpi_bus_attach+0xac/0x218
> > [   25.948144]  acpi_bus_scan+0x5c/0xc0
> > [   25.951708]  acpi_scan_init+0xf8/0x254
> > [   25.955443]  acpi_init+0x310/0x37c
> > [   25.958831]  do_one_initcall+0x54/0x208
> > [   25.962653]  kernel_init_freeable+0x244/0x340
> > [   25.966999]  kernel_init+0x18/0x118
> > [   25.970474]  ret_from_fork+0x10/0x1c
> > [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
> > [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
> > [   25.984765] Kernel panic - not syncing: Fatal exception
> >
> > Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
> > Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
> > Cc: Hanjun Guo <hanjun.guo@linaro.org>
> > Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
> > Cc: Xishi Qiu <qiuxishi@huawei.com>
> > ---
> >  arch/arm64/kernel/pci.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> > index 0e2ea1c..e17cc45 100644
> > --- a/arch/arm64/kernel/pci.c
> > +++ b/arch/arm64/kernel/pci.c
> > @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
> >       struct pci_bus *bus, *child;
> >       struct acpi_pci_root_ops *root_ops;
> >
> > +     if (node != NUMA_NO_NODE && !node_online(node))
> > +             node = NUMA_NO_NODE;
> > +
>
> This really feels like a bodge, but it does appear to be what other
> architectures do, so:
>
> Acked-by: Will Deacon <will.deacon@arm.com>

I agree, this doesn't feel like something we should be avoiding in the
caller of kzalloc_node().

I would not expect kzalloc_node() to return memory that's offline, no
matter what node we told it to allocate from.  I could imagine it
returning failure, or returning memory from a node that *is* online,
but returning a pointer to offline memory seems broken.

Are we putting memory that's offline in the free list?  I don't know
where to look to figure this out.

Bjorn

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-06 20:39       ` Bjorn Helgaas
  0 siblings, 0 replies; 103+ messages in thread
From: Bjorn Helgaas @ 2018-06-06 20:39 UTC (permalink / raw)
  To: Will Deacon
  Cc: tnowicki, xiexiuqi, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Hanjun Guo, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Andrew Morton, linux-arm

[+cc akpm, linux-mm, linux-pci]

On Wed, Jun 6, 2018 at 10:44 AM Will Deacon <will.deacon@arm.com> wrote:
>
> On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
> > A numa system may return node which is not online.
> > For example, a numa node:
> > 1) without memory
> > 2) NR_CPUS is very small, and the cpus on the node are not brought up
> >
> > In this situation, we use NUMA_NO_NODE to avoid oops.
> >
> > [   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
> > [   25.740982] Mem abort info:
> > [   25.743762]   ESR = 0x96000005
> > [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
> > [   25.752711]   SET = 0, FnV = 0
> > [   25.755751]   EA = 0, S1PTW = 0
> > [   25.758878] Data abort info:
> > [   25.761745]   ISV = 0, ISS = 0x00000005
> > [   25.765568]   CM = 0, WnR = 0
> > [   25.768521] [0000000000001988] user address but active_mm is swapper
> > [   25.774861] Internal error: Oops: 96000005 [#1] SMP
> > [   25.779724] Modules linked in:
> > [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
> > [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
> > [   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
> > [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
> > [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
> > [   25.813252] sp : ffff00000996f660
> > [   25.816553] x29: ffff00000996f660 x28: 0000000000000000
> > [   25.821852] x27: 00000000014012c0 x26: 0000000000000000
> > [   25.827150] x25: 0000000000000003 x24: ffff000008099eac
> > [   25.832449] x23: 0000000000400000 x22: 0000000000000000
> > [   25.837747] x21: 0000000000000001 x20: 0000000000000000
> > [   25.843045] x19: 0000000000400000 x18: 0000000000010e00
> > [   25.848343] x17: 000000000437f790 x16: 0000000000000020
> > [   25.853641] x15: 0000000000000000 x14: 6549435020524541
> > [   25.858939] x13: 20454d502067756c x12: 0000000000000000
> > [   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
> > [   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
> > [   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
> > [   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
> > [   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
> > [   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
> > [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
> > [   25.902712] Call trace:
> > [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
> > [   25.909577]  allocate_slab+0x94/0x590
> > [   25.913225]  new_slab+0x68/0xc8
> > [   25.916353]  ___slab_alloc+0x444/0x4f8
> > [   25.920088]  __slab_alloc+0x50/0x68
> > [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
> > [   25.928426]  pci_acpi_scan_root+0x94/0x278
> > [   25.932510]  acpi_pci_root_add+0x228/0x4b0
> > [   25.936593]  acpi_bus_attach+0x10c/0x218
> > [   25.940501]  acpi_bus_attach+0xac/0x218
> > [   25.944323]  acpi_bus_attach+0xac/0x218
> > [   25.948144]  acpi_bus_scan+0x5c/0xc0
> > [   25.951708]  acpi_scan_init+0xf8/0x254
> > [   25.955443]  acpi_init+0x310/0x37c
> > [   25.958831]  do_one_initcall+0x54/0x208
> > [   25.962653]  kernel_init_freeable+0x244/0x340
> > [   25.966999]  kernel_init+0x18/0x118
> > [   25.970474]  ret_from_fork+0x10/0x1c
> > [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
> > [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
> > [   25.984765] Kernel panic - not syncing: Fatal exception
> >
> > Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
> > Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
> > Cc: Hanjun Guo <hanjun.guo@linaro.org>
> > Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
> > Cc: Xishi Qiu <qiuxishi@huawei.com>
> > ---
> >  arch/arm64/kernel/pci.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> > index 0e2ea1c..e17cc45 100644
> > --- a/arch/arm64/kernel/pci.c
> > +++ b/arch/arm64/kernel/pci.c
> > @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
> >       struct pci_bus *bus, *child;
> >       struct acpi_pci_root_ops *root_ops;
> >
> > +     if (node != NUMA_NO_NODE && !node_online(node))
> > +             node = NUMA_NO_NODE;
> > +
>
> This really feels like a bodge, but it does appear to be what other
> architectures do, so:
>
> Acked-by: Will Deacon <will.deacon@arm.com>

I agree, this doesn't feel like something we should be avoiding in the
caller of kzalloc_node().

I would not expect kzalloc_node() to return memory that's offline, no
matter what node we told it to allocate from.  I could imagine it
returning failure, or returning memory from a node that *is* online,
but returning a pointer to offline memory seems broken.

Are we putting memory that's offline in the free list?  I don't know
where to look to figure this out.

Bjorn

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-06 20:39       ` Bjorn Helgaas
  0 siblings, 0 replies; 103+ messages in thread
From: Bjorn Helgaas @ 2018-06-06 20:39 UTC (permalink / raw)
  To: linux-arm-kernel

[+cc akpm, linux-mm, linux-pci]

On Wed, Jun 6, 2018 at 10:44 AM Will Deacon <will.deacon@arm.com> wrote:
>
> On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
> > A numa system may return node which is not online.
> > For example, a numa node:
> > 1) without memory
> > 2) NR_CPUS is very small, and the cpus on the node are not brought up
> >
> > In this situation, we use NUMA_NO_NODE to avoid oops.
> >
> > [   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
> > [   25.740982] Mem abort info:
> > [   25.743762]   ESR = 0x96000005
> > [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
> > [   25.752711]   SET = 0, FnV = 0
> > [   25.755751]   EA = 0, S1PTW = 0
> > [   25.758878] Data abort info:
> > [   25.761745]   ISV = 0, ISS = 0x00000005
> > [   25.765568]   CM = 0, WnR = 0
> > [   25.768521] [0000000000001988] user address but active_mm is swapper
> > [   25.774861] Internal error: Oops: 96000005 [#1] SMP
> > [   25.779724] Modules linked in:
> > [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
> > [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
> > [   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
> > [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
> > [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
> > [   25.813252] sp : ffff00000996f660
> > [   25.816553] x29: ffff00000996f660 x28: 0000000000000000
> > [   25.821852] x27: 00000000014012c0 x26: 0000000000000000
> > [   25.827150] x25: 0000000000000003 x24: ffff000008099eac
> > [   25.832449] x23: 0000000000400000 x22: 0000000000000000
> > [   25.837747] x21: 0000000000000001 x20: 0000000000000000
> > [   25.843045] x19: 0000000000400000 x18: 0000000000010e00
> > [   25.848343] x17: 000000000437f790 x16: 0000000000000020
> > [   25.853641] x15: 0000000000000000 x14: 6549435020524541
> > [   25.858939] x13: 20454d502067756c x12: 0000000000000000
> > [   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
> > [   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
> > [   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
> > [   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
> > [   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
> > [   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
> > [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
> > [   25.902712] Call trace:
> > [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
> > [   25.909577]  allocate_slab+0x94/0x590
> > [   25.913225]  new_slab+0x68/0xc8
> > [   25.916353]  ___slab_alloc+0x444/0x4f8
> > [   25.920088]  __slab_alloc+0x50/0x68
> > [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
> > [   25.928426]  pci_acpi_scan_root+0x94/0x278
> > [   25.932510]  acpi_pci_root_add+0x228/0x4b0
> > [   25.936593]  acpi_bus_attach+0x10c/0x218
> > [   25.940501]  acpi_bus_attach+0xac/0x218
> > [   25.944323]  acpi_bus_attach+0xac/0x218
> > [   25.948144]  acpi_bus_scan+0x5c/0xc0
> > [   25.951708]  acpi_scan_init+0xf8/0x254
> > [   25.955443]  acpi_init+0x310/0x37c
> > [   25.958831]  do_one_initcall+0x54/0x208
> > [   25.962653]  kernel_init_freeable+0x244/0x340
> > [   25.966999]  kernel_init+0x18/0x118
> > [   25.970474]  ret_from_fork+0x10/0x1c
> > [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
> > [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
> > [   25.984765] Kernel panic - not syncing: Fatal exception
> >
> > Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
> > Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
> > Cc: Hanjun Guo <hanjun.guo@linaro.org>
> > Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
> > Cc: Xishi Qiu <qiuxishi@huawei.com>
> > ---
> >  arch/arm64/kernel/pci.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> > index 0e2ea1c..e17cc45 100644
> > --- a/arch/arm64/kernel/pci.c
> > +++ b/arch/arm64/kernel/pci.c
> > @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
> >       struct pci_bus *bus, *child;
> >       struct acpi_pci_root_ops *root_ops;
> >
> > +     if (node != NUMA_NO_NODE && !node_online(node))
> > +             node = NUMA_NO_NODE;
> > +
>
> This really feels like a bodge, but it does appear to be what other
> architectures do, so:
>
> Acked-by: Will Deacon <will.deacon@arm.com>

I agree, this doesn't feel like something we should be avoiding in the
caller of kzalloc_node().

I would not expect kzalloc_node() to return memory that's offline, no
matter what node we told it to allocate from.  I could imagine it
returning failure, or returning memory from a node that *is* online,
but returning a pointer to offline memory seems broken.

Are we putting memory that's offline in the free list?  I don't know
where to look to figure this out.

Bjorn

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-06 20:39       ` Bjorn Helgaas
  (?)
@ 2018-06-07 10:55         ` Michal Hocko
  -1 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-07 10:55 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Will Deacon, xiexiuqi, Catalin Marinas, Greg Kroah-Hartman,
	Rafael J. Wysocki, Jarkko Sakkinen, linux-arm,
	Linux Kernel Mailing List, Hanjun Guo, wanghuiqiang, tnowicki,
	linux-pci, Andrew Morton, linux-mm

On Wed 06-06-18 15:39:34, Bjorn Helgaas wrote:
> [+cc akpm, linux-mm, linux-pci]
> 
> On Wed, Jun 6, 2018 at 10:44 AM Will Deacon <will.deacon@arm.com> wrote:
> >
> > On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
> > > A numa system may return node which is not online.
> > > For example, a numa node:
> > > 1) without memory
> > > 2) NR_CPUS is very small, and the cpus on the node are not brought up
> > >
> > > In this situation, we use NUMA_NO_NODE to avoid oops.
> > >
> > > [   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
> > > [   25.740982] Mem abort info:
> > > [   25.743762]   ESR = 0x96000005
> > > [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
> > > [   25.752711]   SET = 0, FnV = 0
> > > [   25.755751]   EA = 0, S1PTW = 0
> > > [   25.758878] Data abort info:
> > > [   25.761745]   ISV = 0, ISS = 0x00000005
> > > [   25.765568]   CM = 0, WnR = 0
> > > [   25.768521] [0000000000001988] user address but active_mm is swapper
> > > [   25.774861] Internal error: Oops: 96000005 [#1] SMP
> > > [   25.779724] Modules linked in:
> > > [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
> > > [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
> > > [   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
> > > [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
> > > [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
> > > [   25.813252] sp : ffff00000996f660
> > > [   25.816553] x29: ffff00000996f660 x28: 0000000000000000
> > > [   25.821852] x27: 00000000014012c0 x26: 0000000000000000
> > > [   25.827150] x25: 0000000000000003 x24: ffff000008099eac
> > > [   25.832449] x23: 0000000000400000 x22: 0000000000000000
> > > [   25.837747] x21: 0000000000000001 x20: 0000000000000000
> > > [   25.843045] x19: 0000000000400000 x18: 0000000000010e00
> > > [   25.848343] x17: 000000000437f790 x16: 0000000000000020
> > > [   25.853641] x15: 0000000000000000 x14: 6549435020524541
> > > [   25.858939] x13: 20454d502067756c x12: 0000000000000000
> > > [   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
> > > [   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
> > > [   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
> > > [   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
> > > [   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
> > > [   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
> > > [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
> > > [   25.902712] Call trace:
> > > [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
> > > [   25.909577]  allocate_slab+0x94/0x590
> > > [   25.913225]  new_slab+0x68/0xc8
> > > [   25.916353]  ___slab_alloc+0x444/0x4f8
> > > [   25.920088]  __slab_alloc+0x50/0x68
> > > [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
> > > [   25.928426]  pci_acpi_scan_root+0x94/0x278
> > > [   25.932510]  acpi_pci_root_add+0x228/0x4b0
> > > [   25.936593]  acpi_bus_attach+0x10c/0x218
> > > [   25.940501]  acpi_bus_attach+0xac/0x218
> > > [   25.944323]  acpi_bus_attach+0xac/0x218
> > > [   25.948144]  acpi_bus_scan+0x5c/0xc0
> > > [   25.951708]  acpi_scan_init+0xf8/0x254
> > > [   25.955443]  acpi_init+0x310/0x37c
> > > [   25.958831]  do_one_initcall+0x54/0x208
> > > [   25.962653]  kernel_init_freeable+0x244/0x340
> > > [   25.966999]  kernel_init+0x18/0x118
> > > [   25.970474]  ret_from_fork+0x10/0x1c
> > > [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
> > > [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
> > > [   25.984765] Kernel panic - not syncing: Fatal exception
> > >
> > > Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
> > > Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
> > > Cc: Hanjun Guo <hanjun.guo@linaro.org>
> > > Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
> > > Cc: Xishi Qiu <qiuxishi@huawei.com>
> > > ---
> > >  arch/arm64/kernel/pci.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> > > index 0e2ea1c..e17cc45 100644
> > > --- a/arch/arm64/kernel/pci.c
> > > +++ b/arch/arm64/kernel/pci.c
> > > @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
> > >       struct pci_bus *bus, *child;
> > >       struct acpi_pci_root_ops *root_ops;
> > >
> > > +     if (node != NUMA_NO_NODE && !node_online(node))
> > > +             node = NUMA_NO_NODE;
> > > +
> >
> > This really feels like a bodge, but it does appear to be what other
> > architectures do, so:
> >
> > Acked-by: Will Deacon <will.deacon@arm.com>
> 
> I agree, this doesn't feel like something we should be avoiding in the
> caller of kzalloc_node().
> 
> I would not expect kzalloc_node() to return memory that's offline, no
> matter what node we told it to allocate from.  I could imagine it
> returning failure, or returning memory from a node that *is* online,
> but returning a pointer to offline memory seems broken.
> 
> Are we putting memory that's offline in the free list?  I don't know
> where to look to figure this out.

I am not sure I have the full context but pci_acpi_scan_root calls
kzalloc_node(sizeof(*info), GFP_KERNEL, node)
and that should fall back to whatever node that is online. Offline node
shouldn't keep any pages behind. So there must be something else going
on here and the patch is not the right way to handle it. What does
faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-07 10:55         ` Michal Hocko
  0 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-07 10:55 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Hanjun Guo, tnowicki, xiexiuqi, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Andrew Morton, linux-arm

On Wed 06-06-18 15:39:34, Bjorn Helgaas wrote:
> [+cc akpm, linux-mm, linux-pci]
> 
> On Wed, Jun 6, 2018 at 10:44 AM Will Deacon <will.deacon@arm.com> wrote:
> >
> > On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
> > > A numa system may return node which is not online.
> > > For example, a numa node:
> > > 1) without memory
> > > 2) NR_CPUS is very small, and the cpus on the node are not brought up
> > >
> > > In this situation, we use NUMA_NO_NODE to avoid oops.
> > >
> > > [   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
> > > [   25.740982] Mem abort info:
> > > [   25.743762]   ESR = 0x96000005
> > > [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
> > > [   25.752711]   SET = 0, FnV = 0
> > > [   25.755751]   EA = 0, S1PTW = 0
> > > [   25.758878] Data abort info:
> > > [   25.761745]   ISV = 0, ISS = 0x00000005
> > > [   25.765568]   CM = 0, WnR = 0
> > > [   25.768521] [0000000000001988] user address but active_mm is swapper
> > > [   25.774861] Internal error: Oops: 96000005 [#1] SMP
> > > [   25.779724] Modules linked in:
> > > [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
> > > [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
> > > [   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
> > > [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
> > > [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
> > > [   25.813252] sp : ffff00000996f660
> > > [   25.816553] x29: ffff00000996f660 x28: 0000000000000000
> > > [   25.821852] x27: 00000000014012c0 x26: 0000000000000000
> > > [   25.827150] x25: 0000000000000003 x24: ffff000008099eac
> > > [   25.832449] x23: 0000000000400000 x22: 0000000000000000
> > > [   25.837747] x21: 0000000000000001 x20: 0000000000000000
> > > [   25.843045] x19: 0000000000400000 x18: 0000000000010e00
> > > [   25.848343] x17: 000000000437f790 x16: 0000000000000020
> > > [   25.853641] x15: 0000000000000000 x14: 6549435020524541
> > > [   25.858939] x13: 20454d502067756c x12: 0000000000000000
> > > [   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
> > > [   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
> > > [   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
> > > [   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
> > > [   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
> > > [   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
> > > [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
> > > [   25.902712] Call trace:
> > > [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
> > > [   25.909577]  allocate_slab+0x94/0x590
> > > [   25.913225]  new_slab+0x68/0xc8
> > > [   25.916353]  ___slab_alloc+0x444/0x4f8
> > > [   25.920088]  __slab_alloc+0x50/0x68
> > > [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
> > > [   25.928426]  pci_acpi_scan_root+0x94/0x278
> > > [   25.932510]  acpi_pci_root_add+0x228/0x4b0
> > > [   25.936593]  acpi_bus_attach+0x10c/0x218
> > > [   25.940501]  acpi_bus_attach+0xac/0x218
> > > [   25.944323]  acpi_bus_attach+0xac/0x218
> > > [   25.948144]  acpi_bus_scan+0x5c/0xc0
> > > [   25.951708]  acpi_scan_init+0xf8/0x254
> > > [   25.955443]  acpi_init+0x310/0x37c
> > > [   25.958831]  do_one_initcall+0x54/0x208
> > > [   25.962653]  kernel_init_freeable+0x244/0x340
> > > [   25.966999]  kernel_init+0x18/0x118
> > > [   25.970474]  ret_from_fork+0x10/0x1c
> > > [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
> > > [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
> > > [   25.984765] Kernel panic - not syncing: Fatal exception
> > >
> > > Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
> > > Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
> > > Cc: Hanjun Guo <hanjun.guo@linaro.org>
> > > Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
> > > Cc: Xishi Qiu <qiuxishi@huawei.com>
> > > ---
> > >  arch/arm64/kernel/pci.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> > > index 0e2ea1c..e17cc45 100644
> > > --- a/arch/arm64/kernel/pci.c
> > > +++ b/arch/arm64/kernel/pci.c
> > > @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
> > >       struct pci_bus *bus, *child;
> > >       struct acpi_pci_root_ops *root_ops;
> > >
> > > +     if (node != NUMA_NO_NODE && !node_online(node))
> > > +             node = NUMA_NO_NODE;
> > > +
> >
> > This really feels like a bodge, but it does appear to be what other
> > architectures do, so:
> >
> > Acked-by: Will Deacon <will.deacon@arm.com>
> 
> I agree, this doesn't feel like something we should be avoiding in the
> caller of kzalloc_node().
> 
> I would not expect kzalloc_node() to return memory that's offline, no
> matter what node we told it to allocate from.  I could imagine it
> returning failure, or returning memory from a node that *is* online,
> but returning a pointer to offline memory seems broken.
> 
> Are we putting memory that's offline in the free list?  I don't know
> where to look to figure this out.

I am not sure I have the full context but pci_acpi_scan_root calls
kzalloc_node(sizeof(*info), GFP_KERNEL, node)
and that should fall back to whatever node that is online. Offline node
shouldn't keep any pages behind. So there must be something else going
on here and the patch is not the right way to handle it. What does
faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-07 10:55         ` Michal Hocko
  0 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-07 10:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed 06-06-18 15:39:34, Bjorn Helgaas wrote:
> [+cc akpm, linux-mm, linux-pci]
> 
> On Wed, Jun 6, 2018 at 10:44 AM Will Deacon <will.deacon@arm.com> wrote:
> >
> > On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
> > > A numa system may return node which is not online.
> > > For example, a numa node:
> > > 1) without memory
> > > 2) NR_CPUS is very small, and the cpus on the node are not brought up
> > >
> > > In this situation, we use NUMA_NO_NODE to avoid oops.
> > >
> > > [   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
> > > [   25.740982] Mem abort info:
> > > [   25.743762]   ESR = 0x96000005
> > > [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
> > > [   25.752711]   SET = 0, FnV = 0
> > > [   25.755751]   EA = 0, S1PTW = 0
> > > [   25.758878] Data abort info:
> > > [   25.761745]   ISV = 0, ISS = 0x00000005
> > > [   25.765568]   CM = 0, WnR = 0
> > > [   25.768521] [0000000000001988] user address but active_mm is swapper
> > > [   25.774861] Internal error: Oops: 96000005 [#1] SMP
> > > [   25.779724] Modules linked in:
> > > [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
> > > [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
> > > [   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
> > > [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
> > > [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
> > > [   25.813252] sp : ffff00000996f660
> > > [   25.816553] x29: ffff00000996f660 x28: 0000000000000000
> > > [   25.821852] x27: 00000000014012c0 x26: 0000000000000000
> > > [   25.827150] x25: 0000000000000003 x24: ffff000008099eac
> > > [   25.832449] x23: 0000000000400000 x22: 0000000000000000
> > > [   25.837747] x21: 0000000000000001 x20: 0000000000000000
> > > [   25.843045] x19: 0000000000400000 x18: 0000000000010e00
> > > [   25.848343] x17: 000000000437f790 x16: 0000000000000020
> > > [   25.853641] x15: 0000000000000000 x14: 6549435020524541
> > > [   25.858939] x13: 20454d502067756c x12: 0000000000000000
> > > [   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
> > > [   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
> > > [   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
> > > [   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
> > > [   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
> > > [   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
> > > [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
> > > [   25.902712] Call trace:
> > > [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
> > > [   25.909577]  allocate_slab+0x94/0x590
> > > [   25.913225]  new_slab+0x68/0xc8
> > > [   25.916353]  ___slab_alloc+0x444/0x4f8
> > > [   25.920088]  __slab_alloc+0x50/0x68
> > > [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
> > > [   25.928426]  pci_acpi_scan_root+0x94/0x278
> > > [   25.932510]  acpi_pci_root_add+0x228/0x4b0
> > > [   25.936593]  acpi_bus_attach+0x10c/0x218
> > > [   25.940501]  acpi_bus_attach+0xac/0x218
> > > [   25.944323]  acpi_bus_attach+0xac/0x218
> > > [   25.948144]  acpi_bus_scan+0x5c/0xc0
> > > [   25.951708]  acpi_scan_init+0xf8/0x254
> > > [   25.955443]  acpi_init+0x310/0x37c
> > > [   25.958831]  do_one_initcall+0x54/0x208
> > > [   25.962653]  kernel_init_freeable+0x244/0x340
> > > [   25.966999]  kernel_init+0x18/0x118
> > > [   25.970474]  ret_from_fork+0x10/0x1c
> > > [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
> > > [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
> > > [   25.984765] Kernel panic - not syncing: Fatal exception
> > >
> > > Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
> > > Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
> > > Cc: Hanjun Guo <hanjun.guo@linaro.org>
> > > Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
> > > Cc: Xishi Qiu <qiuxishi@huawei.com>
> > > ---
> > >  arch/arm64/kernel/pci.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> > > index 0e2ea1c..e17cc45 100644
> > > --- a/arch/arm64/kernel/pci.c
> > > +++ b/arch/arm64/kernel/pci.c
> > > @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
> > >       struct pci_bus *bus, *child;
> > >       struct acpi_pci_root_ops *root_ops;
> > >
> > > +     if (node != NUMA_NO_NODE && !node_online(node))
> > > +             node = NUMA_NO_NODE;
> > > +
> >
> > This really feels like a bodge, but it does appear to be what other
> > architectures do, so:
> >
> > Acked-by: Will Deacon <will.deacon@arm.com>
> 
> I agree, this doesn't feel like something we should be avoiding in the
> caller of kzalloc_node().
> 
> I would not expect kzalloc_node() to return memory that's offline, no
> matter what node we told it to allocate from.  I could imagine it
> returning failure, or returning memory from a node that *is* online,
> but returning a pointer to offline memory seems broken.
> 
> Are we putting memory that's offline in the free list?  I don't know
> where to look to figure this out.

I am not sure I have the full context but pci_acpi_scan_root calls
kzalloc_node(sizeof(*info), GFP_KERNEL, node)
and that should fall back to whatever node that is online. Offline node
shouldn't keep any pages behind. So there must be something else going
on here and the patch is not the right way to handle it. What does
faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-07 10:55         ` Michal Hocko
  (?)
  (?)
@ 2018-06-07 11:55           ` Hanjun Guo
  -1 siblings, 0 replies; 103+ messages in thread
From: Hanjun Guo @ 2018-06-07 11:55 UTC (permalink / raw)
  To: Michal Hocko, Bjorn Helgaas
  Cc: Will Deacon, xiexiuqi, Catalin Marinas, Greg Kroah-Hartman,
	Rafael J. Wysocki, Jarkko Sakkinen, linux-arm,
	Linux Kernel Mailing List, wanghuiqiang, tnowicki, linux-pci,
	Andrew Morton, linux-mm

On 2018/6/7 18:55, Michal Hocko wrote:
> On Wed 06-06-18 15:39:34, Bjorn Helgaas wrote:
>> [+cc akpm, linux-mm, linux-pci]
>>
>> On Wed, Jun 6, 2018 at 10:44 AM Will Deacon <will.deacon@arm.com> wrote:
>>>
>>> On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
>>>> A numa system may return node which is not online.
>>>> For example, a numa node:
>>>> 1) without memory
>>>> 2) NR_CPUS is very small, and the cpus on the node are not brought up
>>>>
>>>> In this situation, we use NUMA_NO_NODE to avoid oops.
>>>>
>>>> [   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
>>>> [   25.740982] Mem abort info:
>>>> [   25.743762]   ESR = 0x96000005
>>>> [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
>>>> [   25.752711]   SET = 0, FnV = 0
>>>> [   25.755751]   EA = 0, S1PTW = 0
>>>> [   25.758878] Data abort info:
>>>> [   25.761745]   ISV = 0, ISS = 0x00000005
>>>> [   25.765568]   CM = 0, WnR = 0
>>>> [   25.768521] [0000000000001988] user address but active_mm is swapper
>>>> [   25.774861] Internal error: Oops: 96000005 [#1] SMP
>>>> [   25.779724] Modules linked in:
>>>> [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
>>>> [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
>>>> [   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
>>>> [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
>>>> [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
>>>> [   25.813252] sp : ffff00000996f660
>>>> [   25.816553] x29: ffff00000996f660 x28: 0000000000000000
>>>> [   25.821852] x27: 00000000014012c0 x26: 0000000000000000
>>>> [   25.827150] x25: 0000000000000003 x24: ffff000008099eac
>>>> [   25.832449] x23: 0000000000400000 x22: 0000000000000000
>>>> [   25.837747] x21: 0000000000000001 x20: 0000000000000000
>>>> [   25.843045] x19: 0000000000400000 x18: 0000000000010e00
>>>> [   25.848343] x17: 000000000437f790 x16: 0000000000000020
>>>> [   25.853641] x15: 0000000000000000 x14: 6549435020524541
>>>> [   25.858939] x13: 20454d502067756c x12: 0000000000000000
>>>> [   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
>>>> [   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
>>>> [   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
>>>> [   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
>>>> [   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
>>>> [   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
>>>> [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
>>>> [   25.902712] Call trace:
>>>> [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
>>>> [   25.909577]  allocate_slab+0x94/0x590
>>>> [   25.913225]  new_slab+0x68/0xc8
>>>> [   25.916353]  ___slab_alloc+0x444/0x4f8
>>>> [   25.920088]  __slab_alloc+0x50/0x68
>>>> [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
>>>> [   25.928426]  pci_acpi_scan_root+0x94/0x278
>>>> [   25.932510]  acpi_pci_root_add+0x228/0x4b0
>>>> [   25.936593]  acpi_bus_attach+0x10c/0x218
>>>> [   25.940501]  acpi_bus_attach+0xac/0x218
>>>> [   25.944323]  acpi_bus_attach+0xac/0x218
>>>> [   25.948144]  acpi_bus_scan+0x5c/0xc0
>>>> [   25.951708]  acpi_scan_init+0xf8/0x254
>>>> [   25.955443]  acpi_init+0x310/0x37c
>>>> [   25.958831]  do_one_initcall+0x54/0x208
>>>> [   25.962653]  kernel_init_freeable+0x244/0x340
>>>> [   25.966999]  kernel_init+0x18/0x118
>>>> [   25.970474]  ret_from_fork+0x10/0x1c
>>>> [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
>>>> [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
>>>> [   25.984765] Kernel panic - not syncing: Fatal exception
>>>>
>>>> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
>>>> Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
>>>> Cc: Hanjun Guo <hanjun.guo@linaro.org>
>>>> Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
>>>> Cc: Xishi Qiu <qiuxishi@huawei.com>
>>>> ---
>>>>  arch/arm64/kernel/pci.c | 3 +++
>>>>  1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
>>>> index 0e2ea1c..e17cc45 100644
>>>> --- a/arch/arm64/kernel/pci.c
>>>> +++ b/arch/arm64/kernel/pci.c
>>>> @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
>>>>       struct pci_bus *bus, *child;
>>>>       struct acpi_pci_root_ops *root_ops;
>>>>
>>>> +     if (node != NUMA_NO_NODE && !node_online(node))
>>>> +             node = NUMA_NO_NODE;
>>>> +
>>>
>>> This really feels like a bodge, but it does appear to be what other
>>> architectures do, so:
>>>
>>> Acked-by: Will Deacon <will.deacon@arm.com>
>>
>> I agree, this doesn't feel like something we should be avoiding in the
>> caller of kzalloc_node().
>>
>> I would not expect kzalloc_node() to return memory that's offline, no
>> matter what node we told it to allocate from.  I could imagine it
>> returning failure, or returning memory from a node that *is* online,
>> but returning a pointer to offline memory seems broken.
>>
>> Are we putting memory that's offline in the free list?  I don't know
>> where to look to figure this out.
> 
> I am not sure I have the full context but pci_acpi_scan_root calls
> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> and that should fall back to whatever node that is online. Offline node
> shouldn't keep any pages behind. So there must be something else going
> on here and the patch is not the right way to handle it. What does
> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?

The whole context is:

The system is booted with a NUMA node has no memory attaching to it
(memory-less NUMA node), also with NR_CPUS less than CPUs presented
in MADT, so CPUs on this memory-less node are not brought up, and
this NUMA node will not be online (but SRAT presents this NUMA node);

Devices attaching to this NUMA node such as PCI host bridge still
return the valid NUMA node via _PXM, but actually that valid NUMA node
is not online which lead to this issue.

Thanks
Hanjun

> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-07 11:55           ` Hanjun Guo
  0 siblings, 0 replies; 103+ messages in thread
From: Hanjun Guo @ 2018-06-07 11:55 UTC (permalink / raw)
  To: Michal Hocko, Bjorn Helgaas
  Cc: tnowicki, xiexiuqi, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Andrew Morton, linux-arm

On 2018/6/7 18:55, Michal Hocko wrote:
> On Wed 06-06-18 15:39:34, Bjorn Helgaas wrote:
>> [+cc akpm, linux-mm, linux-pci]
>>
>> On Wed, Jun 6, 2018 at 10:44 AM Will Deacon <will.deacon@arm.com> wrote:
>>>
>>> On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
>>>> A numa system may return node which is not online.
>>>> For example, a numa node:
>>>> 1) without memory
>>>> 2) NR_CPUS is very small, and the cpus on the node are not brought up
>>>>
>>>> In this situation, we use NUMA_NO_NODE to avoid oops.
>>>>
>>>> [   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
>>>> [   25.740982] Mem abort info:
>>>> [   25.743762]   ESR = 0x96000005
>>>> [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
>>>> [   25.752711]   SET = 0, FnV = 0
>>>> [   25.755751]   EA = 0, S1PTW = 0
>>>> [   25.758878] Data abort info:
>>>> [   25.761745]   ISV = 0, ISS = 0x00000005
>>>> [   25.765568]   CM = 0, WnR = 0
>>>> [   25.768521] [0000000000001988] user address but active_mm is swapper
>>>> [   25.774861] Internal error: Oops: 96000005 [#1] SMP
>>>> [   25.779724] Modules linked in:
>>>> [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
>>>> [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
>>>> [   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
>>>> [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
>>>> [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
>>>> [   25.813252] sp : ffff00000996f660
>>>> [   25.816553] x29: ffff00000996f660 x28: 0000000000000000
>>>> [   25.821852] x27: 00000000014012c0 x26: 0000000000000000
>>>> [   25.827150] x25: 0000000000000003 x24: ffff000008099eac
>>>> [   25.832449] x23: 0000000000400000 x22: 0000000000000000
>>>> [   25.837747] x21: 0000000000000001 x20: 0000000000000000
>>>> [   25.843045] x19: 0000000000400000 x18: 0000000000010e00
>>>> [   25.848343] x17: 000000000437f790 x16: 0000000000000020
>>>> [   25.853641] x15: 0000000000000000 x14: 6549435020524541
>>>> [   25.858939] x13: 20454d502067756c x12: 0000000000000000
>>>> [   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
>>>> [   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
>>>> [   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
>>>> [   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
>>>> [   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
>>>> [   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
>>>> [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
>>>> [   25.902712] Call trace:
>>>> [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
>>>> [   25.909577]  allocate_slab+0x94/0x590
>>>> [   25.913225]  new_slab+0x68/0xc8
>>>> [   25.916353]  ___slab_alloc+0x444/0x4f8
>>>> [   25.920088]  __slab_alloc+0x50/0x68
>>>> [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
>>>> [   25.928426]  pci_acpi_scan_root+0x94/0x278
>>>> [   25.932510]  acpi_pci_root_add+0x228/0x4b0
>>>> [   25.936593]  acpi_bus_attach+0x10c/0x218
>>>> [   25.940501]  acpi_bus_attach+0xac/0x218
>>>> [   25.944323]  acpi_bus_attach+0xac/0x218
>>>> [   25.948144]  acpi_bus_scan+0x5c/0xc0
>>>> [   25.951708]  acpi_scan_init+0xf8/0x254
>>>> [   25.955443]  acpi_init+0x310/0x37c
>>>> [   25.958831]  do_one_initcall+0x54/0x208
>>>> [   25.962653]  kernel_init_freeable+0x244/0x340
>>>> [   25.966999]  kernel_init+0x18/0x118
>>>> [   25.970474]  ret_from_fork+0x10/0x1c
>>>> [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
>>>> [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
>>>> [   25.984765] Kernel panic - not syncing: Fatal exception
>>>>
>>>> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
>>>> Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
>>>> Cc: Hanjun Guo <hanjun.guo@linaro.org>
>>>> Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
>>>> Cc: Xishi Qiu <qiuxishi@huawei.com>
>>>> ---
>>>>  arch/arm64/kernel/pci.c | 3 +++
>>>>  1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
>>>> index 0e2ea1c..e17cc45 100644
>>>> --- a/arch/arm64/kernel/pci.c
>>>> +++ b/arch/arm64/kernel/pci.c
>>>> @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
>>>>       struct pci_bus *bus, *child;
>>>>       struct acpi_pci_root_ops *root_ops;
>>>>
>>>> +     if (node != NUMA_NO_NODE && !node_online(node))
>>>> +             node = NUMA_NO_NODE;
>>>> +
>>>
>>> This really feels like a bodge, but it does appear to be what other
>>> architectures do, so:
>>>
>>> Acked-by: Will Deacon <will.deacon@arm.com>
>>
>> I agree, this doesn't feel like something we should be avoiding in the
>> caller of kzalloc_node().
>>
>> I would not expect kzalloc_node() to return memory that's offline, no
>> matter what node we told it to allocate from.  I could imagine it
>> returning failure, or returning memory from a node that *is* online,
>> but returning a pointer to offline memory seems broken.
>>
>> Are we putting memory that's offline in the free list?  I don't know
>> where to look to figure this out.
> 
> I am not sure I have the full context but pci_acpi_scan_root calls
> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> and that should fall back to whatever node that is online. Offline node
> shouldn't keep any pages behind. So there must be something else going
> on here and the patch is not the right way to handle it. What does
> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?

The whole context is:

The system is booted with a NUMA node has no memory attaching to it
(memory-less NUMA node), also with NR_CPUS less than CPUs presented
in MADT, so CPUs on this memory-less node are not brought up, and
this NUMA node will not be online (but SRAT presents this NUMA node);

Devices attaching to this NUMA node such as PCI host bridge still
return the valid NUMA node via _PXM, but actually that valid NUMA node
is not online which lead to this issue.

Thanks
Hanjun

> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-07 11:55           ` Hanjun Guo
  0 siblings, 0 replies; 103+ messages in thread
From: Hanjun Guo @ 2018-06-07 11:55 UTC (permalink / raw)
  To: Michal Hocko, Bjorn Helgaas
  Cc: Will Deacon, xiexiuqi, Catalin Marinas, Greg Kroah-Hartman,
	Rafael J. Wysocki, Jarkko Sakkinen, linux-arm,
	Linux Kernel Mailing List, wanghuiqiang, tnowicki, linux-pci,
	Andrew Morton, linux-mm

On 2018/6/7 18:55, Michal Hocko wrote:
> On Wed 06-06-18 15:39:34, Bjorn Helgaas wrote:
>> [+cc akpm, linux-mm, linux-pci]
>>
>> On Wed, Jun 6, 2018 at 10:44 AM Will Deacon <will.deacon@arm.com> wrote:
>>>
>>> On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
>>>> A numa system may return node which is not online.
>>>> For example, a numa node:
>>>> 1) without memory
>>>> 2) NR_CPUS is very small, and the cpus on the node are not brought up
>>>>
>>>> In this situation, we use NUMA_NO_NODE to avoid oops.
>>>>
>>>> [   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
>>>> [   25.740982] Mem abort info:
>>>> [   25.743762]   ESR = 0x96000005
>>>> [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
>>>> [   25.752711]   SET = 0, FnV = 0
>>>> [   25.755751]   EA = 0, S1PTW = 0
>>>> [   25.758878] Data abort info:
>>>> [   25.761745]   ISV = 0, ISS = 0x00000005
>>>> [   25.765568]   CM = 0, WnR = 0
>>>> [   25.768521] [0000000000001988] user address but active_mm is swapper
>>>> [   25.774861] Internal error: Oops: 96000005 [#1] SMP
>>>> [   25.779724] Modules linked in:
>>>> [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
>>>> [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
>>>> [   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
>>>> [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
>>>> [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
>>>> [   25.813252] sp : ffff00000996f660
>>>> [   25.816553] x29: ffff00000996f660 x28: 0000000000000000
>>>> [   25.821852] x27: 00000000014012c0 x26: 0000000000000000
>>>> [   25.827150] x25: 0000000000000003 x24: ffff000008099eac
>>>> [   25.832449] x23: 0000000000400000 x22: 0000000000000000
>>>> [   25.837747] x21: 0000000000000001 x20: 0000000000000000
>>>> [   25.843045] x19: 0000000000400000 x18: 0000000000010e00
>>>> [   25.848343] x17: 000000000437f790 x16: 0000000000000020
>>>> [   25.853641] x15: 0000000000000000 x14: 6549435020524541
>>>> [   25.858939] x13: 20454d502067756c x12: 0000000000000000
>>>> [   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
>>>> [   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
>>>> [   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
>>>> [   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
>>>> [   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
>>>> [   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
>>>> [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
>>>> [   25.902712] Call trace:
>>>> [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
>>>> [   25.909577]  allocate_slab+0x94/0x590
>>>> [   25.913225]  new_slab+0x68/0xc8
>>>> [   25.916353]  ___slab_alloc+0x444/0x4f8
>>>> [   25.920088]  __slab_alloc+0x50/0x68
>>>> [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
>>>> [   25.928426]  pci_acpi_scan_root+0x94/0x278
>>>> [   25.932510]  acpi_pci_root_add+0x228/0x4b0
>>>> [   25.936593]  acpi_bus_attach+0x10c/0x218
>>>> [   25.940501]  acpi_bus_attach+0xac/0x218
>>>> [   25.944323]  acpi_bus_attach+0xac/0x218
>>>> [   25.948144]  acpi_bus_scan+0x5c/0xc0
>>>> [   25.951708]  acpi_scan_init+0xf8/0x254
>>>> [   25.955443]  acpi_init+0x310/0x37c
>>>> [   25.958831]  do_one_initcall+0x54/0x208
>>>> [   25.962653]  kernel_init_freeable+0x244/0x340
>>>> [   25.966999]  kernel_init+0x18/0x118
>>>> [   25.970474]  ret_from_fork+0x10/0x1c
>>>> [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
>>>> [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
>>>> [   25.984765] Kernel panic - not syncing: Fatal exception
>>>>
>>>> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
>>>> Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
>>>> Cc: Hanjun Guo <hanjun.guo@linaro.org>
>>>> Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
>>>> Cc: Xishi Qiu <qiuxishi@huawei.com>
>>>> ---
>>>>  arch/arm64/kernel/pci.c | 3 +++
>>>>  1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
>>>> index 0e2ea1c..e17cc45 100644
>>>> --- a/arch/arm64/kernel/pci.c
>>>> +++ b/arch/arm64/kernel/pci.c
>>>> @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
>>>>       struct pci_bus *bus, *child;
>>>>       struct acpi_pci_root_ops *root_ops;
>>>>
>>>> +     if (node != NUMA_NO_NODE && !node_online(node))
>>>> +             node = NUMA_NO_NODE;
>>>> +
>>>
>>> This really feels like a bodge, but it does appear to be what other
>>> architectures do, so:
>>>
>>> Acked-by: Will Deacon <will.deacon@arm.com>
>>
>> I agree, this doesn't feel like something we should be avoiding in the
>> caller of kzalloc_node().
>>
>> I would not expect kzalloc_node() to return memory that's offline, no
>> matter what node we told it to allocate from.  I could imagine it
>> returning failure, or returning memory from a node that *is* online,
>> but returning a pointer to offline memory seems broken.
>>
>> Are we putting memory that's offline in the free list?  I don't know
>> where to look to figure this out.
> 
> I am not sure I have the full context but pci_acpi_scan_root calls
> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> and that should fall back to whatever node that is online. Offline node
> shouldn't keep any pages behind. So there must be something else going
> on here and the patch is not the right way to handle it. What does
> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?

The whole context is:

The system is booted with a NUMA node has no memory attaching to it
(memory-less NUMA node), also with NR_CPUS less than CPUs presented
in MADT, so CPUs on this memory-less node are not brought up, and
this NUMA node will not be online (but SRAT presents this NUMA node);

Devices attaching to this NUMA node such as PCI host bridge still
return the valid NUMA node via _PXM, but actually that valid NUMA node
is not online which lead to this issue.

Thanks
Hanjun

> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-07 11:55           ` Hanjun Guo
  0 siblings, 0 replies; 103+ messages in thread
From: Hanjun Guo @ 2018-06-07 11:55 UTC (permalink / raw)
  To: linux-arm-kernel

On 2018/6/7 18:55, Michal Hocko wrote:
> On Wed 06-06-18 15:39:34, Bjorn Helgaas wrote:
>> [+cc akpm, linux-mm, linux-pci]
>>
>> On Wed, Jun 6, 2018 at 10:44 AM Will Deacon <will.deacon@arm.com> wrote:
>>>
>>> On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
>>>> A numa system may return node which is not online.
>>>> For example, a numa node:
>>>> 1) without memory
>>>> 2) NR_CPUS is very small, and the cpus on the node are not brought up
>>>>
>>>> In this situation, we use NUMA_NO_NODE to avoid oops.
>>>>
>>>> [   25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
>>>> [   25.740982] Mem abort info:
>>>> [   25.743762]   ESR = 0x96000005
>>>> [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
>>>> [   25.752711]   SET = 0, FnV = 0
>>>> [   25.755751]   EA = 0, S1PTW = 0
>>>> [   25.758878] Data abort info:
>>>> [   25.761745]   ISV = 0, ISS = 0x00000005
>>>> [   25.765568]   CM = 0, WnR = 0
>>>> [   25.768521] [0000000000001988] user address but active_mm is swapper
>>>> [   25.774861] Internal error: Oops: 96000005 [#1] SMP
>>>> [   25.779724] Modules linked in:
>>>> [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
>>>> [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
>>>> [   25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
>>>> [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
>>>> [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
>>>> [   25.813252] sp : ffff00000996f660
>>>> [   25.816553] x29: ffff00000996f660 x28: 0000000000000000
>>>> [   25.821852] x27: 00000000014012c0 x26: 0000000000000000
>>>> [   25.827150] x25: 0000000000000003 x24: ffff000008099eac
>>>> [   25.832449] x23: 0000000000400000 x22: 0000000000000000
>>>> [   25.837747] x21: 0000000000000001 x20: 0000000000000000
>>>> [   25.843045] x19: 0000000000400000 x18: 0000000000010e00
>>>> [   25.848343] x17: 000000000437f790 x16: 0000000000000020
>>>> [   25.853641] x15: 0000000000000000 x14: 6549435020524541
>>>> [   25.858939] x13: 20454d502067756c x12: 0000000000000000
>>>> [   25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
>>>> [   25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
>>>> [   25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
>>>> [   25.880132] x5 : 0000000000000000 x4 : 0000000000000081
>>>> [   25.885430] x3 : 0000000000000000 x2 : 0000000000000000
>>>> [   25.890728] x1 : 0000000000000001 x0 : 0000000000001980
>>>> [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x        (ptrval))
>>>> [   25.902712] Call trace:
>>>> [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
>>>> [   25.909577]  allocate_slab+0x94/0x590
>>>> [   25.913225]  new_slab+0x68/0xc8
>>>> [   25.916353]  ___slab_alloc+0x444/0x4f8
>>>> [   25.920088]  __slab_alloc+0x50/0x68
>>>> [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
>>>> [   25.928426]  pci_acpi_scan_root+0x94/0x278
>>>> [   25.932510]  acpi_pci_root_add+0x228/0x4b0
>>>> [   25.936593]  acpi_bus_attach+0x10c/0x218
>>>> [   25.940501]  acpi_bus_attach+0xac/0x218
>>>> [   25.944323]  acpi_bus_attach+0xac/0x218
>>>> [   25.948144]  acpi_bus_scan+0x5c/0xc0
>>>> [   25.951708]  acpi_scan_init+0xf8/0x254
>>>> [   25.955443]  acpi_init+0x310/0x37c
>>>> [   25.958831]  do_one_initcall+0x54/0x208
>>>> [   25.962653]  kernel_init_freeable+0x244/0x340
>>>> [   25.966999]  kernel_init+0x18/0x118
>>>> [   25.970474]  ret_from_fork+0x10/0x1c
>>>> [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
>>>> [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
>>>> [   25.984765] Kernel panic - not syncing: Fatal exception
>>>>
>>>> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
>>>> Tested-by: Huiqiang Wang <wanghuiqiang@huawei.com>
>>>> Cc: Hanjun Guo <hanjun.guo@linaro.org>
>>>> Cc: Tomasz Nowicki <Tomasz.Nowicki@caviumnetworks.com>
>>>> Cc: Xishi Qiu <qiuxishi@huawei.com>
>>>> ---
>>>>  arch/arm64/kernel/pci.c | 3 +++
>>>>  1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
>>>> index 0e2ea1c..e17cc45 100644
>>>> --- a/arch/arm64/kernel/pci.c
>>>> +++ b/arch/arm64/kernel/pci.c
>>>> @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
>>>>       struct pci_bus *bus, *child;
>>>>       struct acpi_pci_root_ops *root_ops;
>>>>
>>>> +     if (node != NUMA_NO_NODE && !node_online(node))
>>>> +             node = NUMA_NO_NODE;
>>>> +
>>>
>>> This really feels like a bodge, but it does appear to be what other
>>> architectures do, so:
>>>
>>> Acked-by: Will Deacon <will.deacon@arm.com>
>>
>> I agree, this doesn't feel like something we should be avoiding in the
>> caller of kzalloc_node().
>>
>> I would not expect kzalloc_node() to return memory that's offline, no
>> matter what node we told it to allocate from.  I could imagine it
>> returning failure, or returning memory from a node that *is* online,
>> but returning a pointer to offline memory seems broken.
>>
>> Are we putting memory that's offline in the free list?  I don't know
>> where to look to figure this out.
> 
> I am not sure I have the full context but pci_acpi_scan_root calls
> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> and that should fall back to whatever node that is online. Offline node
> shouldn't keep any pages behind. So there must be something else going
> on here and the patch is not the right way to handle it. What does
> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?

The whole context is:

The system is booted with a NUMA node has no memory attaching to it
(memory-less NUMA node), also with NR_CPUS less than CPUs presented
in MADT, so CPUs on this memory-less node are not brought up, and
this NUMA node will not be online (but SRAT presents this NUMA node);

Devices attaching to this NUMA node such as PCI host bridge still
return the valid NUMA node via _PXM, but actually that valid NUMA node
is not online which lead to this issue.

Thanks
Hanjun

> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-07 11:55           ` Hanjun Guo
  (?)
@ 2018-06-07 12:21             ` Michal Hocko
  -1 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-07 12:21 UTC (permalink / raw)
  To: Hanjun Guo
  Cc: Bjorn Helgaas, Will Deacon, xiexiuqi, Catalin Marinas,
	Greg Kroah-Hartman, Rafael J. Wysocki, Jarkko Sakkinen,
	linux-arm, Linux Kernel Mailing List, wanghuiqiang, tnowicki,
	linux-pci, Andrew Morton, linux-mm

On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> On 2018/6/7 18:55, Michal Hocko wrote:
[...]
> > I am not sure I have the full context but pci_acpi_scan_root calls
> > kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> > and that should fall back to whatever node that is online. Offline node
> > shouldn't keep any pages behind. So there must be something else going
> > on here and the patch is not the right way to handle it. What does
> > faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> 
> The whole context is:
> 
> The system is booted with a NUMA node has no memory attaching to it
> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> in MADT, so CPUs on this memory-less node are not brought up, and
> this NUMA node will not be online (but SRAT presents this NUMA node);
> 
> Devices attaching to this NUMA node such as PCI host bridge still
> return the valid NUMA node via _PXM, but actually that valid NUMA node
> is not online which lead to this issue.

But we should have other numa nodes on the zonelists so the allocator
should fall back to other node. If the zonelist is not intiailized
properly, though, then this can indeed show up as a problem. Knowing
which exact place has blown up would help get a better picture...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-07 12:21             ` Michal Hocko
  0 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-07 12:21 UTC (permalink / raw)
  To: Hanjun Guo
  Cc: tnowicki, xiexiuqi, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Bjorn Helgaas, Andrew Morton, linux-arm

On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> On 2018/6/7 18:55, Michal Hocko wrote:
[...]
> > I am not sure I have the full context but pci_acpi_scan_root calls
> > kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> > and that should fall back to whatever node that is online. Offline node
> > shouldn't keep any pages behind. So there must be something else going
> > on here and the patch is not the right way to handle it. What does
> > faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> 
> The whole context is:
> 
> The system is booted with a NUMA node has no memory attaching to it
> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> in MADT, so CPUs on this memory-less node are not brought up, and
> this NUMA node will not be online (but SRAT presents this NUMA node);
> 
> Devices attaching to this NUMA node such as PCI host bridge still
> return the valid NUMA node via _PXM, but actually that valid NUMA node
> is not online which lead to this issue.

But we should have other numa nodes on the zonelists so the allocator
should fall back to other node. If the zonelist is not intiailized
properly, though, then this can indeed show up as a problem. Knowing
which exact place has blown up would help get a better picture...

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-07 12:21             ` Michal Hocko
  0 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-07 12:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> On 2018/6/7 18:55, Michal Hocko wrote:
[...]
> > I am not sure I have the full context but pci_acpi_scan_root calls
> > kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> > and that should fall back to whatever node that is online. Offline node
> > shouldn't keep any pages behind. So there must be something else going
> > on here and the patch is not the right way to handle it. What does
> > faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> 
> The whole context is:
> 
> The system is booted with a NUMA node has no memory attaching to it
> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> in MADT, so CPUs on this memory-less node are not brought up, and
> this NUMA node will not be online (but SRAT presents this NUMA node);
> 
> Devices attaching to this NUMA node such as PCI host bridge still
> return the valid NUMA node via _PXM, but actually that valid NUMA node
> is not online which lead to this issue.

But we should have other numa nodes on the zonelists so the allocator
should fall back to other node. If the zonelist is not intiailized
properly, though, then this can indeed show up as a problem. Knowing
which exact place has blown up would help get a better picture...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-07 12:21             ` Michal Hocko
  (?)
  (?)
@ 2018-06-11  3:23               ` Xie XiuQi
  -1 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-11  3:23 UTC (permalink / raw)
  To: Michal Hocko, Hanjun Guo
  Cc: Bjorn Helgaas, Will Deacon, Catalin Marinas, Greg Kroah-Hartman,
	Rafael J. Wysocki, Jarkko Sakkinen, linux-arm,
	Linux Kernel Mailing List, wanghuiqiang, tnowicki, linux-pci,
	Andrew Morton, linux-mm, zhongjiang

Hi Michal,

On 2018/6/7 20:21, Michal Hocko wrote:
> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>> On 2018/6/7 18:55, Michal Hocko wrote:
> [...]
>>> I am not sure I have the full context but pci_acpi_scan_root calls
>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>>> and that should fall back to whatever node that is online. Offline node
>>> shouldn't keep any pages behind. So there must be something else going
>>> on here and the patch is not the right way to handle it. What does
>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>>
>> The whole context is:
>>
>> The system is booted with a NUMA node has no memory attaching to it
>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>> in MADT, so CPUs on this memory-less node are not brought up, and
>> this NUMA node will not be online (but SRAT presents this NUMA node);
>>
>> Devices attaching to this NUMA node such as PCI host bridge still
>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>> is not online which lead to this issue.
> 
> But we should have other numa nodes on the zonelists so the allocator
> should fall back to other node. If the zonelist is not intiailized
> properly, though, then this can indeed show up as a problem. Knowing
> which exact place has blown up would help get a better picture...
> 

I specific a non-exist node to allocate memory using kzalloc_node,
and got this following error message.

And I found out there is just a VM_WARN, but it does not prevent the memory
allocation continue.

This nid would be use to access NODE_DADA(nid), so if nid is invalid,
it would cause oops here.

459 /*
460  * Allocate pages, preferring the node given as nid. The node must be valid and
461  * online. For more general interface, see alloc_pages_node().
462  */
463 static inline struct page *
464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
465 {
466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
467         VM_WARN_ON(!node_online(nid));
468
469         return __alloc_pages(gfp_mask, order, nid);
470 }
471

(I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)

[  120.061693] WARNING: CPU: 6 PID: 3966 at ./include/linux/gfp.h:467 allocate_slab+0x5fd/0x7e0
[  120.070095] Modules linked in: bench(OE+) nls_utf8 isofs loop xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_mirror dm_region_hash dm_log dm_mod intel_rapl skx_edac nfit vfat libnvdimm fat x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass iTCO_wdt crct10dif_pclmul iTCO_vendor_support crc32_pclmul ghash_clmulni_intel ses pcbc enclosure aesni_intel scsi_transport_sas crypto_simd cryptd sg glue_helper ipmi_si joydev mei_me i2c_i801 ipmi_devintf ioatdma shpchp pcspkr ipmi_msghandler mei dca i2c_core lpc_ich acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
[  120.140992]  ext4 mbcache jbd2 sd_mod crc32c_intel i40e ahci libahci megaraid_sas libata
[  120.149053] CPU: 6 PID: 3966 Comm: insmod Tainted: G           OE     4.17.0-rc2-RHEL74+ #5
[  120.157369] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.62 03/26/2018
[  120.164303] RIP: 0010:allocate_slab+0x5fd/0x7e0
[  120.168817] RSP: 0018:ffff881196947af0 EFLAGS: 00010246
[  120.174022] RAX: 0000000000000000 RBX: 00000000014012c0 RCX: ffffffffb4bc8173
[  120.181126] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8817aefa7868
[  120.188233] RBP: 00000000014000c0 R08: ffffed02f5df4f0e R09: ffffed02f5df4f0e
[  120.195338] R10: ffffed02f5df4f0d R11: ffff8817aefa786f R12: 0000000000000055
[  120.202444] R13: 0000000000000003 R14: ffff880107c0f800 R15: 0000000000000000
[  120.209550] FS:  00007f6935d8c740(0000) GS:ffff8817aef80000(0000) knlGS:0000000000000000
[  120.217606] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  120.223330] CR2: 0000000000c21b88 CR3: 0000001197fd0006 CR4: 00000000007606e0
[  120.230435] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  120.237541] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  120.244646] PKRU: 55555554
[  120.247346] Call Trace:
[  120.249791]  ? __kasan_slab_free+0xff/0x150
[  120.253960]  ? mpidr_init+0x20/0x30 [bench]
[  120.258129]  new_slab+0x3d/0x90
[  120.261262]  ___slab_alloc+0x371/0x640
[  120.265002]  ? __wake_up_common+0x8a/0x150
[  120.269085]  ? mpidr_init+0x20/0x30 [bench]
[  120.273254]  ? mpidr_init+0x20/0x30 [bench]
[  120.277423]  __slab_alloc+0x40/0x66
[  120.280901]  kmem_cache_alloc_node_trace+0xbc/0x270
[  120.285762]  ? mpidr_init+0x20/0x30 [bench]
[  120.289931]  ? 0xffffffffc0740000
[  120.293236]  mpidr_init+0x20/0x30 [bench]
[  120.297236]  do_one_initcall+0x4b/0x1f5
[  120.301062]  ? do_init_module+0x22/0x233
[  120.304972]  ? kmem_cache_alloc_trace+0xfe/0x220
[  120.309571]  ? do_init_module+0x22/0x233
[  120.313481]  do_init_module+0x77/0x233
[  120.317218]  load_module+0x21ea/0x2960
[  120.320955]  ? m_show+0x1d0/0x1d0
[  120.324264]  ? security_capable+0x39/0x50
[  120.328261]  __do_sys_finit_module+0x94/0xe0
[  120.332516]  do_syscall_64+0x55/0x180
[  120.336171]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  120.341203] RIP: 0033:0x7f69352627f9
[  120.344767] RSP: 002b:00007ffd7d73f718 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[  120.352305] RAX: ffffffffffffffda RBX: 0000000000c201d0 RCX: 00007f69352627f9
[  120.359411] RDX: 0000000000000000 RSI: 000000000041a2d8 RDI: 0000000000000003
[  120.366517] RBP: 000000000041a2d8 R08: 0000000000000000 R09: 00007ffd7d73f8b8
[  120.373622] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000000000
[  120.380727] R13: 0000000000c20130 R14: 0000000000000000 R15: 0000000000000000
[  120.387833] Code: 4b e8 ac 97 eb ff e9 e1 fc ff ff 89 de 89 ef e8 7a 35 ff ff 49 89 c7 4d 85 ff 74 71 0f 1f 44 00 00 e9 f1 fa ff ff e8 cf 54 00 00 <0f> 0b 90 e9 c4 fa ff ff 45 89 e8 b9 b1 05 00 00 48 c7 c2 10 79
[  120.406620] ---[ end trace 89f801c36550734e ]---
[  120.411234] BUG: unable to handle kernel paging request at 0000000000002088
[  120.418168] PGD 8000001197c75067 P4D 8000001197c75067 PUD 119858f067 PMD 0
[  120.425103] Oops: 0000 [#1] SMP KASAN PTI
[  120.429097] Modules linked in: bench(OE+) nls_utf8 isofs loop xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_mirror dm_region_hash dm_log dm_mod intel_rapl skx_edac nfit vfat libnvdimm fat x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass iTCO_wdt crct10dif_pclmul iTCO_vendor_support crc32_pclmul ghash_clmulni_intel ses pcbc enclosure aesni_intel scsi_transport_sas crypto_simd cryptd sg glue_helper ipmi_si joydev mei_me i2c_i801 ipmi_devintf ioatdma shpchp pcspkr ipmi_msghandler mei dca i2c_core lpc_ich acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
[  120.499986]  ext4 mbcache jbd2 sd_mod crc32c_intel i40e ahci libahci megaraid_sas libata
[  120.508045] CPU: 6 PID: 3966 Comm: insmod Tainted: G        W  OE     4.17.0-rc2-RHEL74+ #5
[  120.516359] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.62 03/26/2018
[  120.523296] RIP: 0010:__alloc_pages_nodemask+0x10d/0x2c0
[  120.528586] RSP: 0018:ffff881196947a90 EFLAGS: 00010246
[  120.533790] RAX: 0000000000000001 RBX: 00000000014012c0 RCX: 0000000000000000
[  120.540895] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[  120.548000] RBP: 00000000014012c0 R08: ffffed0233ccb8f4 R09: ffffed0233ccb8f4
[  120.555105] R10: ffffed0233ccb8f3 R11: ffff88119e65c79f R12: 0000000000000000
[  120.562210] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[  120.569316] FS:  00007f6935d8c740(0000) GS:ffff8817aef80000(0000) knlGS:0000000000000000
[  120.577374] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  120.583095] CR2: 0000000000002088 CR3: 0000001197fd0006 CR4: 00000000007606e0
[  120.590200] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  120.597307] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  120.604412] PKRU: 55555554
[  120.607111] Call Trace:
[  120.609554]  allocate_slab+0xd8/0x7e0
[  120.613205]  ? __kasan_slab_free+0xff/0x150
[  120.617376]  ? mpidr_init+0x20/0x30 [bench]
[  120.621545]  new_slab+0x3d/0x90
[  120.624678]  ___slab_alloc+0x371/0x640
[  120.628415]  ? __wake_up_common+0x8a/0x150
[  120.632498]  ? mpidr_init+0x20/0x30 [bench]
[  120.636667]  ? mpidr_init+0x20/0x30 [bench]
[  120.640836]  __slab_alloc+0x40/0x66
[  120.644315]  kmem_cache_alloc_node_trace+0xbc/0x270
[  120.649175]  ? mpidr_init+0x20/0x30 [bench]
[  120.653343]  ? 0xffffffffc0740000
[  120.656649]  mpidr_init+0x20/0x30 [bench]
[  120.660645]  do_one_initcall+0x4b/0x1f5
[  120.664469]  ? do_init_module+0x22/0x233
[  120.668379]  ? kmem_cache_alloc_trace+0xfe/0x220
[  120.672978]  ? do_init_module+0x22/0x233
[  120.676887]  do_init_module+0x77/0x233
[  120.680624]  load_module+0x21ea/0x2960
[  120.684360]  ? m_show+0x1d0/0x1d0
[  120.687667]  ? security_capable+0x39/0x50
[  120.691663]  __do_sys_finit_module+0x94/0xe0
[  120.695920]  do_syscall_64+0x55/0x180
[  120.699571]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  120.704603] RIP: 0033:0x7f69352627f9
[  120.708166] RSP: 002b:00007ffd7d73f718 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[  120.715704] RAX: ffffffffffffffda RBX: 0000000000c201d0 RCX: 00007f69352627f9
[  120.722808] RDX: 0000000000000000 RSI: 000000000041a2d8 RDI: 0000000000000003
[  120.729913] RBP: 000000000041a2d8 R08: 0000000000000000 R09: 00007ffd7d73f8b8
[  120.737019] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000000000
[  120.744123] R13: 0000000000c20130 R14: 0000000000000000 R15: 0000000000000000
[  120.751230] Code: 89 c6 74 0d e8 55 ab 5e 00 8b 74 24 1c 48 8b 3c 24 48 8b 54 24 08 89 d9 c1 e9 17 83 e1 01 48 85 d2 88 4c 24 20 0f 85 25 01 00 00 <3b> 77 08 0f 82 1c 01 00 00 48 89 f8 44 89 ea 48 89 e1 44 89 e6
[  120.770020] RIP: __alloc_pages_nodemask+0x10d/0x2c0 RSP: ffff881196947a90
[  120.776780] CR2: 0000000000002088
[  120.780116] ---[ end trace 89f801c36550734f ]---
[  120.978922] Kernel panic - not syncing: Fatal exception
[  120.984186] Kernel Offset: 0x33800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  121.209501] ---[ end Kernel panic - not syncing: Fatal exception ]---



-- 
Thanks,
Xie XiuQi

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-11  3:23               ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-11  3:23 UTC (permalink / raw)
  To: Michal Hocko, Hanjun Guo
  Cc: tnowicki, linux-pci, Catalin Marinas, Rafael J. Wysocki,
	Will Deacon, Linux Kernel Mailing List, Jarkko Sakkinen,
	linux-mm, wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas,
	Andrew Morton, zhongjiang, linux-arm

Hi Michal,

On 2018/6/7 20:21, Michal Hocko wrote:
> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>> On 2018/6/7 18:55, Michal Hocko wrote:
> [...]
>>> I am not sure I have the full context but pci_acpi_scan_root calls
>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>>> and that should fall back to whatever node that is online. Offline node
>>> shouldn't keep any pages behind. So there must be something else going
>>> on here and the patch is not the right way to handle it. What does
>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>>
>> The whole context is:
>>
>> The system is booted with a NUMA node has no memory attaching to it
>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>> in MADT, so CPUs on this memory-less node are not brought up, and
>> this NUMA node will not be online (but SRAT presents this NUMA node);
>>
>> Devices attaching to this NUMA node such as PCI host bridge still
>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>> is not online which lead to this issue.
> 
> But we should have other numa nodes on the zonelists so the allocator
> should fall back to other node. If the zonelist is not intiailized
> properly, though, then this can indeed show up as a problem. Knowing
> which exact place has blown up would help get a better picture...
> 

I specific a non-exist node to allocate memory using kzalloc_node,
and got this following error message.

And I found out there is just a VM_WARN, but it does not prevent the memory
allocation continue.

This nid would be use to access NODE_DADA(nid), so if nid is invalid,
it would cause oops here.

459 /*
460  * Allocate pages, preferring the node given as nid. The node must be valid and
461  * online. For more general interface, see alloc_pages_node().
462  */
463 static inline struct page *
464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
465 {
466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
467         VM_WARN_ON(!node_online(nid));
468
469         return __alloc_pages(gfp_mask, order, nid);
470 }
471

(I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)

[  120.061693] WARNING: CPU: 6 PID: 3966 at ./include/linux/gfp.h:467 allocate_slab+0x5fd/0x7e0
[  120.070095] Modules linked in: bench(OE+) nls_utf8 isofs loop xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_mirror dm_region_hash dm_log dm_mod intel_rapl skx_edac nfit vfat libnvdimm fat x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass iTCO_wdt crct10dif_pclmul iTCO_vendor_support crc32_pclmul ghash_clmulni_intel ses pcbc enclosure aesni_intel scsi_transport_sas crypto_simd cryptd sg glue_helper ipmi_si joydev mei_me i2c_i801 ipmi_devintf ioatdma shpchp pcspkr ipmi_msghandler mei dca i2c_core lpc_ich acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
[  120.140992]  ext4 mbcache jbd2 sd_mod crc32c_intel i40e ahci libahci megaraid_sas libata
[  120.149053] CPU: 6 PID: 3966 Comm: insmod Tainted: G           OE     4.17.0-rc2-RHEL74+ #5
[  120.157369] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.62 03/26/2018
[  120.164303] RIP: 0010:allocate_slab+0x5fd/0x7e0
[  120.168817] RSP: 0018:ffff881196947af0 EFLAGS: 00010246
[  120.174022] RAX: 0000000000000000 RBX: 00000000014012c0 RCX: ffffffffb4bc8173
[  120.181126] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8817aefa7868
[  120.188233] RBP: 00000000014000c0 R08: ffffed02f5df4f0e R09: ffffed02f5df4f0e
[  120.195338] R10: ffffed02f5df4f0d R11: ffff8817aefa786f R12: 0000000000000055
[  120.202444] R13: 0000000000000003 R14: ffff880107c0f800 R15: 0000000000000000
[  120.209550] FS:  00007f6935d8c740(0000) GS:ffff8817aef80000(0000) knlGS:0000000000000000
[  120.217606] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  120.223330] CR2: 0000000000c21b88 CR3: 0000001197fd0006 CR4: 00000000007606e0
[  120.230435] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  120.237541] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  120.244646] PKRU: 55555554
[  120.247346] Call Trace:
[  120.249791]  ? __kasan_slab_free+0xff/0x150
[  120.253960]  ? mpidr_init+0x20/0x30 [bench]
[  120.258129]  new_slab+0x3d/0x90
[  120.261262]  ___slab_alloc+0x371/0x640
[  120.265002]  ? __wake_up_common+0x8a/0x150
[  120.269085]  ? mpidr_init+0x20/0x30 [bench]
[  120.273254]  ? mpidr_init+0x20/0x30 [bench]
[  120.277423]  __slab_alloc+0x40/0x66
[  120.280901]  kmem_cache_alloc_node_trace+0xbc/0x270
[  120.285762]  ? mpidr_init+0x20/0x30 [bench]
[  120.289931]  ? 0xffffffffc0740000
[  120.293236]  mpidr_init+0x20/0x30 [bench]
[  120.297236]  do_one_initcall+0x4b/0x1f5
[  120.301062]  ? do_init_module+0x22/0x233
[  120.304972]  ? kmem_cache_alloc_trace+0xfe/0x220
[  120.309571]  ? do_init_module+0x22/0x233
[  120.313481]  do_init_module+0x77/0x233
[  120.317218]  load_module+0x21ea/0x2960
[  120.320955]  ? m_show+0x1d0/0x1d0
[  120.324264]  ? security_capable+0x39/0x50
[  120.328261]  __do_sys_finit_module+0x94/0xe0
[  120.332516]  do_syscall_64+0x55/0x180
[  120.336171]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  120.341203] RIP: 0033:0x7f69352627f9
[  120.344767] RSP: 002b:00007ffd7d73f718 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[  120.352305] RAX: ffffffffffffffda RBX: 0000000000c201d0 RCX: 00007f69352627f9
[  120.359411] RDX: 0000000000000000 RSI: 000000000041a2d8 RDI: 0000000000000003
[  120.366517] RBP: 000000000041a2d8 R08: 0000000000000000 R09: 00007ffd7d73f8b8
[  120.373622] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000000000
[  120.380727] R13: 0000000000c20130 R14: 0000000000000000 R15: 0000000000000000
[  120.387833] Code: 4b e8 ac 97 eb ff e9 e1 fc ff ff 89 de 89 ef e8 7a 35 ff ff 49 89 c7 4d 85 ff 74 71 0f 1f 44 00 00 e9 f1 fa ff ff e8 cf 54 00 00 <0f> 0b 90 e9 c4 fa ff ff 45 89 e8 b9 b1 05 00 00 48 c7 c2 10 79
[  120.406620] ---[ end trace 89f801c36550734e ]---
[  120.411234] BUG: unable to handle kernel paging request at 0000000000002088
[  120.418168] PGD 8000001197c75067 P4D 8000001197c75067 PUD 119858f067 PMD 0
[  120.425103] Oops: 0000 [#1] SMP KASAN PTI
[  120.429097] Modules linked in: bench(OE+) nls_utf8 isofs loop xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_mirror dm_region_hash dm_log dm_mod intel_rapl skx_edac nfit vfat libnvdimm fat x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass iTCO_wdt crct10dif_pclmul iTCO_vendor_support crc32_pclmul ghash_clmulni_intel ses pcbc enclosure aesni_intel scsi_transport_sas crypto_simd cryptd sg glue_helper ipmi_si joydev mei_me i2c_i801 ipmi_devintf ioatdma shpchp pcspkr ipmi_msghandler mei dca i2c_core lpc_ich acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
[  120.499986]  ext4 mbcache jbd2 sd_mod crc32c_intel i40e ahci libahci megaraid_sas libata
[  120.508045] CPU: 6 PID: 3966 Comm: insmod Tainted: G        W  OE     4.17.0-rc2-RHEL74+ #5
[  120.516359] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.62 03/26/2018
[  120.523296] RIP: 0010:__alloc_pages_nodemask+0x10d/0x2c0
[  120.528586] RSP: 0018:ffff881196947a90 EFLAGS: 00010246
[  120.533790] RAX: 0000000000000001 RBX: 00000000014012c0 RCX: 0000000000000000
[  120.540895] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[  120.548000] RBP: 00000000014012c0 R08: ffffed0233ccb8f4 R09: ffffed0233ccb8f4
[  120.555105] R10: ffffed0233ccb8f3 R11: ffff88119e65c79f R12: 0000000000000000
[  120.562210] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[  120.569316] FS:  00007f6935d8c740(0000) GS:ffff8817aef80000(0000) knlGS:0000000000000000
[  120.577374] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  120.583095] CR2: 0000000000002088 CR3: 0000001197fd0006 CR4: 00000000007606e0
[  120.590200] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  120.597307] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  120.604412] PKRU: 55555554
[  120.607111] Call Trace:
[  120.609554]  allocate_slab+0xd8/0x7e0
[  120.613205]  ? __kasan_slab_free+0xff/0x150
[  120.617376]  ? mpidr_init+0x20/0x30 [bench]
[  120.621545]  new_slab+0x3d/0x90
[  120.624678]  ___slab_alloc+0x371/0x640
[  120.628415]  ? __wake_up_common+0x8a/0x150
[  120.632498]  ? mpidr_init+0x20/0x30 [bench]
[  120.636667]  ? mpidr_init+0x20/0x30 [bench]
[  120.640836]  __slab_alloc+0x40/0x66
[  120.644315]  kmem_cache_alloc_node_trace+0xbc/0x270
[  120.649175]  ? mpidr_init+0x20/0x30 [bench]
[  120.653343]  ? 0xffffffffc0740000
[  120.656649]  mpidr_init+0x20/0x30 [bench]
[  120.660645]  do_one_initcall+0x4b/0x1f5
[  120.664469]  ? do_init_module+0x22/0x233
[  120.668379]  ? kmem_cache_alloc_trace+0xfe/0x220
[  120.672978]  ? do_init_module+0x22/0x233
[  120.676887]  do_init_module+0x77/0x233
[  120.680624]  load_module+0x21ea/0x2960
[  120.684360]  ? m_show+0x1d0/0x1d0
[  120.687667]  ? security_capable+0x39/0x50
[  120.691663]  __do_sys_finit_module+0x94/0xe0
[  120.695920]  do_syscall_64+0x55/0x180
[  120.699571]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  120.704603] RIP: 0033:0x7f69352627f9
[  120.708166] RSP: 002b:00007ffd7d73f718 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[  120.715704] RAX: ffffffffffffffda RBX: 0000000000c201d0 RCX: 00007f69352627f9
[  120.722808] RDX: 0000000000000000 RSI: 000000000041a2d8 RDI: 0000000000000003
[  120.729913] RBP: 000000000041a2d8 R08: 0000000000000000 R09: 00007ffd7d73f8b8
[  120.737019] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000000000
[  120.744123] R13: 0000000000c20130 R14: 0000000000000000 R15: 0000000000000000
[  120.751230] Code: 89 c6 74 0d e8 55 ab 5e 00 8b 74 24 1c 48 8b 3c 24 48 8b 54 24 08 89 d9 c1 e9 17 83 e1 01 48 85 d2 88 4c 24 20 0f 85 25 01 00 00 <3b> 77 08 0f 82 1c 01 00 00 48 89 f8 44 89 ea 48 89 e1 44 89 e6
[  120.770020] RIP: __alloc_pages_nodemask+0x10d/0x2c0 RSP: ffff881196947a90
[  120.776780] CR2: 0000000000002088
[  120.780116] ---[ end trace 89f801c36550734f ]---
[  120.978922] Kernel panic - not syncing: Fatal exception
[  120.984186] Kernel Offset: 0x33800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  121.209501] ---[ end Kernel panic - not syncing: Fatal exception ]---



-- 
Thanks,
Xie XiuQi


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-11  3:23               ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-11  3:23 UTC (permalink / raw)
  To: Michal Hocko, Hanjun Guo
  Cc: Bjorn Helgaas, Will Deacon, Catalin Marinas, Greg Kroah-Hartman,
	Rafael J. Wysocki, Jarkko Sakkinen, linux-arm,
	Linux Kernel Mailing List, wanghuiqiang, tnowicki, linux-pci,
	Andrew Morton, linux-mm, zhongjiang

Hi Michal,

On 2018/6/7 20:21, Michal Hocko wrote:
> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>> On 2018/6/7 18:55, Michal Hocko wrote:
> [...]
>>> I am not sure I have the full context but pci_acpi_scan_root calls
>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>>> and that should fall back to whatever node that is online. Offline node
>>> shouldn't keep any pages behind. So there must be something else going
>>> on here and the patch is not the right way to handle it. What does
>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>>
>> The whole context is:
>>
>> The system is booted with a NUMA node has no memory attaching to it
>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>> in MADT, so CPUs on this memory-less node are not brought up, and
>> this NUMA node will not be online (but SRAT presents this NUMA node);
>>
>> Devices attaching to this NUMA node such as PCI host bridge still
>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>> is not online which lead to this issue.
> 
> But we should have other numa nodes on the zonelists so the allocator
> should fall back to other node. If the zonelist is not intiailized
> properly, though, then this can indeed show up as a problem. Knowing
> which exact place has blown up would help get a better picture...
> 

I specific a non-exist node to allocate memory using kzalloc_node,
and got this following error message.

And I found out there is just a VM_WARN, but it does not prevent the memory
allocation continue.

This nid would be use to access NODE_DADA(nid), so if nid is invalid,
it would cause oops here.

459 /*
460  * Allocate pages, preferring the node given as nid. The node must be valid and
461  * online. For more general interface, see alloc_pages_node().
462  */
463 static inline struct page *
464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
465 {
466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
467         VM_WARN_ON(!node_online(nid));
468
469         return __alloc_pages(gfp_mask, order, nid);
470 }
471

(I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)

[  120.061693] WARNING: CPU: 6 PID: 3966 at ./include/linux/gfp.h:467 allocate_slab+0x5fd/0x7e0
[  120.070095] Modules linked in: bench(OE+) nls_utf8 isofs loop xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_mirror dm_region_hash dm_log dm_mod intel_rapl skx_edac nfit vfat libnvdimm fat x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass iTCO_wdt crct10dif_pclmul iTCO_vendor_support crc32_pclmul ghash_clmulni_intel ses pcbc enclosure aesni_intel scsi_transport_sas crypto_simd cryptd sg glue_helper ipmi_si joydev mei_me i2c_i801 ipmi_devintf ioatdma shpchp pcspkr ipmi_msghandler mei dca i2c_core lpc_ich acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
[  120.140992]  ext4 mbcache jbd2 sd_mod crc32c_intel i40e ahci libahci megaraid_sas libata
[  120.149053] CPU: 6 PID: 3966 Comm: insmod Tainted: G           OE     4.17.0-rc2-RHEL74+ #5
[  120.157369] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.62 03/26/2018
[  120.164303] RIP: 0010:allocate_slab+0x5fd/0x7e0
[  120.168817] RSP: 0018:ffff881196947af0 EFLAGS: 00010246
[  120.174022] RAX: 0000000000000000 RBX: 00000000014012c0 RCX: ffffffffb4bc8173
[  120.181126] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8817aefa7868
[  120.188233] RBP: 00000000014000c0 R08: ffffed02f5df4f0e R09: ffffed02f5df4f0e
[  120.195338] R10: ffffed02f5df4f0d R11: ffff8817aefa786f R12: 0000000000000055
[  120.202444] R13: 0000000000000003 R14: ffff880107c0f800 R15: 0000000000000000
[  120.209550] FS:  00007f6935d8c740(0000) GS:ffff8817aef80000(0000) knlGS:0000000000000000
[  120.217606] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  120.223330] CR2: 0000000000c21b88 CR3: 0000001197fd0006 CR4: 00000000007606e0
[  120.230435] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  120.237541] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  120.244646] PKRU: 55555554
[  120.247346] Call Trace:
[  120.249791]  ? __kasan_slab_free+0xff/0x150
[  120.253960]  ? mpidr_init+0x20/0x30 [bench]
[  120.258129]  new_slab+0x3d/0x90
[  120.261262]  ___slab_alloc+0x371/0x640
[  120.265002]  ? __wake_up_common+0x8a/0x150
[  120.269085]  ? mpidr_init+0x20/0x30 [bench]
[  120.273254]  ? mpidr_init+0x20/0x30 [bench]
[  120.277423]  __slab_alloc+0x40/0x66
[  120.280901]  kmem_cache_alloc_node_trace+0xbc/0x270
[  120.285762]  ? mpidr_init+0x20/0x30 [bench]
[  120.289931]  ? 0xffffffffc0740000
[  120.293236]  mpidr_init+0x20/0x30 [bench]
[  120.297236]  do_one_initcall+0x4b/0x1f5
[  120.301062]  ? do_init_module+0x22/0x233
[  120.304972]  ? kmem_cache_alloc_trace+0xfe/0x220
[  120.309571]  ? do_init_module+0x22/0x233
[  120.313481]  do_init_module+0x77/0x233
[  120.317218]  load_module+0x21ea/0x2960
[  120.320955]  ? m_show+0x1d0/0x1d0
[  120.324264]  ? security_capable+0x39/0x50
[  120.328261]  __do_sys_finit_module+0x94/0xe0
[  120.332516]  do_syscall_64+0x55/0x180
[  120.336171]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  120.341203] RIP: 0033:0x7f69352627f9
[  120.344767] RSP: 002b:00007ffd7d73f718 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[  120.352305] RAX: ffffffffffffffda RBX: 0000000000c201d0 RCX: 00007f69352627f9
[  120.359411] RDX: 0000000000000000 RSI: 000000000041a2d8 RDI: 0000000000000003
[  120.366517] RBP: 000000000041a2d8 R08: 0000000000000000 R09: 00007ffd7d73f8b8
[  120.373622] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000000000
[  120.380727] R13: 0000000000c20130 R14: 0000000000000000 R15: 0000000000000000
[  120.387833] Code: 4b e8 ac 97 eb ff e9 e1 fc ff ff 89 de 89 ef e8 7a 35 ff ff 49 89 c7 4d 85 ff 74 71 0f 1f 44 00 00 e9 f1 fa ff ff e8 cf 54 00 00 <0f> 0b 90 e9 c4 fa ff ff 45 89 e8 b9 b1 05 00 00 48 c7 c2 10 79
[  120.406620] ---[ end trace 89f801c36550734e ]---
[  120.411234] BUG: unable to handle kernel paging request at 0000000000002088
[  120.418168] PGD 8000001197c75067 P4D 8000001197c75067 PUD 119858f067 PMD 0
[  120.425103] Oops: 0000 [#1] SMP KASAN PTI
[  120.429097] Modules linked in: bench(OE+) nls_utf8 isofs loop xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_mirror dm_region_hash dm_log dm_mod intel_rapl skx_edac nfit vfat libnvdimm fat x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass iTCO_wdt crct10dif_pclmul iTCO_vendor_support crc32_pclmul ghash_clmulni_intel ses pcbc enclosure aesni_intel scsi_transport_sas crypto_simd cryptd sg glue_helper ipmi_si joydev mei_me i2c_i801 ipmi_devintf ioatdma shpchp pcspkr ipmi_msghandler mei dca i2c_core lpc_ich acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
[  120.499986]  ext4 mbcache jbd2 sd_mod crc32c_intel i40e ahci libahci megaraid_sas libata
[  120.508045] CPU: 6 PID: 3966 Comm: insmod Tainted: G        W  OE     4.17.0-rc2-RHEL74+ #5
[  120.516359] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.62 03/26/2018
[  120.523296] RIP: 0010:__alloc_pages_nodemask+0x10d/0x2c0
[  120.528586] RSP: 0018:ffff881196947a90 EFLAGS: 00010246
[  120.533790] RAX: 0000000000000001 RBX: 00000000014012c0 RCX: 0000000000000000
[  120.540895] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[  120.548000] RBP: 00000000014012c0 R08: ffffed0233ccb8f4 R09: ffffed0233ccb8f4
[  120.555105] R10: ffffed0233ccb8f3 R11: ffff88119e65c79f R12: 0000000000000000
[  120.562210] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[  120.569316] FS:  00007f6935d8c740(0000) GS:ffff8817aef80000(0000) knlGS:0000000000000000
[  120.577374] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  120.583095] CR2: 0000000000002088 CR3: 0000001197fd0006 CR4: 00000000007606e0
[  120.590200] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  120.597307] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  120.604412] PKRU: 55555554
[  120.607111] Call Trace:
[  120.609554]  allocate_slab+0xd8/0x7e0
[  120.613205]  ? __kasan_slab_free+0xff/0x150
[  120.617376]  ? mpidr_init+0x20/0x30 [bench]
[  120.621545]  new_slab+0x3d/0x90
[  120.624678]  ___slab_alloc+0x371/0x640
[  120.628415]  ? __wake_up_common+0x8a/0x150
[  120.632498]  ? mpidr_init+0x20/0x30 [bench]
[  120.636667]  ? mpidr_init+0x20/0x30 [bench]
[  120.640836]  __slab_alloc+0x40/0x66
[  120.644315]  kmem_cache_alloc_node_trace+0xbc/0x270
[  120.649175]  ? mpidr_init+0x20/0x30 [bench]
[  120.653343]  ? 0xffffffffc0740000
[  120.656649]  mpidr_init+0x20/0x30 [bench]
[  120.660645]  do_one_initcall+0x4b/0x1f5
[  120.664469]  ? do_init_module+0x22/0x233
[  120.668379]  ? kmem_cache_alloc_trace+0xfe/0x220
[  120.672978]  ? do_init_module+0x22/0x233
[  120.676887]  do_init_module+0x77/0x233
[  120.680624]  load_module+0x21ea/0x2960
[  120.684360]  ? m_show+0x1d0/0x1d0
[  120.687667]  ? security_capable+0x39/0x50
[  120.691663]  __do_sys_finit_module+0x94/0xe0
[  120.695920]  do_syscall_64+0x55/0x180
[  120.699571]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  120.704603] RIP: 0033:0x7f69352627f9
[  120.708166] RSP: 002b:00007ffd7d73f718 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[  120.715704] RAX: ffffffffffffffda RBX: 0000000000c201d0 RCX: 00007f69352627f9
[  120.722808] RDX: 0000000000000000 RSI: 000000000041a2d8 RDI: 0000000000000003
[  120.729913] RBP: 000000000041a2d8 R08: 0000000000000000 R09: 00007ffd7d73f8b8
[  120.737019] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000000000
[  120.744123] R13: 0000000000c20130 R14: 0000000000000000 R15: 0000000000000000
[  120.751230] Code: 89 c6 74 0d e8 55 ab 5e 00 8b 74 24 1c 48 8b 3c 24 48 8b 54 24 08 89 d9 c1 e9 17 83 e1 01 48 85 d2 88 4c 24 20 0f 85 25 01 00 00 <3b> 77 08 0f 82 1c 01 00 00 48 89 f8 44 89 ea 48 89 e1 44 89 e6
[  120.770020] RIP: __alloc_pages_nodemask+0x10d/0x2c0 RSP: ffff881196947a90
[  120.776780] CR2: 0000000000002088
[  120.780116] ---[ end trace 89f801c36550734f ]---
[  120.978922] Kernel panic - not syncing: Fatal exception
[  120.984186] Kernel Offset: 0x33800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  121.209501] ---[ end Kernel panic - not syncing: Fatal exception ]---



-- 
Thanks,
Xie XiuQi

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-11  3:23               ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-11  3:23 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Michal,

On 2018/6/7 20:21, Michal Hocko wrote:
> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>> On 2018/6/7 18:55, Michal Hocko wrote:
> [...]
>>> I am not sure I have the full context but pci_acpi_scan_root calls
>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>>> and that should fall back to whatever node that is online. Offline node
>>> shouldn't keep any pages behind. So there must be something else going
>>> on here and the patch is not the right way to handle it. What does
>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>>
>> The whole context is:
>>
>> The system is booted with a NUMA node has no memory attaching to it
>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>> in MADT, so CPUs on this memory-less node are not brought up, and
>> this NUMA node will not be online (but SRAT presents this NUMA node);
>>
>> Devices attaching to this NUMA node such as PCI host bridge still
>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>> is not online which lead to this issue.
> 
> But we should have other numa nodes on the zonelists so the allocator
> should fall back to other node. If the zonelist is not intiailized
> properly, though, then this can indeed show up as a problem. Knowing
> which exact place has blown up would help get a better picture...
> 

I specific a non-exist node to allocate memory using kzalloc_node,
and got this following error message.

And I found out there is just a VM_WARN, but it does not prevent the memory
allocation continue.

This nid would be use to access NODE_DADA(nid), so if nid is invalid,
it would cause oops here.

459 /*
460  * Allocate pages, preferring the node given as nid. The node must be valid and
461  * online. For more general interface, see alloc_pages_node().
462  */
463 static inline struct page *
464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
465 {
466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
467         VM_WARN_ON(!node_online(nid));
468
469         return __alloc_pages(gfp_mask, order, nid);
470 }
471

(I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)

[  120.061693] WARNING: CPU: 6 PID: 3966 at ./include/linux/gfp.h:467 allocate_slab+0x5fd/0x7e0
[  120.070095] Modules linked in: bench(OE+) nls_utf8 isofs loop xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_mirror dm_region_hash dm_log dm_mod intel_rapl skx_edac nfit vfat libnvdimm fat x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass iTCO_wdt crct10dif_pclmul iTCO_vendor_support crc32_pclmul ghash_clmulni_intel ses pcbc enclosure aesni_intel scsi_transport_sas crypto_simd cryptd sg glue_helper ipmi_si joydev mei_me i2c_i801 ipmi_devintf ioatdma shpchp pcspkr ipmi_msghandler mei dca i2c_core lpc_ich acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
[  120.140992]  ext4 mbcache jbd2 sd_mod crc32c_intel i40e ahci libahci megaraid_sas libata
[  120.149053] CPU: 6 PID: 3966 Comm: insmod Tainted: G           OE     4.17.0-rc2-RHEL74+ #5
[  120.157369] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.62 03/26/2018
[  120.164303] RIP: 0010:allocate_slab+0x5fd/0x7e0
[  120.168817] RSP: 0018:ffff881196947af0 EFLAGS: 00010246
[  120.174022] RAX: 0000000000000000 RBX: 00000000014012c0 RCX: ffffffffb4bc8173
[  120.181126] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8817aefa7868
[  120.188233] RBP: 00000000014000c0 R08: ffffed02f5df4f0e R09: ffffed02f5df4f0e
[  120.195338] R10: ffffed02f5df4f0d R11: ffff8817aefa786f R12: 0000000000000055
[  120.202444] R13: 0000000000000003 R14: ffff880107c0f800 R15: 0000000000000000
[  120.209550] FS:  00007f6935d8c740(0000) GS:ffff8817aef80000(0000) knlGS:0000000000000000
[  120.217606] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  120.223330] CR2: 0000000000c21b88 CR3: 0000001197fd0006 CR4: 00000000007606e0
[  120.230435] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  120.237541] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  120.244646] PKRU: 55555554
[  120.247346] Call Trace:
[  120.249791]  ? __kasan_slab_free+0xff/0x150
[  120.253960]  ? mpidr_init+0x20/0x30 [bench]
[  120.258129]  new_slab+0x3d/0x90
[  120.261262]  ___slab_alloc+0x371/0x640
[  120.265002]  ? __wake_up_common+0x8a/0x150
[  120.269085]  ? mpidr_init+0x20/0x30 [bench]
[  120.273254]  ? mpidr_init+0x20/0x30 [bench]
[  120.277423]  __slab_alloc+0x40/0x66
[  120.280901]  kmem_cache_alloc_node_trace+0xbc/0x270
[  120.285762]  ? mpidr_init+0x20/0x30 [bench]
[  120.289931]  ? 0xffffffffc0740000
[  120.293236]  mpidr_init+0x20/0x30 [bench]
[  120.297236]  do_one_initcall+0x4b/0x1f5
[  120.301062]  ? do_init_module+0x22/0x233
[  120.304972]  ? kmem_cache_alloc_trace+0xfe/0x220
[  120.309571]  ? do_init_module+0x22/0x233
[  120.313481]  do_init_module+0x77/0x233
[  120.317218]  load_module+0x21ea/0x2960
[  120.320955]  ? m_show+0x1d0/0x1d0
[  120.324264]  ? security_capable+0x39/0x50
[  120.328261]  __do_sys_finit_module+0x94/0xe0
[  120.332516]  do_syscall_64+0x55/0x180
[  120.336171]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  120.341203] RIP: 0033:0x7f69352627f9
[  120.344767] RSP: 002b:00007ffd7d73f718 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[  120.352305] RAX: ffffffffffffffda RBX: 0000000000c201d0 RCX: 00007f69352627f9
[  120.359411] RDX: 0000000000000000 RSI: 000000000041a2d8 RDI: 0000000000000003
[  120.366517] RBP: 000000000041a2d8 R08: 0000000000000000 R09: 00007ffd7d73f8b8
[  120.373622] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000000000
[  120.380727] R13: 0000000000c20130 R14: 0000000000000000 R15: 0000000000000000
[  120.387833] Code: 4b e8 ac 97 eb ff e9 e1 fc ff ff 89 de 89 ef e8 7a 35 ff ff 49 89 c7 4d 85 ff 74 71 0f 1f 44 00 00 e9 f1 fa ff ff e8 cf 54 00 00 <0f> 0b 90 e9 c4 fa ff ff 45 89 e8 b9 b1 05 00 00 48 c7 c2 10 79
[  120.406620] ---[ end trace 89f801c36550734e ]---
[  120.411234] BUG: unable to handle kernel paging request at 0000000000002088
[  120.418168] PGD 8000001197c75067 P4D 8000001197c75067 PUD 119858f067 PMD 0
[  120.425103] Oops: 0000 [#1] SMP KASAN PTI
[  120.429097] Modules linked in: bench(OE+) nls_utf8 isofs loop xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_mirror dm_region_hash dm_log dm_mod intel_rapl skx_edac nfit vfat libnvdimm fat x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass iTCO_wdt crct10dif_pclmul iTCO_vendor_support crc32_pclmul ghash_clmulni_intel ses pcbc enclosure aesni_intel scsi_transport_sas crypto_simd cryptd sg glue_helper ipmi_si joydev mei_me i2c_i801 ipmi_devintf ioatdma shpchp pcspkr ipmi_msghandler mei dca i2c_core lpc_ich acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
[  120.499986]  ext4 mbcache jbd2 sd_mod crc32c_intel i40e ahci libahci megaraid_sas libata
[  120.508045] CPU: 6 PID: 3966 Comm: insmod Tainted: G        W  OE     4.17.0-rc2-RHEL74+ #5
[  120.516359] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.62 03/26/2018
[  120.523296] RIP: 0010:__alloc_pages_nodemask+0x10d/0x2c0
[  120.528586] RSP: 0018:ffff881196947a90 EFLAGS: 00010246
[  120.533790] RAX: 0000000000000001 RBX: 00000000014012c0 RCX: 0000000000000000
[  120.540895] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[  120.548000] RBP: 00000000014012c0 R08: ffffed0233ccb8f4 R09: ffffed0233ccb8f4
[  120.555105] R10: ffffed0233ccb8f3 R11: ffff88119e65c79f R12: 0000000000000000
[  120.562210] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[  120.569316] FS:  00007f6935d8c740(0000) GS:ffff8817aef80000(0000) knlGS:0000000000000000
[  120.577374] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  120.583095] CR2: 0000000000002088 CR3: 0000001197fd0006 CR4: 00000000007606e0
[  120.590200] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  120.597307] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  120.604412] PKRU: 55555554
[  120.607111] Call Trace:
[  120.609554]  allocate_slab+0xd8/0x7e0
[  120.613205]  ? __kasan_slab_free+0xff/0x150
[  120.617376]  ? mpidr_init+0x20/0x30 [bench]
[  120.621545]  new_slab+0x3d/0x90
[  120.624678]  ___slab_alloc+0x371/0x640
[  120.628415]  ? __wake_up_common+0x8a/0x150
[  120.632498]  ? mpidr_init+0x20/0x30 [bench]
[  120.636667]  ? mpidr_init+0x20/0x30 [bench]
[  120.640836]  __slab_alloc+0x40/0x66
[  120.644315]  kmem_cache_alloc_node_trace+0xbc/0x270
[  120.649175]  ? mpidr_init+0x20/0x30 [bench]
[  120.653343]  ? 0xffffffffc0740000
[  120.656649]  mpidr_init+0x20/0x30 [bench]
[  120.660645]  do_one_initcall+0x4b/0x1f5
[  120.664469]  ? do_init_module+0x22/0x233
[  120.668379]  ? kmem_cache_alloc_trace+0xfe/0x220
[  120.672978]  ? do_init_module+0x22/0x233
[  120.676887]  do_init_module+0x77/0x233
[  120.680624]  load_module+0x21ea/0x2960
[  120.684360]  ? m_show+0x1d0/0x1d0
[  120.687667]  ? security_capable+0x39/0x50
[  120.691663]  __do_sys_finit_module+0x94/0xe0
[  120.695920]  do_syscall_64+0x55/0x180
[  120.699571]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  120.704603] RIP: 0033:0x7f69352627f9
[  120.708166] RSP: 002b:00007ffd7d73f718 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[  120.715704] RAX: ffffffffffffffda RBX: 0000000000c201d0 RCX: 00007f69352627f9
[  120.722808] RDX: 0000000000000000 RSI: 000000000041a2d8 RDI: 0000000000000003
[  120.729913] RBP: 000000000041a2d8 R08: 0000000000000000 R09: 00007ffd7d73f8b8
[  120.737019] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000000000
[  120.744123] R13: 0000000000c20130 R14: 0000000000000000 R15: 0000000000000000
[  120.751230] Code: 89 c6 74 0d e8 55 ab 5e 00 8b 74 24 1c 48 8b 3c 24 48 8b 54 24 08 89 d9 c1 e9 17 83 e1 01 48 85 d2 88 4c 24 20 0f 85 25 01 00 00 <3b> 77 08 0f 82 1c 01 00 00 48 89 f8 44 89 ea 48 89 e1 44 89 e6
[  120.770020] RIP: __alloc_pages_nodemask+0x10d/0x2c0 RSP: ffff881196947a90
[  120.776780] CR2: 0000000000002088
[  120.780116] ---[ end trace 89f801c36550734f ]---
[  120.978922] Kernel panic - not syncing: Fatal exception
[  120.984186] Kernel Offset: 0x33800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  121.209501] ---[ end Kernel panic - not syncing: Fatal exception ]---



-- 
Thanks,
Xie XiuQi

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-11  3:23               ` Xie XiuQi
  (?)
@ 2018-06-11  8:52                 ` Michal Hocko
  -1 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-11  8:52 UTC (permalink / raw)
  To: Xie XiuQi
  Cc: Hanjun Guo, Bjorn Helgaas, Will Deacon, Catalin Marinas,
	Greg Kroah-Hartman, Rafael J. Wysocki, Jarkko Sakkinen,
	linux-arm, Linux Kernel Mailing List, wanghuiqiang, tnowicki,
	linux-pci, Andrew Morton, linux-mm, zhongjiang

On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
> Hi Michal,
> 
> On 2018/6/7 20:21, Michal Hocko wrote:
> > On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> >> On 2018/6/7 18:55, Michal Hocko wrote:
> > [...]
> >>> I am not sure I have the full context but pci_acpi_scan_root calls
> >>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> >>> and that should fall back to whatever node that is online. Offline node
> >>> shouldn't keep any pages behind. So there must be something else going
> >>> on here and the patch is not the right way to handle it. What does
> >>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> >>
> >> The whole context is:
> >>
> >> The system is booted with a NUMA node has no memory attaching to it
> >> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> >> in MADT, so CPUs on this memory-less node are not brought up, and
> >> this NUMA node will not be online (but SRAT presents this NUMA node);
> >>
> >> Devices attaching to this NUMA node such as PCI host bridge still
> >> return the valid NUMA node via _PXM, but actually that valid NUMA node
> >> is not online which lead to this issue.
> > 
> > But we should have other numa nodes on the zonelists so the allocator
> > should fall back to other node. If the zonelist is not intiailized
> > properly, though, then this can indeed show up as a problem. Knowing
> > which exact place has blown up would help get a better picture...
> > 
> 
> I specific a non-exist node to allocate memory using kzalloc_node,
> and got this following error message.
>
> And I found out there is just a VM_WARN, but it does not prevent the memory
> allocation continue.
> 
> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
> it would cause oops here.
> 
> 459 /*
> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
> 461  * online. For more general interface, see alloc_pages_node().
> 462  */
> 463 static inline struct page *
> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> 465 {
> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> 467         VM_WARN_ON(!node_online(nid));
> 468
> 469         return __alloc_pages(gfp_mask, order, nid);
> 470 }
> 471
> 
> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)

OK, so this is an artificialy broken code, right. You shouldn't get a
non-existent node via standard APIs AFAICS. The original report was
about an existing node which is offline AFAIU. That would be a different
case. If I am missing something and there are legitimate users that try
to allocate from non-existing nodes then we should handle that in
node_zonelist.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-11  8:52                 ` Michal Hocko
  0 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-11  8:52 UTC (permalink / raw)
  To: Xie XiuQi
  Cc: Hanjun Guo, tnowicki, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Bjorn Helgaas, Andrew Morton, zhongjiang, linux-arm

On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
> Hi Michal,
> 
> On 2018/6/7 20:21, Michal Hocko wrote:
> > On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> >> On 2018/6/7 18:55, Michal Hocko wrote:
> > [...]
> >>> I am not sure I have the full context but pci_acpi_scan_root calls
> >>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> >>> and that should fall back to whatever node that is online. Offline node
> >>> shouldn't keep any pages behind. So there must be something else going
> >>> on here and the patch is not the right way to handle it. What does
> >>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> >>
> >> The whole context is:
> >>
> >> The system is booted with a NUMA node has no memory attaching to it
> >> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> >> in MADT, so CPUs on this memory-less node are not brought up, and
> >> this NUMA node will not be online (but SRAT presents this NUMA node);
> >>
> >> Devices attaching to this NUMA node such as PCI host bridge still
> >> return the valid NUMA node via _PXM, but actually that valid NUMA node
> >> is not online which lead to this issue.
> > 
> > But we should have other numa nodes on the zonelists so the allocator
> > should fall back to other node. If the zonelist is not intiailized
> > properly, though, then this can indeed show up as a problem. Knowing
> > which exact place has blown up would help get a better picture...
> > 
> 
> I specific a non-exist node to allocate memory using kzalloc_node,
> and got this following error message.
>
> And I found out there is just a VM_WARN, but it does not prevent the memory
> allocation continue.
> 
> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
> it would cause oops here.
> 
> 459 /*
> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
> 461  * online. For more general interface, see alloc_pages_node().
> 462  */
> 463 static inline struct page *
> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> 465 {
> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> 467         VM_WARN_ON(!node_online(nid));
> 468
> 469         return __alloc_pages(gfp_mask, order, nid);
> 470 }
> 471
> 
> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)

OK, so this is an artificialy broken code, right. You shouldn't get a
non-existent node via standard APIs AFAICS. The original report was
about an existing node which is offline AFAIU. That would be a different
case. If I am missing something and there are legitimate users that try
to allocate from non-existing nodes then we should handle that in
node_zonelist.

[...]
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-11  8:52                 ` Michal Hocko
  0 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-11  8:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
> Hi Michal,
> 
> On 2018/6/7 20:21, Michal Hocko wrote:
> > On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> >> On 2018/6/7 18:55, Michal Hocko wrote:
> > [...]
> >>> I am not sure I have the full context but pci_acpi_scan_root calls
> >>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> >>> and that should fall back to whatever node that is online. Offline node
> >>> shouldn't keep any pages behind. So there must be something else going
> >>> on here and the patch is not the right way to handle it. What does
> >>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> >>
> >> The whole context is:
> >>
> >> The system is booted with a NUMA node has no memory attaching to it
> >> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> >> in MADT, so CPUs on this memory-less node are not brought up, and
> >> this NUMA node will not be online (but SRAT presents this NUMA node);
> >>
> >> Devices attaching to this NUMA node such as PCI host bridge still
> >> return the valid NUMA node via _PXM, but actually that valid NUMA node
> >> is not online which lead to this issue.
> > 
> > But we should have other numa nodes on the zonelists so the allocator
> > should fall back to other node. If the zonelist is not intiailized
> > properly, though, then this can indeed show up as a problem. Knowing
> > which exact place has blown up would help get a better picture...
> > 
> 
> I specific a non-exist node to allocate memory using kzalloc_node,
> and got this following error message.
>
> And I found out there is just a VM_WARN, but it does not prevent the memory
> allocation continue.
> 
> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
> it would cause oops here.
> 
> 459 /*
> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
> 461  * online. For more general interface, see alloc_pages_node().
> 462  */
> 463 static inline struct page *
> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> 465 {
> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> 467         VM_WARN_ON(!node_online(nid));
> 468
> 469         return __alloc_pages(gfp_mask, order, nid);
> 470 }
> 471
> 
> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)

OK, so this is an artificialy broken code, right. You shouldn't get a
non-existent node via standard APIs AFAICS. The original report was
about an existing node which is offline AFAIU. That would be a different
case. If I am missing something and there are legitimate users that try
to allocate from non-existing nodes then we should handle that in
node_zonelist.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-11  8:52                 ` Michal Hocko
  (?)
  (?)
@ 2018-06-11 12:32                   ` Xie XiuQi
  -1 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-11 12:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hanjun Guo, Bjorn Helgaas, Will Deacon, Catalin Marinas,
	Greg Kroah-Hartman, Rafael J. Wysocki, Jarkko Sakkinen,
	linux-arm, Linux Kernel Mailing List, wanghuiqiang, tnowicki,
	linux-pci, Andrew Morton, linux-mm, zhongjiang

Hi Michal,

On 2018/6/11 16:52, Michal Hocko wrote:
> On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
>> Hi Michal,
>>
>> On 2018/6/7 20:21, Michal Hocko wrote:
>>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>>>> On 2018/6/7 18:55, Michal Hocko wrote:
>>> [...]
>>>>> I am not sure I have the full context but pci_acpi_scan_root calls
>>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>>>>> and that should fall back to whatever node that is online. Offline node
>>>>> shouldn't keep any pages behind. So there must be something else going
>>>>> on here and the patch is not the right way to handle it. What does
>>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>>>>
>>>> The whole context is:
>>>>
>>>> The system is booted with a NUMA node has no memory attaching to it
>>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>>>> in MADT, so CPUs on this memory-less node are not brought up, and
>>>> this NUMA node will not be online (but SRAT presents this NUMA node);
>>>>
>>>> Devices attaching to this NUMA node such as PCI host bridge still
>>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>>>> is not online which lead to this issue.
>>>
>>> But we should have other numa nodes on the zonelists so the allocator
>>> should fall back to other node. If the zonelist is not intiailized
>>> properly, though, then this can indeed show up as a problem. Knowing
>>> which exact place has blown up would help get a better picture...
>>>
>>
>> I specific a non-exist node to allocate memory using kzalloc_node,
>> and got this following error message.
>>
>> And I found out there is just a VM_WARN, but it does not prevent the memory
>> allocation continue.
>>
>> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
>> it would cause oops here.
>>
>> 459 /*
>> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
>> 461  * online. For more general interface, see alloc_pages_node().
>> 462  */
>> 463 static inline struct page *
>> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>> 465 {
>> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> 467         VM_WARN_ON(!node_online(nid));
>> 468
>> 469         return __alloc_pages(gfp_mask, order, nid);
>> 470 }
>> 471
>>
>> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
> 
> OK, so this is an artificialy broken code, right. You shouldn't get a
> non-existent node via standard APIs AFAICS. The original report was
> about an existing node which is offline AFAIU. That would be a different
> case. If I am missing something and there are legitimate users that try
> to allocate from non-existing nodes then we should handle that in
> node_zonelist.

I think hanjun's comments may help to understood this question:
 - NUMA node will be built if CPUs and (or) memory are valid on this NUMA node;

 - But if we boot the system with memory-less node and also with CONFIG_NR_CPUS
   less than CPUs in SRAT, for example, 64 CPUs total with 4 NUMA nodes, 16 CPUs
   on each NUMA node, if we boot with CONFIG_NR_CPUS=48, then we will not built
   numa node for node 3, but with devices on that numa node, alloc memory will
   be panic because NUMA node 3 is not a valid node.

I triggered this BUG on arm64 platform, and I found a similar bug has been fixed
on x86 platform. So I sent a similar patch for this bug.

Or, could we consider to fix it in the mm subsystem?

>From b755de8dfdfef97effaa91379ffafcb81f4d62a1 Mon Sep 17 00:00:00 2001
From: Yinghai Lu <Yinghai.Lu@Sun.COM>
Date: Wed, 20 Feb 2008 12:41:52 -0800
Subject: [PATCH] x86: make dev_to_node return online node

a numa system (with multi HT chains) may return node without ram. Aka it
is not online. Try to get an online node, otherwise return -1.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/pci/acpi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index d95de2f..ea8685f 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -172,6 +172,9 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
 		set_mp_bus_to_node(busnum, node);
 	else
 		node = get_mp_bus_to_node(busnum);
+
+	if (node != -1 && !node_online(node))
+		node = -1;
 #endif

 	/* Allocate per-root-bus (not per bus) arch-specific data.
-- 
1.8.3.1


> 
> [...]
> 

-- 
Thanks,
Xie XiuQi

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-11 12:32                   ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-11 12:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hanjun Guo, tnowicki, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Bjorn Helgaas, Andrew Morton, zhongjiang, linux-arm

Hi Michal,

On 2018/6/11 16:52, Michal Hocko wrote:
> On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
>> Hi Michal,
>>
>> On 2018/6/7 20:21, Michal Hocko wrote:
>>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>>>> On 2018/6/7 18:55, Michal Hocko wrote:
>>> [...]
>>>>> I am not sure I have the full context but pci_acpi_scan_root calls
>>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>>>>> and that should fall back to whatever node that is online. Offline node
>>>>> shouldn't keep any pages behind. So there must be something else going
>>>>> on here and the patch is not the right way to handle it. What does
>>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>>>>
>>>> The whole context is:
>>>>
>>>> The system is booted with a NUMA node has no memory attaching to it
>>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>>>> in MADT, so CPUs on this memory-less node are not brought up, and
>>>> this NUMA node will not be online (but SRAT presents this NUMA node);
>>>>
>>>> Devices attaching to this NUMA node such as PCI host bridge still
>>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>>>> is not online which lead to this issue.
>>>
>>> But we should have other numa nodes on the zonelists so the allocator
>>> should fall back to other node. If the zonelist is not intiailized
>>> properly, though, then this can indeed show up as a problem. Knowing
>>> which exact place has blown up would help get a better picture...
>>>
>>
>> I specific a non-exist node to allocate memory using kzalloc_node,
>> and got this following error message.
>>
>> And I found out there is just a VM_WARN, but it does not prevent the memory
>> allocation continue.
>>
>> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
>> it would cause oops here.
>>
>> 459 /*
>> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
>> 461  * online. For more general interface, see alloc_pages_node().
>> 462  */
>> 463 static inline struct page *
>> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>> 465 {
>> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> 467         VM_WARN_ON(!node_online(nid));
>> 468
>> 469         return __alloc_pages(gfp_mask, order, nid);
>> 470 }
>> 471
>>
>> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
> 
> OK, so this is an artificialy broken code, right. You shouldn't get a
> non-existent node via standard APIs AFAICS. The original report was
> about an existing node which is offline AFAIU. That would be a different
> case. If I am missing something and there are legitimate users that try
> to allocate from non-existing nodes then we should handle that in
> node_zonelist.

I think hanjun's comments may help to understood this question:
 - NUMA node will be built if CPUs and (or) memory are valid on this NUMA node;

 - But if we boot the system with memory-less node and also with CONFIG_NR_CPUS
   less than CPUs in SRAT, for example, 64 CPUs total with 4 NUMA nodes, 16 CPUs
   on each NUMA node, if we boot with CONFIG_NR_CPUS=48, then we will not built
   numa node for node 3, but with devices on that numa node, alloc memory will
   be panic because NUMA node 3 is not a valid node.

I triggered this BUG on arm64 platform, and I found a similar bug has been fixed
on x86 platform. So I sent a similar patch for this bug.

Or, could we consider to fix it in the mm subsystem?

>From b755de8dfdfef97effaa91379ffafcb81f4d62a1 Mon Sep 17 00:00:00 2001
From: Yinghai Lu <Yinghai.Lu@Sun.COM>
Date: Wed, 20 Feb 2008 12:41:52 -0800
Subject: [PATCH] x86: make dev_to_node return online node

a numa system (with multi HT chains) may return node without ram. Aka it
is not online. Try to get an online node, otherwise return -1.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/pci/acpi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index d95de2f..ea8685f 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -172,6 +172,9 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
 		set_mp_bus_to_node(busnum, node);
 	else
 		node = get_mp_bus_to_node(busnum);
+
+	if (node != -1 && !node_online(node))
+		node = -1;
 #endif

 	/* Allocate per-root-bus (not per bus) arch-specific data.
-- 
1.8.3.1


> 
> [...]
> 

-- 
Thanks,
Xie XiuQi


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-11 12:32                   ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-11 12:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hanjun Guo, Bjorn Helgaas, Will Deacon, Catalin Marinas,
	Greg Kroah-Hartman, Rafael J. Wysocki, Jarkko Sakkinen,
	linux-arm, Linux Kernel Mailing List, wanghuiqiang, tnowicki,
	linux-pci, Andrew Morton, linux-mm, zhongjiang

Hi Michal,

On 2018/6/11 16:52, Michal Hocko wrote:
> On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
>> Hi Michal,
>>
>> On 2018/6/7 20:21, Michal Hocko wrote:
>>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>>>> On 2018/6/7 18:55, Michal Hocko wrote:
>>> [...]
>>>>> I am not sure I have the full context but pci_acpi_scan_root calls
>>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>>>>> and that should fall back to whatever node that is online. Offline node
>>>>> shouldn't keep any pages behind. So there must be something else going
>>>>> on here and the patch is not the right way to handle it. What does
>>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>>>>
>>>> The whole context is:
>>>>
>>>> The system is booted with a NUMA node has no memory attaching to it
>>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>>>> in MADT, so CPUs on this memory-less node are not brought up, and
>>>> this NUMA node will not be online (but SRAT presents this NUMA node);
>>>>
>>>> Devices attaching to this NUMA node such as PCI host bridge still
>>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>>>> is not online which lead to this issue.
>>>
>>> But we should have other numa nodes on the zonelists so the allocator
>>> should fall back to other node. If the zonelist is not intiailized
>>> properly, though, then this can indeed show up as a problem. Knowing
>>> which exact place has blown up would help get a better picture...
>>>
>>
>> I specific a non-exist node to allocate memory using kzalloc_node,
>> and got this following error message.
>>
>> And I found out there is just a VM_WARN, but it does not prevent the memory
>> allocation continue.
>>
>> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
>> it would cause oops here.
>>
>> 459 /*
>> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
>> 461  * online. For more general interface, see alloc_pages_node().
>> 462  */
>> 463 static inline struct page *
>> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>> 465 {
>> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> 467         VM_WARN_ON(!node_online(nid));
>> 468
>> 469         return __alloc_pages(gfp_mask, order, nid);
>> 470 }
>> 471
>>
>> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
> 
> OK, so this is an artificialy broken code, right. You shouldn't get a
> non-existent node via standard APIs AFAICS. The original report was
> about an existing node which is offline AFAIU. That would be a different
> case. If I am missing something and there are legitimate users that try
> to allocate from non-existing nodes then we should handle that in
> node_zonelist.

I think hanjun's comments may help to understood this question:
 - NUMA node will be built if CPUs and (or) memory are valid on this NUMA node;

 - But if we boot the system with memory-less node and also with CONFIG_NR_CPUS
   less than CPUs in SRAT, for example, 64 CPUs total with 4 NUMA nodes, 16 CPUs
   on each NUMA node, if we boot with CONFIG_NR_CPUS=48, then we will not built
   numa node for node 3, but with devices on that numa node, alloc memory will
   be panic because NUMA node 3 is not a valid node.

I triggered this BUG on arm64 platform, and I found a similar bug has been fixed
on x86 platform. So I sent a similar patch for this bug.

Or, could we consider to fix it in the mm subsystem?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-11 12:32                   ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-11 12:32 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Michal,

On 2018/6/11 16:52, Michal Hocko wrote:
> On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
>> Hi Michal,
>>
>> On 2018/6/7 20:21, Michal Hocko wrote:
>>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>>>> On 2018/6/7 18:55, Michal Hocko wrote:
>>> [...]
>>>>> I am not sure I have the full context but pci_acpi_scan_root calls
>>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>>>>> and that should fall back to whatever node that is online. Offline node
>>>>> shouldn't keep any pages behind. So there must be something else going
>>>>> on here and the patch is not the right way to handle it. What does
>>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>>>>
>>>> The whole context is:
>>>>
>>>> The system is booted with a NUMA node has no memory attaching to it
>>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>>>> in MADT, so CPUs on this memory-less node are not brought up, and
>>>> this NUMA node will not be online (but SRAT presents this NUMA node);
>>>>
>>>> Devices attaching to this NUMA node such as PCI host bridge still
>>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>>>> is not online which lead to this issue.
>>>
>>> But we should have other numa nodes on the zonelists so the allocator
>>> should fall back to other node. If the zonelist is not intiailized
>>> properly, though, then this can indeed show up as a problem. Knowing
>>> which exact place has blown up would help get a better picture...
>>>
>>
>> I specific a non-exist node to allocate memory using kzalloc_node,
>> and got this following error message.
>>
>> And I found out there is just a VM_WARN, but it does not prevent the memory
>> allocation continue.
>>
>> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
>> it would cause oops here.
>>
>> 459 /*
>> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
>> 461  * online. For more general interface, see alloc_pages_node().
>> 462  */
>> 463 static inline struct page *
>> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>> 465 {
>> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> 467         VM_WARN_ON(!node_online(nid));
>> 468
>> 469         return __alloc_pages(gfp_mask, order, nid);
>> 470 }
>> 471
>>
>> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
> 
> OK, so this is an artificialy broken code, right. You shouldn't get a
> non-existent node via standard APIs AFAICS. The original report was
> about an existing node which is offline AFAIU. That would be a different
> case. If I am missing something and there are legitimate users that try
> to allocate from non-existing nodes then we should handle that in
> node_zonelist.

I think hanjun's comments may help to understood this question:
 - NUMA node will be built if CPUs and (or) memory are valid on this NUMA node;

 - But if we boot the system with memory-less node and also with CONFIG_NR_CPUS
   less than CPUs in SRAT, for example, 64 CPUs total with 4 NUMA nodes, 16 CPUs
   on each NUMA node, if we boot with CONFIG_NR_CPUS=48, then we will not built
   numa node for node 3, but with devices on that numa node, alloc memory will
   be panic because NUMA node 3 is not a valid node.

I triggered this BUG on arm64 platform, and I found a similar bug has been fixed
on x86 platform. So I sent a similar patch for this bug.

Or, could we consider to fix it in the mm subsystem?

>From b755de8dfdfef97effaa91379ffafcb81f4d62a1 Mon Sep 17 00:00:00 2001
From: Yinghai Lu <Yinghai.Lu@Sun.COM>
Date: Wed, 20 Feb 2008 12:41:52 -0800
Subject: [PATCH] x86: make dev_to_node return online node

a numa system (with multi HT chains) may return node without ram. Aka it
is not online. Try to get an online node, otherwise return -1.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/pci/acpi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index d95de2f..ea8685f 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -172,6 +172,9 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
 		set_mp_bus_to_node(busnum, node);
 	else
 		node = get_mp_bus_to_node(busnum);
+
+	if (node != -1 && !node_online(node))
+		node = -1;
 #endif

 	/* Allocate per-root-bus (not per bus) arch-specific data.
-- 
1.8.3.1


> 
> [...]
> 

-- 
Thanks,
Xie XiuQi

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-11 12:32                   ` Xie XiuQi
  (?)
@ 2018-06-11 13:43                     ` Bjorn Helgaas
  -1 siblings, 0 replies; 103+ messages in thread
From: Bjorn Helgaas @ 2018-06-11 13:43 UTC (permalink / raw)
  To: Xie XiuQi
  Cc: Michal Hocko, Hanjun Guo, tnowicki, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Bjorn Helgaas, Andrew Morton, zhongjiang, linux-arm

On Mon, Jun 11, 2018 at 08:32:10PM +0800, Xie XiuQi wrote:
> Hi Michal,
> 
> On 2018/6/11 16:52, Michal Hocko wrote:
> > On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
> >> Hi Michal,
> >>
> >> On 2018/6/7 20:21, Michal Hocko wrote:
> >>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> >>>> On 2018/6/7 18:55, Michal Hocko wrote:
> >>> [...]
> >>>>> I am not sure I have the full context but pci_acpi_scan_root calls
> >>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> >>>>> and that should fall back to whatever node that is online. Offline node
> >>>>> shouldn't keep any pages behind. So there must be something else going
> >>>>> on here and the patch is not the right way to handle it. What does
> >>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> >>>>
> >>>> The whole context is:
> >>>>
> >>>> The system is booted with a NUMA node has no memory attaching to it
> >>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> >>>> in MADT, so CPUs on this memory-less node are not brought up, and
> >>>> this NUMA node will not be online (but SRAT presents this NUMA node);
> >>>>
> >>>> Devices attaching to this NUMA node such as PCI host bridge still
> >>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
> >>>> is not online which lead to this issue.
> >>>
> >>> But we should have other numa nodes on the zonelists so the allocator
> >>> should fall back to other node. If the zonelist is not intiailized
> >>> properly, though, then this can indeed show up as a problem. Knowing
> >>> which exact place has blown up would help get a better picture...
> >>>
> >>
> >> I specific a non-exist node to allocate memory using kzalloc_node,
> >> and got this following error message.
> >>
> >> And I found out there is just a VM_WARN, but it does not prevent the memory
> >> allocation continue.
> >>
> >> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
> >> it would cause oops here.
> >>
> >> 459 /*
> >> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
> >> 461  * online. For more general interface, see alloc_pages_node().
> >> 462  */
> >> 463 static inline struct page *
> >> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> >> 465 {
> >> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> >> 467         VM_WARN_ON(!node_online(nid));
> >> 468
> >> 469         return __alloc_pages(gfp_mask, order, nid);
> >> 470 }
> >> 471
> >>
> >> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
> > 
> > OK, so this is an artificialy broken code, right. You shouldn't get a
> > non-existent node via standard APIs AFAICS. The original report was
> > about an existing node which is offline AFAIU. That would be a different
> > case. If I am missing something and there are legitimate users that try
> > to allocate from non-existing nodes then we should handle that in
> > node_zonelist.
> 
> I think hanjun's comments may help to understood this question:
>  - NUMA node will be built if CPUs and (or) memory are valid on this NUMA
>  node;
> 
>  - But if we boot the system with memory-less node and also with
>  CONFIG_NR_CPUS less than CPUs in SRAT, for example, 64 CPUs total with 4
>  NUMA nodes, 16 CPUs on each NUMA node, if we boot with
>  CONFIG_NR_CPUS=48, then we will not built numa node for node 3, but with
>  devices on that numa node, alloc memory will be panic because NUMA node
>  3 is not a valid node.
> 
> I triggered this BUG on arm64 platform, and I found a similar bug has
> been fixed on x86 platform. So I sent a similar patch for this bug.
> 
> Or, could we consider to fix it in the mm subsystem?

The patch below (b755de8dfdfe) seems like totally the wrong direction.
I don't think we want every caller of kzalloc_node() to have check for
node_online().

Why would memory on an off-line node even be in the allocation pool?
I wouldn't expect that memory to be put in the pool until the node
comes online and the memory is accessible, so this sounds like some
kind of setup issue.

But I'm definitely not an mm person.

> From b755de8dfdfef97effaa91379ffafcb81f4d62a1 Mon Sep 17 00:00:00 2001
> From: Yinghai Lu <Yinghai.Lu@Sun.COM>
> Date: Wed, 20 Feb 2008 12:41:52 -0800
> Subject: [PATCH] x86: make dev_to_node return online node
> 
> a numa system (with multi HT chains) may return node without ram. Aka it
> is not online. Try to get an online node, otherwise return -1.
> 
> Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/pci/acpi.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
> index d95de2f..ea8685f 100644
> --- a/arch/x86/pci/acpi.c
> +++ b/arch/x86/pci/acpi.c
> @@ -172,6 +172,9 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
>  		set_mp_bus_to_node(busnum, node);
>  	else
>  		node = get_mp_bus_to_node(busnum);
> +
> +	if (node != -1 && !node_online(node))
> +		node = -1;
>  #endif
> 
>  	/* Allocate per-root-bus (not per bus) arch-specific data.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-11 13:43                     ` Bjorn Helgaas
  0 siblings, 0 replies; 103+ messages in thread
From: Bjorn Helgaas @ 2018-06-11 13:43 UTC (permalink / raw)
  To: Xie XiuQi
  Cc: Catalin Marinas, tnowicki, Greg Kroah-Hartman, linux-pci,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, Michal Hocko, linux-mm, linux-arm, Hanjun Guo,
	Bjorn Helgaas, Andrew Morton, zhongjiang, wanghuiqiang

On Mon, Jun 11, 2018 at 08:32:10PM +0800, Xie XiuQi wrote:
> Hi Michal,
> 
> On 2018/6/11 16:52, Michal Hocko wrote:
> > On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
> >> Hi Michal,
> >>
> >> On 2018/6/7 20:21, Michal Hocko wrote:
> >>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> >>>> On 2018/6/7 18:55, Michal Hocko wrote:
> >>> [...]
> >>>>> I am not sure I have the full context but pci_acpi_scan_root calls
> >>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> >>>>> and that should fall back to whatever node that is online. Offline node
> >>>>> shouldn't keep any pages behind. So there must be something else going
> >>>>> on here and the patch is not the right way to handle it. What does
> >>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> >>>>
> >>>> The whole context is:
> >>>>
> >>>> The system is booted with a NUMA node has no memory attaching to it
> >>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> >>>> in MADT, so CPUs on this memory-less node are not brought up, and
> >>>> this NUMA node will not be online (but SRAT presents this NUMA node);
> >>>>
> >>>> Devices attaching to this NUMA node such as PCI host bridge still
> >>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
> >>>> is not online which lead to this issue.
> >>>
> >>> But we should have other numa nodes on the zonelists so the allocator
> >>> should fall back to other node. If the zonelist is not intiailized
> >>> properly, though, then this can indeed show up as a problem. Knowing
> >>> which exact place has blown up would help get a better picture...
> >>>
> >>
> >> I specific a non-exist node to allocate memory using kzalloc_node,
> >> and got this following error message.
> >>
> >> And I found out there is just a VM_WARN, but it does not prevent the memory
> >> allocation continue.
> >>
> >> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
> >> it would cause oops here.
> >>
> >> 459 /*
> >> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
> >> 461  * online. For more general interface, see alloc_pages_node().
> >> 462  */
> >> 463 static inline struct page *
> >> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> >> 465 {
> >> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> >> 467         VM_WARN_ON(!node_online(nid));
> >> 468
> >> 469         return __alloc_pages(gfp_mask, order, nid);
> >> 470 }
> >> 471
> >>
> >> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
> > 
> > OK, so this is an artificialy broken code, right. You shouldn't get a
> > non-existent node via standard APIs AFAICS. The original report was
> > about an existing node which is offline AFAIU. That would be a different
> > case. If I am missing something and there are legitimate users that try
> > to allocate from non-existing nodes then we should handle that in
> > node_zonelist.
> 
> I think hanjun's comments may help to understood this question:
>  - NUMA node will be built if CPUs and (or) memory are valid on this NUMA
>  node;
> 
>  - But if we boot the system with memory-less node and also with
>  CONFIG_NR_CPUS less than CPUs in SRAT, for example, 64 CPUs total with 4
>  NUMA nodes, 16 CPUs on each NUMA node, if we boot with
>  CONFIG_NR_CPUS=48, then we will not built numa node for node 3, but with
>  devices on that numa node, alloc memory will be panic because NUMA node
>  3 is not a valid node.
> 
> I triggered this BUG on arm64 platform, and I found a similar bug has
> been fixed on x86 platform. So I sent a similar patch for this bug.
> 
> Or, could we consider to fix it in the mm subsystem?

The patch below (b755de8dfdfe) seems like totally the wrong direction.
I don't think we want every caller of kzalloc_node() to have check for
node_online().

Why would memory on an off-line node even be in the allocation pool?
I wouldn't expect that memory to be put in the pool until the node
comes online and the memory is accessible, so this sounds like some
kind of setup issue.

But I'm definitely not an mm person.

> From b755de8dfdfef97effaa91379ffafcb81f4d62a1 Mon Sep 17 00:00:00 2001
> From: Yinghai Lu <Yinghai.Lu@Sun.COM>
> Date: Wed, 20 Feb 2008 12:41:52 -0800
> Subject: [PATCH] x86: make dev_to_node return online node
> 
> a numa system (with multi HT chains) may return node without ram. Aka it
> is not online. Try to get an online node, otherwise return -1.
> 
> Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/pci/acpi.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
> index d95de2f..ea8685f 100644
> --- a/arch/x86/pci/acpi.c
> +++ b/arch/x86/pci/acpi.c
> @@ -172,6 +172,9 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
>  		set_mp_bus_to_node(busnum, node);
>  	else
>  		node = get_mp_bus_to_node(busnum);
> +
> +	if (node != -1 && !node_online(node))
> +		node = -1;
>  #endif
> 
>  	/* Allocate per-root-bus (not per bus) arch-specific data.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-11 13:43                     ` Bjorn Helgaas
  0 siblings, 0 replies; 103+ messages in thread
From: Bjorn Helgaas @ 2018-06-11 13:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 11, 2018 at 08:32:10PM +0800, Xie XiuQi wrote:
> Hi Michal,
> 
> On 2018/6/11 16:52, Michal Hocko wrote:
> > On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
> >> Hi Michal,
> >>
> >> On 2018/6/7 20:21, Michal Hocko wrote:
> >>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> >>>> On 2018/6/7 18:55, Michal Hocko wrote:
> >>> [...]
> >>>>> I am not sure I have the full context but pci_acpi_scan_root calls
> >>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> >>>>> and that should fall back to whatever node that is online. Offline node
> >>>>> shouldn't keep any pages behind. So there must be something else going
> >>>>> on here and the patch is not the right way to handle it. What does
> >>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> >>>>
> >>>> The whole context is:
> >>>>
> >>>> The system is booted with a NUMA node has no memory attaching to it
> >>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> >>>> in MADT, so CPUs on this memory-less node are not brought up, and
> >>>> this NUMA node will not be online (but SRAT presents this NUMA node);
> >>>>
> >>>> Devices attaching to this NUMA node such as PCI host bridge still
> >>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
> >>>> is not online which lead to this issue.
> >>>
> >>> But we should have other numa nodes on the zonelists so the allocator
> >>> should fall back to other node. If the zonelist is not intiailized
> >>> properly, though, then this can indeed show up as a problem. Knowing
> >>> which exact place has blown up would help get a better picture...
> >>>
> >>
> >> I specific a non-exist node to allocate memory using kzalloc_node,
> >> and got this following error message.
> >>
> >> And I found out there is just a VM_WARN, but it does not prevent the memory
> >> allocation continue.
> >>
> >> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
> >> it would cause oops here.
> >>
> >> 459 /*
> >> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
> >> 461  * online. For more general interface, see alloc_pages_node().
> >> 462  */
> >> 463 static inline struct page *
> >> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> >> 465 {
> >> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> >> 467         VM_WARN_ON(!node_online(nid));
> >> 468
> >> 469         return __alloc_pages(gfp_mask, order, nid);
> >> 470 }
> >> 471
> >>
> >> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
> > 
> > OK, so this is an artificialy broken code, right. You shouldn't get a
> > non-existent node via standard APIs AFAICS. The original report was
> > about an existing node which is offline AFAIU. That would be a different
> > case. If I am missing something and there are legitimate users that try
> > to allocate from non-existing nodes then we should handle that in
> > node_zonelist.
> 
> I think hanjun's comments may help to understood this question:
>  - NUMA node will be built if CPUs and (or) memory are valid on this NUMA
>  node;
> 
>  - But if we boot the system with memory-less node and also with
>  CONFIG_NR_CPUS less than CPUs in SRAT, for example, 64 CPUs total with 4
>  NUMA nodes, 16 CPUs on each NUMA node, if we boot with
>  CONFIG_NR_CPUS=48, then we will not built numa node for node 3, but with
>  devices on that numa node, alloc memory will be panic because NUMA node
>  3 is not a valid node.
> 
> I triggered this BUG on arm64 platform, and I found a similar bug has
> been fixed on x86 platform. So I sent a similar patch for this bug.
> 
> Or, could we consider to fix it in the mm subsystem?

The patch below (b755de8dfdfe) seems like totally the wrong direction.
I don't think we want every caller of kzalloc_node() to have check for
node_online().

Why would memory on an off-line node even be in the allocation pool?
I wouldn't expect that memory to be put in the pool until the node
comes online and the memory is accessible, so this sounds like some
kind of setup issue.

But I'm definitely not an mm person.

> From b755de8dfdfef97effaa91379ffafcb81f4d62a1 Mon Sep 17 00:00:00 2001
> From: Yinghai Lu <Yinghai.Lu@Sun.COM>
> Date: Wed, 20 Feb 2008 12:41:52 -0800
> Subject: [PATCH] x86: make dev_to_node return online node
> 
> a numa system (with multi HT chains) may return node without ram. Aka it
> is not online. Try to get an online node, otherwise return -1.
> 
> Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/pci/acpi.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
> index d95de2f..ea8685f 100644
> --- a/arch/x86/pci/acpi.c
> +++ b/arch/x86/pci/acpi.c
> @@ -172,6 +172,9 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
>  		set_mp_bus_to_node(busnum, node);
>  	else
>  		node = get_mp_bus_to_node(busnum);
> +
> +	if (node != -1 && !node_online(node))
> +		node = -1;
>  #endif
> 
>  	/* Allocate per-root-bus (not per bus) arch-specific data.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-11 13:43                     ` Bjorn Helgaas
@ 2018-06-11 14:53                       ` Michal Hocko
  -1 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-11 14:53 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Xie XiuQi, Hanjun Guo, tnowicki, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Bjorn Helgaas, Andrew Morton, zhongjiang, linux-arm

On Mon 11-06-18 08:43:03, Bjorn Helgaas wrote:
> On Mon, Jun 11, 2018 at 08:32:10PM +0800, Xie XiuQi wrote:
> > Hi Michal,
> > 
> > On 2018/6/11 16:52, Michal Hocko wrote:
> > > On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
> > >> Hi Michal,
> > >>
> > >> On 2018/6/7 20:21, Michal Hocko wrote:
> > >>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> > >>>> On 2018/6/7 18:55, Michal Hocko wrote:
> > >>> [...]
> > >>>>> I am not sure I have the full context but pci_acpi_scan_root calls
> > >>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> > >>>>> and that should fall back to whatever node that is online. Offline node
> > >>>>> shouldn't keep any pages behind. So there must be something else going
> > >>>>> on here and the patch is not the right way to handle it. What does
> > >>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> > >>>>
> > >>>> The whole context is:
> > >>>>
> > >>>> The system is booted with a NUMA node has no memory attaching to it
> > >>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> > >>>> in MADT, so CPUs on this memory-less node are not brought up, and
> > >>>> this NUMA node will not be online (but SRAT presents this NUMA node);
> > >>>>
> > >>>> Devices attaching to this NUMA node such as PCI host bridge still
> > >>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
> > >>>> is not online which lead to this issue.
> > >>>
> > >>> But we should have other numa nodes on the zonelists so the allocator
> > >>> should fall back to other node. If the zonelist is not intiailized
> > >>> properly, though, then this can indeed show up as a problem. Knowing
> > >>> which exact place has blown up would help get a better picture...
> > >>>
> > >>
> > >> I specific a non-exist node to allocate memory using kzalloc_node,
> > >> and got this following error message.
> > >>
> > >> And I found out there is just a VM_WARN, but it does not prevent the memory
> > >> allocation continue.
> > >>
> > >> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
> > >> it would cause oops here.
> > >>
> > >> 459 /*
> > >> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
> > >> 461  * online. For more general interface, see alloc_pages_node().
> > >> 462  */
> > >> 463 static inline struct page *
> > >> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> > >> 465 {
> > >> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> > >> 467         VM_WARN_ON(!node_online(nid));
> > >> 468
> > >> 469         return __alloc_pages(gfp_mask, order, nid);
> > >> 470 }
> > >> 471
> > >>
> > >> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
> > > 
> > > OK, so this is an artificialy broken code, right. You shouldn't get a
> > > non-existent node via standard APIs AFAICS. The original report was
> > > about an existing node which is offline AFAIU. That would be a different
> > > case. If I am missing something and there are legitimate users that try
> > > to allocate from non-existing nodes then we should handle that in
> > > node_zonelist.
> > 
> > I think hanjun's comments may help to understood this question:
> >  - NUMA node will be built if CPUs and (or) memory are valid on this NUMA
> >  node;
> > 
> >  - But if we boot the system with memory-less node and also with
> >  CONFIG_NR_CPUS less than CPUs in SRAT, for example, 64 CPUs total with 4
> >  NUMA nodes, 16 CPUs on each NUMA node, if we boot with
> >  CONFIG_NR_CPUS=48, then we will not built numa node for node 3, but with
> >  devices on that numa node, alloc memory will be panic because NUMA node
> >  3 is not a valid node.

Hmm, but this is not a memory-less node. It sounds like a misconfigured
kernel to me or the broken initialization. Each CPU should have a
fallback numa node to be used.

> > I triggered this BUG on arm64 platform, and I found a similar bug has
> > been fixed on x86 platform. So I sent a similar patch for this bug.
> > 
> > Or, could we consider to fix it in the mm subsystem?
> 
> The patch below (b755de8dfdfe) seems like totally the wrong direction.
> I don't think we want every caller of kzalloc_node() to have check for
> node_online().

absolutely.

> Why would memory on an off-line node even be in the allocation pool?
> I wouldn't expect that memory to be put in the pool until the node
> comes online and the memory is accessible, so this sounds like some
> kind of setup issue.
> 
> But I'm definitely not an mm person.

Well, the standard way to handle memory less NUMA nodes is to simply
fallback to the closest NUMA node. We even have an API for that
(numa_mem_id).
 
> > From b755de8dfdfef97effaa91379ffafcb81f4d62a1 Mon Sep 17 00:00:00 2001
> > From: Yinghai Lu <Yinghai.Lu@Sun.COM>
> > Date: Wed, 20 Feb 2008 12:41:52 -0800
> > Subject: [PATCH] x86: make dev_to_node return online node
> > 
> > a numa system (with multi HT chains) may return node without ram. Aka it
> > is not online. Try to get an online node, otherwise return -1.
> > 
> > Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
> > Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  arch/x86/pci/acpi.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
> > index d95de2f..ea8685f 100644
> > --- a/arch/x86/pci/acpi.c
> > +++ b/arch/x86/pci/acpi.c
> > @@ -172,6 +172,9 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
> >  		set_mp_bus_to_node(busnum, node);
> >  	else
> >  		node = get_mp_bus_to_node(busnum);
> > +
> > +	if (node != -1 && !node_online(node))
> > +		node = -1;
> >  #endif
> > 
> >  	/* Allocate per-root-bus (not per bus) arch-specific data.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-11 14:53                       ` Michal Hocko
  0 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-11 14:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon 11-06-18 08:43:03, Bjorn Helgaas wrote:
> On Mon, Jun 11, 2018 at 08:32:10PM +0800, Xie XiuQi wrote:
> > Hi Michal,
> > 
> > On 2018/6/11 16:52, Michal Hocko wrote:
> > > On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
> > >> Hi Michal,
> > >>
> > >> On 2018/6/7 20:21, Michal Hocko wrote:
> > >>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> > >>>> On 2018/6/7 18:55, Michal Hocko wrote:
> > >>> [...]
> > >>>>> I am not sure I have the full context but pci_acpi_scan_root calls
> > >>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> > >>>>> and that should fall back to whatever node that is online. Offline node
> > >>>>> shouldn't keep any pages behind. So there must be something else going
> > >>>>> on here and the patch is not the right way to handle it. What does
> > >>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> > >>>>
> > >>>> The whole context is:
> > >>>>
> > >>>> The system is booted with a NUMA node has no memory attaching to it
> > >>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> > >>>> in MADT, so CPUs on this memory-less node are not brought up, and
> > >>>> this NUMA node will not be online (but SRAT presents this NUMA node);
> > >>>>
> > >>>> Devices attaching to this NUMA node such as PCI host bridge still
> > >>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
> > >>>> is not online which lead to this issue.
> > >>>
> > >>> But we should have other numa nodes on the zonelists so the allocator
> > >>> should fall back to other node. If the zonelist is not intiailized
> > >>> properly, though, then this can indeed show up as a problem. Knowing
> > >>> which exact place has blown up would help get a better picture...
> > >>>
> > >>
> > >> I specific a non-exist node to allocate memory using kzalloc_node,
> > >> and got this following error message.
> > >>
> > >> And I found out there is just a VM_WARN, but it does not prevent the memory
> > >> allocation continue.
> > >>
> > >> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
> > >> it would cause oops here.
> > >>
> > >> 459 /*
> > >> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
> > >> 461  * online. For more general interface, see alloc_pages_node().
> > >> 462  */
> > >> 463 static inline struct page *
> > >> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> > >> 465 {
> > >> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> > >> 467         VM_WARN_ON(!node_online(nid));
> > >> 468
> > >> 469         return __alloc_pages(gfp_mask, order, nid);
> > >> 470 }
> > >> 471
> > >>
> > >> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
> > > 
> > > OK, so this is an artificialy broken code, right. You shouldn't get a
> > > non-existent node via standard APIs AFAICS. The original report was
> > > about an existing node which is offline AFAIU. That would be a different
> > > case. If I am missing something and there are legitimate users that try
> > > to allocate from non-existing nodes then we should handle that in
> > > node_zonelist.
> > 
> > I think hanjun's comments may help to understood this question:
> >  - NUMA node will be built if CPUs and (or) memory are valid on this NUMA
> >  node;
> > 
> >  - But if we boot the system with memory-less node and also with
> >  CONFIG_NR_CPUS less than CPUs in SRAT, for example, 64 CPUs total with 4
> >  NUMA nodes, 16 CPUs on each NUMA node, if we boot with
> >  CONFIG_NR_CPUS=48, then we will not built numa node for node 3, but with
> >  devices on that numa node, alloc memory will be panic because NUMA node
> >  3 is not a valid node.

Hmm, but this is not a memory-less node. It sounds like a misconfigured
kernel to me or the broken initialization. Each CPU should have a
fallback numa node to be used.

> > I triggered this BUG on arm64 platform, and I found a similar bug has
> > been fixed on x86 platform. So I sent a similar patch for this bug.
> > 
> > Or, could we consider to fix it in the mm subsystem?
> 
> The patch below (b755de8dfdfe) seems like totally the wrong direction.
> I don't think we want every caller of kzalloc_node() to have check for
> node_online().

absolutely.

> Why would memory on an off-line node even be in the allocation pool?
> I wouldn't expect that memory to be put in the pool until the node
> comes online and the memory is accessible, so this sounds like some
> kind of setup issue.
> 
> But I'm definitely not an mm person.

Well, the standard way to handle memory less NUMA nodes is to simply
fallback to the closest NUMA node. We even have an API for that
(numa_mem_id).
 
> > From b755de8dfdfef97effaa91379ffafcb81f4d62a1 Mon Sep 17 00:00:00 2001
> > From: Yinghai Lu <Yinghai.Lu@Sun.COM>
> > Date: Wed, 20 Feb 2008 12:41:52 -0800
> > Subject: [PATCH] x86: make dev_to_node return online node
> > 
> > a numa system (with multi HT chains) may return node without ram. Aka it
> > is not online. Try to get an online node, otherwise return -1.
> > 
> > Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
> > Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  arch/x86/pci/acpi.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
> > index d95de2f..ea8685f 100644
> > --- a/arch/x86/pci/acpi.c
> > +++ b/arch/x86/pci/acpi.c
> > @@ -172,6 +172,9 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
> >  		set_mp_bus_to_node(busnum, node);
> >  	else
> >  		node = get_mp_bus_to_node(busnum);
> > +
> > +	if (node != -1 && !node_online(node))
> > +		node = -1;
> >  #endif
> > 
> >  	/* Allocate per-root-bus (not per bus) arch-specific data.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-11 14:53                       ` Michal Hocko
  (?)
@ 2018-06-12 15:08                         ` Punit Agrawal
  -1 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-12 15:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Bjorn Helgaas, Xie XiuQi, Hanjun Guo, tnowicki, linux-pci,
	Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

Michal Hocko <mhocko@kernel.org> writes:

> On Mon 11-06-18 08:43:03, Bjorn Helgaas wrote:
>> On Mon, Jun 11, 2018 at 08:32:10PM +0800, Xie XiuQi wrote:
>> > Hi Michal,
>> > 
>> > On 2018/6/11 16:52, Michal Hocko wrote:
>> > > On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
>> > >> Hi Michal,
>> > >>
>> > >> On 2018/6/7 20:21, Michal Hocko wrote:
>> > >>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>> > >>>> On 2018/6/7 18:55, Michal Hocko wrote:
>> > >>> [...]
>> > >>>>> I am not sure I have the full context but pci_acpi_scan_root calls
>> > >>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>> > >>>>> and that should fall back to whatever node that is online. Offline node
>> > >>>>> shouldn't keep any pages behind. So there must be something else going
>> > >>>>> on here and the patch is not the right way to handle it. What does
>> > >>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>> > >>>>
>> > >>>> The whole context is:
>> > >>>>
>> > >>>> The system is booted with a NUMA node has no memory attaching to it
>> > >>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>> > >>>> in MADT, so CPUs on this memory-less node are not brought up, and
>> > >>>> this NUMA node will not be online (but SRAT presents this NUMA node);
>> > >>>>
>> > >>>> Devices attaching to this NUMA node such as PCI host bridge still
>> > >>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>> > >>>> is not online which lead to this issue.
>> > >>>
>> > >>> But we should have other numa nodes on the zonelists so the allocator
>> > >>> should fall back to other node. If the zonelist is not intiailized
>> > >>> properly, though, then this can indeed show up as a problem. Knowing
>> > >>> which exact place has blown up would help get a better picture...
>> > >>>
>> > >>
>> > >> I specific a non-exist node to allocate memory using kzalloc_node,
>> > >> and got this following error message.
>> > >>
>> > >> And I found out there is just a VM_WARN, but it does not prevent the memory
>> > >> allocation continue.
>> > >>
>> > >> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
>> > >> it would cause oops here.
>> > >>
>> > >> 459 /*
>> > >> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
>> > >> 461  * online. For more general interface, see alloc_pages_node().
>> > >> 462  */
>> > >> 463 static inline struct page *
>> > >> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>> > >> 465 {
>> > >> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> > >> 467         VM_WARN_ON(!node_online(nid));
>> > >> 468
>> > >> 469         return __alloc_pages(gfp_mask, order, nid);
>> > >> 470 }
>> > >> 471
>> > >>
>> > >> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
>> > > 
>> > > OK, so this is an artificialy broken code, right. You shouldn't get a
>> > > non-existent node via standard APIs AFAICS. The original report was
>> > > about an existing node which is offline AFAIU. That would be a different
>> > > case. If I am missing something and there are legitimate users that try
>> > > to allocate from non-existing nodes then we should handle that in
>> > > node_zonelist.
>> > 
>> > I think hanjun's comments may help to understood this question:
>> >  - NUMA node will be built if CPUs and (or) memory are valid on this NUMA
>> >  node;
>> > 
>> >  - But if we boot the system with memory-less node and also with
>> >  CONFIG_NR_CPUS less than CPUs in SRAT, for example, 64 CPUs total with 4
>> >  NUMA nodes, 16 CPUs on each NUMA node, if we boot with
>> >  CONFIG_NR_CPUS=48, then we will not built numa node for node 3, but with
>> >  devices on that numa node, alloc memory will be panic because NUMA node
>> >  3 is not a valid node.
>
> Hmm, but this is not a memory-less node. It sounds like a misconfigured
> kernel to me or the broken initialization. Each CPU should have a
> fallback numa node to be used.
>
>> > I triggered this BUG on arm64 platform, and I found a similar bug has
>> > been fixed on x86 platform. So I sent a similar patch for this bug.
>> > 
>> > Or, could we consider to fix it in the mm subsystem?
>> 
>> The patch below (b755de8dfdfe) seems like totally the wrong direction.
>> I don't think we want every caller of kzalloc_node() to have check for
>> node_online().
>
> absolutely.
>
>> Why would memory on an off-line node even be in the allocation pool?
>> I wouldn't expect that memory to be put in the pool until the node
>> comes online and the memory is accessible, so this sounds like some
>> kind of setup issue.
>> 
>> But I'm definitely not an mm person.
>
> Well, the standard way to handle memory less NUMA nodes is to simply
> fallback to the closest NUMA node. We even have an API for that
> (numa_mem_id).

CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
up returning the original node in the fallback path.

Xie, does the below patch help? I can submit a proper patch if this
fixes the issue for you.

-- >8 --
Subject: [PATCH] arm64/numa: Enable memoryless numa nodes

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
---
 arch/arm64/Kconfig   | 4 ++++
 arch/arm64/mm/numa.c | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index eb2cf4938f6d..5317e9aa93ab 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
 	def_bool y
 	depends on NUMA
 
+config HAVE_MEMORYLESS_NODES
+       def_bool y
+       depends on NUMA
+
 config HAVE_SETUP_PER_CPU_AREA
 	def_bool y
 	depends on NUMA
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index dad128ba98bf..c699dcfe93de 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
 static void map_cpu_to_node(unsigned int cpu, int nid)
 {
 	set_cpu_numa_node(cpu, nid);
+	set_numa_mem(local_memory_node(nid));
+
 	if (nid >= 0)
 		cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
 }
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-12 15:08                         ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-12 15:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Bjorn Helgaas, Xie XiuQi, Hanjun Guo, tnowicki, linux-pci,
	Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

Michal Hocko <mhocko@kernel.org> writes:

> On Mon 11-06-18 08:43:03, Bjorn Helgaas wrote:
>> On Mon, Jun 11, 2018 at 08:32:10PM +0800, Xie XiuQi wrote:
>> > Hi Michal,
>> > 
>> > On 2018/6/11 16:52, Michal Hocko wrote:
>> > > On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
>> > >> Hi Michal,
>> > >>
>> > >> On 2018/6/7 20:21, Michal Hocko wrote:
>> > >>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>> > >>>> On 2018/6/7 18:55, Michal Hocko wrote:
>> > >>> [...]
>> > >>>>> I am not sure I have the full context but pci_acpi_scan_root calls
>> > >>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>> > >>>>> and that should fall back to whatever node that is online. Offline node
>> > >>>>> shouldn't keep any pages behind. So there must be something else going
>> > >>>>> on here and the patch is not the right way to handle it. What does
>> > >>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>> > >>>>
>> > >>>> The whole context is:
>> > >>>>
>> > >>>> The system is booted with a NUMA node has no memory attaching to it
>> > >>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>> > >>>> in MADT, so CPUs on this memory-less node are not brought up, and
>> > >>>> this NUMA node will not be online (but SRAT presents this NUMA node);
>> > >>>>
>> > >>>> Devices attaching to this NUMA node such as PCI host bridge still
>> > >>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>> > >>>> is not online which lead to this issue.
>> > >>>
>> > >>> But we should have other numa nodes on the zonelists so the allocator
>> > >>> should fall back to other node. If the zonelist is not intiailized
>> > >>> properly, though, then this can indeed show up as a problem. Knowing
>> > >>> which exact place has blown up would help get a better picture...
>> > >>>
>> > >>
>> > >> I specific a non-exist node to allocate memory using kzalloc_node,
>> > >> and got this following error message.
>> > >>
>> > >> And I found out there is just a VM_WARN, but it does not prevent the memory
>> > >> allocation continue.
>> > >>
>> > >> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
>> > >> it would cause oops here.
>> > >>
>> > >> 459 /*
>> > >> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
>> > >> 461  * online. For more general interface, see alloc_pages_node().
>> > >> 462  */
>> > >> 463 static inline struct page *
>> > >> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>> > >> 465 {
>> > >> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> > >> 467         VM_WARN_ON(!node_online(nid));
>> > >> 468
>> > >> 469         return __alloc_pages(gfp_mask, order, nid);
>> > >> 470 }
>> > >> 471
>> > >>
>> > >> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
>> > > 
>> > > OK, so this is an artificialy broken code, right. You shouldn't get a
>> > > non-existent node via standard APIs AFAICS. The original report was
>> > > about an existing node which is offline AFAIU. That would be a different
>> > > case. If I am missing something and there are legitimate users that try
>> > > to allocate from non-existing nodes then we should handle that in
>> > > node_zonelist.
>> > 
>> > I think hanjun's comments may help to understood this question:
>> >  - NUMA node will be built if CPUs and (or) memory are valid on this NUMA
>> >  node;
>> > 
>> >  - But if we boot the system with memory-less node and also with
>> >  CONFIG_NR_CPUS less than CPUs in SRAT, for example, 64 CPUs total with 4
>> >  NUMA nodes, 16 CPUs on each NUMA node, if we boot with
>> >  CONFIG_NR_CPUS=48, then we will not built numa node for node 3, but with
>> >  devices on that numa node, alloc memory will be panic because NUMA node
>> >  3 is not a valid node.
>
> Hmm, but this is not a memory-less node. It sounds like a misconfigured
> kernel to me or the broken initialization. Each CPU should have a
> fallback numa node to be used.
>
>> > I triggered this BUG on arm64 platform, and I found a similar bug has
>> > been fixed on x86 platform. So I sent a similar patch for this bug.
>> > 
>> > Or, could we consider to fix it in the mm subsystem?
>> 
>> The patch below (b755de8dfdfe) seems like totally the wrong direction.
>> I don't think we want every caller of kzalloc_node() to have check for
>> node_online().
>
> absolutely.
>
>> Why would memory on an off-line node even be in the allocation pool?
>> I wouldn't expect that memory to be put in the pool until the node
>> comes online and the memory is accessible, so this sounds like some
>> kind of setup issue.
>> 
>> But I'm definitely not an mm person.
>
> Well, the standard way to handle memory less NUMA nodes is to simply
> fallback to the closest NUMA node. We even have an API for that
> (numa_mem_id).

CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
up returning the original node in the fallback path.

Xie, does the below patch help? I can submit a proper patch if this
fixes the issue for you.

-- >8 --
Subject: [PATCH] arm64/numa: Enable memoryless numa nodes

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
---
 arch/arm64/Kconfig   | 4 ++++
 arch/arm64/mm/numa.c | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index eb2cf4938f6d..5317e9aa93ab 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
 	def_bool y
 	depends on NUMA
 
+config HAVE_MEMORYLESS_NODES
+       def_bool y
+       depends on NUMA
+
 config HAVE_SETUP_PER_CPU_AREA
 	def_bool y
 	depends on NUMA
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index dad128ba98bf..c699dcfe93de 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
 static void map_cpu_to_node(unsigned int cpu, int nid)
 {
 	set_cpu_numa_node(cpu, nid);
+	set_numa_mem(local_memory_node(nid));
+
 	if (nid >= 0)
 		cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
 }
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-12 15:08                         ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-12 15:08 UTC (permalink / raw)
  To: linux-arm-kernel

Michal Hocko <mhocko@kernel.org> writes:

> On Mon 11-06-18 08:43:03, Bjorn Helgaas wrote:
>> On Mon, Jun 11, 2018 at 08:32:10PM +0800, Xie XiuQi wrote:
>> > Hi Michal,
>> > 
>> > On 2018/6/11 16:52, Michal Hocko wrote:
>> > > On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
>> > >> Hi Michal,
>> > >>
>> > >> On 2018/6/7 20:21, Michal Hocko wrote:
>> > >>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>> > >>>> On 2018/6/7 18:55, Michal Hocko wrote:
>> > >>> [...]
>> > >>>>> I am not sure I have the full context but pci_acpi_scan_root calls
>> > >>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>> > >>>>> and that should fall back to whatever node that is online. Offline node
>> > >>>>> shouldn't keep any pages behind. So there must be something else going
>> > >>>>> on here and the patch is not the right way to handle it. What does
>> > >>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>> > >>>>
>> > >>>> The whole context is:
>> > >>>>
>> > >>>> The system is booted with a NUMA node has no memory attaching to it
>> > >>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>> > >>>> in MADT, so CPUs on this memory-less node are not brought up, and
>> > >>>> this NUMA node will not be online (but SRAT presents this NUMA node);
>> > >>>>
>> > >>>> Devices attaching to this NUMA node such as PCI host bridge still
>> > >>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>> > >>>> is not online which lead to this issue.
>> > >>>
>> > >>> But we should have other numa nodes on the zonelists so the allocator
>> > >>> should fall back to other node. If the zonelist is not intiailized
>> > >>> properly, though, then this can indeed show up as a problem. Knowing
>> > >>> which exact place has blown up would help get a better picture...
>> > >>>
>> > >>
>> > >> I specific a non-exist node to allocate memory using kzalloc_node,
>> > >> and got this following error message.
>> > >>
>> > >> And I found out there is just a VM_WARN, but it does not prevent the memory
>> > >> allocation continue.
>> > >>
>> > >> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
>> > >> it would cause oops here.
>> > >>
>> > >> 459 /*
>> > >> 460  * Allocate pages, preferring the node given as nid. The node must be valid and
>> > >> 461  * online. For more general interface, see alloc_pages_node().
>> > >> 462  */
>> > >> 463 static inline struct page *
>> > >> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>> > >> 465 {
>> > >> 466         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> > >> 467         VM_WARN_ON(!node_online(nid));
>> > >> 468
>> > >> 469         return __alloc_pages(gfp_mask, order, nid);
>> > >> 470 }
>> > >> 471
>> > >>
>> > >> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
>> > > 
>> > > OK, so this is an artificialy broken code, right. You shouldn't get a
>> > > non-existent node via standard APIs AFAICS. The original report was
>> > > about an existing node which is offline AFAIU. That would be a different
>> > > case. If I am missing something and there are legitimate users that try
>> > > to allocate from non-existing nodes then we should handle that in
>> > > node_zonelist.
>> > 
>> > I think hanjun's comments may help to understood this question:
>> >  - NUMA node will be built if CPUs and (or) memory are valid on this NUMA
>> >  node;
>> > 
>> >  - But if we boot the system with memory-less node and also with
>> >  CONFIG_NR_CPUS less than CPUs in SRAT, for example, 64 CPUs total with 4
>> >  NUMA nodes, 16 CPUs on each NUMA node, if we boot with
>> >  CONFIG_NR_CPUS=48, then we will not built numa node for node 3, but with
>> >  devices on that numa node, alloc memory will be panic because NUMA node
>> >  3 is not a valid node.
>
> Hmm, but this is not a memory-less node. It sounds like a misconfigured
> kernel to me or the broken initialization. Each CPU should have a
> fallback numa node to be used.
>
>> > I triggered this BUG on arm64 platform, and I found a similar bug has
>> > been fixed on x86 platform. So I sent a similar patch for this bug.
>> > 
>> > Or, could we consider to fix it in the mm subsystem?
>> 
>> The patch below (b755de8dfdfe) seems like totally the wrong direction.
>> I don't think we want every caller of kzalloc_node() to have check for
>> node_online().
>
> absolutely.
>
>> Why would memory on an off-line node even be in the allocation pool?
>> I wouldn't expect that memory to be put in the pool until the node
>> comes online and the memory is accessible, so this sounds like some
>> kind of setup issue.
>> 
>> But I'm definitely not an mm person.
>
> Well, the standard way to handle memory less NUMA nodes is to simply
> fallback to the closest NUMA node. We even have an API for that
> (numa_mem_id).

CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
up returning the original node in the fallback path.

Xie, does the below patch help? I can submit a proper patch if this
fixes the issue for you.

-- >8 --
Subject: [PATCH] arm64/numa: Enable memoryless numa nodes

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
---
 arch/arm64/Kconfig   | 4 ++++
 arch/arm64/mm/numa.c | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index eb2cf4938f6d..5317e9aa93ab 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
 	def_bool y
 	depends on NUMA
 
+config HAVE_MEMORYLESS_NODES
+       def_bool y
+       depends on NUMA
+
 config HAVE_SETUP_PER_CPU_AREA
 	def_bool y
 	depends on NUMA
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index dad128ba98bf..c699dcfe93de 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
 static void map_cpu_to_node(unsigned int cpu, int nid)
 {
 	set_cpu_numa_node(cpu, nid);
+	set_numa_mem(local_memory_node(nid));
+
 	if (nid >= 0)
 		cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
 }
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-12 15:08                         ` Punit Agrawal
@ 2018-06-12 15:20                           ` Michal Hocko
  -1 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-12 15:20 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Bjorn Helgaas, Xie XiuQi, Hanjun Guo, tnowicki, linux-pci,
	Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

On Tue 12-06-18 16:08:03, Punit Agrawal wrote:
> Michal Hocko <mhocko@kernel.org> writes:
[...]
> > Well, the standard way to handle memory less NUMA nodes is to simply
> > fallback to the closest NUMA node. We even have an API for that
> > (numa_mem_id).
> 
> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
> up returning the original node in the fallback path.

Yes this makes more sense.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-12 15:20                           ` Michal Hocko
  0 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-12 15:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue 12-06-18 16:08:03, Punit Agrawal wrote:
> Michal Hocko <mhocko@kernel.org> writes:
[...]
> > Well, the standard way to handle memory less NUMA nodes is to simply
> > fallback to the closest NUMA node. We even have an API for that
> > (numa_mem_id).
> 
> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
> up returning the original node in the fallback path.

Yes this makes more sense.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-12 15:08                         ` Punit Agrawal
  (?)
@ 2018-06-13 17:39                           ` Punit Agrawal
  -1 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-13 17:39 UTC (permalink / raw)
  To: Xie XiuQi, Hanjun Guo
  Cc: Bjorn Helgaas, tnowicki, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Bjorn Helgaas, Andrew Morton, zhongjiang, linux-arm,
	Michal Hocko

Punit Agrawal <punit.agrawal@arm.com> writes:


[...]

>
> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
> up returning the original node in the fallback path.
>
> Xie, does the below patch help? I can submit a proper patch if this
> fixes the issue for you.
>
> -- >8 --
> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> ---
>  arch/arm64/Kconfig   | 4 ++++
>  arch/arm64/mm/numa.c | 2 ++
>  2 files changed, 6 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index eb2cf4938f6d..5317e9aa93ab 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
>  	def_bool y
>  	depends on NUMA
>  
> +config HAVE_MEMORYLESS_NODES
> +       def_bool y
> +       depends on NUMA
> +
>  config HAVE_SETUP_PER_CPU_AREA
>  	def_bool y
>  	depends on NUMA
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index dad128ba98bf..c699dcfe93de 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
>  static void map_cpu_to_node(unsigned int cpu, int nid)
>  {
>  	set_cpu_numa_node(cpu, nid);
> +	set_numa_mem(local_memory_node(nid));

Argh, this should be

        set_cpu_numa_mem(cpu, local_memory_node(nid));

There is not guarantee that map_cpu_to_node() will be called on the
local cpu.

Hanjun, Xie - can you try with the update please?

Thanks,
Punit

> +
>  	if (nid >= 0)
>  		cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
>  }

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-13 17:39                           ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-13 17:39 UTC (permalink / raw)
  To: Xie XiuQi, Hanjun Guo
  Cc: Bjorn Helgaas, tnowicki, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Bjorn Helgaas, Andrew Morton, zhongjiang, linux-arm,
	Michal Hocko

Punit Agrawal <punit.agrawal@arm.com> writes:


[...]

>
> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
> up returning the original node in the fallback path.
>
> Xie, does the below patch help? I can submit a proper patch if this
> fixes the issue for you.
>
> -- >8 --
> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> ---
>  arch/arm64/Kconfig   | 4 ++++
>  arch/arm64/mm/numa.c | 2 ++
>  2 files changed, 6 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index eb2cf4938f6d..5317e9aa93ab 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
>  	def_bool y
>  	depends on NUMA
>  
> +config HAVE_MEMORYLESS_NODES
> +       def_bool y
> +       depends on NUMA
> +
>  config HAVE_SETUP_PER_CPU_AREA
>  	def_bool y
>  	depends on NUMA
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index dad128ba98bf..c699dcfe93de 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
>  static void map_cpu_to_node(unsigned int cpu, int nid)
>  {
>  	set_cpu_numa_node(cpu, nid);
> +	set_numa_mem(local_memory_node(nid));

Argh, this should be

        set_cpu_numa_mem(cpu, local_memory_node(nid));

There is not guarantee that map_cpu_to_node() will be called on the
local cpu.

Hanjun, Xie - can you try with the update please?

Thanks,
Punit

> +
>  	if (nid >= 0)
>  		cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
>  }

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-13 17:39                           ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-13 17:39 UTC (permalink / raw)
  To: linux-arm-kernel

Punit Agrawal <punit.agrawal@arm.com> writes:


[...]

>
> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
> up returning the original node in the fallback path.
>
> Xie, does the below patch help? I can submit a proper patch if this
> fixes the issue for you.
>
> -- >8 --
> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> ---
>  arch/arm64/Kconfig   | 4 ++++
>  arch/arm64/mm/numa.c | 2 ++
>  2 files changed, 6 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index eb2cf4938f6d..5317e9aa93ab 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
>  	def_bool y
>  	depends on NUMA
>  
> +config HAVE_MEMORYLESS_NODES
> +       def_bool y
> +       depends on NUMA
> +
>  config HAVE_SETUP_PER_CPU_AREA
>  	def_bool y
>  	depends on NUMA
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index dad128ba98bf..c699dcfe93de 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
>  static void map_cpu_to_node(unsigned int cpu, int nid)
>  {
>  	set_cpu_numa_node(cpu, nid);
> +	set_numa_mem(local_memory_node(nid));

Argh, this should be

        set_cpu_numa_mem(cpu, local_memory_node(nid));

There is not guarantee that map_cpu_to_node() will be called on the
local cpu.

Hanjun, Xie - can you try with the update please?

Thanks,
Punit

> +
>  	if (nid >= 0)
>  		cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
>  }

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-13 17:39                           ` Punit Agrawal
  (?)
@ 2018-06-14  6:23                             ` Hanjun Guo
  -1 siblings, 0 replies; 103+ messages in thread
From: Hanjun Guo @ 2018-06-14  6:23 UTC (permalink / raw)
  To: Punit Agrawal, Xie XiuQi
  Cc: Bjorn Helgaas, tnowicki, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Bjorn Helgaas, Andrew Morton, zhongjiang, linux-arm,
	Michal Hocko

Hi Punit,

On 2018/6/14 1:39, Punit Agrawal wrote:
> Punit Agrawal <punit.agrawal@arm.com> writes:
> 
> 
> [...]
> 
>>
>> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
>> up returning the original node in the fallback path.
>>
>> Xie, does the below patch help? I can submit a proper patch if this
>> fixes the issue for you.
>>
>> -- >8 --
>> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>>
>> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
>> ---
>>  arch/arm64/Kconfig   | 4 ++++
>>  arch/arm64/mm/numa.c | 2 ++
>>  2 files changed, 6 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index eb2cf4938f6d..5317e9aa93ab 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
>>  	def_bool y
>>  	depends on NUMA
>>  
>> +config HAVE_MEMORYLESS_NODES
>> +       def_bool y
>> +       depends on NUMA
>> +
>>  config HAVE_SETUP_PER_CPU_AREA
>>  	def_bool y
>>  	depends on NUMA
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index dad128ba98bf..c699dcfe93de 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
>>  static void map_cpu_to_node(unsigned int cpu, int nid)
>>  {
>>  	set_cpu_numa_node(cpu, nid);
>> +	set_numa_mem(local_memory_node(nid));
> 
> Argh, this should be
> 
>         set_cpu_numa_mem(cpu, local_memory_node(nid));
> 
> There is not guarantee that map_cpu_to_node() will be called on the
> local cpu.
> 
> Hanjun, Xie - can you try with the update please?

Thanks for looking into this, we will try this tomorrow
(the hardware is occupied now) and update here.

Thanks
Hanjun


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-14  6:23                             ` Hanjun Guo
  0 siblings, 0 replies; 103+ messages in thread
From: Hanjun Guo @ 2018-06-14  6:23 UTC (permalink / raw)
  To: Punit Agrawal, Xie XiuQi
  Cc: Bjorn Helgaas, tnowicki, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Bjorn Helgaas, Andrew Morton, zhongjiang, linux-arm,
	Michal Hocko

Hi Punit,

On 2018/6/14 1:39, Punit Agrawal wrote:
> Punit Agrawal <punit.agrawal@arm.com> writes:
> 
> 
> [...]
> 
>>
>> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
>> up returning the original node in the fallback path.
>>
>> Xie, does the below patch help? I can submit a proper patch if this
>> fixes the issue for you.
>>
>> -- >8 --
>> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>>
>> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
>> ---
>>  arch/arm64/Kconfig   | 4 ++++
>>  arch/arm64/mm/numa.c | 2 ++
>>  2 files changed, 6 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index eb2cf4938f6d..5317e9aa93ab 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
>>  	def_bool y
>>  	depends on NUMA
>>  
>> +config HAVE_MEMORYLESS_NODES
>> +       def_bool y
>> +       depends on NUMA
>> +
>>  config HAVE_SETUP_PER_CPU_AREA
>>  	def_bool y
>>  	depends on NUMA
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index dad128ba98bf..c699dcfe93de 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
>>  static void map_cpu_to_node(unsigned int cpu, int nid)
>>  {
>>  	set_cpu_numa_node(cpu, nid);
>> +	set_numa_mem(local_memory_node(nid));
> 
> Argh, this should be
> 
>         set_cpu_numa_mem(cpu, local_memory_node(nid));
> 
> There is not guarantee that map_cpu_to_node() will be called on the
> local cpu.
> 
> Hanjun, Xie - can you try with the update please?

Thanks for looking into this, we will try this tomorrow
(the hardware is occupied now) and update here.

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-14  6:23                             ` Hanjun Guo
  0 siblings, 0 replies; 103+ messages in thread
From: Hanjun Guo @ 2018-06-14  6:23 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Punit,

On 2018/6/14 1:39, Punit Agrawal wrote:
> Punit Agrawal <punit.agrawal@arm.com> writes:
> 
> 
> [...]
> 
>>
>> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
>> up returning the original node in the fallback path.
>>
>> Xie, does the below patch help? I can submit a proper patch if this
>> fixes the issue for you.
>>
>> -- >8 --
>> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>>
>> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
>> ---
>>  arch/arm64/Kconfig   | 4 ++++
>>  arch/arm64/mm/numa.c | 2 ++
>>  2 files changed, 6 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index eb2cf4938f6d..5317e9aa93ab 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
>>  	def_bool y
>>  	depends on NUMA
>>  
>> +config HAVE_MEMORYLESS_NODES
>> +       def_bool y
>> +       depends on NUMA
>> +
>>  config HAVE_SETUP_PER_CPU_AREA
>>  	def_bool y
>>  	depends on NUMA
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index dad128ba98bf..c699dcfe93de 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
>>  static void map_cpu_to_node(unsigned int cpu, int nid)
>>  {
>>  	set_cpu_numa_node(cpu, nid);
>> +	set_numa_mem(local_memory_node(nid));
> 
> Argh, this should be
> 
>         set_cpu_numa_mem(cpu, local_memory_node(nid));
> 
> There is not guarantee that map_cpu_to_node() will be called on the
> local cpu.
> 
> Hanjun, Xie - can you try with the update please?

Thanks for looking into this, we will try this tomorrow
(the hardware is occupied now) and update here.

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-13 17:39                           ` Punit Agrawal
  (?)
@ 2018-06-19 12:03                             ` Xie XiuQi
  -1 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-19 12:03 UTC (permalink / raw)
  To: Punit Agrawal, Hanjun Guo
  Cc: Bjorn Helgaas, tnowicki, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Bjorn Helgaas, Andrew Morton, zhongjiang, linux-arm,
	Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 2942 bytes --]

Hi Punit,


On 2018/6/14 1:39, Punit Agrawal wrote:
> Punit Agrawal <punit.agrawal@arm.com> writes:
> 
> 
> [...]
> 
>>
>> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
>> up returning the original node in the fallback path.
>>
>> Xie, does the below patch help? I can submit a proper patch if this
>> fixes the issue for you.
>>
>> -- >8 --
>> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>>
>> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
>> ---
>>  arch/arm64/Kconfig   | 4 ++++
>>  arch/arm64/mm/numa.c | 2 ++
>>  2 files changed, 6 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index eb2cf4938f6d..5317e9aa93ab 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
>>  	def_bool y
>>  	depends on NUMA
>>  
>> +config HAVE_MEMORYLESS_NODES
>> +       def_bool y
>> +       depends on NUMA
>> +
>>  config HAVE_SETUP_PER_CPU_AREA
>>  	def_bool y
>>  	depends on NUMA
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index dad128ba98bf..c699dcfe93de 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
>>  static void map_cpu_to_node(unsigned int cpu, int nid)
>>  {
>>  	set_cpu_numa_node(cpu, nid);
>> +	set_numa_mem(local_memory_node(nid));
> 
> Argh, this should be
> 
>         set_cpu_numa_mem(cpu, local_memory_node(nid));
> 
> There is not guarantee that map_cpu_to_node() will be called on the
> local cpu.
> 
> Hanjun, Xie - can you try with the update please?

I've tested this patch, but it does not help.
The boot message is attached.

I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
Then node 3 is not be created, because node 3 has no memory, and no cpu.
But some pci device may related to node 3, which be set in ACPI table.

165 /* Interface called from ACPI code to setup PCI host controller */
166 struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
167 {
168         int node = acpi_get_node(root->device->handle);
169         struct acpi_pci_generic_root_info *ri;
170         struct pci_bus *bus, *child;
171         struct acpi_pci_root_ops *root_ops;
172
            // this node may be not created.
177         ri = kzalloc_node(sizeof(*ri), GFP_KERNEL, node);
178         if (!ri)
179                 return NULL;
180
181         root_ops = kzalloc_node(sizeof(*root_ops), GFP_KERNEL, node);
182         if (!root_ops) {
183                 kfree(ri);
184                 return NULL;
185         }
186
187         ri->cfg = pci_acpi_setup_ecam_mapping(root);
188         if (!ri->cfg) {
189                 kfree(ri);
190                 kfree(root_ops);
191                 return NULL;
192         }


> 
> Thanks,
> Punit
> 
>> +
>>  	if (nid >= 0)
>>  		cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
>>  }
> 
> .
> 

-- 
Thanks,
Xie XiuQi

[-- Attachment #2: boot message.txt --]
[-- Type: text/plain, Size: 16398 bytes --]

[    0.000000] Booting Linux on physical CPU 0x0000030000 [0x480fd010]
[    0.000000] Linux version 4.16.0-rc1-00491-g204a6cc-dirty (xiexiuqi@localhost.localdomain) (gcc version 6.3.1 20170404 (Linaro GCC 6.3-2017.05)) #17 SMP PREEMPT Tue Jun 19 16:33:32 CST 2018
[    0.000000] earlycon: pl11 at MMIO32 0x0000000094080000 (options '')
[    0.000000] bootconsole [pl11] enabled
[    0.000000] efi: Getting EFI parameters from FDT:
[    0.000000] efi: EFI v2.60 by EDK II
[    0.000000] efi:  SMBIOS 3.0=0x3eb60000  ACPI 2.0=0x39710000  MEMATTR=0x3b106418 
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x0000000039710000 000024 (v02 HISI  )
[    0.000000] ACPI: XSDT 0x0000000039700000 000074 (v01 HISI   HIP08    00000000      01000013)
[    0.000000] ACPI: FACP 0x0000000039630000 000114 (v06 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: DSDT 0x00000000395C0000 006A1A (v02 HISI   HIP08    00000000 INTL 20170929)
[    0.000000] ACPI: GTDT 0x0000000039620000 000060 (v02 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: DBG2 0x0000000039610000 00005A (v00 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: MCFG 0x0000000039600000 00003C (v01 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: SLIT 0x00000000395F0000 00003C (v01 HISI   HIP07    00000000 INTL 20151124)
[    0.000000] ACPI: SRAT 0x00000000395E0000 0009C0 (v03 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: APIC 0x00000000395D0000 00286C (v04 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: IORT 0x00000000395B0000 00110C (v00 HISI   HIP08    00000000 INTL 20170929)
[    0.000000] ACPI: PPTT 0x00000000311F0000 0037D0 (v01 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: SPMI 0x00000000311E0000 000041 (v05 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30000 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30001 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30002 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30003 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30100 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30101 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30102 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30103 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30200 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30201 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30202 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30203 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30300 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30301 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30302 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30303 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30400 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30401 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30402 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30403 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30500 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30501 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30502 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30503 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30600 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30601 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30602 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30603 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30700 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30701 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30702 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30703 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10000 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10001 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10002 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10003 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10100 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10101 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10102 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10103 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10200 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10201 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10202 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10203 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10300 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10301 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10302 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10303 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10400 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10401 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10402 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10403 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10500 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10501 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10502 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10503 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10600 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10601 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10602 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10603 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10700 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10701 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10702 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10703 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70000 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70001 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70002 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70003 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70100 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70101 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70102 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70103 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: cpu_to_node_map[72] is too small, may not be able to use all cpus
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x2080000000-0x23ffffffff]
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x402000000000-0x4023ffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x23ffffe780-0x23ffffffff]
[    0.000000] NUMA: Initmem setup node 1 [<memory-less node>]
[    0.000000] NUMA: NODE_DATA [mem 0x4023fffed780-0x4023fffeefff]
[    0.000000] NUMA: NODE_DATA(1) on node 2
[    0.000000] NUMA: NODE_DATA [mem 0x4023fffebf00-0x4023fffed77f]
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000000000000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x00004023ffffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000003942ffff]
[    0.000000]   node   0: [mem 0x0000000039430000-0x000000003956ffff]
[    0.000000]   node   0: [mem 0x0000000039570000-0x000000003963ffff]
[    0.000000]   node   0: [mem 0x0000000039640000-0x00000000396fffff]
[    0.000000]   node   0: [mem 0x0000000039700000-0x000000003971ffff]
[    0.000000]   node   0: [mem 0x0000000039720000-0x0000000039b6ffff]
[    0.000000]   node   0: [mem 0x0000000039b70000-0x000000003eb5ffff]
[    0.000000]   node   0: [mem 0x000000003eb60000-0x000000003eb8ffff]
[    0.000000]   node   0: [mem 0x000000003eb90000-0x000000003fbfffff]
[    0.000000]   node   0: [mem 0x0000002080000000-0x00000023ffffffff]
[    0.000000]   node   2: [mem 0x0000402000000000-0x00004023ffffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x00000023ffffffff]
[    0.000000] Could not find start_pfn for node 1
[    0.000000] Initmem setup node 1 [mem 0x0000000000000000-0x0000000000000000]
[    0.000000] Initmem setup node 2 [mem 0x0000402000000000-0x00004023ffffffff]
[    0.000000] psci: probing for conduit method from ACPI.
[    0.000000] psci: PSCIv1.0 detected in firmware.
[    0.000000] psci: Using standard PSCI v0.2 function IDs
[    0.000000] psci: MIGRATE_INFO_TYPE not supported.
[    0.000000] psci: SMC Calling Convention v1.0
[    0.000000] random: fast init done
[    0.000000] percpu: Embedded 24 pages/cpu @        (ptrval) s59432 r8192 d30680 u98304
[    0.000000] Detected VIPT I-cache on CPU0
[    0.000000] Built 3 zonelists, mobility grouping on.  Total pages: 7998480
[    0.000000] Policy zone: Normal
[    0.000000] Kernel command line: earlycon=pl011,mmio32,0x94080000 console=ttyAMA0,115200 initrd=minifs.cpio.gz acpi=force
[    0.000000] log_buf_len individual max cpu contribution: 4096 bytes
[    0.000000] log_buf_len total cpu_extra contributions: 290816 bytes
...
[    0.141182] Platform MSI: ITS@0x400202100000 domain created
[    0.146793] PCI/MSI: ITS@0x202100000 domain created
[    0.151690] PCI/MSI: ITS@0x400202100000 domain created
[    0.156906] Remapping and enabling EFI services.
[    0.161562]   EFI remap 0x0000000039430000 =>         (ptrval)
[    0.167431]   EFI remap 0x0000000039480000 =>         (ptrval)
[    0.173299]   EFI remap 0x00000000394d0000 =>         (ptrval)
[    0.179167]   EFI remap 0x0000000039520000 =>         (ptrval)
[    0.185037]   EFI remap 0x0000000039650000 =>         (ptrval)
[    0.190905]   EFI remap 0x00000000396b0000 =>         (ptrval)
[    0.196773]   EFI remap 0x0000000039720000 =>         (ptrval)
[    0.202639]   EFI remap 0x0000000039770000 =>         (ptrval)
[    0.208505]   EFI remap 0x00000000397c0000 =>         (ptrval)
[    0.214371]   EFI remap 0x0000000039810000 =>         (ptrval)
[    0.220236]   EFI remap 0x0000000039860000 =>         (ptrval)
[    0.226103]   EFI remap 0x00000000398b0000 =>         (ptrval)
[    0.231973]   EFI remap 0x0000000039900000 =>         (ptrval)
[    0.237840]   EFI remap 0x0000000039970000 =>         (ptrval)
[    0.243710]   EFI remap 0x00000000399c0000 =>         (ptrval)
[    0.249577]   EFI remap 0x0000000039a10000 =>         (ptrval)
[    0.255445]   EFI remap 0x0000000039a60000 =>         (ptrval)
[    0.261313]   EFI remap 0x0000000039ab0000 =>         (ptrval)
[    0.267184]   EFI remap 0x0000000039b00000 =>         (ptrval)
[    0.273047]   EFI remap 0x000000003eb60000 =>         (ptrval)
[    0.278907]   EFI remap 0x0000000080000000 =>         (ptrval)
[    0.284766]   EFI remap 0x0000000202020000 =>         (ptrval)
[    0.290626]   EFI remap 0x0000000204000000 =>         (ptrval)
[    0.296487]   EFI remap 0x0000000206200000 =>         (ptrval)
[    0.318359] smp: Bringing up secondary CPUs ...
[    0.422473] Detected VIPT I-cache on CPU1
[    0.422481] GICv3: CPU1: found redistributor 30001 region 1:0x00000000aa140000
[    0.422502] CPU1: using LPI pending table @0x00000023ee480000
[    0.422543] CPU1: Booted secondary processor 0x0000030001 [0x480fd010]
[    0.522109] Detected VIPT I-cache on CPU2
[    0.522115] GICv3: CPU2: found redistributor 30002 region 2:0x00000000aa180000
[    0.522135] CPU2: using LPI pending table @0x00000023ee4b0000
[    0.522175] CPU2: Booted secondary processor 0x0000030002 [0x480fd010]
[    0.621751] Detected VIPT I-cache on CPU3
...
[    7.328992] CPU70: Booted secondary processor 0x0000070102 [0x480fd010]
[    7.435880] Detected VIPT I-cache on CPU71
[    7.435931] GICv3: CPU71: found redistributor 70103 region 71:0x00004000aa2c0000
[    7.435958] CPU71: using LPI pending table @0x00000023ed340000
[    7.436037] CPU71: Booted secondary processor 0x0000070103 [0x480fd010]
[    7.436131] smp: Brought up 3 nodes, 72 CPUs
[    9.146250] SMP: Total of 72 processors activated.
[    9.151067] CPU features: detected feature: GIC system register CPU interface
[    9.158247] CPU features: detected feature: Privileged Access Never
...
[   11.535050] pci 0000:74:02.0: BAR 5: assigned [mem 0xa2000000-0xa2007fff]
[   11.541829] pci 0000:74:03.0: BAR 5: assigned [mem 0xa2008000-0xa2008fff]
[   11.548610] pci 0000:75:00.0: BAR 2: assigned [mem 0x144000000-0x1443fffff 64bit pref]
[   11.556518] pci 0000:75:00.0: BAR 9: assigned [mem 0x144400000-0x1447effff 64bit pref]
[   11.564426] pci 0000:74:00.0: PCI bridge to [bus 75]
[   11.569382] pci 0000:74:00.0:   bridge window [mem 0x144000000-0x1447fffff 64bit pref]
[   11.577332] ACPI: PCI Root Bridge [PCI6] (domain 0000 [bus 80-9f])
[   11.583505] acpi PNP0A08:06: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[   11.591850] acpi PNP0A08:06: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[   11.599851] CPU002: pci_acpi_scan_root: 175 node 3, online 0
[   11.605503] Unable to handle kernel paging request at virtual address 00001530
[   11.612712] Mem abort info:
[   11.615492]   ESR = 0x96000004
[   11.618534]   Exception class = DABT (current EL), IL = 32 bits
[   11.624440]   SET = 0, FnV = 0
[   11.627481]   EA = 0, S1PTW = 0
[   11.630608] Data abort info:
[   11.633476]   ISV = 0, ISS = 0x00000004
[   11.637299]   CM = 0, WnR = 0
[   11.640254] [0000000000001530] user address but active_mm is swapper
[   11.646594] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[   11.652154] Modules linked in:
[   11.655198] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.16.0-rc1-00491-g204a6cc-dirty #17
[   11.663361] Hardware name: Huawei, 06/02/2018
[   11.673174] pstate: 20c00089 (nzCv daIf +PAN +UAO)
[   11.677955] pc : ___slab_alloc+0x60/0x590
[   11.681952] lr : __slab_alloc.isra.24+0x58/0x70
[   11.686469] sp : ffff00000981b870
[   11.689771] x29: ffff00000981b870 x28: ffffc023eb8d8000 
[   11.695071] x27: ffff7e008fafd600 x26: ffff8023f0407c00 
[   11.700372] x25: ffff8023f0407c00 x24: ffff000008098da0 
[   11.705671] x23: 0000000000000003 x22: 00000000014080c0 
[   11.710971] x21: 0000000000000000 x20: 0000000000000003 
[   11.716271] x19: ffff8023f0b40870 x18: ffffffffffffffff 
[   11.721571] x17: 0000000000000000 x16: 000000000001c200 
[   11.726871] x15: ffff000009423b88 x14: ffff00008958c94f 
[   11.732171] x13: ffff00000958c95d x12: ffff00000943e858 
[   11.737471] x11: ffff00000943e000 x10: 0000000005f5e0ff 
[   11.742771] x9 : 00000000ffffffd0 x8 : 65646f6e20353731 
[   11.748071] x7 : 203a746f6f725f6e x6 : 00000000000002e8 
[   11.753371] x5 : 000000000000000a x4 : ffff8023f0b40870 
[   11.758670] x3 : ffff000008098da0 x2 : 0000000000000003 
[   11.763970] x1 : 0000000000000000 x0 : 0000000000000000 
[   11.769271] Process swapper/0 (pid: 1, stack limit = 0x00000000b21a776c)
[   11.775958] Call trace:
[   11.778392]  ___slab_alloc+0x60/0x590
[   11.782041]  __slab_alloc.isra.24+0x58/0x70
[   11.786212]  kmem_cache_alloc_node+0xe0/0x240
[   11.790556]  pci_acpi_scan_root+0xb0/0x270
[   11.794642]  acpi_pci_root_add+0x29c/0x498
[   11.798726]  acpi_bus_attach+0x104/0x210
[   11.802635]  acpi_bus_attach+0xa4/0x210
[   11.806458]  acpi_bus_attach+0xa4/0x210
[   11.810282]  acpi_bus_scan+0x4c/0xb0
[   11.813846]  acpi_scan_init+0xec/0x24c
[   11.817582]  acpi_init+0x300/0x36c
[   11.820972]  do_one_initcall+0x50/0x158
[   11.824797]  kernel_init_freeable+0x188/0x228
[   11.829141]  kernel_init+0x10/0x100
[   11.832617]  ret_from_fork+0x10/0x18
[   11.836181] Code: 90009081 93407e82 912ec021 f8627821 (f94a9821) 
[   11.842316] ---[ end trace 6b7476446f4b95ab ]---
[   11.846936] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[   11.846936] 
[   11.856079] SMP: stopping secondary CPUs
[   11.860027] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[   11.860027] 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 12:03                             ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-19 12:03 UTC (permalink / raw)
  To: Punit Agrawal, Hanjun Guo
  Cc: Bjorn Helgaas, tnowicki, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Bjorn Helgaas, Andrew Morton, zhongjiang, linux-arm,
	Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 2942 bytes --]

Hi Punit,


On 2018/6/14 1:39, Punit Agrawal wrote:
> Punit Agrawal <punit.agrawal@arm.com> writes:
> 
> 
> [...]
> 
>>
>> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
>> up returning the original node in the fallback path.
>>
>> Xie, does the below patch help? I can submit a proper patch if this
>> fixes the issue for you.
>>
>> -- >8 --
>> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>>
>> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
>> ---
>>  arch/arm64/Kconfig   | 4 ++++
>>  arch/arm64/mm/numa.c | 2 ++
>>  2 files changed, 6 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index eb2cf4938f6d..5317e9aa93ab 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
>>  	def_bool y
>>  	depends on NUMA
>>  
>> +config HAVE_MEMORYLESS_NODES
>> +       def_bool y
>> +       depends on NUMA
>> +
>>  config HAVE_SETUP_PER_CPU_AREA
>>  	def_bool y
>>  	depends on NUMA
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index dad128ba98bf..c699dcfe93de 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
>>  static void map_cpu_to_node(unsigned int cpu, int nid)
>>  {
>>  	set_cpu_numa_node(cpu, nid);
>> +	set_numa_mem(local_memory_node(nid));
> 
> Argh, this should be
> 
>         set_cpu_numa_mem(cpu, local_memory_node(nid));
> 
> There is not guarantee that map_cpu_to_node() will be called on the
> local cpu.
> 
> Hanjun, Xie - can you try with the update please?

I've tested this patch, but it does not help.
The boot message is attached.

I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
Then node 3 is not be created, because node 3 has no memory, and no cpu.
But some pci device may related to node 3, which be set in ACPI table.

165 /* Interface called from ACPI code to setup PCI host controller */
166 struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
167 {
168         int node = acpi_get_node(root->device->handle);
169         struct acpi_pci_generic_root_info *ri;
170         struct pci_bus *bus, *child;
171         struct acpi_pci_root_ops *root_ops;
172
            // this node may be not created.
177         ri = kzalloc_node(sizeof(*ri), GFP_KERNEL, node);
178         if (!ri)
179                 return NULL;
180
181         root_ops = kzalloc_node(sizeof(*root_ops), GFP_KERNEL, node);
182         if (!root_ops) {
183                 kfree(ri);
184                 return NULL;
185         }
186
187         ri->cfg = pci_acpi_setup_ecam_mapping(root);
188         if (!ri->cfg) {
189                 kfree(ri);
190                 kfree(root_ops);
191                 return NULL;
192         }


> 
> Thanks,
> Punit
> 
>> +
>>  	if (nid >= 0)
>>  		cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
>>  }
> 
> .
> 

-- 
Thanks,
Xie XiuQi

[-- Attachment #2: boot message.txt --]
[-- Type: text/plain, Size: 16398 bytes --]

[    0.000000] Booting Linux on physical CPU 0x0000030000 [0x480fd010]
[    0.000000] Linux version 4.16.0-rc1-00491-g204a6cc-dirty (xiexiuqi@localhost.localdomain) (gcc version 6.3.1 20170404 (Linaro GCC 6.3-2017.05)) #17 SMP PREEMPT Tue Jun 19 16:33:32 CST 2018
[    0.000000] earlycon: pl11 at MMIO32 0x0000000094080000 (options '')
[    0.000000] bootconsole [pl11] enabled
[    0.000000] efi: Getting EFI parameters from FDT:
[    0.000000] efi: EFI v2.60 by EDK II
[    0.000000] efi:  SMBIOS 3.0=0x3eb60000  ACPI 2.0=0x39710000  MEMATTR=0x3b106418 
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x0000000039710000 000024 (v02 HISI  )
[    0.000000] ACPI: XSDT 0x0000000039700000 000074 (v01 HISI   HIP08    00000000      01000013)
[    0.000000] ACPI: FACP 0x0000000039630000 000114 (v06 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: DSDT 0x00000000395C0000 006A1A (v02 HISI   HIP08    00000000 INTL 20170929)
[    0.000000] ACPI: GTDT 0x0000000039620000 000060 (v02 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: DBG2 0x0000000039610000 00005A (v00 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: MCFG 0x0000000039600000 00003C (v01 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: SLIT 0x00000000395F0000 00003C (v01 HISI   HIP07    00000000 INTL 20151124)
[    0.000000] ACPI: SRAT 0x00000000395E0000 0009C0 (v03 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: APIC 0x00000000395D0000 00286C (v04 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: IORT 0x00000000395B0000 00110C (v00 HISI   HIP08    00000000 INTL 20170929)
[    0.000000] ACPI: PPTT 0x00000000311F0000 0037D0 (v01 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: SPMI 0x00000000311E0000 000041 (v05 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30000 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30001 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30002 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30003 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30100 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30101 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30102 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30103 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30200 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30201 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30202 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30203 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30300 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30301 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30302 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30303 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30400 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30401 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30402 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30403 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30500 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30501 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30502 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30503 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30600 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30601 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30602 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30603 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30700 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30701 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30702 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30703 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10000 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10001 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10002 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10003 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10100 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10101 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10102 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10103 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10200 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10201 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10202 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10203 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10300 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10301 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10302 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10303 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10400 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10401 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10402 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10403 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10500 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10501 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10502 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10503 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10600 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10601 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10602 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10603 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10700 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10701 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10702 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10703 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70000 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70001 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70002 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70003 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70100 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70101 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70102 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70103 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: cpu_to_node_map[72] is too small, may not be able to use all cpus
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x2080000000-0x23ffffffff]
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x402000000000-0x4023ffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x23ffffe780-0x23ffffffff]
[    0.000000] NUMA: Initmem setup node 1 [<memory-less node>]
[    0.000000] NUMA: NODE_DATA [mem 0x4023fffed780-0x4023fffeefff]
[    0.000000] NUMA: NODE_DATA(1) on node 2
[    0.000000] NUMA: NODE_DATA [mem 0x4023fffebf00-0x4023fffed77f]
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000000000000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x00004023ffffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000003942ffff]
[    0.000000]   node   0: [mem 0x0000000039430000-0x000000003956ffff]
[    0.000000]   node   0: [mem 0x0000000039570000-0x000000003963ffff]
[    0.000000]   node   0: [mem 0x0000000039640000-0x00000000396fffff]
[    0.000000]   node   0: [mem 0x0000000039700000-0x000000003971ffff]
[    0.000000]   node   0: [mem 0x0000000039720000-0x0000000039b6ffff]
[    0.000000]   node   0: [mem 0x0000000039b70000-0x000000003eb5ffff]
[    0.000000]   node   0: [mem 0x000000003eb60000-0x000000003eb8ffff]
[    0.000000]   node   0: [mem 0x000000003eb90000-0x000000003fbfffff]
[    0.000000]   node   0: [mem 0x0000002080000000-0x00000023ffffffff]
[    0.000000]   node   2: [mem 0x0000402000000000-0x00004023ffffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x00000023ffffffff]
[    0.000000] Could not find start_pfn for node 1
[    0.000000] Initmem setup node 1 [mem 0x0000000000000000-0x0000000000000000]
[    0.000000] Initmem setup node 2 [mem 0x0000402000000000-0x00004023ffffffff]
[    0.000000] psci: probing for conduit method from ACPI.
[    0.000000] psci: PSCIv1.0 detected in firmware.
[    0.000000] psci: Using standard PSCI v0.2 function IDs
[    0.000000] psci: MIGRATE_INFO_TYPE not supported.
[    0.000000] psci: SMC Calling Convention v1.0
[    0.000000] random: fast init done
[    0.000000] percpu: Embedded 24 pages/cpu @        (ptrval) s59432 r8192 d30680 u98304
[    0.000000] Detected VIPT I-cache on CPU0
[    0.000000] Built 3 zonelists, mobility grouping on.  Total pages: 7998480
[    0.000000] Policy zone: Normal
[    0.000000] Kernel command line: earlycon=pl011,mmio32,0x94080000 console=ttyAMA0,115200 initrd=minifs.cpio.gz acpi=force
[    0.000000] log_buf_len individual max cpu contribution: 4096 bytes
[    0.000000] log_buf_len total cpu_extra contributions: 290816 bytes
...
[    0.141182] Platform MSI: ITS@0x400202100000 domain created
[    0.146793] PCI/MSI: ITS@0x202100000 domain created
[    0.151690] PCI/MSI: ITS@0x400202100000 domain created
[    0.156906] Remapping and enabling EFI services.
[    0.161562]   EFI remap 0x0000000039430000 =>         (ptrval)
[    0.167431]   EFI remap 0x0000000039480000 =>         (ptrval)
[    0.173299]   EFI remap 0x00000000394d0000 =>         (ptrval)
[    0.179167]   EFI remap 0x0000000039520000 =>         (ptrval)
[    0.185037]   EFI remap 0x0000000039650000 =>         (ptrval)
[    0.190905]   EFI remap 0x00000000396b0000 =>         (ptrval)
[    0.196773]   EFI remap 0x0000000039720000 =>         (ptrval)
[    0.202639]   EFI remap 0x0000000039770000 =>         (ptrval)
[    0.208505]   EFI remap 0x00000000397c0000 =>         (ptrval)
[    0.214371]   EFI remap 0x0000000039810000 =>         (ptrval)
[    0.220236]   EFI remap 0x0000000039860000 =>         (ptrval)
[    0.226103]   EFI remap 0x00000000398b0000 =>         (ptrval)
[    0.231973]   EFI remap 0x0000000039900000 =>         (ptrval)
[    0.237840]   EFI remap 0x0000000039970000 =>         (ptrval)
[    0.243710]   EFI remap 0x00000000399c0000 =>         (ptrval)
[    0.249577]   EFI remap 0x0000000039a10000 =>         (ptrval)
[    0.255445]   EFI remap 0x0000000039a60000 =>         (ptrval)
[    0.261313]   EFI remap 0x0000000039ab0000 =>         (ptrval)
[    0.267184]   EFI remap 0x0000000039b00000 =>         (ptrval)
[    0.273047]   EFI remap 0x000000003eb60000 =>         (ptrval)
[    0.278907]   EFI remap 0x0000000080000000 =>         (ptrval)
[    0.284766]   EFI remap 0x0000000202020000 =>         (ptrval)
[    0.290626]   EFI remap 0x0000000204000000 =>         (ptrval)
[    0.296487]   EFI remap 0x0000000206200000 =>         (ptrval)
[    0.318359] smp: Bringing up secondary CPUs ...
[    0.422473] Detected VIPT I-cache on CPU1
[    0.422481] GICv3: CPU1: found redistributor 30001 region 1:0x00000000aa140000
[    0.422502] CPU1: using LPI pending table @0x00000023ee480000
[    0.422543] CPU1: Booted secondary processor 0x0000030001 [0x480fd010]
[    0.522109] Detected VIPT I-cache on CPU2
[    0.522115] GICv3: CPU2: found redistributor 30002 region 2:0x00000000aa180000
[    0.522135] CPU2: using LPI pending table @0x00000023ee4b0000
[    0.522175] CPU2: Booted secondary processor 0x0000030002 [0x480fd010]
[    0.621751] Detected VIPT I-cache on CPU3
...
[    7.328992] CPU70: Booted secondary processor 0x0000070102 [0x480fd010]
[    7.435880] Detected VIPT I-cache on CPU71
[    7.435931] GICv3: CPU71: found redistributor 70103 region 71:0x00004000aa2c0000
[    7.435958] CPU71: using LPI pending table @0x00000023ed340000
[    7.436037] CPU71: Booted secondary processor 0x0000070103 [0x480fd010]
[    7.436131] smp: Brought up 3 nodes, 72 CPUs
[    9.146250] SMP: Total of 72 processors activated.
[    9.151067] CPU features: detected feature: GIC system register CPU interface
[    9.158247] CPU features: detected feature: Privileged Access Never
...
[   11.535050] pci 0000:74:02.0: BAR 5: assigned [mem 0xa2000000-0xa2007fff]
[   11.541829] pci 0000:74:03.0: BAR 5: assigned [mem 0xa2008000-0xa2008fff]
[   11.548610] pci 0000:75:00.0: BAR 2: assigned [mem 0x144000000-0x1443fffff 64bit pref]
[   11.556518] pci 0000:75:00.0: BAR 9: assigned [mem 0x144400000-0x1447effff 64bit pref]
[   11.564426] pci 0000:74:00.0: PCI bridge to [bus 75]
[   11.569382] pci 0000:74:00.0:   bridge window [mem 0x144000000-0x1447fffff 64bit pref]
[   11.577332] ACPI: PCI Root Bridge [PCI6] (domain 0000 [bus 80-9f])
[   11.583505] acpi PNP0A08:06: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[   11.591850] acpi PNP0A08:06: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[   11.599851] CPU002: pci_acpi_scan_root: 175 node 3, online 0
[   11.605503] Unable to handle kernel paging request at virtual address 00001530
[   11.612712] Mem abort info:
[   11.615492]   ESR = 0x96000004
[   11.618534]   Exception class = DABT (current EL), IL = 32 bits
[   11.624440]   SET = 0, FnV = 0
[   11.627481]   EA = 0, S1PTW = 0
[   11.630608] Data abort info:
[   11.633476]   ISV = 0, ISS = 0x00000004
[   11.637299]   CM = 0, WnR = 0
[   11.640254] [0000000000001530] user address but active_mm is swapper
[   11.646594] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[   11.652154] Modules linked in:
[   11.655198] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.16.0-rc1-00491-g204a6cc-dirty #17
[   11.663361] Hardware name: Huawei, 06/02/2018
[   11.673174] pstate: 20c00089 (nzCv daIf +PAN +UAO)
[   11.677955] pc : ___slab_alloc+0x60/0x590
[   11.681952] lr : __slab_alloc.isra.24+0x58/0x70
[   11.686469] sp : ffff00000981b870
[   11.689771] x29: ffff00000981b870 x28: ffffc023eb8d8000 
[   11.695071] x27: ffff7e008fafd600 x26: ffff8023f0407c00 
[   11.700372] x25: ffff8023f0407c00 x24: ffff000008098da0 
[   11.705671] x23: 0000000000000003 x22: 00000000014080c0 
[   11.710971] x21: 0000000000000000 x20: 0000000000000003 
[   11.716271] x19: ffff8023f0b40870 x18: ffffffffffffffff 
[   11.721571] x17: 0000000000000000 x16: 000000000001c200 
[   11.726871] x15: ffff000009423b88 x14: ffff00008958c94f 
[   11.732171] x13: ffff00000958c95d x12: ffff00000943e858 
[   11.737471] x11: ffff00000943e000 x10: 0000000005f5e0ff 
[   11.742771] x9 : 00000000ffffffd0 x8 : 65646f6e20353731 
[   11.748071] x7 : 203a746f6f725f6e x6 : 00000000000002e8 
[   11.753371] x5 : 000000000000000a x4 : ffff8023f0b40870 
[   11.758670] x3 : ffff000008098da0 x2 : 0000000000000003 
[   11.763970] x1 : 0000000000000000 x0 : 0000000000000000 
[   11.769271] Process swapper/0 (pid: 1, stack limit = 0x00000000b21a776c)
[   11.775958] Call trace:
[   11.778392]  ___slab_alloc+0x60/0x590
[   11.782041]  __slab_alloc.isra.24+0x58/0x70
[   11.786212]  kmem_cache_alloc_node+0xe0/0x240
[   11.790556]  pci_acpi_scan_root+0xb0/0x270
[   11.794642]  acpi_pci_root_add+0x29c/0x498
[   11.798726]  acpi_bus_attach+0x104/0x210
[   11.802635]  acpi_bus_attach+0xa4/0x210
[   11.806458]  acpi_bus_attach+0xa4/0x210
[   11.810282]  acpi_bus_scan+0x4c/0xb0
[   11.813846]  acpi_scan_init+0xec/0x24c
[   11.817582]  acpi_init+0x300/0x36c
[   11.820972]  do_one_initcall+0x50/0x158
[   11.824797]  kernel_init_freeable+0x188/0x228
[   11.829141]  kernel_init+0x10/0x100
[   11.832617]  ret_from_fork+0x10/0x18
[   11.836181] Code: 90009081 93407e82 912ec021 f8627821 (f94a9821) 
[   11.842316] ---[ end trace 6b7476446f4b95ab ]---
[   11.846936] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[   11.846936] 
[   11.856079] SMP: stopping secondary CPUs
[   11.860027] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[   11.860027] 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 12:03                             ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-19 12:03 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Punit,


On 2018/6/14 1:39, Punit Agrawal wrote:
> Punit Agrawal <punit.agrawal@arm.com> writes:
> 
> 
> [...]
> 
>>
>> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
>> up returning the original node in the fallback path.
>>
>> Xie, does the below patch help? I can submit a proper patch if this
>> fixes the issue for you.
>>
>> -- >8 --
>> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>>
>> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
>> ---
>>  arch/arm64/Kconfig   | 4 ++++
>>  arch/arm64/mm/numa.c | 2 ++
>>  2 files changed, 6 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index eb2cf4938f6d..5317e9aa93ab 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
>>  	def_bool y
>>  	depends on NUMA
>>  
>> +config HAVE_MEMORYLESS_NODES
>> +       def_bool y
>> +       depends on NUMA
>> +
>>  config HAVE_SETUP_PER_CPU_AREA
>>  	def_bool y
>>  	depends on NUMA
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index dad128ba98bf..c699dcfe93de 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
>>  static void map_cpu_to_node(unsigned int cpu, int nid)
>>  {
>>  	set_cpu_numa_node(cpu, nid);
>> +	set_numa_mem(local_memory_node(nid));
> 
> Argh, this should be
> 
>         set_cpu_numa_mem(cpu, local_memory_node(nid));
> 
> There is not guarantee that map_cpu_to_node() will be called on the
> local cpu.
> 
> Hanjun, Xie - can you try with the update please?

I've tested this patch, but it does not help.
The boot message is attached.

I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
Then node 3 is not be created, because node 3 has no memory, and no cpu.
But some pci device may related to node 3, which be set in ACPI table.

165 /* Interface called from ACPI code to setup PCI host controller */
166 struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
167 {
168         int node = acpi_get_node(root->device->handle);
169         struct acpi_pci_generic_root_info *ri;
170         struct pci_bus *bus, *child;
171         struct acpi_pci_root_ops *root_ops;
172
            // this node may be not created.
177         ri = kzalloc_node(sizeof(*ri), GFP_KERNEL, node);
178         if (!ri)
179                 return NULL;
180
181         root_ops = kzalloc_node(sizeof(*root_ops), GFP_KERNEL, node);
182         if (!root_ops) {
183                 kfree(ri);
184                 return NULL;
185         }
186
187         ri->cfg = pci_acpi_setup_ecam_mapping(root);
188         if (!ri->cfg) {
189                 kfree(ri);
190                 kfree(root_ops);
191                 return NULL;
192         }


> 
> Thanks,
> Punit
> 
>> +
>>  	if (nid >= 0)
>>  		cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
>>  }
> 
> .
> 

-- 
Thanks,
Xie XiuQi
-------------- next part --------------
[    0.000000] Booting Linux on physical CPU 0x0000030000 [0x480fd010]
[    0.000000] Linux version 4.16.0-rc1-00491-g204a6cc-dirty (xiexiuqi at localhost.localdomain) (gcc version 6.3.1 20170404 (Linaro GCC 6.3-2017.05)) #17 SMP PREEMPT Tue Jun 19 16:33:32 CST 2018
[    0.000000] earlycon: pl11 at MMIO32 0x0000000094080000 (options '')
[    0.000000] bootconsole [pl11] enabled
[    0.000000] efi: Getting EFI parameters from FDT:
[    0.000000] efi: EFI v2.60 by EDK II
[    0.000000] efi:  SMBIOS 3.0=0x3eb60000  ACPI 2.0=0x39710000  MEMATTR=0x3b106418 
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x0000000039710000 000024 (v02 HISI  )
[    0.000000] ACPI: XSDT 0x0000000039700000 000074 (v01 HISI   HIP08    00000000      01000013)
[    0.000000] ACPI: FACP 0x0000000039630000 000114 (v06 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: DSDT 0x00000000395C0000 006A1A (v02 HISI   HIP08    00000000 INTL 20170929)
[    0.000000] ACPI: GTDT 0x0000000039620000 000060 (v02 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: DBG2 0x0000000039610000 00005A (v00 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: MCFG 0x0000000039600000 00003C (v01 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: SLIT 0x00000000395F0000 00003C (v01 HISI   HIP07    00000000 INTL 20151124)
[    0.000000] ACPI: SRAT 0x00000000395E0000 0009C0 (v03 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: APIC 0x00000000395D0000 00286C (v04 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: IORT 0x00000000395B0000 00110C (v00 HISI   HIP08    00000000 INTL 20170929)
[    0.000000] ACPI: PPTT 0x00000000311F0000 0037D0 (v01 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: SPMI 0x00000000311E0000 000041 (v05 HISI   HIP08    00000000 INTL 20151124)
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30000 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30001 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30002 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30003 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30100 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30101 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30102 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30103 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30200 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30201 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30202 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30203 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30300 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30301 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30302 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30303 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30400 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30401 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30402 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30403 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30500 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30501 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30502 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30503 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30600 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30601 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30602 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30603 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30700 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30701 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30702 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30703 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10000 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10001 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10002 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10003 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10100 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10101 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10102 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10103 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10200 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10201 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10202 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10203 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10300 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10301 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10302 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10303 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10400 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10401 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10402 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10403 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10500 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10501 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10502 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10503 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10600 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10601 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10602 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10603 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10700 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10701 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10702 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10703 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70000 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70001 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70002 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70003 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70100 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70101 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70102 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70103 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: cpu_to_node_map[72] is too small, may not be able to use all cpus
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x2080000000-0x23ffffffff]
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x402000000000-0x4023ffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x23ffffe780-0x23ffffffff]
[    0.000000] NUMA: Initmem setup node 1 [<memory-less node>]
[    0.000000] NUMA: NODE_DATA [mem 0x4023fffed780-0x4023fffeefff]
[    0.000000] NUMA: NODE_DATA(1) on node 2
[    0.000000] NUMA: NODE_DATA [mem 0x4023fffebf00-0x4023fffed77f]
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000000000000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x00004023ffffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000003942ffff]
[    0.000000]   node   0: [mem 0x0000000039430000-0x000000003956ffff]
[    0.000000]   node   0: [mem 0x0000000039570000-0x000000003963ffff]
[    0.000000]   node   0: [mem 0x0000000039640000-0x00000000396fffff]
[    0.000000]   node   0: [mem 0x0000000039700000-0x000000003971ffff]
[    0.000000]   node   0: [mem 0x0000000039720000-0x0000000039b6ffff]
[    0.000000]   node   0: [mem 0x0000000039b70000-0x000000003eb5ffff]
[    0.000000]   node   0: [mem 0x000000003eb60000-0x000000003eb8ffff]
[    0.000000]   node   0: [mem 0x000000003eb90000-0x000000003fbfffff]
[    0.000000]   node   0: [mem 0x0000002080000000-0x00000023ffffffff]
[    0.000000]   node   2: [mem 0x0000402000000000-0x00004023ffffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x00000023ffffffff]
[    0.000000] Could not find start_pfn for node 1
[    0.000000] Initmem setup node 1 [mem 0x0000000000000000-0x0000000000000000]
[    0.000000] Initmem setup node 2 [mem 0x0000402000000000-0x00004023ffffffff]
[    0.000000] psci: probing for conduit method from ACPI.
[    0.000000] psci: PSCIv1.0 detected in firmware.
[    0.000000] psci: Using standard PSCI v0.2 function IDs
[    0.000000] psci: MIGRATE_INFO_TYPE not supported.
[    0.000000] psci: SMC Calling Convention v1.0
[    0.000000] random: fast init done
[    0.000000] percpu: Embedded 24 pages/cpu @        (ptrval) s59432 r8192 d30680 u98304
[    0.000000] Detected VIPT I-cache on CPU0
[    0.000000] Built 3 zonelists, mobility grouping on.  Total pages: 7998480
[    0.000000] Policy zone: Normal
[    0.000000] Kernel command line: earlycon=pl011,mmio32,0x94080000 console=ttyAMA0,115200 initrd=minifs.cpio.gz acpi=force
[    0.000000] log_buf_len individual max cpu contribution: 4096 bytes
[    0.000000] log_buf_len total cpu_extra contributions: 290816 bytes
...
[    0.141182] Platform MSI: ITS at 0x400202100000 domain created
[    0.146793] PCI/MSI: ITS at 0x202100000 domain created
[    0.151690] PCI/MSI: ITS at 0x400202100000 domain created
[    0.156906] Remapping and enabling EFI services.
[    0.161562]   EFI remap 0x0000000039430000 =>         (ptrval)
[    0.167431]   EFI remap 0x0000000039480000 =>         (ptrval)
[    0.173299]   EFI remap 0x00000000394d0000 =>         (ptrval)
[    0.179167]   EFI remap 0x0000000039520000 =>         (ptrval)
[    0.185037]   EFI remap 0x0000000039650000 =>         (ptrval)
[    0.190905]   EFI remap 0x00000000396b0000 =>         (ptrval)
[    0.196773]   EFI remap 0x0000000039720000 =>         (ptrval)
[    0.202639]   EFI remap 0x0000000039770000 =>         (ptrval)
[    0.208505]   EFI remap 0x00000000397c0000 =>         (ptrval)
[    0.214371]   EFI remap 0x0000000039810000 =>         (ptrval)
[    0.220236]   EFI remap 0x0000000039860000 =>         (ptrval)
[    0.226103]   EFI remap 0x00000000398b0000 =>         (ptrval)
[    0.231973]   EFI remap 0x0000000039900000 =>         (ptrval)
[    0.237840]   EFI remap 0x0000000039970000 =>         (ptrval)
[    0.243710]   EFI remap 0x00000000399c0000 =>         (ptrval)
[    0.249577]   EFI remap 0x0000000039a10000 =>         (ptrval)
[    0.255445]   EFI remap 0x0000000039a60000 =>         (ptrval)
[    0.261313]   EFI remap 0x0000000039ab0000 =>         (ptrval)
[    0.267184]   EFI remap 0x0000000039b00000 =>         (ptrval)
[    0.273047]   EFI remap 0x000000003eb60000 =>         (ptrval)
[    0.278907]   EFI remap 0x0000000080000000 =>         (ptrval)
[    0.284766]   EFI remap 0x0000000202020000 =>         (ptrval)
[    0.290626]   EFI remap 0x0000000204000000 =>         (ptrval)
[    0.296487]   EFI remap 0x0000000206200000 =>         (ptrval)
[    0.318359] smp: Bringing up secondary CPUs ...
[    0.422473] Detected VIPT I-cache on CPU1
[    0.422481] GICv3: CPU1: found redistributor 30001 region 1:0x00000000aa140000
[    0.422502] CPU1: using LPI pending table @0x00000023ee480000
[    0.422543] CPU1: Booted secondary processor 0x0000030001 [0x480fd010]
[    0.522109] Detected VIPT I-cache on CPU2
[    0.522115] GICv3: CPU2: found redistributor 30002 region 2:0x00000000aa180000
[    0.522135] CPU2: using LPI pending table @0x00000023ee4b0000
[    0.522175] CPU2: Booted secondary processor 0x0000030002 [0x480fd010]
[    0.621751] Detected VIPT I-cache on CPU3
...
[    7.328992] CPU70: Booted secondary processor 0x0000070102 [0x480fd010]
[    7.435880] Detected VIPT I-cache on CPU71
[    7.435931] GICv3: CPU71: found redistributor 70103 region 71:0x00004000aa2c0000
[    7.435958] CPU71: using LPI pending table @0x00000023ed340000
[    7.436037] CPU71: Booted secondary processor 0x0000070103 [0x480fd010]
[    7.436131] smp: Brought up 3 nodes, 72 CPUs
[    9.146250] SMP: Total of 72 processors activated.
[    9.151067] CPU features: detected feature: GIC system register CPU interface
[    9.158247] CPU features: detected feature: Privileged Access Never
...
[   11.535050] pci 0000:74:02.0: BAR 5: assigned [mem 0xa2000000-0xa2007fff]
[   11.541829] pci 0000:74:03.0: BAR 5: assigned [mem 0xa2008000-0xa2008fff]
[   11.548610] pci 0000:75:00.0: BAR 2: assigned [mem 0x144000000-0x1443fffff 64bit pref]
[   11.556518] pci 0000:75:00.0: BAR 9: assigned [mem 0x144400000-0x1447effff 64bit pref]
[   11.564426] pci 0000:74:00.0: PCI bridge to [bus 75]
[   11.569382] pci 0000:74:00.0:   bridge window [mem 0x144000000-0x1447fffff 64bit pref]
[   11.577332] ACPI: PCI Root Bridge [PCI6] (domain 0000 [bus 80-9f])
[   11.583505] acpi PNP0A08:06: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[   11.591850] acpi PNP0A08:06: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[   11.599851] CPU002: pci_acpi_scan_root: 175 node 3, online 0
[   11.605503] Unable to handle kernel paging request at virtual address 00001530
[   11.612712] Mem abort info:
[   11.615492]   ESR = 0x96000004
[   11.618534]   Exception class = DABT (current EL), IL = 32 bits
[   11.624440]   SET = 0, FnV = 0
[   11.627481]   EA = 0, S1PTW = 0
[   11.630608] Data abort info:
[   11.633476]   ISV = 0, ISS = 0x00000004
[   11.637299]   CM = 0, WnR = 0
[   11.640254] [0000000000001530] user address but active_mm is swapper
[   11.646594] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[   11.652154] Modules linked in:
[   11.655198] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.16.0-rc1-00491-g204a6cc-dirty #17
[   11.663361] Hardware name: Huawei, 06/02/2018
[   11.673174] pstate: 20c00089 (nzCv daIf +PAN +UAO)
[   11.677955] pc : ___slab_alloc+0x60/0x590
[   11.681952] lr : __slab_alloc.isra.24+0x58/0x70
[   11.686469] sp : ffff00000981b870
[   11.689771] x29: ffff00000981b870 x28: ffffc023eb8d8000 
[   11.695071] x27: ffff7e008fafd600 x26: ffff8023f0407c00 
[   11.700372] x25: ffff8023f0407c00 x24: ffff000008098da0 
[   11.705671] x23: 0000000000000003 x22: 00000000014080c0 
[   11.710971] x21: 0000000000000000 x20: 0000000000000003 
[   11.716271] x19: ffff8023f0b40870 x18: ffffffffffffffff 
[   11.721571] x17: 0000000000000000 x16: 000000000001c200 
[   11.726871] x15: ffff000009423b88 x14: ffff00008958c94f 
[   11.732171] x13: ffff00000958c95d x12: ffff00000943e858 
[   11.737471] x11: ffff00000943e000 x10: 0000000005f5e0ff 
[   11.742771] x9 : 00000000ffffffd0 x8 : 65646f6e20353731 
[   11.748071] x7 : 203a746f6f725f6e x6 : 00000000000002e8 
[   11.753371] x5 : 000000000000000a x4 : ffff8023f0b40870 
[   11.758670] x3 : ffff000008098da0 x2 : 0000000000000003 
[   11.763970] x1 : 0000000000000000 x0 : 0000000000000000 
[   11.769271] Process swapper/0 (pid: 1, stack limit = 0x00000000b21a776c)
[   11.775958] Call trace:
[   11.778392]  ___slab_alloc+0x60/0x590
[   11.782041]  __slab_alloc.isra.24+0x58/0x70
[   11.786212]  kmem_cache_alloc_node+0xe0/0x240
[   11.790556]  pci_acpi_scan_root+0xb0/0x270
[   11.794642]  acpi_pci_root_add+0x29c/0x498
[   11.798726]  acpi_bus_attach+0x104/0x210
[   11.802635]  acpi_bus_attach+0xa4/0x210
[   11.806458]  acpi_bus_attach+0xa4/0x210
[   11.810282]  acpi_bus_scan+0x4c/0xb0
[   11.813846]  acpi_scan_init+0xec/0x24c
[   11.817582]  acpi_init+0x300/0x36c
[   11.820972]  do_one_initcall+0x50/0x158
[   11.824797]  kernel_init_freeable+0x188/0x228
[   11.829141]  kernel_init+0x10/0x100
[   11.832617]  ret_from_fork+0x10/0x18
[   11.836181] Code: 90009081 93407e82 912ec021 f8627821 (f94a9821) 
[   11.842316] ---[ end trace 6b7476446f4b95ab ]---
[   11.846936] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[   11.846936] 
[   11.856079] SMP: stopping secondary CPUs
[   11.860027] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[   11.860027] 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-19 12:03                             ` Xie XiuQi
@ 2018-06-19 12:07                               ` Michal Hocko
  -1 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-19 12:07 UTC (permalink / raw)
  To: Xie XiuQi
  Cc: Punit Agrawal, Hanjun Guo, Bjorn Helgaas, tnowicki, linux-pci,
	Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
[...]
> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
> Then node 3 is not be created, because node 3 has no memory, and no cpu.
> But some pci device may related to node 3, which be set in ACPI table.

Could you double check that zonelists for node 3 are generated
correctly?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 12:07                               ` Michal Hocko
  0 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-19 12:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
[...]
> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
> Then node 3 is not be created, because node 3 has no memory, and no cpu.
> But some pci device may related to node 3, which be set in ACPI table.

Could you double check that zonelists for node 3 are generated
correctly?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-19 12:07                               ` Michal Hocko
  (?)
@ 2018-06-19 12:40                                 ` Xie XiuQi
  -1 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-19 12:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Punit Agrawal, Hanjun Guo, Bjorn Helgaas, tnowicki, linux-pci,
	Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

Hi Michal,

On 2018/6/19 20:07, Michal Hocko wrote:
> On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> [...]
>> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> But some pci device may related to node 3, which be set in ACPI table.
> 
> Could you double check that zonelists for node 3 are generated
> correctly?
> 

zonelists for node 3 is not created at all.

Kernel parse SRAT table to create node info, but in this case,
SRAT table is parsed not completed. Only the first 72 items are parsed.
In SRAT table, we haven't seen node 3 information yet, because cpu_to_node_map[72] is too small.

[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30000 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30001 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30002 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30003 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30100 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30101 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30102 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30103 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30200 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30201 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30202 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30203 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30300 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30301 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30302 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30303 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30400 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30401 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30402 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30403 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30500 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30501 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30502 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30503 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30600 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30601 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30602 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30603 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30700 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30701 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30702 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30703 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10000 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10001 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10002 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10003 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10100 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10101 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10102 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10103 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10200 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10201 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10202 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10203 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10300 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10301 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10302 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10303 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10400 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10401 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10402 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10403 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10500 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10501 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10502 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10503 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10600 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10601 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10602 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10603 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10700 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10701 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10702 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10703 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70000 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70001 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70002 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70003 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70100 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70101 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70102 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70103 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: cpu_to_node_map[72] is too small, may not be able to use all cpus
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x2080000000-0x23ffffffff]
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x402000000000-0x4023ffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x23ffffe780-0x23ffffffff]
[    0.000000] NUMA: Initmem setup node 1 [<memory-less node>]
[    0.000000] NUMA: NODE_DATA [mem 0x4023fffed780-0x4023fffeefff]
[    0.000000] NUMA: NODE_DATA(1) on node 2
[    0.000000] NUMA: NODE_DATA [mem 0x4023fffebf00-0x4023fffed77f]
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000000000000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x00004023ffffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000003942ffff]
[    0.000000]   node   0: [mem 0x0000000039430000-0x000000003956ffff]
[    0.000000]   node   0: [mem 0x0000000039570000-0x000000003963ffff]
[    0.000000]   node   0: [mem 0x0000000039640000-0x00000000396fffff]
[    0.000000]   node   0: [mem 0x0000000039700000-0x000000003971ffff]
[    0.000000]   node   0: [mem 0x0000000039720000-0x0000000039b6ffff]
[    0.000000]   node   0: [mem 0x0000000039b70000-0x000000003eb5ffff]
[    0.000000]   node   0: [mem 0x000000003eb60000-0x000000003eb8ffff]
[    0.000000]   node   0: [mem 0x000000003eb90000-0x000000003fbfffff]
[    0.000000]   node   0: [mem 0x0000002080000000-0x00000023ffffffff]
[    0.000000]   node   2: [mem 0x0000402000000000-0x00004023ffffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x00000023ffffffff]
[    0.000000] Could not find start_pfn for node 1
[    0.000000] Initmem setup node 1 [mem 0x0000000000000000-0x0000000000000000]
[    0.000000] Initmem setup node 2 [mem 0x0000402000000000-0x00004023ffffffff]



-- 
Thanks,
Xie XiuQi


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 12:40                                 ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-19 12:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Punit Agrawal, Hanjun Guo, Bjorn Helgaas, tnowicki, linux-pci,
	Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

Hi Michal,

On 2018/6/19 20:07, Michal Hocko wrote:
> On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> [...]
>> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> But some pci device may related to node 3, which be set in ACPI table.
> 
> Could you double check that zonelists for node 3 are generated
> correctly?
> 

zonelists for node 3 is not created at all.

Kernel parse SRAT table to create node info, but in this case,
SRAT table is parsed not completed. Only the first 72 items are parsed.
In SRAT table, we haven't seen node 3 information yet, because cpu_to_node_map[72] is too small.

[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30000 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30001 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30002 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30003 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30100 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30101 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30102 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30103 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30200 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30201 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30202 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30203 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30300 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30301 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30302 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30303 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30400 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30401 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30402 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30403 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30500 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30501 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30502 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30503 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30600 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30601 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30602 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30603 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30700 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30701 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30702 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30703 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10000 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10001 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10002 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10003 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10100 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10101 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10102 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10103 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10200 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10201 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10202 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10203 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10300 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10301 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10302 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10303 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10400 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10401 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10402 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10403 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10500 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10501 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10502 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10503 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10600 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10601 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10602 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10603 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10700 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10701 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10702 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10703 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70000 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70001 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70002 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70003 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70100 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70101 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70102 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70103 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: cpu_to_node_map[72] is too small, may not be able to use all cpus
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x2080000000-0x23ffffffff]
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x402000000000-0x4023ffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x23ffffe780-0x23ffffffff]
[    0.000000] NUMA: Initmem setup node 1 [<memory-less node>]
[    0.000000] NUMA: NODE_DATA [mem 0x4023fffed780-0x4023fffeefff]
[    0.000000] NUMA: NODE_DATA(1) on node 2
[    0.000000] NUMA: NODE_DATA [mem 0x4023fffebf00-0x4023fffed77f]
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000000000000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x00004023ffffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000003942ffff]
[    0.000000]   node   0: [mem 0x0000000039430000-0x000000003956ffff]
[    0.000000]   node   0: [mem 0x0000000039570000-0x000000003963ffff]
[    0.000000]   node   0: [mem 0x0000000039640000-0x00000000396fffff]
[    0.000000]   node   0: [mem 0x0000000039700000-0x000000003971ffff]
[    0.000000]   node   0: [mem 0x0000000039720000-0x0000000039b6ffff]
[    0.000000]   node   0: [mem 0x0000000039b70000-0x000000003eb5ffff]
[    0.000000]   node   0: [mem 0x000000003eb60000-0x000000003eb8ffff]
[    0.000000]   node   0: [mem 0x000000003eb90000-0x000000003fbfffff]
[    0.000000]   node   0: [mem 0x0000002080000000-0x00000023ffffffff]
[    0.000000]   node   2: [mem 0x0000402000000000-0x00004023ffffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x00000023ffffffff]
[    0.000000] Could not find start_pfn for node 1
[    0.000000] Initmem setup node 1 [mem 0x0000000000000000-0x0000000000000000]
[    0.000000] Initmem setup node 2 [mem 0x0000402000000000-0x00004023ffffffff]



-- 
Thanks,
Xie XiuQi

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 12:40                                 ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-19 12:40 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Michal,

On 2018/6/19 20:07, Michal Hocko wrote:
> On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> [...]
>> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> But some pci device may related to node 3, which be set in ACPI table.
> 
> Could you double check that zonelists for node 3 are generated
> correctly?
> 

zonelists for node 3 is not created at all.

Kernel parse SRAT table to create node info, but in this case,
SRAT table is parsed not completed. Only the first 72 items are parsed.
In SRAT table, we haven't seen node 3 information yet, because cpu_to_node_map[72] is too small.

[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30000 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30001 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30002 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30003 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30100 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30101 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30102 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30103 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30200 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30201 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30202 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30203 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30300 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30301 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30302 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30303 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30400 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30401 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30402 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30403 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30500 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30501 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30502 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30503 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30600 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30601 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30602 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30603 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30700 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30701 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30702 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30703 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10000 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10001 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10002 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10003 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10100 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10101 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10102 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10103 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10200 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10201 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10202 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10203 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10300 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10301 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10302 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10303 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10400 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10401 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10402 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10403 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10500 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10501 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10502 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10503 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10600 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10601 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10602 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10603 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10700 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10701 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10702 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10703 -> Node 1
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70000 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70001 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70002 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70003 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70100 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70101 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70102 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70103 -> Node 2
[    0.000000] ACPI: NUMA: SRAT: cpu_to_node_map[72] is too small, may not be able to use all cpus
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x2080000000-0x23ffffffff]
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x402000000000-0x4023ffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x23ffffe780-0x23ffffffff]
[    0.000000] NUMA: Initmem setup node 1 [<memory-less node>]
[    0.000000] NUMA: NODE_DATA [mem 0x4023fffed780-0x4023fffeefff]
[    0.000000] NUMA: NODE_DATA(1) on node 2
[    0.000000] NUMA: NODE_DATA [mem 0x4023fffebf00-0x4023fffed77f]
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000000000000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x00004023ffffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000003942ffff]
[    0.000000]   node   0: [mem 0x0000000039430000-0x000000003956ffff]
[    0.000000]   node   0: [mem 0x0000000039570000-0x000000003963ffff]
[    0.000000]   node   0: [mem 0x0000000039640000-0x00000000396fffff]
[    0.000000]   node   0: [mem 0x0000000039700000-0x000000003971ffff]
[    0.000000]   node   0: [mem 0x0000000039720000-0x0000000039b6ffff]
[    0.000000]   node   0: [mem 0x0000000039b70000-0x000000003eb5ffff]
[    0.000000]   node   0: [mem 0x000000003eb60000-0x000000003eb8ffff]
[    0.000000]   node   0: [mem 0x000000003eb90000-0x000000003fbfffff]
[    0.000000]   node   0: [mem 0x0000002080000000-0x00000023ffffffff]
[    0.000000]   node   2: [mem 0x0000402000000000-0x00004023ffffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x00000023ffffffff]
[    0.000000] Could not find start_pfn for node 1
[    0.000000] Initmem setup node 1 [mem 0x0000000000000000-0x0000000000000000]
[    0.000000] Initmem setup node 2 [mem 0x0000402000000000-0x00004023ffffffff]



-- 
Thanks,
Xie XiuQi

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-19 12:07                               ` Michal Hocko
  (?)
  (?)
@ 2018-06-19 12:52                                 ` Punit Agrawal
  -1 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-19 12:52 UTC (permalink / raw)
  To: Michal Hocko, Xie XiuQi
  Cc: Hanjun Guo, Bjorn Helgaas, tnowicki, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Bjorn Helgaas, Andrew Morton, zhongjiang, linux-arm

Michal Hocko <mhocko@kernel.org> writes:

> On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> [...]
>> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> But some pci device may related to node 3, which be set in ACPI table.
>
> Could you double check that zonelists for node 3 are generated
> correctly?

The cpus in node 3 aren't onlined and there's no memory attached - I
suspect that no zonelists are built for this node.

We skip creating a node, if the number of SRAT entries parsed exceeds
NR_CPUS[0]. This in turn prevents onlining the numa node and so no
zonelists will be created for it.

I think the problem will go away if the cpus are restricted via the
kernel command line by setting nr_cpus.

Xie, can you try the below patch on top of the one enabling memoryless
nodes? I'm not sure this is the right solution but at least it'll
confirm the problem.

Thanks,
Punit

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/acpi_numa.c?h=v4.18-rc1#n73

-- >8 --
diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
index d190a7b231bf..fea0f7164f1a 100644
--- a/arch/arm64/kernel/acpi_numa.c
+++ b/arch/arm64/kernel/acpi_numa.c
@@ -70,11 +70,9 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
        if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
                return;

-   if (cpus_in_srat >= NR_CPUS) {
+ if (cpus_in_srat >= NR_CPUS)
                pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
                             NR_CPUS);
-           return;
-   }

        pxm = pa->proximity_domain;
        node = acpi_map_pxm_to_node(pxm);

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 12:52                                 ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-19 12:52 UTC (permalink / raw)
  To: Michal Hocko, Xie XiuQi
  Cc: Hanjun Guo, Bjorn Helgaas, tnowicki, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Bjorn Helgaas, Andrew Morton, zhongjiang, linux-arm

Michal Hocko <mhocko@kernel.org> writes:

> On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> [...]
>> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> But some pci device may related to node 3, which be set in ACPI table.
>
> Could you double check that zonelists for node 3 are generated
> correctly?

The cpus in node 3 aren't onlined and there's no memory attached - I
suspect that no zonelists are built for this node.

We skip creating a node, if the number of SRAT entries parsed exceeds
NR_CPUS[0]. This in turn prevents onlining the numa node and so no
zonelists will be created for it.

I think the problem will go away if the cpus are restricted via the
kernel command line by setting nr_cpus.

Xie, can you try the below patch on top of the one enabling memoryless
nodes? I'm not sure this is the right solution but at least it'll
confirm the problem.

Thanks,
Punit

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/acpi_numa.c?h=v4.18-rc1#n73

-- >8 --
diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
index d190a7b231bf..fea0f7164f1a 100644
--- a/arch/arm64/kernel/acpi_numa.c
+++ b/arch/arm64/kernel/acpi_numa.c
@@ -70,11 +70,9 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
        if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
                return;

-   if (cpus_in_srat >= NR_CPUS) {
+ if (cpus_in_srat >= NR_CPUS)
                pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
                             NR_CPUS);
-           return;
-   }

        pxm = pa->proximity_domain;
        node = acpi_map_pxm_to_node(pxm);

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 12:52                                 ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-19 12:52 UTC (permalink / raw)
  To: Michal Hocko, Xie XiuQi
  Cc: Hanjun Guo, Bjorn Helgaas, tnowicki, linux-pci, Catalin Marinas,
	Rafael J. Wysocki, Will Deacon, Linux Kernel Mailing List,
	Jarkko Sakkinen, linux-mm, wanghuiqiang, Greg Kroah-Hartman,
	Bjorn Helgaas, Andrew Morton, zhongjiang, linux-arm

Michal Hocko <mhocko@kernel.org> writes:

> On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> [...]
>> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> But some pci device may related to node 3, which be set in ACPI table.
>
> Could you double check that zonelists for node 3 are generated
> correctly?

The cpus in node 3 aren't onlined and there's no memory attached - I
suspect that no zonelists are built for this node.

We skip creating a node, if the number of SRAT entries parsed exceeds
NR_CPUS[0]. This in turn prevents onlining the numa node and so no
zonelists will be created for it.

I think the problem will go away if the cpus are restricted via the
kernel command line by setting nr_cpus.

Xie, can you try the below patch on top of the one enabling memoryless
nodes? I'm not sure this is the right solution but at least it'll
confirm the problem.

Thanks,
Punit

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/acpi_numa.c?h=v4.18-rc1#n73

-- >8 --
diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
index d190a7b231bf..fea0f7164f1a 100644
--- a/arch/arm64/kernel/acpi_numa.c
+++ b/arch/arm64/kernel/acpi_numa.c
@@ -70,11 +70,9 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
        if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
                return;

-   if (cpus_in_srat >= NR_CPUS) {
+ if (cpus_in_srat >= NR_CPUS)
                pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
                             NR_CPUS);
-           return;
-   }

        pxm = pa->proximity_domain;
        node = acpi_map_pxm_to_node(pxm);

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 12:52                                 ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-19 12:52 UTC (permalink / raw)
  To: linux-arm-kernel

Michal Hocko <mhocko@kernel.org> writes:

> On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> [...]
>> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> But some pci device may related to node 3, which be set in ACPI table.
>
> Could you double check that zonelists for node 3 are generated
> correctly?

The cpus in node 3 aren't onlined and there's no memory attached - I
suspect that no zonelists are built for this node.

We skip creating a node, if the number of SRAT entries parsed exceeds
NR_CPUS[0]. This in turn prevents onlining the numa node and so no
zonelists will be created for it.

I think the problem will go away if the cpus are restricted via the
kernel command line by setting nr_cpus.

Xie, can you try the below patch on top of the one enabling memoryless
nodes? I'm not sure this is the right solution but at least it'll
confirm the problem.

Thanks,
Punit

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/acpi_numa.c?h=v4.18-rc1#n73

-- >8 --
diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
index d190a7b231bf..fea0f7164f1a 100644
--- a/arch/arm64/kernel/acpi_numa.c
+++ b/arch/arm64/kernel/acpi_numa.c
@@ -70,11 +70,9 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
        if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
                return;

-   if (cpus_in_srat >= NR_CPUS) {
+ if (cpus_in_srat >= NR_CPUS)
                pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
                             NR_CPUS);
-           return;
-   }

        pxm = pa->proximity_domain;
        node = acpi_map_pxm_to_node(pxm);

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-19 12:52                                 ` Punit Agrawal
@ 2018-06-19 14:08                                   ` Lorenzo Pieralisi
  -1 siblings, 0 replies; 103+ messages in thread
From: Lorenzo Pieralisi @ 2018-06-19 14:08 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Michal Hocko, Xie XiuQi, Hanjun Guo, Bjorn Helgaas, tnowicki,
	linux-pci, Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

On Tue, Jun 19, 2018 at 01:52:16PM +0100, Punit Agrawal wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> > [...]
> >> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
> >> Then node 3 is not be created, because node 3 has no memory, and no cpu.
> >> But some pci device may related to node 3, which be set in ACPI table.
> >
> > Could you double check that zonelists for node 3 are generated
> > correctly?
> 
> The cpus in node 3 aren't onlined and there's no memory attached - I
> suspect that no zonelists are built for this node.
> 
> We skip creating a node, if the number of SRAT entries parsed exceeds
> NR_CPUS[0]. This in turn prevents onlining the numa node and so no
> zonelists will be created for it.
> 
> I think the problem will go away if the cpus are restricted via the
> kernel command line by setting nr_cpus.
> 
> Xie, can you try the below patch on top of the one enabling memoryless
> nodes? I'm not sure this is the right solution but at least it'll
> confirm the problem.

This issue looks familiar (or at least related):

git log d3bd058826aa

The reason why the NR_CPUS guard is there is to avoid overflowing
the early_node_cpu_hwid array. IA64 does something different in
that respect compared to x86, we have to have a look into this.

Regardless, AFAICS the proximity domains to nodes mappings should not
depend on CONFIG_NR_CPUS, it seems that there is something wrong in that
in ARM64 ACPI SRAT parsing.

Lorenzo

> 
> Thanks,
> Punit
> 
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/acpi_numa.c?h=v4.18-rc1#n73
> 
> -- >8 --
> diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
> index d190a7b231bf..fea0f7164f1a 100644
> --- a/arch/arm64/kernel/acpi_numa.c
> +++ b/arch/arm64/kernel/acpi_numa.c
> @@ -70,11 +70,9 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
>         if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
>                 return;
> 
> -   if (cpus_in_srat >= NR_CPUS) {
> + if (cpus_in_srat >= NR_CPUS)
>                 pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
>                              NR_CPUS);
> -           return;
> -   }
> 
>         pxm = pa->proximity_domain;
>         node = acpi_map_pxm_to_node(pxm);

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 14:08                                   ` Lorenzo Pieralisi
  0 siblings, 0 replies; 103+ messages in thread
From: Lorenzo Pieralisi @ 2018-06-19 14:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 19, 2018 at 01:52:16PM +0100, Punit Agrawal wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> > [...]
> >> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
> >> Then node 3 is not be created, because node 3 has no memory, and no cpu.
> >> But some pci device may related to node 3, which be set in ACPI table.
> >
> > Could you double check that zonelists for node 3 are generated
> > correctly?
> 
> The cpus in node 3 aren't onlined and there's no memory attached - I
> suspect that no zonelists are built for this node.
> 
> We skip creating a node, if the number of SRAT entries parsed exceeds
> NR_CPUS[0]. This in turn prevents onlining the numa node and so no
> zonelists will be created for it.
> 
> I think the problem will go away if the cpus are restricted via the
> kernel command line by setting nr_cpus.
> 
> Xie, can you try the below patch on top of the one enabling memoryless
> nodes? I'm not sure this is the right solution but at least it'll
> confirm the problem.

This issue looks familiar (or at least related):

git log d3bd058826aa

The reason why the NR_CPUS guard is there is to avoid overflowing
the early_node_cpu_hwid array. IA64 does something different in
that respect compared to x86, we have to have a look into this.

Regardless, AFAICS the proximity domains to nodes mappings should not
depend on CONFIG_NR_CPUS, it seems that there is something wrong in that
in ARM64 ACPI SRAT parsing.

Lorenzo

> 
> Thanks,
> Punit
> 
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/acpi_numa.c?h=v4.18-rc1#n73
> 
> -- >8 --
> diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
> index d190a7b231bf..fea0f7164f1a 100644
> --- a/arch/arm64/kernel/acpi_numa.c
> +++ b/arch/arm64/kernel/acpi_numa.c
> @@ -70,11 +70,9 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
>         if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
>                 return;
> 
> -   if (cpus_in_srat >= NR_CPUS) {
> + if (cpus_in_srat >= NR_CPUS)
>                 pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
>                              NR_CPUS);
> -           return;
> -   }
> 
>         pxm = pa->proximity_domain;
>         node = acpi_map_pxm_to_node(pxm);

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-19 14:08                                   ` Lorenzo Pieralisi
  (?)
  (?)
@ 2018-06-19 14:54                                     ` Punit Agrawal
  -1 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-19 14:54 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Michal Hocko, Xie XiuQi, Hanjun Guo, Bjorn Helgaas, tnowicki,
	linux-pci, Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> writes:

> On Tue, Jun 19, 2018 at 01:52:16PM +0100, Punit Agrawal wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
>> > [...]
>> >> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> >> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> >> But some pci device may related to node 3, which be set in ACPI table.
>> >
>> > Could you double check that zonelists for node 3 are generated
>> > correctly?
>> 
>> The cpus in node 3 aren't onlined and there's no memory attached - I
>> suspect that no zonelists are built for this node.
>> 
>> We skip creating a node, if the number of SRAT entries parsed exceeds
>> NR_CPUS[0]. This in turn prevents onlining the numa node and so no
>> zonelists will be created for it.
>> 
>> I think the problem will go away if the cpus are restricted via the
>> kernel command line by setting nr_cpus.
>> 
>> Xie, can you try the below patch on top of the one enabling memoryless
>> nodes? I'm not sure this is the right solution but at least it'll
>> confirm the problem.
>
> This issue looks familiar (or at least related):
>
> git log d3bd058826aa

Indeed. Thanks for digging into this.

>
> The reason why the NR_CPUS guard is there is to avoid overflowing
> the early_node_cpu_hwid array.

Ah right... I missed that. The below patch is definitely not what we
want.

> IA64 does something different in
> that respect compared to x86, we have to have a look into this.
>
> Regardless, AFAICS the proximity domains to nodes mappings should not
> depend on CONFIG_NR_CPUS, it seems that there is something wrong in that
> in ARM64 ACPI SRAT parsing.

Not only SRAT parsing but it looks like there is a similar restriction
while parsing the ACPI MADT in acpi_map_gic_cpu_interface().

The incomplete parsing introduces a dependency on the ordering of
entries being aligned between SRAT and MADT when NR_CPUS is
restricted. We want to parse the entire table in both cases so that the
code is robust to reordering of entries.

In terms of $SUBJECT, I wonder if it's worth taking the original patch
as a temporary fix (it'll also be easier to backport) while we work on
fixing these other issues and enabling memoryless nodes.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 14:54                                     ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-19 14:54 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Michal Hocko, Xie XiuQi, Hanjun Guo, Bjorn Helgaas, tnowicki,
	linux-pci, Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> writes:

> On Tue, Jun 19, 2018 at 01:52:16PM +0100, Punit Agrawal wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
>> > [...]
>> >> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> >> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> >> But some pci device may related to node 3, which be set in ACPI table.
>> >
>> > Could you double check that zonelists for node 3 are generated
>> > correctly?
>> 
>> The cpus in node 3 aren't onlined and there's no memory attached - I
>> suspect that no zonelists are built for this node.
>> 
>> We skip creating a node, if the number of SRAT entries parsed exceeds
>> NR_CPUS[0]. This in turn prevents onlining the numa node and so no
>> zonelists will be created for it.
>> 
>> I think the problem will go away if the cpus are restricted via the
>> kernel command line by setting nr_cpus.
>> 
>> Xie, can you try the below patch on top of the one enabling memoryless
>> nodes? I'm not sure this is the right solution but at least it'll
>> confirm the problem.
>
> This issue looks familiar (or at least related):
>
> git log d3bd058826aa

Indeed. Thanks for digging into this.

>
> The reason why the NR_CPUS guard is there is to avoid overflowing
> the early_node_cpu_hwid array.

Ah right... I missed that. The below patch is definitely not what we
want.

> IA64 does something different in
> that respect compared to x86, we have to have a look into this.
>
> Regardless, AFAICS the proximity domains to nodes mappings should not
> depend on CONFIG_NR_CPUS, it seems that there is something wrong in that
> in ARM64 ACPI SRAT parsing.

Not only SRAT parsing but it looks like there is a similar restriction
while parsing the ACPI MADT in acpi_map_gic_cpu_interface().

The incomplete parsing introduces a dependency on the ordering of
entries being aligned between SRAT and MADT when NR_CPUS is
restricted. We want to parse the entire table in both cases so that the
code is robust to reordering of entries.

In terms of $SUBJECT, I wonder if it's worth taking the original patch
as a temporary fix (it'll also be easier to backport) while we work on
fixing these other issues and enabling memoryless nodes.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 14:54                                     ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-19 14:54 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Michal Hocko, Xie XiuQi, Hanjun Guo, Bjorn Helgaas, tnowicki,
	linux-pci, Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> writes:

> On Tue, Jun 19, 2018 at 01:52:16PM +0100, Punit Agrawal wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
>> > [...]
>> >> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> >> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> >> But some pci device may related to node 3, which be set in ACPI table.
>> >
>> > Could you double check that zonelists for node 3 are generated
>> > correctly?
>> 
>> The cpus in node 3 aren't onlined and there's no memory attached - I
>> suspect that no zonelists are built for this node.
>> 
>> We skip creating a node, if the number of SRAT entries parsed exceeds
>> NR_CPUS[0]. This in turn prevents onlining the numa node and so no
>> zonelists will be created for it.
>> 
>> I think the problem will go away if the cpus are restricted via the
>> kernel command line by setting nr_cpus.
>> 
>> Xie, can you try the below patch on top of the one enabling memoryless
>> nodes? I'm not sure this is the right solution but at least it'll
>> confirm the problem.
>
> This issue looks familiar (or at least related):
>
> git log d3bd058826aa

Indeed. Thanks for digging into this.

>
> The reason why the NR_CPUS guard is there is to avoid overflowing
> the early_node_cpu_hwid array.

Ah right... I missed that. The below patch is definitely not what we
want.

> IA64 does something different in
> that respect compared to x86, we have to have a look into this.
>
> Regardless, AFAICS the proximity domains to nodes mappings should not
> depend on CONFIG_NR_CPUS, it seems that there is something wrong in that
> in ARM64 ACPI SRAT parsing.

Not only SRAT parsing but it looks like there is a similar restriction
while parsing the ACPI MADT in acpi_map_gic_cpu_interface().

The incomplete parsing introduces a dependency on the ordering of
entries being aligned between SRAT and MADT when NR_CPUS is
restricted. We want to parse the entire table in both cases so that the
code is robust to reordering of entries.

In terms of $SUBJECT, I wonder if it's worth taking the original patch
as a temporary fix (it'll also be easier to backport) while we work on
fixing these other issues and enabling memoryless nodes.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 14:54                                     ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-19 14:54 UTC (permalink / raw)
  To: linux-arm-kernel

Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> writes:

> On Tue, Jun 19, 2018 at 01:52:16PM +0100, Punit Agrawal wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
>> > [...]
>> >> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> >> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> >> But some pci device may related to node 3, which be set in ACPI table.
>> >
>> > Could you double check that zonelists for node 3 are generated
>> > correctly?
>> 
>> The cpus in node 3 aren't onlined and there's no memory attached - I
>> suspect that no zonelists are built for this node.
>> 
>> We skip creating a node, if the number of SRAT entries parsed exceeds
>> NR_CPUS[0]. This in turn prevents onlining the numa node and so no
>> zonelists will be created for it.
>> 
>> I think the problem will go away if the cpus are restricted via the
>> kernel command line by setting nr_cpus.
>> 
>> Xie, can you try the below patch on top of the one enabling memoryless
>> nodes? I'm not sure this is the right solution but at least it'll
>> confirm the problem.
>
> This issue looks familiar (or at least related):
>
> git log d3bd058826aa

Indeed. Thanks for digging into this.

>
> The reason why the NR_CPUS guard is there is to avoid overflowing
> the early_node_cpu_hwid array.

Ah right... I missed that. The below patch is definitely not what we
want.

> IA64 does something different in
> that respect compared to x86, we have to have a look into this.
>
> Regardless, AFAICS the proximity domains to nodes mappings should not
> depend on CONFIG_NR_CPUS, it seems that there is something wrong in that
> in ARM64 ACPI SRAT parsing.

Not only SRAT parsing but it looks like there is a similar restriction
while parsing the ACPI MADT in acpi_map_gic_cpu_interface().

The incomplete parsing introduces a dependency on the ordering of
entries being aligned between SRAT and MADT when NR_CPUS is
restricted. We want to parse the entire table in both cases so that the
code is robust to reordering of entries.

In terms of $SUBJECT, I wonder if it's worth taking the original patch
as a temporary fix (it'll also be easier to backport) while we work on
fixing these other issues and enabling memoryless nodes.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-19 14:54                                     ` Punit Agrawal
@ 2018-06-19 15:14                                       ` Michal Hocko
  -1 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-19 15:14 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Lorenzo Pieralisi, Xie XiuQi, Hanjun Guo, Bjorn Helgaas,
	tnowicki, linux-pci, Catalin Marinas, Rafael J. Wysocki,
	Will Deacon, Linux Kernel Mailing List, Jarkko Sakkinen,
	linux-mm, wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas,
	Andrew Morton, zhongjiang, linux-arm

On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
[...]
> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> as a temporary fix (it'll also be easier to backport) while we work on
> fixing these other issues and enabling memoryless nodes.

Well, x86 already does that but copying this antipatern is not really
nice. So it is good as a quick fix but it would be definitely much
better to have a robust fix. Who knows how many other places might hit
this. You certainly do not want to add a hack like this all over...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 15:14                                       ` Michal Hocko
  0 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-19 15:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
[...]
> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> as a temporary fix (it'll also be easier to backport) while we work on
> fixing these other issues and enabling memoryless nodes.

Well, x86 already does that but copying this antipatern is not really
nice. So it is good as a quick fix but it would be definitely much
better to have a robust fix. Who knows how many other places might hit
this. You certainly do not want to add a hack like this all over...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-19 15:14                                       ` Michal Hocko
  (?)
  (?)
@ 2018-06-19 15:35                                         ` Punit Agrawal
  -1 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-19 15:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Lorenzo Pieralisi, Xie XiuQi, Hanjun Guo, Bjorn Helgaas,
	tnowicki, linux-pci, Catalin Marinas, Rafael J. Wysocki,
	Will Deacon, Linux Kernel Mailing List, Jarkko Sakkinen,
	linux-mm, wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas,
	Andrew Morton, zhongjiang, linux-arm

Michal Hocko <mhocko@kernel.org> writes:

> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> [...]
>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> as a temporary fix (it'll also be easier to backport) while we work on
>> fixing these other issues and enabling memoryless nodes.
>
> Well, x86 already does that but copying this antipatern is not really
> nice. So it is good as a quick fix but it would be definitely much
> better to have a robust fix. Who knows how many other places might hit
> this. You certainly do not want to add a hack like this all over...

Completely agree! I was only suggesting it as a temporary measure,
especially as it looked like a proper fix might be invasive.

Another fix might be to change the node specific allocation to node
agnostic allocations. It isn't clear why the allocation is being
requested from a specific node. I think Lorenzo suggested this in one of
the threads.

I've started putting together a set fixing the issues identified in this
thread. It should give a better idea on the best course of action.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 15:35                                         ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-19 15:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Lorenzo Pieralisi, Xie XiuQi, Hanjun Guo, Bjorn Helgaas,
	tnowicki, linux-pci, Catalin Marinas, Rafael J. Wysocki,
	Will Deacon, Linux Kernel Mailing List, Jarkko Sakkinen,
	linux-mm, wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas,
	Andrew Morton, zhongjiang, linux-arm

Michal Hocko <mhocko@kernel.org> writes:

> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> [...]
>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> as a temporary fix (it'll also be easier to backport) while we work on
>> fixing these other issues and enabling memoryless nodes.
>
> Well, x86 already does that but copying this antipatern is not really
> nice. So it is good as a quick fix but it would be definitely much
> better to have a robust fix. Who knows how many other places might hit
> this. You certainly do not want to add a hack like this all over...

Completely agree! I was only suggesting it as a temporary measure,
especially as it looked like a proper fix might be invasive.

Another fix might be to change the node specific allocation to node
agnostic allocations. It isn't clear why the allocation is being
requested from a specific node. I think Lorenzo suggested this in one of
the threads.

I've started putting together a set fixing the issues identified in this
thread. It should give a better idea on the best course of action.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 15:35                                         ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-19 15:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Lorenzo Pieralisi, Xie XiuQi, Hanjun Guo, Bjorn Helgaas,
	tnowicki, linux-pci, Catalin Marinas, Rafael J. Wysocki,
	Will Deacon, Linux Kernel Mailing List, Jarkko Sakkinen,
	linux-mm, wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas,
	Andrew Morton, zhongjiang, linux-arm

Michal Hocko <mhocko@kernel.org> writes:

> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> [...]
>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> as a temporary fix (it'll also be easier to backport) while we work on
>> fixing these other issues and enabling memoryless nodes.
>
> Well, x86 already does that but copying this antipatern is not really
> nice. So it is good as a quick fix but it would be definitely much
> better to have a robust fix. Who knows how many other places might hit
> this. You certainly do not want to add a hack like this all over...

Completely agree! I was only suggesting it as a temporary measure,
especially as it looked like a proper fix might be invasive.

Another fix might be to change the node specific allocation to node
agnostic allocations. It isn't clear why the allocation is being
requested from a specific node. I think Lorenzo suggested this in one of
the threads.

I've started putting together a set fixing the issues identified in this
thread. It should give a better idea on the best course of action.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 15:35                                         ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-19 15:35 UTC (permalink / raw)
  To: linux-arm-kernel

Michal Hocko <mhocko@kernel.org> writes:

> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> [...]
>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> as a temporary fix (it'll also be easier to backport) while we work on
>> fixing these other issues and enabling memoryless nodes.
>
> Well, x86 already does that but copying this antipatern is not really
> nice. So it is good as a quick fix but it would be definitely much
> better to have a robust fix. Who knows how many other places might hit
> this. You certainly do not want to add a hack like this all over...

Completely agree! I was only suggesting it as a temporary measure,
especially as it looked like a proper fix might be invasive.

Another fix might be to change the node specific allocation to node
agnostic allocations. It isn't clear why the allocation is being
requested from a specific node. I think Lorenzo suggested this in one of
the threads.

I've started putting together a set fixing the issues identified in this
thread. It should give a better idea on the best course of action.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-19 15:35                                         ` Punit Agrawal
@ 2018-06-19 16:32                                           ` Lorenzo Pieralisi
  -1 siblings, 0 replies; 103+ messages in thread
From: Lorenzo Pieralisi @ 2018-06-19 16:32 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Michal Hocko, Xie XiuQi, Hanjun Guo, Bjorn Helgaas, tnowicki,
	linux-pci, Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> > [...]
> >> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> >> as a temporary fix (it'll also be easier to backport) while we work on
> >> fixing these other issues and enabling memoryless nodes.
> >
> > Well, x86 already does that but copying this antipatern is not really
> > nice. So it is good as a quick fix but it would be definitely much
> > better to have a robust fix. Who knows how many other places might hit
> > this. You certainly do not want to add a hack like this all over...
> 
> Completely agree! I was only suggesting it as a temporary measure,
> especially as it looked like a proper fix might be invasive.
> 
> Another fix might be to change the node specific allocation to node
> agnostic allocations. It isn't clear why the allocation is being
> requested from a specific node. I think Lorenzo suggested this in one of
> the threads.

I think that code was just copypasted but it is better to fix the
underlying issue.

> I've started putting together a set fixing the issues identified in this
> thread. It should give a better idea on the best course of action.

On ACPI ARM64, this diff should do if I read the code correctly, it
should be (famous last words) just a matter of mapping PXMs to nodes for
every SRAT GICC entry, feel free to pick it up if it works.

Yes, we can take the original patch just because it is safer for an -rc
cycle even though if the patch below would do delaying the fix for a
couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
not a disaster.

Lorenzo

-- >8 --
diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
index d190a7b231bf..877b268ef9fa 100644
--- a/arch/arm64/kernel/acpi_numa.c
+++ b/arch/arm64/kernel/acpi_numa.c
@@ -70,12 +70,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
 	if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
 		return;
 
-	if (cpus_in_srat >= NR_CPUS) {
-		pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
-			     NR_CPUS);
-		return;
-	}
-
 	pxm = pa->proximity_domain;
 	node = acpi_map_pxm_to_node(pxm);
 
@@ -85,6 +79,14 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
 		return;
 	}
 
+	node_set(node, numa_nodes_parsed);
+
+	if (cpus_in_srat >= NR_CPUS) {
+		pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
+			     NR_CPUS);
+		return;
+	}
+
 	mpidr = acpi_map_madt_entry(pa->acpi_processor_uid);
 	if (mpidr == PHYS_CPUID_INVALID) {
 		pr_err("SRAT: PXM %d with ACPI ID %d has no valid MPIDR in MADT\n",
@@ -95,7 +97,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
 
 	early_node_cpu_hwid[cpus_in_srat].node_id = node;
 	early_node_cpu_hwid[cpus_in_srat].cpu_hwid =  mpidr;
-	node_set(node, numa_nodes_parsed);
 	cpus_in_srat++;
 	pr_info("SRAT: PXM %d -> MPIDR 0x%Lx -> Node %d\n",
 		pxm, mpidr, node);

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-19 16:32                                           ` Lorenzo Pieralisi
  0 siblings, 0 replies; 103+ messages in thread
From: Lorenzo Pieralisi @ 2018-06-19 16:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> > [...]
> >> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> >> as a temporary fix (it'll also be easier to backport) while we work on
> >> fixing these other issues and enabling memoryless nodes.
> >
> > Well, x86 already does that but copying this antipatern is not really
> > nice. So it is good as a quick fix but it would be definitely much
> > better to have a robust fix. Who knows how many other places might hit
> > this. You certainly do not want to add a hack like this all over...
> 
> Completely agree! I was only suggesting it as a temporary measure,
> especially as it looked like a proper fix might be invasive.
> 
> Another fix might be to change the node specific allocation to node
> agnostic allocations. It isn't clear why the allocation is being
> requested from a specific node. I think Lorenzo suggested this in one of
> the threads.

I think that code was just copypasted but it is better to fix the
underlying issue.

> I've started putting together a set fixing the issues identified in this
> thread. It should give a better idea on the best course of action.

On ACPI ARM64, this diff should do if I read the code correctly, it
should be (famous last words) just a matter of mapping PXMs to nodes for
every SRAT GICC entry, feel free to pick it up if it works.

Yes, we can take the original patch just because it is safer for an -rc
cycle even though if the patch below would do delaying the fix for a
couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
not a disaster.

Lorenzo

-- >8 --
diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
index d190a7b231bf..877b268ef9fa 100644
--- a/arch/arm64/kernel/acpi_numa.c
+++ b/arch/arm64/kernel/acpi_numa.c
@@ -70,12 +70,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
 	if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
 		return;
 
-	if (cpus_in_srat >= NR_CPUS) {
-		pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
-			     NR_CPUS);
-		return;
-	}
-
 	pxm = pa->proximity_domain;
 	node = acpi_map_pxm_to_node(pxm);
 
@@ -85,6 +79,14 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
 		return;
 	}
 
+	node_set(node, numa_nodes_parsed);
+
+	if (cpus_in_srat >= NR_CPUS) {
+		pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
+			     NR_CPUS);
+		return;
+	}
+
 	mpidr = acpi_map_madt_entry(pa->acpi_processor_uid);
 	if (mpidr == PHYS_CPUID_INVALID) {
 		pr_err("SRAT: PXM %d with ACPI ID %d has no valid MPIDR in MADT\n",
@@ -95,7 +97,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
 
 	early_node_cpu_hwid[cpus_in_srat].node_id = node;
 	early_node_cpu_hwid[cpus_in_srat].cpu_hwid =  mpidr;
-	node_set(node, numa_nodes_parsed);
 	cpus_in_srat++;
 	pr_info("SRAT: PXM %d -> MPIDR 0x%Lx -> Node %d\n",
 		pxm, mpidr, node);

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-19 16:32                                           ` Lorenzo Pieralisi
  (?)
@ 2018-06-20  3:31                                             ` Xie XiuQi
  -1 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-20  3:31 UTC (permalink / raw)
  To: Lorenzo Pieralisi, Punit Agrawal
  Cc: Michal Hocko, Hanjun Guo, Bjorn Helgaas, tnowicki, linux-pci,
	Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

Hi Lorenzo, Punit,


On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>>
>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>> [...]
>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>> fixing these other issues and enabling memoryless nodes.
>>>
>>> Well, x86 already does that but copying this antipatern is not really
>>> nice. So it is good as a quick fix but it would be definitely much
>>> better to have a robust fix. Who knows how many other places might hit
>>> this. You certainly do not want to add a hack like this all over...
>>
>> Completely agree! I was only suggesting it as a temporary measure,
>> especially as it looked like a proper fix might be invasive.
>>
>> Another fix might be to change the node specific allocation to node
>> agnostic allocations. It isn't clear why the allocation is being
>> requested from a specific node. I think Lorenzo suggested this in one of
>> the threads.
> 
> I think that code was just copypasted but it is better to fix the
> underlying issue.
> 
>> I've started putting together a set fixing the issues identified in this
>> thread. It should give a better idea on the best course of action.
> 
> On ACPI ARM64, this diff should do if I read the code correctly, it
> should be (famous last words) just a matter of mapping PXMs to nodes for
> every SRAT GICC entry, feel free to pick it up if it works.
> 
> Yes, we can take the original patch just because it is safer for an -rc
> cycle even though if the patch below would do delaying the fix for a
> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
> not a disaster.

I tested this patch on my arm board, it works.

-- 
Thanks,
Xie XiuQi

> 
> Lorenzo
> 
> -- >8 --
> diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
> index d190a7b231bf..877b268ef9fa 100644
> --- a/arch/arm64/kernel/acpi_numa.c
> +++ b/arch/arm64/kernel/acpi_numa.c
> @@ -70,12 +70,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
>  	if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
>  		return;
>  
> -	if (cpus_in_srat >= NR_CPUS) {
> -		pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
> -			     NR_CPUS);
> -		return;
> -	}
> -
>  	pxm = pa->proximity_domain;
>  	node = acpi_map_pxm_to_node(pxm);
>  
> @@ -85,6 +79,14 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
>  		return;
>  	}
>  
> +	node_set(node, numa_nodes_parsed);
> +
> +	if (cpus_in_srat >= NR_CPUS) {
> +		pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
> +			     NR_CPUS);
> +		return;
> +	}
> +
>  	mpidr = acpi_map_madt_entry(pa->acpi_processor_uid);
>  	if (mpidr == PHYS_CPUID_INVALID) {
>  		pr_err("SRAT: PXM %d with ACPI ID %d has no valid MPIDR in MADT\n",
> @@ -95,7 +97,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
>  
>  	early_node_cpu_hwid[cpus_in_srat].node_id = node;
>  	early_node_cpu_hwid[cpus_in_srat].cpu_hwid =  mpidr;
> -	node_set(node, numa_nodes_parsed);
>  	cpus_in_srat++;
>  	pr_info("SRAT: PXM %d -> MPIDR 0x%Lx -> Node %d\n",
>  		pxm, mpidr, node);
> 
> .
> 




^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-20  3:31                                             ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-20  3:31 UTC (permalink / raw)
  To: Lorenzo Pieralisi, Punit Agrawal
  Cc: Michal Hocko, Hanjun Guo, Bjorn Helgaas, tnowicki, linux-pci,
	Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

Hi Lorenzo, Punit,


On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>>
>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>> [...]
>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>> fixing these other issues and enabling memoryless nodes.
>>>
>>> Well, x86 already does that but copying this antipatern is not really
>>> nice. So it is good as a quick fix but it would be definitely much
>>> better to have a robust fix. Who knows how many other places might hit
>>> this. You certainly do not want to add a hack like this all over...
>>
>> Completely agree! I was only suggesting it as a temporary measure,
>> especially as it looked like a proper fix might be invasive.
>>
>> Another fix might be to change the node specific allocation to node
>> agnostic allocations. It isn't clear why the allocation is being
>> requested from a specific node. I think Lorenzo suggested this in one of
>> the threads.
> 
> I think that code was just copypasted but it is better to fix the
> underlying issue.
> 
>> I've started putting together a set fixing the issues identified in this
>> thread. It should give a better idea on the best course of action.
> 
> On ACPI ARM64, this diff should do if I read the code correctly, it
> should be (famous last words) just a matter of mapping PXMs to nodes for
> every SRAT GICC entry, feel free to pick it up if it works.
> 
> Yes, we can take the original patch just because it is safer for an -rc
> cycle even though if the patch below would do delaying the fix for a
> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
> not a disaster.

I tested this patch on my arm board, it works.

-- 
Thanks,
Xie XiuQi

> 
> Lorenzo
> 
> -- >8 --
> diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
> index d190a7b231bf..877b268ef9fa 100644
> --- a/arch/arm64/kernel/acpi_numa.c
> +++ b/arch/arm64/kernel/acpi_numa.c
> @@ -70,12 +70,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
>  	if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
>  		return;
>  
> -	if (cpus_in_srat >= NR_CPUS) {
> -		pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
> -			     NR_CPUS);
> -		return;
> -	}
> -
>  	pxm = pa->proximity_domain;
>  	node = acpi_map_pxm_to_node(pxm);
>  
> @@ -85,6 +79,14 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
>  		return;
>  	}
>  
> +	node_set(node, numa_nodes_parsed);
> +
> +	if (cpus_in_srat >= NR_CPUS) {
> +		pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
> +			     NR_CPUS);
> +		return;
> +	}
> +
>  	mpidr = acpi_map_madt_entry(pa->acpi_processor_uid);
>  	if (mpidr == PHYS_CPUID_INVALID) {
>  		pr_err("SRAT: PXM %d with ACPI ID %d has no valid MPIDR in MADT\n",
> @@ -95,7 +97,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
>  
>  	early_node_cpu_hwid[cpus_in_srat].node_id = node;
>  	early_node_cpu_hwid[cpus_in_srat].cpu_hwid =  mpidr;
> -	node_set(node, numa_nodes_parsed);
>  	cpus_in_srat++;
>  	pr_info("SRAT: PXM %d -> MPIDR 0x%Lx -> Node %d\n",
>  		pxm, mpidr, node);
> 
> .
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-20  3:31                                             ` Xie XiuQi
  0 siblings, 0 replies; 103+ messages in thread
From: Xie XiuQi @ 2018-06-20  3:31 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Lorenzo, Punit,


On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>>
>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>> [...]
>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>> fixing these other issues and enabling memoryless nodes.
>>>
>>> Well, x86 already does that but copying this antipatern is not really
>>> nice. So it is good as a quick fix but it would be definitely much
>>> better to have a robust fix. Who knows how many other places might hit
>>> this. You certainly do not want to add a hack like this all over...
>>
>> Completely agree! I was only suggesting it as a temporary measure,
>> especially as it looked like a proper fix might be invasive.
>>
>> Another fix might be to change the node specific allocation to node
>> agnostic allocations. It isn't clear why the allocation is being
>> requested from a specific node. I think Lorenzo suggested this in one of
>> the threads.
> 
> I think that code was just copypasted but it is better to fix the
> underlying issue.
> 
>> I've started putting together a set fixing the issues identified in this
>> thread. It should give a better idea on the best course of action.
> 
> On ACPI ARM64, this diff should do if I read the code correctly, it
> should be (famous last words) just a matter of mapping PXMs to nodes for
> every SRAT GICC entry, feel free to pick it up if it works.
> 
> Yes, we can take the original patch just because it is safer for an -rc
> cycle even though if the patch below would do delaying the fix for a
> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
> not a disaster.

I tested this patch on my arm board, it works.

-- 
Thanks,
Xie XiuQi

> 
> Lorenzo
> 
> -- >8 --
> diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
> index d190a7b231bf..877b268ef9fa 100644
> --- a/arch/arm64/kernel/acpi_numa.c
> +++ b/arch/arm64/kernel/acpi_numa.c
> @@ -70,12 +70,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
>  	if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
>  		return;
>  
> -	if (cpus_in_srat >= NR_CPUS) {
> -		pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
> -			     NR_CPUS);
> -		return;
> -	}
> -
>  	pxm = pa->proximity_domain;
>  	node = acpi_map_pxm_to_node(pxm);
>  
> @@ -85,6 +79,14 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
>  		return;
>  	}
>  
> +	node_set(node, numa_nodes_parsed);
> +
> +	if (cpus_in_srat >= NR_CPUS) {
> +		pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
> +			     NR_CPUS);
> +		return;
> +	}
> +
>  	mpidr = acpi_map_madt_entry(pa->acpi_processor_uid);
>  	if (mpidr == PHYS_CPUID_INVALID) {
>  		pr_err("SRAT: PXM %d with ACPI ID %d has no valid MPIDR in MADT\n",
> @@ -95,7 +97,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
>  
>  	early_node_cpu_hwid[cpus_in_srat].node_id = node;
>  	early_node_cpu_hwid[cpus_in_srat].cpu_hwid =  mpidr;
> -	node_set(node, numa_nodes_parsed);
>  	cpus_in_srat++;
>  	pr_info("SRAT: PXM %d -> MPIDR 0x%Lx -> Node %d\n",
>  		pxm, mpidr, node);
> 
> .
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-20  3:31                                             ` Xie XiuQi
  (?)
  (?)
@ 2018-06-20 11:51                                               ` Punit Agrawal
  -1 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-20 11:51 UTC (permalink / raw)
  To: Xie XiuQi
  Cc: Lorenzo Pieralisi, Michal Hocko, Hanjun Guo, Bjorn Helgaas,
	tnowicki, linux-pci, Catalin Marinas, Rafael J. Wysocki,
	Will Deacon, Linux Kernel Mailing List, Jarkko Sakkinen,
	linux-mm, wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas,
	Andrew Morton, zhongjiang, linux-arm

Xie XiuQi <xiexiuqi@huawei.com> writes:

> Hi Lorenzo, Punit,
>
>
> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>>> Michal Hocko <mhocko@kernel.org> writes:
>>>
>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>>> [...]
>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>>> fixing these other issues and enabling memoryless nodes.
>>>>
>>>> Well, x86 already does that but copying this antipatern is not really
>>>> nice. So it is good as a quick fix but it would be definitely much
>>>> better to have a robust fix. Who knows how many other places might hit
>>>> this. You certainly do not want to add a hack like this all over...
>>>
>>> Completely agree! I was only suggesting it as a temporary measure,
>>> especially as it looked like a proper fix might be invasive.
>>>
>>> Another fix might be to change the node specific allocation to node
>>> agnostic allocations. It isn't clear why the allocation is being
>>> requested from a specific node. I think Lorenzo suggested this in one of
>>> the threads.
>> 
>> I think that code was just copypasted but it is better to fix the
>> underlying issue.
>> 
>>> I've started putting together a set fixing the issues identified in this
>>> thread. It should give a better idea on the best course of action.
>> 
>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> every SRAT GICC entry, feel free to pick it up if it works.
>> 
>> Yes, we can take the original patch just because it is safer for an -rc
>> cycle even though if the patch below would do delaying the fix for a
>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> not a disaster.
>
> I tested this patch on my arm board, it works.

I am assuming you tried the patch without enabling support for
memory-less nodes.

The patch de-couples the onlining of numa nodes (as parsed from SRAT)
from NR_CPUS restriction. When it comes to building zonelists, the node
referenced by the PCI controller also has zonelists initialised.

So it looks like a fallback node is setup even if we don't have
memory-less nodes enabled. I need to stare some more at the code to see
why we need memory-less nodes at all then ...


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-20 11:51                                               ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-20 11:51 UTC (permalink / raw)
  To: Xie XiuQi
  Cc: Lorenzo Pieralisi, Michal Hocko, Hanjun Guo, Bjorn Helgaas,
	tnowicki, linux-pci, Catalin Marinas, Rafael J. Wysocki,
	Will Deacon, Linux Kernel Mailing List, Jarkko Sakkinen,
	linux-mm, wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas,
	Andrew Morton, zhongjiang, linux-arm

Xie XiuQi <xiexiuqi@huawei.com> writes:

> Hi Lorenzo, Punit,
>
>
> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>>> Michal Hocko <mhocko@kernel.org> writes:
>>>
>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>>> [...]
>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>>> fixing these other issues and enabling memoryless nodes.
>>>>
>>>> Well, x86 already does that but copying this antipatern is not really
>>>> nice. So it is good as a quick fix but it would be definitely much
>>>> better to have a robust fix. Who knows how many other places might hit
>>>> this. You certainly do not want to add a hack like this all over...
>>>
>>> Completely agree! I was only suggesting it as a temporary measure,
>>> especially as it looked like a proper fix might be invasive.
>>>
>>> Another fix might be to change the node specific allocation to node
>>> agnostic allocations. It isn't clear why the allocation is being
>>> requested from a specific node. I think Lorenzo suggested this in one of
>>> the threads.
>> 
>> I think that code was just copypasted but it is better to fix the
>> underlying issue.
>> 
>>> I've started putting together a set fixing the issues identified in this
>>> thread. It should give a better idea on the best course of action.
>> 
>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> every SRAT GICC entry, feel free to pick it up if it works.
>> 
>> Yes, we can take the original patch just because it is safer for an -rc
>> cycle even though if the patch below would do delaying the fix for a
>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> not a disaster.
>
> I tested this patch on my arm board, it works.

I am assuming you tried the patch without enabling support for
memory-less nodes.

The patch de-couples the onlining of numa nodes (as parsed from SRAT)
from NR_CPUS restriction. When it comes to building zonelists, the node
referenced by the PCI controller also has zonelists initialised.

So it looks like a fallback node is setup even if we don't have
memory-less nodes enabled. I need to stare some more at the code to see
why we need memory-less nodes at all then ...

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-20 11:51                                               ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-20 11:51 UTC (permalink / raw)
  To: Xie XiuQi
  Cc: Lorenzo Pieralisi, Michal Hocko, Hanjun Guo, Bjorn Helgaas,
	tnowicki, linux-pci, Catalin Marinas, Rafael J. Wysocki,
	Will Deacon, Linux Kernel Mailing List, Jarkko Sakkinen,
	linux-mm, wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas,
	Andrew Morton, zhongjiang, linux-arm

Xie XiuQi <xiexiuqi@huawei.com> writes:

> Hi Lorenzo, Punit,
>
>
> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>>> Michal Hocko <mhocko@kernel.org> writes:
>>>
>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>>> [...]
>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>>> fixing these other issues and enabling memoryless nodes.
>>>>
>>>> Well, x86 already does that but copying this antipatern is not really
>>>> nice. So it is good as a quick fix but it would be definitely much
>>>> better to have a robust fix. Who knows how many other places might hit
>>>> this. You certainly do not want to add a hack like this all over...
>>>
>>> Completely agree! I was only suggesting it as a temporary measure,
>>> especially as it looked like a proper fix might be invasive.
>>>
>>> Another fix might be to change the node specific allocation to node
>>> agnostic allocations. It isn't clear why the allocation is being
>>> requested from a specific node. I think Lorenzo suggested this in one of
>>> the threads.
>> 
>> I think that code was just copypasted but it is better to fix the
>> underlying issue.
>> 
>>> I've started putting together a set fixing the issues identified in this
>>> thread. It should give a better idea on the best course of action.
>> 
>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> every SRAT GICC entry, feel free to pick it up if it works.
>> 
>> Yes, we can take the original patch just because it is safer for an -rc
>> cycle even though if the patch below would do delaying the fix for a
>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> not a disaster.
>
> I tested this patch on my arm board, it works.

I am assuming you tried the patch without enabling support for
memory-less nodes.

The patch de-couples the onlining of numa nodes (as parsed from SRAT)
from NR_CPUS restriction. When it comes to building zonelists, the node
referenced by the PCI controller also has zonelists initialised.

So it looks like a fallback node is setup even if we don't have
memory-less nodes enabled. I need to stare some more at the code to see
why we need memory-less nodes at all then ...

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-20 11:51                                               ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-20 11:51 UTC (permalink / raw)
  To: linux-arm-kernel

Xie XiuQi <xiexiuqi@huawei.com> writes:

> Hi Lorenzo, Punit,
>
>
> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>>> Michal Hocko <mhocko@kernel.org> writes:
>>>
>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>>> [...]
>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>>> fixing these other issues and enabling memoryless nodes.
>>>>
>>>> Well, x86 already does that but copying this antipatern is not really
>>>> nice. So it is good as a quick fix but it would be definitely much
>>>> better to have a robust fix. Who knows how many other places might hit
>>>> this. You certainly do not want to add a hack like this all over...
>>>
>>> Completely agree! I was only suggesting it as a temporary measure,
>>> especially as it looked like a proper fix might be invasive.
>>>
>>> Another fix might be to change the node specific allocation to node
>>> agnostic allocations. It isn't clear why the allocation is being
>>> requested from a specific node. I think Lorenzo suggested this in one of
>>> the threads.
>> 
>> I think that code was just copypasted but it is better to fix the
>> underlying issue.
>> 
>>> I've started putting together a set fixing the issues identified in this
>>> thread. It should give a better idea on the best course of action.
>> 
>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> every SRAT GICC entry, feel free to pick it up if it works.
>> 
>> Yes, we can take the original patch just because it is safer for an -rc
>> cycle even though if the patch below would do delaying the fix for a
>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> not a disaster.
>
> I tested this patch on my arm board, it works.

I am assuming you tried the patch without enabling support for
memory-less nodes.

The patch de-couples the onlining of numa nodes (as parsed from SRAT)
from NR_CPUS restriction. When it comes to building zonelists, the node
referenced by the PCI controller also has zonelists initialised.

So it looks like a fallback node is setup even if we don't have
memory-less nodes enabled. I need to stare some more at the code to see
why we need memory-less nodes at all then ...

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-20 11:51                                               ` Punit Agrawal
  (?)
@ 2018-06-22  8:58                                                 ` Hanjun Guo
  -1 siblings, 0 replies; 103+ messages in thread
From: Hanjun Guo @ 2018-06-22  8:58 UTC (permalink / raw)
  To: Punit Agrawal, Xie XiuQi
  Cc: Lorenzo Pieralisi, Michal Hocko, Bjorn Helgaas, tnowicki,
	linux-pci, Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

On 2018/6/20 19:51, Punit Agrawal wrote:
> Xie XiuQi <xiexiuqi@huawei.com> writes:
> 
>> Hi Lorenzo, Punit,
>>
>>
>> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>>>> Michal Hocko <mhocko@kernel.org> writes:
>>>>
>>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>>>> [...]
>>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>>>> fixing these other issues and enabling memoryless nodes.
>>>>>
>>>>> Well, x86 already does that but copying this antipatern is not really
>>>>> nice. So it is good as a quick fix but it would be definitely much
>>>>> better to have a robust fix. Who knows how many other places might hit
>>>>> this. You certainly do not want to add a hack like this all over...
>>>>
>>>> Completely agree! I was only suggesting it as a temporary measure,
>>>> especially as it looked like a proper fix might be invasive.
>>>>
>>>> Another fix might be to change the node specific allocation to node
>>>> agnostic allocations. It isn't clear why the allocation is being
>>>> requested from a specific node. I think Lorenzo suggested this in one of
>>>> the threads.
>>>
>>> I think that code was just copypasted but it is better to fix the
>>> underlying issue.
>>>
>>>> I've started putting together a set fixing the issues identified in this
>>>> thread. It should give a better idea on the best course of action.
>>>
>>> On ACPI ARM64, this diff should do if I read the code correctly, it
>>> should be (famous last words) just a matter of mapping PXMs to nodes for
>>> every SRAT GICC entry, feel free to pick it up if it works.
>>>
>>> Yes, we can take the original patch just because it is safer for an -rc
>>> cycle even though if the patch below would do delaying the fix for a
>>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>>> not a disaster.
>>
>> I tested this patch on my arm board, it works.
> 
> I am assuming you tried the patch without enabling support for
> memory-less nodes.
> 
> The patch de-couples the onlining of numa nodes (as parsed from SRAT)
> from NR_CPUS restriction. When it comes to building zonelists, the node
> referenced by the PCI controller also has zonelists initialised.
> 
> So it looks like a fallback node is setup even if we don't have
> memory-less nodes enabled. I need to stare some more at the code to see
> why we need memory-less nodes at all then ...

Yes, please. From my limited MM knowledge, zonelists should not be
initialised if no CPU and no memory on this node, correct me if I'm
wrong.

Thanks
Hanjun


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-22  8:58                                                 ` Hanjun Guo
  0 siblings, 0 replies; 103+ messages in thread
From: Hanjun Guo @ 2018-06-22  8:58 UTC (permalink / raw)
  To: Punit Agrawal, Xie XiuQi
  Cc: Lorenzo Pieralisi, Michal Hocko, Bjorn Helgaas, tnowicki,
	linux-pci, Catalin Marinas, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas, Andrew Morton,
	zhongjiang, linux-arm

On 2018/6/20 19:51, Punit Agrawal wrote:
> Xie XiuQi <xiexiuqi@huawei.com> writes:
> 
>> Hi Lorenzo, Punit,
>>
>>
>> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>>>> Michal Hocko <mhocko@kernel.org> writes:
>>>>
>>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>>>> [...]
>>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>>>> fixing these other issues and enabling memoryless nodes.
>>>>>
>>>>> Well, x86 already does that but copying this antipatern is not really
>>>>> nice. So it is good as a quick fix but it would be definitely much
>>>>> better to have a robust fix. Who knows how many other places might hit
>>>>> this. You certainly do not want to add a hack like this all over...
>>>>
>>>> Completely agree! I was only suggesting it as a temporary measure,
>>>> especially as it looked like a proper fix might be invasive.
>>>>
>>>> Another fix might be to change the node specific allocation to node
>>>> agnostic allocations. It isn't clear why the allocation is being
>>>> requested from a specific node. I think Lorenzo suggested this in one of
>>>> the threads.
>>>
>>> I think that code was just copypasted but it is better to fix the
>>> underlying issue.
>>>
>>>> I've started putting together a set fixing the issues identified in this
>>>> thread. It should give a better idea on the best course of action.
>>>
>>> On ACPI ARM64, this diff should do if I read the code correctly, it
>>> should be (famous last words) just a matter of mapping PXMs to nodes for
>>> every SRAT GICC entry, feel free to pick it up if it works.
>>>
>>> Yes, we can take the original patch just because it is safer for an -rc
>>> cycle even though if the patch below would do delaying the fix for a
>>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>>> not a disaster.
>>
>> I tested this patch on my arm board, it works.
> 
> I am assuming you tried the patch without enabling support for
> memory-less nodes.
> 
> The patch de-couples the onlining of numa nodes (as parsed from SRAT)
> from NR_CPUS restriction. When it comes to building zonelists, the node
> referenced by the PCI controller also has zonelists initialised.
> 
> So it looks like a fallback node is setup even if we don't have
> memory-less nodes enabled. I need to stare some more at the code to see
> why we need memory-less nodes at all then ...

Yes, please. From my limited MM knowledge, zonelists should not be
initialised if no CPU and no memory on this node, correct me if I'm
wrong.

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-22  8:58                                                 ` Hanjun Guo
  0 siblings, 0 replies; 103+ messages in thread
From: Hanjun Guo @ 2018-06-22  8:58 UTC (permalink / raw)
  To: linux-arm-kernel

On 2018/6/20 19:51, Punit Agrawal wrote:
> Xie XiuQi <xiexiuqi@huawei.com> writes:
> 
>> Hi Lorenzo, Punit,
>>
>>
>> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>>>> Michal Hocko <mhocko@kernel.org> writes:
>>>>
>>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>>>> [...]
>>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>>>> fixing these other issues and enabling memoryless nodes.
>>>>>
>>>>> Well, x86 already does that but copying this antipatern is not really
>>>>> nice. So it is good as a quick fix but it would be definitely much
>>>>> better to have a robust fix. Who knows how many other places might hit
>>>>> this. You certainly do not want to add a hack like this all over...
>>>>
>>>> Completely agree! I was only suggesting it as a temporary measure,
>>>> especially as it looked like a proper fix might be invasive.
>>>>
>>>> Another fix might be to change the node specific allocation to node
>>>> agnostic allocations. It isn't clear why the allocation is being
>>>> requested from a specific node. I think Lorenzo suggested this in one of
>>>> the threads.
>>>
>>> I think that code was just copypasted but it is better to fix the
>>> underlying issue.
>>>
>>>> I've started putting together a set fixing the issues identified in this
>>>> thread. It should give a better idea on the best course of action.
>>>
>>> On ACPI ARM64, this diff should do if I read the code correctly, it
>>> should be (famous last words) just a matter of mapping PXMs to nodes for
>>> every SRAT GICC entry, feel free to pick it up if it works.
>>>
>>> Yes, we can take the original patch just because it is safer for an -rc
>>> cycle even though if the patch below would do delaying the fix for a
>>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>>> not a disaster.
>>
>> I tested this patch on my arm board, it works.
> 
> I am assuming you tried the patch without enabling support for
> memory-less nodes.
> 
> The patch de-couples the onlining of numa nodes (as parsed from SRAT)
> from NR_CPUS restriction. When it comes to building zonelists, the node
> referenced by the PCI controller also has zonelists initialised.
> 
> So it looks like a fallback node is setup even if we don't have
> memory-less nodes enabled. I need to stare some more at the code to see
> why we need memory-less nodes at all then ...

Yes, please. From my limited MM knowledge, zonelists should not be
initialised if no CPU and no memory on this node, correct me if I'm
wrong.

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-22  8:58                                                 ` Hanjun Guo
@ 2018-06-22  9:11                                                   ` Michal Hocko
  -1 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-22  9:11 UTC (permalink / raw)
  To: Hanjun Guo
  Cc: Punit Agrawal, Xie XiuQi, Lorenzo Pieralisi, Bjorn Helgaas,
	tnowicki, linux-pci, Catalin Marinas, Rafael J. Wysocki,
	Will Deacon, Linux Kernel Mailing List, Jarkko Sakkinen,
	linux-mm, wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas,
	Andrew Morton, zhongjiang, linux-arm

On Fri 22-06-18 16:58:05, Hanjun Guo wrote:
> On 2018/6/20 19:51, Punit Agrawal wrote:
> > Xie XiuQi <xiexiuqi@huawei.com> writes:
> > 
> >> Hi Lorenzo, Punit,
> >>
> >>
> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
> >>>> Michal Hocko <mhocko@kernel.org> writes:
> >>>>
> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> >>>>> [...]
> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on
> >>>>>> fixing these other issues and enabling memoryless nodes.
> >>>>>
> >>>>> Well, x86 already does that but copying this antipatern is not really
> >>>>> nice. So it is good as a quick fix but it would be definitely much
> >>>>> better to have a robust fix. Who knows how many other places might hit
> >>>>> this. You certainly do not want to add a hack like this all over...
> >>>>
> >>>> Completely agree! I was only suggesting it as a temporary measure,
> >>>> especially as it looked like a proper fix might be invasive.
> >>>>
> >>>> Another fix might be to change the node specific allocation to node
> >>>> agnostic allocations. It isn't clear why the allocation is being
> >>>> requested from a specific node. I think Lorenzo suggested this in one of
> >>>> the threads.
> >>>
> >>> I think that code was just copypasted but it is better to fix the
> >>> underlying issue.
> >>>
> >>>> I've started putting together a set fixing the issues identified in this
> >>>> thread. It should give a better idea on the best course of action.
> >>>
> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
> >>> every SRAT GICC entry, feel free to pick it up if it works.
> >>>
> >>> Yes, we can take the original patch just because it is safer for an -rc
> >>> cycle even though if the patch below would do delaying the fix for a
> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
> >>> not a disaster.
> >>
> >> I tested this patch on my arm board, it works.
> > 
> > I am assuming you tried the patch without enabling support for
> > memory-less nodes.
> > 
> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
> > from NR_CPUS restriction. When it comes to building zonelists, the node
> > referenced by the PCI controller also has zonelists initialised.
> > 
> > So it looks like a fallback node is setup even if we don't have
> > memory-less nodes enabled. I need to stare some more at the code to see
> > why we need memory-less nodes at all then ...
> 
> Yes, please. From my limited MM knowledge, zonelists should not be
> initialised if no CPU and no memory on this node, correct me if I'm
> wrong.

Well, as long as there is a code which can explicitly ask for a specific
node than it is safer to have zonelists configured. Otherwise you just
force callers to add hacks and figure out the proper placement there.
Zonelists should be cheep to configure for all possible nodes. It's not
like we are talking about huge amount of resources.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-22  9:11                                                   ` Michal Hocko
  0 siblings, 0 replies; 103+ messages in thread
From: Michal Hocko @ 2018-06-22  9:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri 22-06-18 16:58:05, Hanjun Guo wrote:
> On 2018/6/20 19:51, Punit Agrawal wrote:
> > Xie XiuQi <xiexiuqi@huawei.com> writes:
> > 
> >> Hi Lorenzo, Punit,
> >>
> >>
> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
> >>>> Michal Hocko <mhocko@kernel.org> writes:
> >>>>
> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> >>>>> [...]
> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on
> >>>>>> fixing these other issues and enabling memoryless nodes.
> >>>>>
> >>>>> Well, x86 already does that but copying this antipatern is not really
> >>>>> nice. So it is good as a quick fix but it would be definitely much
> >>>>> better to have a robust fix. Who knows how many other places might hit
> >>>>> this. You certainly do not want to add a hack like this all over...
> >>>>
> >>>> Completely agree! I was only suggesting it as a temporary measure,
> >>>> especially as it looked like a proper fix might be invasive.
> >>>>
> >>>> Another fix might be to change the node specific allocation to node
> >>>> agnostic allocations. It isn't clear why the allocation is being
> >>>> requested from a specific node. I think Lorenzo suggested this in one of
> >>>> the threads.
> >>>
> >>> I think that code was just copypasted but it is better to fix the
> >>> underlying issue.
> >>>
> >>>> I've started putting together a set fixing the issues identified in this
> >>>> thread. It should give a better idea on the best course of action.
> >>>
> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
> >>> every SRAT GICC entry, feel free to pick it up if it works.
> >>>
> >>> Yes, we can take the original patch just because it is safer for an -rc
> >>> cycle even though if the patch below would do delaying the fix for a
> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
> >>> not a disaster.
> >>
> >> I tested this patch on my arm board, it works.
> > 
> > I am assuming you tried the patch without enabling support for
> > memory-less nodes.
> > 
> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
> > from NR_CPUS restriction. When it comes to building zonelists, the node
> > referenced by the PCI controller also has zonelists initialised.
> > 
> > So it looks like a fallback node is setup even if we don't have
> > memory-less nodes enabled. I need to stare some more at the code to see
> > why we need memory-less nodes at all then ...
> 
> Yes, please. From my limited MM knowledge, zonelists should not be
> initialised if no CPU and no memory on this node, correct me if I'm
> wrong.

Well, as long as there is a code which can explicitly ask for a specific
node than it is safer to have zonelists configured. Otherwise you just
force callers to add hacks and figure out the proper placement there.
Zonelists should be cheep to configure for all possible nodes. It's not
like we are talking about huge amount of resources.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-22  9:11                                                   ` Michal Hocko
  (?)
  (?)
@ 2018-06-22 10:24                                                     ` Punit Agrawal
  -1 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-22 10:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hanjun Guo, Xie XiuQi, Lorenzo Pieralisi, Bjorn Helgaas,
	tnowicki, linux-pci, Catalin Marinas, Rafael J. Wysocki,
	Will Deacon, Linux Kernel Mailing List, Jarkko Sakkinen,
	linux-mm, wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas,
	Andrew Morton, zhongjiang, linux-arm

Michal Hocko <mhocko@kernel.org> writes:

> On Fri 22-06-18 16:58:05, Hanjun Guo wrote:
>> On 2018/6/20 19:51, Punit Agrawal wrote:
>> > Xie XiuQi <xiexiuqi@huawei.com> writes:
>> > 
>> >> Hi Lorenzo, Punit,
>> >>
>> >>
>> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>> >>>> Michal Hocko <mhocko@kernel.org> writes:
>> >>>>
>> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>> >>>>> [...]
>> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>> >>>>>> fixing these other issues and enabling memoryless nodes.
>> >>>>>
>> >>>>> Well, x86 already does that but copying this antipatern is not really
>> >>>>> nice. So it is good as a quick fix but it would be definitely much
>> >>>>> better to have a robust fix. Who knows how many other places might hit
>> >>>>> this. You certainly do not want to add a hack like this all over...
>> >>>>
>> >>>> Completely agree! I was only suggesting it as a temporary measure,
>> >>>> especially as it looked like a proper fix might be invasive.
>> >>>>
>> >>>> Another fix might be to change the node specific allocation to node
>> >>>> agnostic allocations. It isn't clear why the allocation is being
>> >>>> requested from a specific node. I think Lorenzo suggested this in one of
>> >>>> the threads.
>> >>>
>> >>> I think that code was just copypasted but it is better to fix the
>> >>> underlying issue.
>> >>>
>> >>>> I've started putting together a set fixing the issues identified in this
>> >>>> thread. It should give a better idea on the best course of action.
>> >>>
>> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> >>> every SRAT GICC entry, feel free to pick it up if it works.
>> >>>
>> >>> Yes, we can take the original patch just because it is safer for an -rc
>> >>> cycle even though if the patch below would do delaying the fix for a
>> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> >>> not a disaster.
>> >>
>> >> I tested this patch on my arm board, it works.
>> > 
>> > I am assuming you tried the patch without enabling support for
>> > memory-less nodes.
>> > 
>> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
>> > from NR_CPUS restriction. When it comes to building zonelists, the node
>> > referenced by the PCI controller also has zonelists initialised.
>> > 
>> > So it looks like a fallback node is setup even if we don't have
>> > memory-less nodes enabled. I need to stare some more at the code to see
>> > why we need memory-less nodes at all then ...
>> 
>> Yes, please. From my limited MM knowledge, zonelists should not be
>> initialised if no CPU and no memory on this node, correct me if I'm
>> wrong.
>
> Well, as long as there is a code which can explicitly ask for a specific
> node than it is safer to have zonelists configured. Otherwise you just
> force callers to add hacks and figure out the proper placement there.
> Zonelists should be cheep to configure for all possible nodes. It's not
> like we are talking about huge amount of resources.

I agree. The current problem stems from not configuring the zonelists
for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes
the configuration of such nodes.

For allocation requests targeting memory-less nodes, the allocator will
take the slow path and fall back to one of the other nodes based on the
zonelists.

I'm not sure how common such allocations are but I'll work on enabling
CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this
config improves the fallback mechanism by starting the search from a
near-by node with memory.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-22 10:24                                                     ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-22 10:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hanjun Guo, Xie XiuQi, Lorenzo Pieralisi, Bjorn Helgaas,
	tnowicki, linux-pci, Catalin Marinas, Rafael J. Wysocki,
	Will Deacon, Linux Kernel Mailing List, Jarkko Sakkinen,
	linux-mm, wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas,
	Andrew Morton, zhongjiang, linux-arm

Michal Hocko <mhocko@kernel.org> writes:

> On Fri 22-06-18 16:58:05, Hanjun Guo wrote:
>> On 2018/6/20 19:51, Punit Agrawal wrote:
>> > Xie XiuQi <xiexiuqi@huawei.com> writes:
>> > 
>> >> Hi Lorenzo, Punit,
>> >>
>> >>
>> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>> >>>> Michal Hocko <mhocko@kernel.org> writes:
>> >>>>
>> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>> >>>>> [...]
>> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>> >>>>>> fixing these other issues and enabling memoryless nodes.
>> >>>>>
>> >>>>> Well, x86 already does that but copying this antipatern is not really
>> >>>>> nice. So it is good as a quick fix but it would be definitely much
>> >>>>> better to have a robust fix. Who knows how many other places might hit
>> >>>>> this. You certainly do not want to add a hack like this all over...
>> >>>>
>> >>>> Completely agree! I was only suggesting it as a temporary measure,
>> >>>> especially as it looked like a proper fix might be invasive.
>> >>>>
>> >>>> Another fix might be to change the node specific allocation to node
>> >>>> agnostic allocations. It isn't clear why the allocation is being
>> >>>> requested from a specific node. I think Lorenzo suggested this in one of
>> >>>> the threads.
>> >>>
>> >>> I think that code was just copypasted but it is better to fix the
>> >>> underlying issue.
>> >>>
>> >>>> I've started putting together a set fixing the issues identified in this
>> >>>> thread. It should give a better idea on the best course of action.
>> >>>
>> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> >>> every SRAT GICC entry, feel free to pick it up if it works.
>> >>>
>> >>> Yes, we can take the original patch just because it is safer for an -rc
>> >>> cycle even though if the patch below would do delaying the fix for a
>> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> >>> not a disaster.
>> >>
>> >> I tested this patch on my arm board, it works.
>> > 
>> > I am assuming you tried the patch without enabling support for
>> > memory-less nodes.
>> > 
>> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
>> > from NR_CPUS restriction. When it comes to building zonelists, the node
>> > referenced by the PCI controller also has zonelists initialised.
>> > 
>> > So it looks like a fallback node is setup even if we don't have
>> > memory-less nodes enabled. I need to stare some more at the code to see
>> > why we need memory-less nodes at all then ...
>> 
>> Yes, please. From my limited MM knowledge, zonelists should not be
>> initialised if no CPU and no memory on this node, correct me if I'm
>> wrong.
>
> Well, as long as there is a code which can explicitly ask for a specific
> node than it is safer to have zonelists configured. Otherwise you just
> force callers to add hacks and figure out the proper placement there.
> Zonelists should be cheep to configure for all possible nodes. It's not
> like we are talking about huge amount of resources.

I agree. The current problem stems from not configuring the zonelists
for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes
the configuration of such nodes.

For allocation requests targeting memory-less nodes, the allocator will
take the slow path and fall back to one of the other nodes based on the
zonelists.

I'm not sure how common such allocations are but I'll work on enabling
CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this
config improves the fallback mechanism by starting the search from a
near-by node with memory.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-22 10:24                                                     ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-22 10:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hanjun Guo, Xie XiuQi, Lorenzo Pieralisi, Bjorn Helgaas,
	tnowicki, linux-pci, Catalin Marinas, Rafael J. Wysocki,
	Will Deacon, Linux Kernel Mailing List, Jarkko Sakkinen,
	linux-mm, wanghuiqiang, Greg Kroah-Hartman, Bjorn Helgaas,
	Andrew Morton, zhongjiang, linux-arm

Michal Hocko <mhocko@kernel.org> writes:

> On Fri 22-06-18 16:58:05, Hanjun Guo wrote:
>> On 2018/6/20 19:51, Punit Agrawal wrote:
>> > Xie XiuQi <xiexiuqi@huawei.com> writes:
>> > 
>> >> Hi Lorenzo, Punit,
>> >>
>> >>
>> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>> >>>> Michal Hocko <mhocko@kernel.org> writes:
>> >>>>
>> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>> >>>>> [...]
>> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>> >>>>>> fixing these other issues and enabling memoryless nodes.
>> >>>>>
>> >>>>> Well, x86 already does that but copying this antipatern is not really
>> >>>>> nice. So it is good as a quick fix but it would be definitely much
>> >>>>> better to have a robust fix. Who knows how many other places might hit
>> >>>>> this. You certainly do not want to add a hack like this all over...
>> >>>>
>> >>>> Completely agree! I was only suggesting it as a temporary measure,
>> >>>> especially as it looked like a proper fix might be invasive.
>> >>>>
>> >>>> Another fix might be to change the node specific allocation to node
>> >>>> agnostic allocations. It isn't clear why the allocation is being
>> >>>> requested from a specific node. I think Lorenzo suggested this in one of
>> >>>> the threads.
>> >>>
>> >>> I think that code was just copypasted but it is better to fix the
>> >>> underlying issue.
>> >>>
>> >>>> I've started putting together a set fixing the issues identified in this
>> >>>> thread. It should give a better idea on the best course of action.
>> >>>
>> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> >>> every SRAT GICC entry, feel free to pick it up if it works.
>> >>>
>> >>> Yes, we can take the original patch just because it is safer for an -rc
>> >>> cycle even though if the patch below would do delaying the fix for a
>> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> >>> not a disaster.
>> >>
>> >> I tested this patch on my arm board, it works.
>> > 
>> > I am assuming you tried the patch without enabling support for
>> > memory-less nodes.
>> > 
>> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
>> > from NR_CPUS restriction. When it comes to building zonelists, the node
>> > referenced by the PCI controller also has zonelists initialised.
>> > 
>> > So it looks like a fallback node is setup even if we don't have
>> > memory-less nodes enabled. I need to stare some more at the code to see
>> > why we need memory-less nodes at all then ...
>> 
>> Yes, please. From my limited MM knowledge, zonelists should not be
>> initialised if no CPU and no memory on this node, correct me if I'm
>> wrong.
>
> Well, as long as there is a code which can explicitly ask for a specific
> node than it is safer to have zonelists configured. Otherwise you just
> force callers to add hacks and figure out the proper placement there.
> Zonelists should be cheep to configure for all possible nodes. It's not
> like we are talking about huge amount of resources.

I agree. The current problem stems from not configuring the zonelists
for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes
the configuration of such nodes.

For allocation requests targeting memory-less nodes, the allocator will
take the slow path and fall back to one of the other nodes based on the
zonelists.

I'm not sure how common such allocations are but I'll work on enabling
CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this
config improves the fallback mechanism by starting the search from a
near-by node with memory.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-22 10:24                                                     ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-22 10:24 UTC (permalink / raw)
  To: linux-arm-kernel

Michal Hocko <mhocko@kernel.org> writes:

> On Fri 22-06-18 16:58:05, Hanjun Guo wrote:
>> On 2018/6/20 19:51, Punit Agrawal wrote:
>> > Xie XiuQi <xiexiuqi@huawei.com> writes:
>> > 
>> >> Hi Lorenzo, Punit,
>> >>
>> >>
>> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>> >>>> Michal Hocko <mhocko@kernel.org> writes:
>> >>>>
>> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>> >>>>> [...]
>> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>> >>>>>> fixing these other issues and enabling memoryless nodes.
>> >>>>>
>> >>>>> Well, x86 already does that but copying this antipatern is not really
>> >>>>> nice. So it is good as a quick fix but it would be definitely much
>> >>>>> better to have a robust fix. Who knows how many other places might hit
>> >>>>> this. You certainly do not want to add a hack like this all over...
>> >>>>
>> >>>> Completely agree! I was only suggesting it as a temporary measure,
>> >>>> especially as it looked like a proper fix might be invasive.
>> >>>>
>> >>>> Another fix might be to change the node specific allocation to node
>> >>>> agnostic allocations. It isn't clear why the allocation is being
>> >>>> requested from a specific node. I think Lorenzo suggested this in one of
>> >>>> the threads.
>> >>>
>> >>> I think that code was just copypasted but it is better to fix the
>> >>> underlying issue.
>> >>>
>> >>>> I've started putting together a set fixing the issues identified in this
>> >>>> thread. It should give a better idea on the best course of action.
>> >>>
>> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> >>> every SRAT GICC entry, feel free to pick it up if it works.
>> >>>
>> >>> Yes, we can take the original patch just because it is safer for an -rc
>> >>> cycle even though if the patch below would do delaying the fix for a
>> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> >>> not a disaster.
>> >>
>> >> I tested this patch on my arm board, it works.
>> > 
>> > I am assuming you tried the patch without enabling support for
>> > memory-less nodes.
>> > 
>> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
>> > from NR_CPUS restriction. When it comes to building zonelists, the node
>> > referenced by the PCI controller also has zonelists initialised.
>> > 
>> > So it looks like a fallback node is setup even if we don't have
>> > memory-less nodes enabled. I need to stare some more at the code to see
>> > why we need memory-less nodes at all then ...
>> 
>> Yes, please. From my limited MM knowledge, zonelists should not be
>> initialised if no CPU and no memory on this node, correct me if I'm
>> wrong.
>
> Well, as long as there is a code which can explicitly ask for a specific
> node than it is safer to have zonelists configured. Otherwise you just
> force callers to add hacks and figure out the proper placement there.
> Zonelists should be cheep to configure for all possible nodes. It's not
> like we are talking about huge amount of resources.

I agree. The current problem stems from not configuring the zonelists
for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes
the configuration of such nodes.

For allocation requests targeting memory-less nodes, the allocator will
take the slow path and fall back to one of the other nodes based on the
zonelists.

I'm not sure how common such allocations are but I'll work on enabling
CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this
config improves the fallback mechanism by starting the search from a
near-by node with memory.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-22 10:24                                                     ` Punit Agrawal
  (?)
@ 2018-06-22 17:42                                                       ` Jonathan Cameron
  -1 siblings, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2018-06-22 17:42 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Michal Hocko, Lorenzo Pieralisi, Catalin Marinas, tnowicki,
	Xie XiuQi, linux-pci, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-arm, Hanjun Guo,
	Bjorn Helgaas, Andrew Morton, zhongjiang, wanghuiqiang

On Fri, 22 Jun 2018 11:24:38 +0100
Punit Agrawal <punit.agrawal@arm.com> wrote:

> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Fri 22-06-18 16:58:05, Hanjun Guo wrote:  
> >> On 2018/6/20 19:51, Punit Agrawal wrote:  
> >> > Xie XiuQi <xiexiuqi@huawei.com> writes:
> >> >   
> >> >> Hi Lorenzo, Punit,
> >> >>
> >> >>
> >> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:  
> >> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:  
> >> >>>> Michal Hocko <mhocko@kernel.org> writes:
> >> >>>>  
> >> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> >> >>>>> [...]  
> >> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> >> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on
> >> >>>>>> fixing these other issues and enabling memoryless nodes.  
> >> >>>>>
> >> >>>>> Well, x86 already does that but copying this antipatern is not really
> >> >>>>> nice. So it is good as a quick fix but it would be definitely much
> >> >>>>> better to have a robust fix. Who knows how many other places might hit
> >> >>>>> this. You certainly do not want to add a hack like this all over...  
> >> >>>>
> >> >>>> Completely agree! I was only suggesting it as a temporary measure,
> >> >>>> especially as it looked like a proper fix might be invasive.
> >> >>>>
> >> >>>> Another fix might be to change the node specific allocation to node
> >> >>>> agnostic allocations. It isn't clear why the allocation is being
> >> >>>> requested from a specific node. I think Lorenzo suggested this in one of
> >> >>>> the threads.  
> >> >>>
> >> >>> I think that code was just copypasted but it is better to fix the
> >> >>> underlying issue.
> >> >>>  
> >> >>>> I've started putting together a set fixing the issues identified in this
> >> >>>> thread. It should give a better idea on the best course of action.  
> >> >>>
> >> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
> >> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
> >> >>> every SRAT GICC entry, feel free to pick it up if it works.
> >> >>>
> >> >>> Yes, we can take the original patch just because it is safer for an -rc
> >> >>> cycle even though if the patch below would do delaying the fix for a
> >> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
> >> >>> not a disaster.  
> >> >>
> >> >> I tested this patch on my arm board, it works.  
> >> > 
> >> > I am assuming you tried the patch without enabling support for
> >> > memory-less nodes.
> >> > 
> >> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
> >> > from NR_CPUS restriction. When it comes to building zonelists, the node
> >> > referenced by the PCI controller also has zonelists initialised.
> >> > 
> >> > So it looks like a fallback node is setup even if we don't have
> >> > memory-less nodes enabled. I need to stare some more at the code to see
> >> > why we need memory-less nodes at all then ...  
> >> 
> >> Yes, please. From my limited MM knowledge, zonelists should not be
> >> initialised if no CPU and no memory on this node, correct me if I'm
> >> wrong.  
> >
> > Well, as long as there is a code which can explicitly ask for a specific
> > node than it is safer to have zonelists configured. Otherwise you just
> > force callers to add hacks and figure out the proper placement there.
> > Zonelists should be cheep to configure for all possible nodes. It's not
> > like we are talking about huge amount of resources.  
> 
> I agree. The current problem stems from not configuring the zonelists
> for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes
> the configuration of such nodes.
> 
> For allocation requests targeting memory-less nodes, the allocator will
> take the slow path and fall back to one of the other nodes based on the
> zonelists.
> 
> I'm not sure how common such allocations are but I'll work on enabling
> CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this
> config improves the fallback mechanism by starting the search from a
> near-by node with memory.

I'll test it when back in the office, but I had a similar issue with
memory only nodes when I moved the SRAT listing for cpus from the 4
4th mode to the 3rd node to fake some memory I could hot unplug.
This gave a memory only node for the last node on the system.

When I instead moved cpus from the 3rd node to the 4th (so the node
with only memory was now in the middle, everything worked).

Was odd, and I'd been meaning to chase it down but hadn't gotten to it
yet.  If I get time I'll put together some test firmwares as see if there
are any other nasty corner cases we aren't handling.

Jonathan

> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-22 17:42                                                       ` Jonathan Cameron
  0 siblings, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2018-06-22 17:42 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Michal Hocko, Lorenzo Pieralisi, Catalin Marinas, tnowicki,
	Xie XiuQi, linux-pci, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-arm, Hanjun Guo,
	Bjorn Helgaas, Andrew Morton, zhongjiang, wanghuiqiang

On Fri, 22 Jun 2018 11:24:38 +0100
Punit Agrawal <punit.agrawal@arm.com> wrote:

> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Fri 22-06-18 16:58:05, Hanjun Guo wrote:  
> >> On 2018/6/20 19:51, Punit Agrawal wrote:  
> >> > Xie XiuQi <xiexiuqi@huawei.com> writes:
> >> >   
> >> >> Hi Lorenzo, Punit,
> >> >>
> >> >>
> >> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:  
> >> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:  
> >> >>>> Michal Hocko <mhocko@kernel.org> writes:
> >> >>>>  
> >> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> >> >>>>> [...]  
> >> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> >> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on
> >> >>>>>> fixing these other issues and enabling memoryless nodes.  
> >> >>>>>
> >> >>>>> Well, x86 already does that but copying this antipatern is not really
> >> >>>>> nice. So it is good as a quick fix but it would be definitely much
> >> >>>>> better to have a robust fix. Who knows how many other places might hit
> >> >>>>> this. You certainly do not want to add a hack like this all over...  
> >> >>>>
> >> >>>> Completely agree! I was only suggesting it as a temporary measure,
> >> >>>> especially as it looked like a proper fix might be invasive.
> >> >>>>
> >> >>>> Another fix might be to change the node specific allocation to node
> >> >>>> agnostic allocations. It isn't clear why the allocation is being
> >> >>>> requested from a specific node. I think Lorenzo suggested this in one of
> >> >>>> the threads.  
> >> >>>
> >> >>> I think that code was just copypasted but it is better to fix the
> >> >>> underlying issue.
> >> >>>  
> >> >>>> I've started putting together a set fixing the issues identified in this
> >> >>>> thread. It should give a better idea on the best course of action.  
> >> >>>
> >> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
> >> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
> >> >>> every SRAT GICC entry, feel free to pick it up if it works.
> >> >>>
> >> >>> Yes, we can take the original patch just because it is safer for an -rc
> >> >>> cycle even though if the patch below would do delaying the fix for a
> >> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
> >> >>> not a disaster.  
> >> >>
> >> >> I tested this patch on my arm board, it works.  
> >> > 
> >> > I am assuming you tried the patch without enabling support for
> >> > memory-less nodes.
> >> > 
> >> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
> >> > from NR_CPUS restriction. When it comes to building zonelists, the node
> >> > referenced by the PCI controller also has zonelists initialised.
> >> > 
> >> > So it looks like a fallback node is setup even if we don't have
> >> > memory-less nodes enabled. I need to stare some more at the code to see
> >> > why we need memory-less nodes at all then ...  
> >> 
> >> Yes, please. From my limited MM knowledge, zonelists should not be
> >> initialised if no CPU and no memory on this node, correct me if I'm
> >> wrong.  
> >
> > Well, as long as there is a code which can explicitly ask for a specific
> > node than it is safer to have zonelists configured. Otherwise you just
> > force callers to add hacks and figure out the proper placement there.
> > Zonelists should be cheep to configure for all possible nodes. It's not
> > like we are talking about huge amount of resources.  
> 
> I agree. The current problem stems from not configuring the zonelists
> for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes
> the configuration of such nodes.
> 
> For allocation requests targeting memory-less nodes, the allocator will
> take the slow path and fall back to one of the other nodes based on the
> zonelists.
> 
> I'm not sure how common such allocations are but I'll work on enabling
> CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this
> config improves the fallback mechanism by starting the search from a
> near-by node with memory.

I'll test it when back in the office, but I had a similar issue with
memory only nodes when I moved the SRAT listing for cpus from the 4
4th mode to the 3rd node to fake some memory I could hot unplug.
This gave a memory only node for the last node on the system.

When I instead moved cpus from the 3rd node to the 4th (so the node
with only memory was now in the middle, everything worked).

Was odd, and I'd been meaning to chase it down but hadn't gotten to it
yet.  If I get time I'll put together some test firmwares as see if there
are any other nasty corner cases we aren't handling.

Jonathan

> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-22 17:42                                                       ` Jonathan Cameron
  0 siblings, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2018-06-22 17:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 22 Jun 2018 11:24:38 +0100
Punit Agrawal <punit.agrawal@arm.com> wrote:

> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Fri 22-06-18 16:58:05, Hanjun Guo wrote:  
> >> On 2018/6/20 19:51, Punit Agrawal wrote:  
> >> > Xie XiuQi <xiexiuqi@huawei.com> writes:
> >> >   
> >> >> Hi Lorenzo, Punit,
> >> >>
> >> >>
> >> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:  
> >> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:  
> >> >>>> Michal Hocko <mhocko@kernel.org> writes:
> >> >>>>  
> >> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> >> >>>>> [...]  
> >> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> >> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on
> >> >>>>>> fixing these other issues and enabling memoryless nodes.  
> >> >>>>>
> >> >>>>> Well, x86 already does that but copying this antipatern is not really
> >> >>>>> nice. So it is good as a quick fix but it would be definitely much
> >> >>>>> better to have a robust fix. Who knows how many other places might hit
> >> >>>>> this. You certainly do not want to add a hack like this all over...  
> >> >>>>
> >> >>>> Completely agree! I was only suggesting it as a temporary measure,
> >> >>>> especially as it looked like a proper fix might be invasive.
> >> >>>>
> >> >>>> Another fix might be to change the node specific allocation to node
> >> >>>> agnostic allocations. It isn't clear why the allocation is being
> >> >>>> requested from a specific node. I think Lorenzo suggested this in one of
> >> >>>> the threads.  
> >> >>>
> >> >>> I think that code was just copypasted but it is better to fix the
> >> >>> underlying issue.
> >> >>>  
> >> >>>> I've started putting together a set fixing the issues identified in this
> >> >>>> thread. It should give a better idea on the best course of action.  
> >> >>>
> >> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
> >> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
> >> >>> every SRAT GICC entry, feel free to pick it up if it works.
> >> >>>
> >> >>> Yes, we can take the original patch just because it is safer for an -rc
> >> >>> cycle even though if the patch below would do delaying the fix for a
> >> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
> >> >>> not a disaster.  
> >> >>
> >> >> I tested this patch on my arm board, it works.  
> >> > 
> >> > I am assuming you tried the patch without enabling support for
> >> > memory-less nodes.
> >> > 
> >> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
> >> > from NR_CPUS restriction. When it comes to building zonelists, the node
> >> > referenced by the PCI controller also has zonelists initialised.
> >> > 
> >> > So it looks like a fallback node is setup even if we don't have
> >> > memory-less nodes enabled. I need to stare some more at the code to see
> >> > why we need memory-less nodes at all then ...  
> >> 
> >> Yes, please. From my limited MM knowledge, zonelists should not be
> >> initialised if no CPU and no memory on this node, correct me if I'm
> >> wrong.  
> >
> > Well, as long as there is a code which can explicitly ask for a specific
> > node than it is safer to have zonelists configured. Otherwise you just
> > force callers to add hacks and figure out the proper placement there.
> > Zonelists should be cheep to configure for all possible nodes. It's not
> > like we are talking about huge amount of resources.  
> 
> I agree. The current problem stems from not configuring the zonelists
> for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes
> the configuration of such nodes.
> 
> For allocation requests targeting memory-less nodes, the allocator will
> take the slow path and fall back to one of the other nodes based on the
> zonelists.
> 
> I'm not sure how common such allocations are but I'll work on enabling
> CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this
> config improves the fallback mechanism by starting the search from a
> near-by node with memory.

I'll test it when back in the office, but I had a similar issue with
memory only nodes when I moved the SRAT listing for cpus from the 4
4th mode to the 3rd node to fake some memory I could hot unplug.
This gave a memory only node for the last node on the system.

When I instead moved cpus from the 3rd node to the 4th (so the node
with only memory was now in the middle, everything worked).

Was odd, and I'd been meaning to chase it down but hadn't gotten to it
yet.  If I get time I'll put together some test firmwares as see if there
are any other nasty corner cases we aren't handling.

Jonathan

> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-22 17:42                                                       ` Jonathan Cameron
  (?)
  (?)
@ 2018-06-26 17:27                                                         ` Punit Agrawal
  -1 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-26 17:27 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Michal Hocko, Lorenzo Pieralisi, Catalin Marinas, tnowicki,
	Xie XiuQi, linux-pci, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-arm, Hanjun Guo,
	Bjorn Helgaas, Andrew Morton, zhongjiang, wanghuiqiang

Jonathan Cameron <jonathan.cameron@huawei.com> writes:

[...]

>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet.  If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.

If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.

This issue was found through code inspection.

Please make sure to use the updated patch from Lorenzo for your
tests[0].

[0] https://marc.info/?l=linux-acpi&m=152998665713983&w=2

>
> Jonathan
>
>> 
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-26 17:27                                                         ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-26 17:27 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Michal Hocko, Lorenzo Pieralisi, Catalin Marinas, tnowicki,
	Xie XiuQi, linux-pci, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-arm, Hanjun Guo,
	Bjorn Helgaas, Andrew Morton, zhongjiang, wanghuiqiang

Jonathan Cameron <jonathan.cameron@huawei.com> writes:

[...]

>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet.  If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.

If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.

This issue was found through code inspection.

Please make sure to use the updated patch from Lorenzo for your
tests[0].

[0] https://marc.info/?l=linux-acpi&m=152998665713983&w=2

>
> Jonathan
>
>> 
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-26 17:27                                                         ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-26 17:27 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Michal Hocko, Lorenzo Pieralisi, Catalin Marinas, tnowicki,
	Xie XiuQi, linux-pci, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-arm, Hanjun Guo,
	Bjorn Helgaas, Andrew Morton, zhongjiang, wanghuiqiang

Jonathan Cameron <jonathan.cameron@huawei.com> writes:

[...]

>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet.  If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.

If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.

This issue was found through code inspection.

Please make sure to use the updated patch from Lorenzo for your
tests[0].

[0] https://marc.info/?l=linux-acpi&m=152998665713983&w=2

>
> Jonathan
>
>> 
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-26 17:27                                                         ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-26 17:27 UTC (permalink / raw)
  To: linux-arm-kernel

Jonathan Cameron <jonathan.cameron@huawei.com> writes:

[...]

>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet.  If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.

If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.

This issue was found through code inspection.

Please make sure to use the updated patch from Lorenzo for your
tests[0].

[0] https://marc.info/?l=linux-acpi&m=152998665713983&w=2

>
> Jonathan
>
>> 
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
  2018-06-22 17:42                                                       ` Jonathan Cameron
  (?)
  (?)
@ 2018-06-26 17:27                                                         ` Punit Agrawal
  -1 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-26 17:27 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Michal Hocko, Lorenzo Pieralisi, Catalin Marinas, tnowicki,
	Xie XiuQi, linux-pci, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-arm, Hanjun Guo,
	Bjorn Helgaas, Andrew Morton, zhongjiang, wanghuiqiang

Jonathan Cameron <jonathan.cameron@huawei.com> writes:

[...]

>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet.  If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.

If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.

This issue was found through code inspection.

Please make sure to use the updated patch from Lorenzo for your
tests[0].

[0] https://marc.info/?l=linux-acpi&m=152998665713983&w=2

>
> Jonathan
>
>> 
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-26 17:27                                                         ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-26 17:27 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Michal Hocko, Lorenzo Pieralisi, Catalin Marinas, tnowicki,
	Xie XiuQi, linux-pci, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-arm, Hanjun Guo,
	Bjorn Helgaas, Andrew Morton, zhongjiang, wanghuiqiang

Jonathan Cameron <jonathan.cameron@huawei.com> writes:

[...]

>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet.  If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.

If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.

This issue was found through code inspection.

Please make sure to use the updated patch from Lorenzo for your
tests[0].

[0] https://marc.info/?l=linux-acpi&m=152998665713983&w=2

>
> Jonathan
>
>> 
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-26 17:27                                                         ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-26 17:27 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Michal Hocko, Lorenzo Pieralisi, Catalin Marinas, tnowicki,
	Xie XiuQi, linux-pci, Rafael J. Wysocki, Will Deacon,
	Linux Kernel Mailing List, Jarkko Sakkinen, linux-mm,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-arm, Hanjun Guo,
	Bjorn Helgaas, Andrew Morton, zhongjiang, wanghuiqiang

Jonathan Cameron <jonathan.cameron@huawei.com> writes:

[...]

>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet.  If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.

If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.

This issue was found through code inspection.

Please make sure to use the updated patch from Lorenzo for your
tests[0].

[0] https://marc.info/?l=linux-acpi&m=152998665713983&w=2

>
> Jonathan
>
>> 
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/2] arm64: avoid alloc memory on offline node
@ 2018-06-26 17:27                                                         ` Punit Agrawal
  0 siblings, 0 replies; 103+ messages in thread
From: Punit Agrawal @ 2018-06-26 17:27 UTC (permalink / raw)
  To: linux-arm-kernel

Jonathan Cameron <jonathan.cameron@huawei.com> writes:

[...]

>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet.  If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.

If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.

This issue was found through code inspection.

Please make sure to use the updated patch from Lorenzo for your
tests[0].

[0] https://marc.info/?l=linux-acpi&m=152998665713983&w=2

>
> Jonathan
>
>> 
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2018-06-26 17:27 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-31 12:14 [PATCH 0/2] arm64/drivers: avoid alloc memory on offline node Xie XiuQi
2018-05-31 12:14 ` Xie XiuQi
2018-05-31 12:14 ` [PATCH 1/2] arm64: " Xie XiuQi
2018-05-31 12:14   ` Xie XiuQi
2018-06-06 15:45   ` Will Deacon
2018-06-06 15:45     ` Will Deacon
2018-06-06 20:39     ` Bjorn Helgaas
2018-06-06 20:39       ` Bjorn Helgaas
2018-06-06 20:39       ` Bjorn Helgaas
2018-06-07 10:55       ` Michal Hocko
2018-06-07 10:55         ` Michal Hocko
2018-06-07 10:55         ` Michal Hocko
2018-06-07 11:55         ` Hanjun Guo
2018-06-07 11:55           ` Hanjun Guo
2018-06-07 11:55           ` Hanjun Guo
2018-06-07 11:55           ` Hanjun Guo
2018-06-07 12:21           ` Michal Hocko
2018-06-07 12:21             ` Michal Hocko
2018-06-07 12:21             ` Michal Hocko
2018-06-11  3:23             ` Xie XiuQi
2018-06-11  3:23               ` Xie XiuQi
2018-06-11  3:23               ` Xie XiuQi
2018-06-11  3:23               ` Xie XiuQi
2018-06-11  8:52               ` Michal Hocko
2018-06-11  8:52                 ` Michal Hocko
2018-06-11  8:52                 ` Michal Hocko
2018-06-11 12:32                 ` Xie XiuQi
2018-06-11 12:32                   ` Xie XiuQi
2018-06-11 12:32                   ` Xie XiuQi
2018-06-11 12:32                   ` Xie XiuQi
2018-06-11 13:43                   ` Bjorn Helgaas
2018-06-11 13:43                     ` Bjorn Helgaas
2018-06-11 13:43                     ` Bjorn Helgaas
2018-06-11 14:53                     ` Michal Hocko
2018-06-11 14:53                       ` Michal Hocko
2018-06-12 15:08                       ` Punit Agrawal
2018-06-12 15:08                         ` Punit Agrawal
2018-06-12 15:08                         ` Punit Agrawal
2018-06-12 15:20                         ` Michal Hocko
2018-06-12 15:20                           ` Michal Hocko
2018-06-13 17:39                         ` Punit Agrawal
2018-06-13 17:39                           ` Punit Agrawal
2018-06-13 17:39                           ` Punit Agrawal
2018-06-14  6:23                           ` Hanjun Guo
2018-06-14  6:23                             ` Hanjun Guo
2018-06-14  6:23                             ` Hanjun Guo
2018-06-19 12:03                           ` Xie XiuQi
2018-06-19 12:03                             ` Xie XiuQi
2018-06-19 12:03                             ` Xie XiuQi
2018-06-19 12:07                             ` Michal Hocko
2018-06-19 12:07                               ` Michal Hocko
2018-06-19 12:40                               ` Xie XiuQi
2018-06-19 12:40                                 ` Xie XiuQi
2018-06-19 12:40                                 ` Xie XiuQi
2018-06-19 12:52                               ` Punit Agrawal
2018-06-19 12:52                                 ` Punit Agrawal
2018-06-19 12:52                                 ` Punit Agrawal
2018-06-19 12:52                                 ` Punit Agrawal
2018-06-19 14:08                                 ` Lorenzo Pieralisi
2018-06-19 14:08                                   ` Lorenzo Pieralisi
2018-06-19 14:54                                   ` Punit Agrawal
2018-06-19 14:54                                     ` Punit Agrawal
2018-06-19 14:54                                     ` Punit Agrawal
2018-06-19 14:54                                     ` Punit Agrawal
2018-06-19 15:14                                     ` Michal Hocko
2018-06-19 15:14                                       ` Michal Hocko
2018-06-19 15:35                                       ` Punit Agrawal
2018-06-19 15:35                                         ` Punit Agrawal
2018-06-19 15:35                                         ` Punit Agrawal
2018-06-19 15:35                                         ` Punit Agrawal
2018-06-19 16:32                                         ` Lorenzo Pieralisi
2018-06-19 16:32                                           ` Lorenzo Pieralisi
2018-06-20  3:31                                           ` Xie XiuQi
2018-06-20  3:31                                             ` Xie XiuQi
2018-06-20  3:31                                             ` Xie XiuQi
2018-06-20 11:51                                             ` Punit Agrawal
2018-06-20 11:51                                               ` Punit Agrawal
2018-06-20 11:51                                               ` Punit Agrawal
2018-06-20 11:51                                               ` Punit Agrawal
2018-06-22  8:58                                               ` Hanjun Guo
2018-06-22  8:58                                                 ` Hanjun Guo
2018-06-22  8:58                                                 ` Hanjun Guo
2018-06-22  9:11                                                 ` Michal Hocko
2018-06-22  9:11                                                   ` Michal Hocko
2018-06-22 10:24                                                   ` Punit Agrawal
2018-06-22 10:24                                                     ` Punit Agrawal
2018-06-22 10:24                                                     ` Punit Agrawal
2018-06-22 10:24                                                     ` Punit Agrawal
2018-06-22 17:42                                                     ` Jonathan Cameron
2018-06-22 17:42                                                       ` Jonathan Cameron
2018-06-22 17:42                                                       ` Jonathan Cameron
2018-06-26 17:27                                                       ` Punit Agrawal
2018-06-26 17:27                                                         ` Punit Agrawal
2018-06-26 17:27                                                         ` Punit Agrawal
2018-06-26 17:27                                                         ` Punit Agrawal
2018-06-26 17:27                                                       ` Punit Agrawal
2018-06-26 17:27                                                         ` Punit Agrawal
2018-06-26 17:27                                                         ` Punit Agrawal
2018-06-26 17:27                                                         ` Punit Agrawal
2018-05-31 12:14 ` [PATCH 2/2] drivers: check numa node's online status in dev_to_node Xie XiuQi
2018-05-31 12:14   ` Xie XiuQi
2018-05-31 14:00 ` [PATCH 0/2] arm64/drivers: avoid alloc memory on offline node Hanjun Guo
2018-05-31 14:00   ` Hanjun Guo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.