linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [Patch V3 0/9] Enable memoryless node support for x86
@ 2015-08-17  3:18 Jiang Liu
  2015-08-17  3:18 ` [Patch V3 1/9] x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition Jiang Liu
                   ` (10 more replies)
  0 siblings, 11 replies; 38+ messages in thread
From: Jiang Liu @ 2015-08-17  3:18 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo
  Cc: Jiang Liu, Tony Luck, linux-mm, linux-hotplug, linux-kernel, x86

This is the third version to enable memoryless node support on x86
platforms. The previous version (https://lkml.org/lkml/2014/7/11/75)
blindly replaces numa_node_id()/cpu_to_node() with numa_mem_id()/
cpu_to_mem(). That's not the right solution as pointed out by Tejun
and Peter due to:
1) We shouldn't shift the burden to normal slab users.
2) Details of memoryless node should be hidden in arch and mm code
   as much as possible.

After digging into more code and documentation, we found the rules to
deal with memoryless node should be:
1) Arch code should online corresponding NUMA node before onlining any
   CPU or memory, otherwise it may cause invalid memory access when
   accessing NODE_DATA(nid).
2) For normal memory allocations without __GFP_THISNODE setting in the
   gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of
   numa_mem_id()/cpu_to_mem() because the latter loses hardware topology
   information as pointed out by Tejun:
	   A - B - X - C - D
	Where X is the memless node.  numa_mem_id() on X would return
	either B or C, right?  If B or C can't satisfy the allocation,
	the allocator would fallback to A from B and D for C, both of
	which aren't optimal. It should first fall back to C or B
	respectively, which the allocator can't do anymoe because the
	information is lost when the caller side performs numa_mem_id().
3) For memory allocation with __GFP_THISNODE setting in gfp_flags,
   numa_node_id()/cpu_to_node() should be used if caller only wants to
   allocate from local memory, otherwise numa_mem_id()/cpu_to_mem()
   should be used if caller wants to allocate from the nearest node
   with memory.
4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check
   whether a page is allocated from the nearest node.

Based on above rules, this patch set
1) Patch 1 is a bugfix to resolve a crash caused by socket hot-addition
2) Patch 2 replaces numa_mem_id() with numa_node_id() when __GFP_THISNODE
   isn't set in gfp_flags.
3) Patch 3-6 replaces numa_node_id()/cpu_to_node() with numa_mem_id()/
   cpu_to_mem() if caller wants to allocate from local node only.
4) Patch 7-9 enables support of memoryless node on x86.

With this patch set applied, on a system with two sockets enabled at boot,
one with memory and the other without memory, we got following numa
topology after boot:
root@bkd04sdp:~# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
node 0 size: 15940 MB
node 0 free: 15397 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
node 1 size: 0 MB
node 1 free: 0 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

After hot-adding the third socket without memory, we got:
root@bkd04sdp:~# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
node 0 size: 15940 MB
node 0 free: 15142 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node distances:
node   0   1   2
  0:  10  21  21
  1:  21  10  21
  2:  21  21  10

Jiang Liu (9):
  x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition
  kernel/profile.c: Replace cpu_to_mem() with cpu_to_node()
  sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless
    node
  openvswitch: Replace cpu_to_node() with cpu_to_mem() to support
    memoryless node
  i40e: Use numa_mem_id() to better support memoryless node
  i40evf: Use numa_mem_id() to better support memoryless node
  x86, numa: Kill useless code to improve code readability
  mm: Update _mem_id_[] for every possible CPU when memory
    configuration changes
  mm, x86: Enable memoryless node support to better support CPU/memory
    hotplug

 arch/x86/Kconfig                              |    3 ++
 arch/x86/kernel/acpi/boot.c                   |    9 +++-
 arch/x86/kernel/smpboot.c                     |    2 +
 arch/x86/mm/numa.c                            |   59 +++++++++++++++----------
 drivers/misc/sgi-xp/xpc_uv.c                  |    2 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |    2 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |    2 +-
 kernel/profile.c                              |    2 +-
 mm/page_alloc.c                               |   10 ++---
 net/openvswitch/flow.c                        |    2 +-
 10 files changed, 59 insertions(+), 34 deletions(-)

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Patch V3 1/9] x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition
  2015-08-17  3:18 [Patch V3 0/9] Enable memoryless node support for x86 Jiang Liu
@ 2015-08-17  3:18 ` Jiang Liu
  2015-08-17  3:18 ` [Patch V3 2/9] kernel/profile.c: Replace cpu_to_mem() with cpu_to_node() Jiang Liu
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: Jiang Liu @ 2015-08-17  3:18 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	Rafael J. Wysocki, Len Brown, Pavel Machek, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86
  Cc: Jiang Liu, Tony Luck, linux-mm, linux-hotplug, linux-kernel, linux-pm

With typical CPU hot-addition flow on x86, PCI host bridges embedded
in physical processor are always associated with NOMA_NO_NODE, which
may cause sub-optimal performance.
1) Handle CPU hot-addition notification
	acpi_processor_add()
		acpi_processor_get_info()
			acpi_processor_hotadd_init()
				acpi_map_lsapic()
1.a)					acpi_map_cpu2node()

2) Handle PCI host bridge hot-addition notification
	acpi_pci_root_add()
		pci_acpi_scan_root()
2.a)			if (node != NUMA_NO_NODE && !node_online(node)) node = NUMA_NO_NODE;

3) Handle memory hot-addition notification
	acpi_memory_device_add()
		acpi_memory_enable_device()
			add_memory()
3.a)				node_set_online();

4) Online CPUs through sysfs interfaces
	cpu_subsys_online()
		cpu_up()
			try_online_node()
4.a)				node_set_online();

So associated node is always in offline state because it is onlined
until step 3.a or 4.a.

We could improve performance by online node at step 1.a. This change
also makes the code symmetric. Nodes are always created when handling
CPU/memory hot-addition events instead of handling user requests from
sysfs interfaces, and are destroyed when handling CPU/memory hot-removal
events.

It also close a race window caused by kmalloc_node(cpu_to_node(cpu)),
which may cause system panic as below.
[ 3663.324476] BUG: unable to handle kernel paging request at 0000000000001f08
[ 3663.332348] IP: [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.339719] PGD 82fe10067 PUD 82ebef067 PMD 0
[ 3663.344773] Oops: 0000 [#1] SMP
[ 3663.348455] Modules linked in: shpchp gpio_ich x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode joydev sb_edac edac_core lpc_ich ipmi_si tpm_tis ipmi_msghandler ioatdma wmi acpi_pad mac_hid lp parport ixgbe isci mpt2sas dca ahci ptp libsas libahci raid_class pps_core scsi_transport_sas mdio hid_generic usbhid hid
[ 3663.394393] CPU: 61 PID: 2416 Comm: cron Tainted: G        W    3.14.0-rc5+ #21
[ 3663.402643] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRIVTIN1.86B.0047.F03.1403031049 03/03/2014
[ 3663.414299] task: ffff88082fe54b00 ti: ffff880845fba000 task.ti: ffff880845fba000
[ 3663.422741] RIP: 0010:[<ffffffff81172219>]  [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.432857] RSP: 0018:ffff880845fbbcd0  EFLAGS: 00010246
[ 3663.439265] RAX: 0000000000001f00 RBX: 0000000000000000 RCX: 0000000000000000
[ 3663.447291] RDX: 0000000000000000 RSI: 0000000000000a8d RDI: ffffffff81a8d950
[ 3663.455318] RBP: ffff880845fbbd58 R08: ffff880823293400 R09: 0000000000000001
[ 3663.463345] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000002052d0
[ 3663.471363] R13: ffff880854c07600 R14: 0000000000000002 R15: 0000000000000000
[ 3663.479389] FS:  00007f2e8b99e800(0000) GS:ffff88105a400000(0000) knlGS:0000000000000000
[ 3663.488514] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3663.495018] CR2: 0000000000001f08 CR3: 00000008237b1000 CR4: 00000000001407e0
[ 3663.503476] Stack:
[ 3663.505757]  ffffffff811bd74d ffff880854c01d98 ffff880854c01df0 ffff880854c01dd0
[ 3663.514167]  00000003208ca420 000000075a5d84d0 ffff88082fe54b00 ffffffff811bb35f
[ 3663.522567]  ffff880854c07600 0000000000000003 0000000000001f00 ffff880845fbbd48
[ 3663.530976] Call Trace:
[ 3663.533753]  [<ffffffff811bd74d>] ? deactivate_slab+0x41d/0x4f0
[ 3663.540421]  [<ffffffff811bb35f>] ? new_slab+0x3f/0x2d0
[ 3663.546307]  [<ffffffff811bb3c5>] new_slab+0xa5/0x2d0
[ 3663.552001]  [<ffffffff81768c97>] __slab_alloc+0x35d/0x54a
[ 3663.558185]  [<ffffffff810a4845>] ? local_clock+0x25/0x30
[ 3663.564686]  [<ffffffff8177a34c>] ? __do_page_fault+0x4ec/0x5e0
[ 3663.571356]  [<ffffffff810b0054>] ? alloc_fair_sched_group+0xc4/0x190
[ 3663.578609]  [<ffffffff810c77f1>] ? __raw_spin_lock_init+0x21/0x60
[ 3663.585570]  [<ffffffff811be476>] kmem_cache_alloc_node_trace+0xa6/0x1d0
[ 3663.593112]  [<ffffffff810b0054>] ? alloc_fair_sched_group+0xc4/0x190
[ 3663.600363]  [<ffffffff810b0054>] alloc_fair_sched_group+0xc4/0x190
[ 3663.607423]  [<ffffffff810a359f>] sched_create_group+0x3f/0x80
[ 3663.613994]  [<ffffffff810b611f>] sched_autogroup_create_attach+0x3f/0x1b0
[ 3663.621732]  [<ffffffff8108258a>] sys_setsid+0xea/0x110
[ 3663.628020]  [<ffffffff8177f42d>] system_call_fastpath+0x1a/0x1f
[ 3663.634780] Code: 00 44 89 e7 e8 b9 f8 f4 ff 41 f6 c4 10 74 18 31 d2 be 8d 0a 00 00 48 c7 c7 50 d9 a8 81 e8 70 6a f2 ff e8 db dd 5f 00 48 8b 45 c8 <48> 83 78 08 00 0f 84 b5 01 00 00 48 83 c0 08 44 89 75 c0 4d 89
[ 3663.657032] RIP  [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.664491]  RSP <ffff880845fbbcd0>
[ 3663.668429] CR2: 0000000000001f08
[ 3663.672659] ---[ end trace df13f08ed9de18ad ]---

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
---
 arch/x86/kernel/acpi/boot.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index e49ee24da85e..07930e1d2fe9 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -704,6 +704,11 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 
 	nid = acpi_get_node(handle);
 	if (nid != -1) {
+		if (try_online_node(nid)) {
+			pr_warn("failed to online node%d for CPU%d, use node%d instead.\n",
+				nid, cpu, first_node(node_online_map));
+			nid = first_node(node_online_map);
+		}
 		set_apicid_to_node(physid, nid);
 		numa_set_node(cpu, nid);
 	}
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Patch V3 2/9] kernel/profile.c: Replace cpu_to_mem() with cpu_to_node()
  2015-08-17  3:18 [Patch V3 0/9] Enable memoryless node support for x86 Jiang Liu
  2015-08-17  3:18 ` [Patch V3 1/9] x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition Jiang Liu
@ 2015-08-17  3:18 ` Jiang Liu
  2015-08-18  0:31   ` David Rientjes
  2015-08-17  3:19 ` [Patch V3 3/9] sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless node Jiang Liu
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 38+ messages in thread
From: Jiang Liu @ 2015-08-17  3:18 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo
  Cc: Jiang Liu, Tony Luck, linux-mm, linux-hotplug, linux-kernel, x86

Function profile_cpu_callback() allocates memory without specifying
__GFP_THISNODE flag, so replace cpu_to_mem() with cpu_to_node()
because cpu_to_mem() may cause suboptimal memory allocation if
there's no free memory on the node returned by cpu_to_mem().

It's safe to use cpu_to_mem() because build_all_zonelists() also
builds suitable fallback zonelist for memoryless node.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
---
 kernel/profile.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/profile.c b/kernel/profile.c
index a7bcd28d6e9f..d14805bdcc4c 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -336,7 +336,7 @@ static int profile_cpu_callback(struct notifier_block *info,
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		node = cpu_to_mem(cpu);
+		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
 			page = alloc_pages_exact_node(node,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Patch V3 3/9] sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless node
  2015-08-17  3:18 [Patch V3 0/9] Enable memoryless node support for x86 Jiang Liu
  2015-08-17  3:18 ` [Patch V3 1/9] x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition Jiang Liu
  2015-08-17  3:18 ` [Patch V3 2/9] kernel/profile.c: Replace cpu_to_mem() with cpu_to_node() Jiang Liu
@ 2015-08-17  3:19 ` Jiang Liu
  2015-08-18  0:25   ` David Rientjes
  2015-08-19 11:52   ` Robin Holt
  2015-08-17  3:19 ` [Patch V3 4/9] openvswitch: " Jiang Liu
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 38+ messages in thread
From: Jiang Liu @ 2015-08-17  3:19 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	Cliff Whickman, Robin Holt
  Cc: Jiang Liu, Tony Luck, linux-mm, linux-hotplug, linux-kernel, x86

Function xpc_create_gru_mq_uv() allocates memory with __GFP_THISNODE
flag set, which may cause permanent memory allocation failure on
memoryless node. So replace cpu_to_node() with cpu_to_mem() to better
support memoryless node. For node with memory, cpu_to_mem() is the same
as cpu_to_node().

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
---
 drivers/misc/sgi-xp/xpc_uv.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index 95c894482fdd..9210981c0d5b 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -238,7 +238,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
 
 	mq->mmr_blade = uv_cpu_to_blade_id(cpu);
 
-	nid = cpu_to_node(cpu);
+	nid = cpu_to_mem(cpu);
 	page = alloc_pages_exact_node(nid,
 				      GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
 				      pg_order);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Patch V3 4/9] openvswitch: Replace cpu_to_node() with cpu_to_mem() to support memoryless node
  2015-08-17  3:18 [Patch V3 0/9] Enable memoryless node support for x86 Jiang Liu
                   ` (2 preceding siblings ...)
  2015-08-17  3:19 ` [Patch V3 3/9] sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless node Jiang Liu
@ 2015-08-17  3:19 ` Jiang Liu
  2015-08-18  0:14   ` Pravin Shelar
  2015-08-17  3:19 ` [Patch V3 5/9] i40e: Use numa_mem_id() to better " Jiang Liu
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 38+ messages in thread
From: Jiang Liu @ 2015-08-17  3:19 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	Pravin Shelar, David S. Miller
  Cc: Jiang Liu, Tony Luck, linux-mm, linux-hotplug, linux-kernel, x86,
	netdev, dev

Function ovs_flow_stats_update() allocates memory with __GFP_THISNODE
flag set, which may cause permanent memory allocation failure on
memoryless node. So replace cpu_to_node() with cpu_to_mem() to better
support memoryless node. For node with memory, cpu_to_mem() is the same
as cpu_to_node().

This change only affects performance and shouldn't affect functionality.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
---
 net/openvswitch/flow.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index bc7b0aba994a..e50a5681d0c2 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -69,7 +69,7 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags,
 			   const struct sk_buff *skb)
 {
 	struct flow_stats *stats;
-	int node = numa_node_id();
+	int node = numa_mem_id();
 	int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0);
 
 	stats = rcu_dereference(flow->stats[node]);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node
  2015-08-17  3:18 [Patch V3 0/9] Enable memoryless node support for x86 Jiang Liu
                   ` (3 preceding siblings ...)
  2015-08-17  3:19 ` [Patch V3 4/9] openvswitch: " Jiang Liu
@ 2015-08-17  3:19 ` Jiang Liu
  2015-08-18  0:35   ` David Rientjes
  2015-08-19 22:38   ` [Intel-wired-lan] " Patil, Kiran
  2015-08-17  3:19 ` [Patch V3 6/9] i40evf: " Jiang Liu
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 38+ messages in thread
From: Jiang Liu @ 2015-08-17  3:19 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	Jeff Kirsher, Jesse Brandeburg, Shannon Nelson, Carolyn Wyborny,
	Don Skidmore, Matthew Vick, John Ronciak, Mitch Williams
  Cc: Jiang Liu, Tony Luck, linux-mm, linux-hotplug, linux-kernel, x86,
	intel-wired-lan, netdev

Function i40e_clean_rx_irq() tries to reuse memory pages allocated
from the nearest node. To better support memoryless node, use
numa_mem_id() instead of numa_node_id() to get the nearest node with
memory.

This change should only affect performance.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 9a4f2bc70cd2..a8f618cb8eb0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1516,7 +1516,7 @@ static int i40e_clean_rx_irq_ps(struct i40e_ring *rx_ring, int budget)
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 	u16 rx_packet_len, rx_header_len, rx_sph, rx_hbo;
 	u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
-	const int current_node = numa_node_id();
+	const int current_node = numa_mem_id();
 	struct i40e_vsi *vsi = rx_ring->vsi;
 	u16 i = rx_ring->next_to_clean;
 	union i40e_rx_desc *rx_desc;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Patch V3 6/9] i40evf: Use numa_mem_id() to better support memoryless node
  2015-08-17  3:18 [Patch V3 0/9] Enable memoryless node support for x86 Jiang Liu
                   ` (4 preceding siblings ...)
  2015-08-17  3:19 ` [Patch V3 5/9] i40e: Use numa_mem_id() to better " Jiang Liu
@ 2015-08-17  3:19 ` Jiang Liu
  2015-08-17 19:03   ` [Intel-wired-lan] " Patil, Kiran
  2015-08-17  3:19 ` [Patch V3 7/9] x86, numa: Kill useless code to improve code readability Jiang Liu
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 38+ messages in thread
From: Jiang Liu @ 2015-08-17  3:19 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	Jeff Kirsher, Jesse Brandeburg, Shannon Nelson, Carolyn Wyborny,
	Don Skidmore, Matthew Vick, John Ronciak, Mitch Williams
  Cc: Jiang Liu, Tony Luck, linux-mm, linux-hotplug, linux-kernel, x86,
	intel-wired-lan, netdev

Function i40e_clean_rx_irq() tries to reuse memory pages allocated
from the nearest node. To better support memoryless node, use
numa_mem_id() instead of numa_node_id() to get the nearest node with
memory.

This change should only affect performance.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
---
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 395f32f226c0..19ca96d8bd97 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -1003,7 +1003,7 @@ static int i40e_clean_rx_irq_ps(struct i40e_ring *rx_ring, int budget)
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 	u16 rx_packet_len, rx_header_len, rx_sph, rx_hbo;
 	u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
-	const int current_node = numa_node_id();
+	const int current_node = numa_mem_id();
 	struct i40e_vsi *vsi = rx_ring->vsi;
 	u16 i = rx_ring->next_to_clean;
 	union i40e_rx_desc *rx_desc;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Patch V3 7/9] x86, numa: Kill useless code to improve code readability
  2015-08-17  3:18 [Patch V3 0/9] Enable memoryless node support for x86 Jiang Liu
                   ` (5 preceding siblings ...)
  2015-08-17  3:19 ` [Patch V3 6/9] i40evf: " Jiang Liu
@ 2015-08-17  3:19 ` Jiang Liu
  2015-08-17  3:19 ` [Patch V3 8/9] mm: Update _mem_id_[] for every possible CPU when memory configuration changes Jiang Liu
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: Jiang Liu @ 2015-08-17  3:19 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Xishi Qiu,
	Jiang Liu, Luiz Capitulino, Dave Young
  Cc: Tony Luck, linux-mm, linux-hotplug, linux-kernel

According to x86 boot sequence, early_cpu_to_node() always returns
NUMA_NO_NODE when called from numa_init(). So kill useless code
to improve code readability.

Related code sequence as below:
x86_cpu_to_node_map is set until step 2, so it is still the default
value (NUMA_NO_NODE) when accessed at step 1.

start_kernel()
	setup_arch()
		initmem_init()
			x86_numa_init()
				numa_init()
					early_cpu_to_node()
1)						return early_per_cpu_ptr(x86_cpu_to_node_map)[cpu];
		acpi_boot_init();
		sfi_init()
		x86_dtb_init()
			generic_processor_info()
				early_per_cpu(x86_cpu_to_apicid, cpu) = apicid;
		init_cpu_to_node()
			numa_set_node(cpu, node);
2)				per_cpu(x86_cpu_to_node_map, cpu) = node;

	rest_init()
		kernel_init()
			smp_init()
				native_cpu_up()
					start_secondary()
						numa_set_node()
							per_cpu(x86_cpu_to_node_map, cpu) = node;

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
---
 arch/x86/mm/numa.c |   10 ----------
 1 file changed, 10 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb58bf92..08860bdf5744 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -591,8 +591,6 @@ static void __init numa_init_array(void)
 
 	rr = first_node(node_online_map);
 	for (i = 0; i < nr_cpu_ids; i++) {
-		if (early_cpu_to_node(i) != NUMA_NO_NODE)
-			continue;
 		numa_set_node(i, rr);
 		rr = next_node(rr, node_online_map);
 		if (rr == MAX_NUMNODES)
@@ -644,14 +642,6 @@ static int __init numa_init(int (*init_func)(void))
 	if (ret < 0)
 		return ret;
 
-	for (i = 0; i < nr_cpu_ids; i++) {
-		int nid = early_cpu_to_node(i);
-
-		if (nid == NUMA_NO_NODE)
-			continue;
-		if (!node_online(nid))
-			numa_clear_node(i);
-	}
 	numa_init_array();
 
 	return 0;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Patch V3 8/9] mm: Update _mem_id_[] for every possible CPU when memory configuration changes
  2015-08-17  3:18 [Patch V3 0/9] Enable memoryless node support for x86 Jiang Liu
                   ` (6 preceding siblings ...)
  2015-08-17  3:19 ` [Patch V3 7/9] x86, numa: Kill useless code to improve code readability Jiang Liu
@ 2015-08-17  3:19 ` Jiang Liu
  2015-08-17  3:19 ` [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug Jiang Liu
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: Jiang Liu @ 2015-08-17  3:19 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	Vlastimil Babka, Michal Hocko, Joonsoo Kim, Johannes Weiner,
	Alexander Duyck, Sasha Levin
  Cc: Jiang Liu, Tony Luck, linux-mm, linux-hotplug, linux-kernel, x86

Current kernel only updates _mem_id_[cpu] for onlined CPUs when memory
configuration changes. So kernel may allocate memory from remote node
for a CPU if the CPU is still in absent or offline state even if the
node associated with the CPU has already been onlined. This patch tries
to improve performance by updating _mem_id_[cpu] for each possible CPU
when memory configuration changes, thus kernel could always allocate
from local node once the node is onlined.

We check node_online(cpu_to_node(cpu)) because:
1) local_memory_node(nid) needs to access NODE_DATA(nid)
2) try_offline_node(nid) just zeroes out NODE_DATA(nid) instead of free it

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
---
 mm/page_alloc.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index beda41710802..bcfd66e66820 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4334,13 +4334,13 @@ static int __build_all_zonelists(void *data)
 		/*
 		 * We now know the "local memory node" for each node--
 		 * i.e., the node of the first zone in the generic zonelist.
-		 * Set up numa_mem percpu variable for on-line cpus.  During
-		 * boot, only the boot cpu should be on-line;  we'll init the
-		 * secondary cpus' numa_mem as they come on-line.  During
-		 * node/memory hotplug, we'll fixup all on-line cpus.
+		 * Set up numa_mem percpu variable for all possible cpus
+		 * if associated node has been onlined.
 		 */
-		if (cpu_online(cpu))
+		if (node_online(cpu_to_node(cpu)))
 			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
+		else
+			set_cpu_numa_mem(cpu, NUMA_NO_NODE);
 #endif
 	}
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug
  2015-08-17  3:18 [Patch V3 0/9] Enable memoryless node support for x86 Jiang Liu
                   ` (7 preceding siblings ...)
  2015-08-17  3:19 ` [Patch V3 8/9] mm: Update _mem_id_[] for every possible CPU when memory configuration changes Jiang Liu
@ 2015-08-17  3:19 ` Jiang Liu
  2015-08-18  6:11   ` Tang Chen
  2015-08-18  7:31   ` Ingo Molnar
  2015-08-17 21:35 ` [Patch V3 0/9] Enable memoryless node support for x86 Andrew Morton
  2015-08-18 10:02 ` Tang Chen
  10 siblings, 2 replies; 38+ messages in thread
From: Jiang Liu @ 2015-08-17  3:19 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Rafael J. Wysocki, Len Brown, Pavel Machek, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Dave Hansen,
	Jan H. Schönherr, Igor Mammedov, Paul E. McKenney,
	Xishi Qiu, Jiang Liu, Luiz Capitulino, Dave Young
  Cc: Tony Luck, linux-mm, linux-hotplug, linux-kernel, Ingo Molnar, linux-pm

With current implementation, all CPUs within a NUMA node will be
assocaited with another NUMA node if the node has no memory installed.

For example, on a four-node system, CPUs on node 2 and 3 are associated
with node 0 when are no memory install on node 2 and 3, which may
confuse users.
root@bkd01sdp:~# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 0 size: 15602 MB
node 0 free: 15014 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15985 MB
node 1 free: 15686 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

To be worse, the CPU affinity relationship won't get fixed even after
memory has been added to those nodes. After memory hot-addition to
node 2, CPUs on node 2 are still associated with node 0. This may cause
sub-optimal performance.
root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 0 size: 15602 MB
node 0 free: 14743 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15985 MB
node 1 free: 15715 MB
node 2 cpus:
node 2 size: 128 MB
node 2 free: 128 MB
node distances:
node   0   1   2
  0:  10  21  21
  1:  21  10  21
  2:  21  21  10

With support of memoryless node enabled, it will correctly report system
hardware topology for nodes without memory installed.
root@bkd01sdp:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 15129 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15627 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3
  0:  10  21  21  21
  1:  21  10  21  21
  2:  21  21  10  21
  3:  21  21  21  10

With memoryless node enabled, CPUs are correctly associated with node 2
after memory hot-addition to node 2.
root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 14872 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15641 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 128 MB
node 2 free: 127 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3
  0:  10  21  21  21
  1:  21  10  21  21
  2:  21  21  10  21
  3:  21  21  21  10

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
---
 arch/x86/Kconfig            |    3 +++
 arch/x86/kernel/acpi/boot.c |    4 +++-
 arch/x86/kernel/smpboot.c   |    2 ++
 arch/x86/mm/numa.c          |   49 +++++++++++++++++++++++++++++++------------
 4 files changed, 44 insertions(+), 14 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b3a1a5d77d92..5d7ad70ace0d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2069,6 +2069,9 @@ config USE_PERCPU_NUMA_NODE_ID
 	def_bool y
 	depends on NUMA
 
+config HAVE_MEMORYLESS_NODES
+	def_bool NUMA
+
 config ARCH_ENABLE_SPLIT_PMD_PTLOCK
 	def_bool y
 	depends on X86_64 || X86_PAE
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 07930e1d2fe9..3403f1f0f28d 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -711,6 +711,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 		}
 		set_apicid_to_node(physid, nid);
 		numa_set_node(cpu, nid);
+		set_cpu_numa_mem(cpu, local_memory_node(nid));
 	}
 #endif
 }
@@ -743,9 +744,10 @@ int acpi_unmap_cpu(int cpu)
 {
 #ifdef CONFIG_ACPI_NUMA
 	set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
+	set_cpu_numa_mem(cpu, NUMA_NO_NODE);
 #endif
 
-	per_cpu(x86_cpu_to_apicid, cpu) = -1;
+	per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
 	set_cpu_present(cpu, false);
 	num_processors--;
 
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index b1f3ed9c7a9e..aeec91ac6fd4 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -162,6 +162,8 @@ static void smp_callin(void)
 	 */
 	phys_id = read_apic_id();
 
+	set_numa_mem(local_memory_node(cpu_to_node(cpuid)));
+
 	/*
 	 * the boot CPU has finished the init stage and is spinning
 	 * on callin_map until we finish. We are free to set up this
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 08860bdf5744..f2a4e23bd14d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -22,6 +22,7 @@
 
 int __initdata numa_off;
 nodemask_t numa_nodes_parsed __initdata;
+static nodemask_t numa_nodes_empty __initdata;
 
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
@@ -560,17 +561,16 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 			end = max(mi->blk[i].end, end);
 		}
 
-		if (start >= end)
-			continue;
-
 		/*
 		 * Don't confuse VM with a node that doesn't have the
 		 * minimum amount of memory:
 		 */
-		if (end && (end - start) < NODE_MIN_SIZE)
-			continue;
-
-		alloc_node_data(nid);
+		if (start < end && (end - start) >= NODE_MIN_SIZE) {
+			alloc_node_data(nid);
+		} else if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
+			alloc_node_data(nid);
+			node_set(nid, numa_nodes_empty);
+		}
 	}
 
 	/* Dump memblock with node info and return. */
@@ -587,14 +587,18 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
  */
 static void __init numa_init_array(void)
 {
-	int rr, i;
+	int i, rr = MAX_NUMNODES;
 
-	rr = first_node(node_online_map);
 	for (i = 0; i < nr_cpu_ids; i++) {
+		/* Search for an onlined node with memory */
+		do {
+			if (rr != MAX_NUMNODES)
+				rr = next_node(rr, node_online_map);
+			if (rr == MAX_NUMNODES)
+				rr = first_node(node_online_map);
+		} while (node_isset(rr, numa_nodes_empty));
+
 		numa_set_node(i, rr);
-		rr = next_node(rr, node_online_map);
-		if (rr == MAX_NUMNODES)
-			rr = first_node(node_online_map);
 	}
 }
 
@@ -696,9 +700,12 @@ static __init int find_near_online_node(int node)
 {
 	int n, val;
 	int min_val = INT_MAX;
-	int best_node = -1;
+	int best_node = NUMA_NO_NODE;
 
 	for_each_online_node(n) {
+		if (node_isset(n, numa_nodes_empty))
+			continue;
+
 		val = node_distance(node, n);
 
 		if (val < min_val) {
@@ -739,6 +746,22 @@ void __init init_cpu_to_node(void)
 		if (!node_online(node))
 			node = find_near_online_node(node);
 		numa_set_node(cpu, node);
+		if (node_spanned_pages(node))
+			set_cpu_numa_mem(cpu, node);
+		if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
+			node_clear(node, numa_nodes_empty);
+	}
+
+	/* Destroy empty nodes */
+	if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
+		int nid;
+		const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+
+		for_each_node_mask(nid, numa_nodes_empty) {
+			node_set_offline(nid);
+			memblock_free(__pa(node_data[nid]), nd_size);
+			node_data[nid] = NULL;
+		}
 	}
 }
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* RE: [Intel-wired-lan] [Patch V3 6/9] i40evf: Use numa_mem_id() to better support memoryless node
  2015-08-17  3:19 ` [Patch V3 6/9] i40evf: " Jiang Liu
@ 2015-08-17 19:03   ` Patil, Kiran
  2015-08-18 21:34     ` Jeff Kirsher
  0 siblings, 1 reply; 38+ messages in thread
From: Patil, Kiran @ 2015-08-17 19:03 UTC (permalink / raw)
  To: Jiang Liu, Andrew Morton, Mel Gorman, David Rientjes,
	Mike Galbraith, Peter Zijlstra, Wysocki, Rafael J, Tang Chen,
	Tejun Heo, Kirsher, Jeffrey T, Brandeburg, Jesse, Nelson,
	Shannon, Wyborny, Carolyn, Skidmore, Donald C, Vick, Matthew,
	Ronciak, John, Williams, Mitch A
  Cc: Luck, Tony, netdev, x86, linux-hotplug, linux-kernel, linux-mm,
	intel-wired-lan

ACK.

Thanks,
-- Kiran P.

-----Original Message-----
From: Intel-wired-lan [mailto:intel-wired-lan-bounces@lists.osuosl.org] On Behalf Of Jiang Liu
Sent: Sunday, August 16, 2015 8:19 PM
To: Andrew Morton; Mel Gorman; David Rientjes; Mike Galbraith; Peter Zijlstra; Wysocki, Rafael J; Tang Chen; Tejun Heo; Kirsher, Jeffrey T; Brandeburg, Jesse; Nelson, Shannon; Wyborny, Carolyn; Skidmore, Donald C; Vick, Matthew; Ronciak, John; Williams, Mitch A
Cc: Luck, Tony; netdev@vger.kernel.org; x86@kernel.org; linux-hotplug@vger.kernel.org; linux-kernel@vger.kernel.org; linux-mm@kvack.org; intel-wired-lan@lists.osuosl.org; Jiang Liu
Subject: [Intel-wired-lan] [Patch V3 6/9] i40evf: Use numa_mem_id() to better support memoryless node

Function i40e_clean_rx_irq() tries to reuse memory pages allocated from the nearest node. To better support memoryless node, use
numa_mem_id() instead of numa_node_id() to get the nearest node with memory.

This change should only affect performance.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
---
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 395f32f226c0..19ca96d8bd97 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -1003,7 +1003,7 @@ static int i40e_clean_rx_irq_ps(struct i40e_ring *rx_ring, int budget)
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 	u16 rx_packet_len, rx_header_len, rx_sph, rx_hbo;
 	u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
-	const int current_node = numa_node_id();
+	const int current_node = numa_mem_id();
 	struct i40e_vsi *vsi = rx_ring->vsi;
 	u16 i = rx_ring->next_to_clean;
 	union i40e_rx_desc *rx_desc;
--
1.7.10.4

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@lists.osuosl.org
http://lists.osuosl.org/mailman/listinfo/intel-wired-lan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [Patch V3 0/9] Enable memoryless node support for x86
  2015-08-17  3:18 [Patch V3 0/9] Enable memoryless node support for x86 Jiang Liu
                   ` (8 preceding siblings ...)
  2015-08-17  3:19 ` [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug Jiang Liu
@ 2015-08-17 21:35 ` Andrew Morton
  2015-08-18 10:02 ` Tang Chen
  10 siblings, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2015-08-17 21:35 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Mel Gorman, David Rientjes, Mike Galbraith, Peter Zijlstra,
	Rafael J . Wysocki, Tang Chen, Tejun Heo, Tony Luck, linux-mm,
	linux-hotplug, linux-kernel, x86

On Mon, 17 Aug 2015 11:18:57 +0800 Jiang Liu <jiang.liu@linux.intel.com> wrote:

> This is the third version to enable memoryless node support on x86
> platforms.

I'll grab this for inclusion in linux-next after the 4.2 release.

It's basically an x86 patch so if someone else was planning on looking
after it, please tell me off.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 4/9] openvswitch: Replace cpu_to_node() with cpu_to_mem() to support memoryless node
  2015-08-17  3:19 ` [Patch V3 4/9] openvswitch: " Jiang Liu
@ 2015-08-18  0:14   ` Pravin Shelar
  0 siblings, 0 replies; 38+ messages in thread
From: Pravin Shelar @ 2015-08-18  0:14 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	David S. Miller, Tony Luck, linux-mm, linux-hotplug, LKML, x86,
	netdev, dev

On Sun, Aug 16, 2015 at 8:19 PM, Jiang Liu <jiang.liu@linux.intel.com> wrote:
> Function ovs_flow_stats_update() allocates memory with __GFP_THISNODE
> flag set, which may cause permanent memory allocation failure on
> memoryless node. So replace cpu_to_node() with cpu_to_mem() to better
> support memoryless node. For node with memory, cpu_to_mem() is the same
> as cpu_to_node().
>
> This change only affects performance and shouldn't affect functionality.
>
> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>

Acked-by: Pravin B Shelar <pshelar@nicira.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 3/9] sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless node
  2015-08-17  3:19 ` [Patch V3 3/9] sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless node Jiang Liu
@ 2015-08-18  0:25   ` David Rientjes
  2015-08-19  8:20     ` Jiang Liu
  2015-08-19 11:52   ` Robin Holt
  1 sibling, 1 reply; 38+ messages in thread
From: David Rientjes @ 2015-08-18  0:25 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Mel Gorman, Mike Galbraith, Peter Zijlstra,
	Rafael J . Wysocki, Tang Chen, Tejun Heo, Cliff Whickman,
	Robin Holt, Tony Luck, linux-mm, linux-hotplug, linux-kernel,
	x86

On Mon, 17 Aug 2015, Jiang Liu wrote:

> Function xpc_create_gru_mq_uv() allocates memory with __GFP_THISNODE
> flag set, which may cause permanent memory allocation failure on
> memoryless node. So replace cpu_to_node() with cpu_to_mem() to better
> support memoryless node. For node with memory, cpu_to_mem() is the same
> as cpu_to_node().
> 
> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
> ---
>  drivers/misc/sgi-xp/xpc_uv.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
> index 95c894482fdd..9210981c0d5b 100644
> --- a/drivers/misc/sgi-xp/xpc_uv.c
> +++ b/drivers/misc/sgi-xp/xpc_uv.c
> @@ -238,7 +238,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
>  
>  	mq->mmr_blade = uv_cpu_to_blade_id(cpu);
>  
> -	nid = cpu_to_node(cpu);
> +	nid = cpu_to_mem(cpu);
>  	page = alloc_pages_exact_node(nid,
>  				      GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
>  				      pg_order);

Why not simply fix build_zonelists_node() so that the __GFP_THISNODE 
zonelists are set up to reference the zones of cpu_to_mem() for memoryless 
nodes?

It seems much better than checking and maintaining every __GFP_THISNODE 
user to determine if they are using a memoryless node or not.  I don't 
feel that this solution is maintainable in the longterm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 2/9] kernel/profile.c: Replace cpu_to_mem() with cpu_to_node()
  2015-08-17  3:18 ` [Patch V3 2/9] kernel/profile.c: Replace cpu_to_mem() with cpu_to_node() Jiang Liu
@ 2015-08-18  0:31   ` David Rientjes
  2015-08-19  7:18     ` Jiang Liu
  0 siblings, 1 reply; 38+ messages in thread
From: David Rientjes @ 2015-08-18  0:31 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Mel Gorman, Mike Galbraith, Peter Zijlstra,
	Rafael J . Wysocki, Tang Chen, Tejun Heo, Tony Luck, linux-mm,
	linux-hotplug, linux-kernel, x86

On Mon, 17 Aug 2015, Jiang Liu wrote:

> Function profile_cpu_callback() allocates memory without specifying
> __GFP_THISNODE flag, so replace cpu_to_mem() with cpu_to_node()
> because cpu_to_mem() may cause suboptimal memory allocation if
> there's no free memory on the node returned by cpu_to_mem().
> 

Why is cpu_to_node() better with regard to free memory and NUMA locality?

> It's safe to use cpu_to_mem() because build_all_zonelists() also
> builds suitable fallback zonelist for memoryless node.
> 

Why reference that cpu_to_mem() is safe if you're changing away from it?

> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
> ---
>  kernel/profile.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/profile.c b/kernel/profile.c
> index a7bcd28d6e9f..d14805bdcc4c 100644
> --- a/kernel/profile.c
> +++ b/kernel/profile.c
> @@ -336,7 +336,7 @@ static int profile_cpu_callback(struct notifier_block *info,
>  	switch (action) {
>  	case CPU_UP_PREPARE:
>  	case CPU_UP_PREPARE_FROZEN:
> -		node = cpu_to_mem(cpu);
> +		node = cpu_to_node(cpu);
>  		per_cpu(cpu_profile_flip, cpu) = 0;
>  		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
>  			page = alloc_pages_exact_node(node,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node
  2015-08-17  3:19 ` [Patch V3 5/9] i40e: Use numa_mem_id() to better " Jiang Liu
@ 2015-08-18  0:35   ` David Rientjes
  2015-08-19 22:38   ` [Intel-wired-lan] " Patil, Kiran
  1 sibling, 0 replies; 38+ messages in thread
From: David Rientjes @ 2015-08-18  0:35 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Mel Gorman, Mike Galbraith, Peter Zijlstra,
	Rafael J . Wysocki, Tang Chen, Tejun Heo, Jeff Kirsher,
	Jesse Brandeburg, Shannon Nelson, Carolyn Wyborny, Don Skidmore,
	Matthew Vick, John Ronciak, Mitch Williams, Tony Luck, linux-mm,
	linux-hotplug, linux-kernel, x86, intel-wired-lan, netdev

On Mon, 17 Aug 2015, Jiang Liu wrote:

> Function i40e_clean_rx_irq() tries to reuse memory pages allocated

s/i40e_clean_rx_irq/i40e_clean_rx_irq_ps/

> from the nearest node. To better support memoryless node, use
> numa_mem_id() instead of numa_node_id() to get the nearest node with
> memory.
> 

Out of curiosity, what prevents the cpu to be preempted and current_node 
to no longer match numa_mem_id()?

> This change should only affect performance.
> 
> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 9a4f2bc70cd2..a8f618cb8eb0 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -1516,7 +1516,7 @@ static int i40e_clean_rx_irq_ps(struct i40e_ring *rx_ring, int budget)
>  	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>  	u16 rx_packet_len, rx_header_len, rx_sph, rx_hbo;
>  	u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
> -	const int current_node = numa_node_id();
> +	const int current_node = numa_mem_id();
>  	struct i40e_vsi *vsi = rx_ring->vsi;
>  	u16 i = rx_ring->next_to_clean;
>  	union i40e_rx_desc *rx_desc;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug
  2015-08-17  3:19 ` [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug Jiang Liu
@ 2015-08-18  6:11   ` Tang Chen
  2015-08-18  6:59     ` Jiang Liu
  2015-08-18  7:31   ` Ingo Molnar
  1 sibling, 1 reply; 38+ messages in thread
From: Tang Chen @ 2015-08-18  6:11 UTC (permalink / raw)
  To: Jiang Liu, Andrew Morton, Mel Gorman, David Rientjes,
	Mike Galbraith, Peter Zijlstra, Rafael J . Wysocki, Tejun Heo,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Rafael J. Wysocki, Len Brown, Pavel Machek, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Dave Hansen,
	"Jan H. Schönherr",
	Igor Mammedov, Paul E. McKenney, Xishi Qiu, Luiz Capitulino,
	Dave Young
  Cc: Tony Luck, linux-mm, linux-hotplug, linux-kernel, Ingo Molnar,
	linux-pm, tangchen


Hi Liu,

On 08/17/2015 11:19 AM, Jiang Liu wrote:
> ......
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b3a1a5d77d92..5d7ad70ace0d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2069,6 +2069,9 @@ config USE_PERCPU_NUMA_NODE_ID
>   	def_bool y
>   	depends on NUMA
>   
> +config HAVE_MEMORYLESS_NODES
> +	def_bool NUMA
> +
>   config ARCH_ENABLE_SPLIT_PMD_PTLOCK
>   	def_bool y
>   	depends on X86_64 || X86_PAE
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index 07930e1d2fe9..3403f1f0f28d 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -711,6 +711,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>   		}
>   		set_apicid_to_node(physid, nid);
>   		numa_set_node(cpu, nid);
> +		set_cpu_numa_mem(cpu, local_memory_node(nid));
>   	}
>   #endif
>   }
> @@ -743,9 +744,10 @@ int acpi_unmap_cpu(int cpu)
>   {
>   #ifdef CONFIG_ACPI_NUMA
>   	set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
> +	set_cpu_numa_mem(cpu, NUMA_NO_NODE);
>   #endif
>   
> -	per_cpu(x86_cpu_to_apicid, cpu) = -1;
> +	per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
>   	set_cpu_present(cpu, false);
>   	num_processors--;
>   
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index b1f3ed9c7a9e..aeec91ac6fd4 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -162,6 +162,8 @@ static void smp_callin(void)
>   	 */
>   	phys_id = read_apic_id();
>   
> +	set_numa_mem(local_memory_node(cpu_to_node(cpuid)));
> +
>   	/*
>   	 * the boot CPU has finished the init stage and is spinning
>   	 * on callin_map until we finish. We are free to set up this
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 08860bdf5744..f2a4e23bd14d 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -22,6 +22,7 @@
>   
>   int __initdata numa_off;
>   nodemask_t numa_nodes_parsed __initdata;
> +static nodemask_t numa_nodes_empty __initdata;
>   
>   struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
>   EXPORT_SYMBOL(node_data);
> @@ -560,17 +561,16 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>   			end = max(mi->blk[i].end, end);
>   		}
>   
> -		if (start >= end)
> -			continue;
> -
>   		/*
>   		 * Don't confuse VM with a node that doesn't have the
>   		 * minimum amount of memory:
>   		 */
> -		if (end && (end - start) < NODE_MIN_SIZE)
> -			continue;
> -
> -		alloc_node_data(nid);
> +		if (start < end && (end - start) >= NODE_MIN_SIZE) {
> +			alloc_node_data(nid);
> +		} else if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
> +			alloc_node_data(nid);
> +			node_set(nid, numa_nodes_empty);

Seeing from here, I think numa_nodes_empty represents all memory-less nodes.
So, since we still have cpu-less nodes out there, shall we rename it to
numa_nodes_memoryless or something similar ?

And BTW, does x86 support cpu-less node after these patches ?

Since I don't have any memory-less or cpu-less node on my box, I cannot 
tell it clearly.
A node is brought online when is has memory in original kernel. So I 
think it is supported.

> +		}
>   	}
>   
>   	/* Dump memblock with node info and return. */
> @@ -587,14 +587,18 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>    */
>   static void __init numa_init_array(void)
>   {
> -	int rr, i;
> +	int i, rr = MAX_NUMNODES;
>   
> -	rr = first_node(node_online_map);
>   	for (i = 0; i < nr_cpu_ids; i++) {
> +		/* Search for an onlined node with memory */
> +		do {
> +			if (rr != MAX_NUMNODES)
> +				rr = next_node(rr, node_online_map);
> +			if (rr == MAX_NUMNODES)
> +				rr = first_node(node_online_map);
> +		} while (node_isset(rr, numa_nodes_empty));
> +
>   		numa_set_node(i, rr);
> -		rr = next_node(rr, node_online_map);
> -		if (rr == MAX_NUMNODES)
> -			rr = first_node(node_online_map);
>   	}
>   }
>   
> @@ -696,9 +700,12 @@ static __init int find_near_online_node(int node)
>   {
>   	int n, val;
>   	int min_val = INT_MAX;
> -	int best_node = -1;
> +	int best_node = NUMA_NO_NODE;
>   
>   	for_each_online_node(n) {
> +		if (node_isset(n, numa_nodes_empty))
> +			continue;
> +
>   		val = node_distance(node, n);
>   
>   		if (val < min_val) {
> @@ -739,6 +746,22 @@ void __init init_cpu_to_node(void)
>   		if (!node_online(node))
>   			node = find_near_online_node(node);
>   		numa_set_node(cpu, node);

So, CPUs are still mapped to online near node, right ?

I was expecting CPUs on a memory-less node are mapped to the node they
belong to. If so, the current memory allocator may fail because they assume
each online node has memory. I was trying to do this in my patch.

https://lkml.org/lkml/2015/7/7/205

Of course, my patch is not to support memory-less node, just run into 
this problem.

> +		if (node_spanned_pages(node))
> +			set_cpu_numa_mem(cpu, node);
> +		if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
> +			node_clear(node, numa_nodes_empty);

And since we are supporting memory-less node, it's better to provide a
for_each_memoryless_node() wrapper.

> +	}
> +
> +	/* Destroy empty nodes */
> +	if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
> +		int nid;
> +		const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
> +
> +		for_each_node_mask(nid, numa_nodes_empty) {
> +			node_set_offline(nid);
> +			memblock_free(__pa(node_data[nid]), nd_size);
> +			node_data[nid] = NULL;

So, memory-less nodes are set offline finally. It's a little different 
from what I thought.
I was expecting that both memory-less and cpu-less nodes could also be 
online after
this patch, which would be very helpful to me.

But actually, they are just exist temporarily, used to set _numa_mem_ so 
that cpu_to_mem()
is able to work, right ?

Thanks.

> +		}
>   	}
>   }
>   

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug
  2015-08-18  6:11   ` Tang Chen
@ 2015-08-18  6:59     ` Jiang Liu
  2015-08-18 11:28       ` Tang Chen
  0 siblings, 1 reply; 38+ messages in thread
From: Jiang Liu @ 2015-08-18  6:59 UTC (permalink / raw)
  To: Tang Chen, Andrew Morton, Mel Gorman, David Rientjes,
	Mike Galbraith, Peter Zijlstra, Rafael J . Wysocki, Tejun Heo,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Rafael J. Wysocki, Len Brown, Pavel Machek, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Dave Hansen,
	Jan H. Schönherr, Igor Mammedov, Paul E. McKenney,
	Xishi Qiu, Luiz Capitulino, Dave Young
  Cc: Tony Luck, linux-mm, linux-hotplug, linux-kernel, Ingo Molnar, linux-pm

On 2015/8/18 14:11, Tang Chen wrote:
> 
> Hi Liu,
> 
> On 08/17/2015 11:19 AM, Jiang Liu wrote:
......
>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>> index 08860bdf5744..f2a4e23bd14d 100644
>> --- a/arch/x86/mm/numa.c
>> +++ b/arch/x86/mm/numa.c
>> @@ -22,6 +22,7 @@
>>     int __initdata numa_off;
>>   nodemask_t numa_nodes_parsed __initdata;
>> +static nodemask_t numa_nodes_empty __initdata;
>>     struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
>>   EXPORT_SYMBOL(node_data);
>> @@ -560,17 +561,16 @@ static int __init numa_register_memblks(struct
>> numa_meminfo *mi)
>>               end = max(mi->blk[i].end, end);
>>           }
>>   -        if (start >= end)
>> -            continue;
>> -
>>           /*
>>            * Don't confuse VM with a node that doesn't have the
>>            * minimum amount of memory:
>>            */
>> -        if (end && (end - start) < NODE_MIN_SIZE)
>> -            continue;
>> -
>> -        alloc_node_data(nid);
>> +        if (start < end && (end - start) >= NODE_MIN_SIZE) {
>> +            alloc_node_data(nid);
>> +        } else if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
>> +            alloc_node_data(nid);
>> +            node_set(nid, numa_nodes_empty);
> 
> Seeing from here, I think numa_nodes_empty represents all memory-less
> nodes.
> So, since we still have cpu-less nodes out there, shall we rename it to
> numa_nodes_memoryless or something similar ?
> 
> And BTW, does x86 support cpu-less node after these patches ?
> 
> Since I don't have any memory-less or cpu-less node on my box, I cannot
> tell it clearly.
> A node is brought online when is has memory in original kernel. So I
> think it is supported.
Hi Chen,
	Thanks for review. With current Intel processor, there's no
hardware configurations for CPU-less NUMA node. From the code itself,
I think CPU-less node is supported. So we could fake CPU-less node
by "maxcpus" kernel parameter. For example, when "maxcpus=2" is
specified on my system, we get following NUMA topology. Among which,
node 2 is CPU-less node with memory.

root@bkd04sdp:~# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: 15954 MB
node 0 free: 15686 MB
node 1 cpus:
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus:
node 2 size: 16113 MB
node 2 free: 16058 MB
node distances:
node   0   1   2
  0:  10  21  21
  1:  21  10  21
  2:  21  21  10


>> +        }
...
>>       }
>> @@ -739,6 +746,22 @@ void __init init_cpu_to_node(void)
>>           if (!node_online(node))
>>               node = find_near_online_node(node);
>>           numa_set_node(cpu, node);
> 
> So, CPUs are still mapped to online near node, right ?
> 
> I was expecting CPUs on a memory-less node are mapped to the node they
> belong to. If so, the current memory allocator may fail because they assume
> each online node has memory. I was trying to do this in my patch.
> 
> https://lkml.org/lkml/2015/7/7/205
> 
> Of course, my patch is not to support memory-less node, just run into
> this problem.
We have two sets of interfaces to figure out NUMA node associated with
a CPU.
1) numa_node_id()/cpu_to_node() return the NUMA node associated with
   the CPU, no matter whether there's memory associated with the node.
2) numa_mem_id()/cpu_to_mem() return the NUMA node the CPU should
   allocate memory from.

> 
>> +        if (node_spanned_pages(node))
>> +            set_cpu_numa_mem(cpu, node);
>> +        if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
>> +            node_clear(node, numa_nodes_empty);
> 
> And since we are supporting memory-less node, it's better to provide a
> for_each_memoryless_node() wrapper.
> 
>> +    }
>> +
>> +    /* Destroy empty nodes */
>> +    if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
>> +        int nid;
>> +        const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
>> +
>> +        for_each_node_mask(nid, numa_nodes_empty) {
>> +            node_set_offline(nid);
>> +            memblock_free(__pa(node_data[nid]), nd_size);
>> +            node_data[nid] = NULL;
> 
> So, memory-less nodes are set offline finally. It's a little different
> from what I thought.
> I was expecting that both memory-less and cpu-less nodes could also be
> online after
> this patch, which would be very helpful to me.
> 
> But actually, they are just exist temporarily, used to set _numa_mem_ so
> that cpu_to_mem()
> is able to work, right ?

No. We have removed NUMA node w/ CPU but w/o memory from the
numa_nodes_empty set. So here we only remove NUMA node without
CPU and memory.
> +        if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
> +            node_clear(node, numa_nodes_empty);

Please refer to the example below, which has memoryless node (node 1).
root@bkd04sdp:~# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 45 46 47 48 49 50 51 52
53 54 55 56 57 58 59
node 0 size: 15954 MB
node 0 free: 15584 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 60 61 62 63 64
65 66 67 68 69 70 71 72 73 74
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89
node 2 size: 16113 MB
node 2 free: 15802 MB
node distances:
node   0   1   2
  0:  10  21  21
  1:  21  10  21
  2:  21  21  10
Thanks!
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug
  2015-08-17  3:19 ` [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug Jiang Liu
  2015-08-18  6:11   ` Tang Chen
@ 2015-08-18  7:31   ` Ingo Molnar
  1 sibling, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2015-08-18  7:31 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Rafael J. Wysocki, Len Brown, Pavel Machek, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Dave Hansen,
	Jan H. Schönherr, Igor Mammedov, Paul E. McKenney,
	Xishi Qiu, Luiz Capitulino, Dave Young, Tony Luck, linux-mm,
	linux-hotplug, linux-kernel, linux-pm


* Jiang Liu <jiang.liu@linux.intel.com> wrote:

> With current implementation, all CPUs within a NUMA node will be
> assocaited with another NUMA node if the node has no memory installed.

typo.

> 
> For example, on a four-node system, CPUs on node 2 and 3 are associated
> with node 0 when are no memory install on node 2 and 3, which may
> confuse users.
>
> root@bkd01sdp:~# numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> node 0 size: 15602 MB
> node 0 free: 15014 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
> node 1 size: 15985 MB
> node 1 free: 15686 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
> 
> To be worse, the CPU affinity relationship won't get fixed even after
> memory has been added to those nodes. After memory hot-addition to
> node 2, CPUs on node 2 are still associated with node 0. This may cause
> sub-optimal performance.
> root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
> available: 3 nodes (0-2)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> node 0 size: 15602 MB
> node 0 free: 14743 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
> node 1 size: 15985 MB
> node 1 free: 15715 MB
> node 2 cpus:
> node 2 size: 128 MB
> node 2 free: 128 MB
> node distances:
> node   0   1   2
>   0:  10  21  21
>   1:  21  10  21
>   2:  21  21  10
> 
> With support of memoryless node enabled, it will correctly report system
> hardware topology for nodes without memory installed.
> root@bkd01sdp:~# numactl --hardware
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
> node 0 size: 15725 MB
> node 0 free: 15129 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
> node 1 size: 15862 MB
> node 1 free: 15627 MB
> node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
> node 2 size: 0 MB
> node 2 free: 0 MB
> node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> node 3 size: 0 MB
> node 3 free: 0 MB
> node distances:
> node   0   1   2   3
>   0:  10  21  21  21
>   1:  21  10  21  21
>   2:  21  21  10  21
>   3:  21  21  21  10
> 
> With memoryless node enabled, CPUs are correctly associated with node 2
> after memory hot-addition to node 2.
> root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
> node 0 size: 15725 MB
> node 0 free: 14872 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
> node 1 size: 15862 MB
> node 1 free: 15641 MB
> node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
> node 2 size: 128 MB
> node 2 free: 127 MB
> node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> node 3 size: 0 MB
> node 3 free: 0 MB
> node distances:
> node   0   1   2   3
>   0:  10  21  21  21
>   1:  21  10  21  21
>   2:  21  21  10  21
>   3:  21  21  21  10
> 
> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
> ---
>  arch/x86/Kconfig            |    3 +++
>  arch/x86/kernel/acpi/boot.c |    4 +++-
>  arch/x86/kernel/smpboot.c   |    2 ++
>  arch/x86/mm/numa.c          |   49 +++++++++++++++++++++++++++++++------------
>  4 files changed, 44 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b3a1a5d77d92..5d7ad70ace0d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2069,6 +2069,9 @@ config USE_PERCPU_NUMA_NODE_ID
>  	def_bool y
>  	depends on NUMA
>  
> +config HAVE_MEMORYLESS_NODES
> +	def_bool NUMA
> +
>  config ARCH_ENABLE_SPLIT_PMD_PTLOCK
>  	def_bool y
>  	depends on X86_64 || X86_PAE
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index 07930e1d2fe9..3403f1f0f28d 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -711,6 +711,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>  		}
>  		set_apicid_to_node(physid, nid);
>  		numa_set_node(cpu, nid);
> +		set_cpu_numa_mem(cpu, local_memory_node(nid));
>  	}
>  #endif
>  }
> @@ -743,9 +744,10 @@ int acpi_unmap_cpu(int cpu)
>  {
>  #ifdef CONFIG_ACPI_NUMA
>  	set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
> +	set_cpu_numa_mem(cpu, NUMA_NO_NODE);
>  #endif
>  
> -	per_cpu(x86_cpu_to_apicid, cpu) = -1;
> +	per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
>  	set_cpu_present(cpu, false);
>  	num_processors--;
>  
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index b1f3ed9c7a9e..aeec91ac6fd4 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -162,6 +162,8 @@ static void smp_callin(void)
>  	 */
>  	phys_id = read_apic_id();
>  
> +	set_numa_mem(local_memory_node(cpu_to_node(cpuid)));
> +
>  	/*
>  	 * the boot CPU has finished the init stage and is spinning
>  	 * on callin_map until we finish. We are free to set up this
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 08860bdf5744..f2a4e23bd14d 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -22,6 +22,7 @@
>  
>  int __initdata numa_off;
>  nodemask_t numa_nodes_parsed __initdata;
> +static nodemask_t numa_nodes_empty __initdata;
>  
>  struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
>  EXPORT_SYMBOL(node_data);
> @@ -560,17 +561,16 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>  			end = max(mi->blk[i].end, end);
>  		}
>  
> -		if (start >= end)
> -			continue;
> -
>  		/*
>  		 * Don't confuse VM with a node that doesn't have the
>  		 * minimum amount of memory:
>  		 */
> -		if (end && (end - start) < NODE_MIN_SIZE)
> -			continue;
> -
> -		alloc_node_data(nid);
> +		if (start < end && (end - start) >= NODE_MIN_SIZE) {
> +			alloc_node_data(nid);
> +		} else if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
> +			alloc_node_data(nid);
> +			node_set(nid, numa_nodes_empty);
> +		}
>  	}
>  
>  	/* Dump memblock with node info and return. */
> @@ -587,14 +587,18 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
>   */
>  static void __init numa_init_array(void)
>  {
> -	int rr, i;
> +	int i, rr = MAX_NUMNODES;
>  
> -	rr = first_node(node_online_map);
>  	for (i = 0; i < nr_cpu_ids; i++) {
> +		/* Search for an onlined node with memory */
> +		do {
> +			if (rr != MAX_NUMNODES)
> +				rr = next_node(rr, node_online_map);
> +			if (rr == MAX_NUMNODES)
> +				rr = first_node(node_online_map);
> +		} while (node_isset(rr, numa_nodes_empty));
> +
>  		numa_set_node(i, rr);
> -		rr = next_node(rr, node_online_map);
> -		if (rr == MAX_NUMNODES)
> -			rr = first_node(node_online_map);
>  	}
>  }
>  
> @@ -696,9 +700,12 @@ static __init int find_near_online_node(int node)
>  {
>  	int n, val;
>  	int min_val = INT_MAX;
> -	int best_node = -1;
> +	int best_node = NUMA_NO_NODE;
>  
>  	for_each_online_node(n) {
> +		if (node_isset(n, numa_nodes_empty))
> +			continue;
> +
>  		val = node_distance(node, n);
>  
>  		if (val < min_val) {
> @@ -739,6 +746,22 @@ void __init init_cpu_to_node(void)
>  		if (!node_online(node))
>  			node = find_near_online_node(node);
>  		numa_set_node(cpu, node);
> +		if (node_spanned_pages(node))
> +			set_cpu_numa_mem(cpu, node);
> +		if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
> +			node_clear(node, numa_nodes_empty);
> +	}
> +
> +	/* Destroy empty nodes */
> +	if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
> +		int nid;
> +		const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
> +
> +		for_each_node_mask(nid, numa_nodes_empty) {
> +			node_set_offline(nid);
> +			memblock_free(__pa(node_data[nid]), nd_size);
> +			node_data[nid] = NULL;
> +		}
>  	}
>  }

So this patch makes messy code even messier.

I'd like to see the fixes, but this really needs to be done cleaner.

There are several problems:

1) the naming is not clear enough between VM and scheduling nodes and their masks. 
   For example we are mixing uses of 'numa_nodes_empty' (a memory space concept),
   'node_online_map' (a scheduling concept), which makes the code hard to read.

   To add insult to injury, 'numa_nodes_empty' is added with zero comments:

       > +static nodemask_t numa_nodes_empty __initdata;

   To resolve this the names should be clearer I think. Something like 
   numa_nomem_mask or so.

2) the existing code is (unfortunately) confusing to begin with. For example what 
   does find_near_online_node() do? It's not commented.

   init_cpu_to_node() has comments but it's mostly implementational gibberish that 
   does not answer the question of what the function's main, high level purpose 
   is. I'm uneasy about modifying code that is hard to read - it should be 
   improved first.

3)

   So I'm wondering about logic like this:

> +		if (node_spanned_pages(node))
> +			set_cpu_numa_mem(cpu, node);

   So first we link the node in the _numa_mem_ array if the node has memory (?).

> +		if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
> +			node_clear(node, numa_nodes_empty);

   But we unconditionally clear it in the numa_nodes_empty - i.e. it has memory?
   Shouldn't the node_clear() be inside the 'has memory' condition?

4)

    Bits like this are confusing:

> +	/* Destroy empty nodes */
> +	if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
> +		int nid;
> +		const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);

    Why do we 'destroy' them? What does 'destroy' mean here?

So I think this series should first make the whole code readable and 
understandable - then fix the bugs as gradually as possible: one bug one patch.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 0/9] Enable memoryless node support for x86
  2015-08-17  3:18 [Patch V3 0/9] Enable memoryless node support for x86 Jiang Liu
                   ` (9 preceding siblings ...)
  2015-08-17 21:35 ` [Patch V3 0/9] Enable memoryless node support for x86 Andrew Morton
@ 2015-08-18 10:02 ` Tang Chen
  2015-08-19  8:09   ` Jiang Liu
  10 siblings, 1 reply; 38+ messages in thread
From: Tang Chen @ 2015-08-18 10:02 UTC (permalink / raw)
  To: Jiang Liu, Andrew Morton, Mel Gorman, David Rientjes,
	Mike Galbraith, Peter Zijlstra, Rafael J . Wysocki, Tejun Heo
  Cc: Tony Luck, linux-mm, linux-hotplug, linux-kernel, x86, tangchen


On 08/17/2015 11:18 AM, Jiang Liu wrote:
> This is the third version to enable memoryless node support on x86
> platforms. The previous version (https://lkml.org/lkml/2014/7/11/75)
> blindly replaces numa_node_id()/cpu_to_node() with numa_mem_id()/
> cpu_to_mem(). That's not the right solution as pointed out by Tejun
> and Peter due to:
> 1) We shouldn't shift the burden to normal slab users.
> 2) Details of memoryless node should be hidden in arch and mm code
>     as much as possible.
>
> After digging into more code and documentation, we found the rules to
> deal with memoryless node should be:
> 1) Arch code should online corresponding NUMA node before onlining any
>     CPU or memory, otherwise it may cause invalid memory access when
>     accessing NODE_DATA(nid).
> 2) For normal memory allocations without __GFP_THISNODE setting in the
>     gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of
>     numa_mem_id()/cpu_to_mem() because the latter loses hardware topology
>     information as pointed out by Tejun:
> 	   A - B - X - C - D
> 	Where X is the memless node.  numa_mem_id() on X would return
> 	either B or C, right?  If B or C can't satisfy the allocation,
> 	the allocator would fallback to A from B and D for C, both of
> 	which aren't optimal. It should first fall back to C or B
> 	respectively, which the allocator can't do anymoe because the
> 	information is lost when the caller side performs numa_mem_id().

Hi Liu,

BTW, how is this A - B - X - C - D problem solved ?
I don't quite follow this.

I cannot tell the difference between numa_node_id()/cpu_to_node() and
numa_mem_id()/cpu_to_mem() on this point. Even with hardware topology
info, how could it avoid this problem ?

Isn't it still possible falling back to A from B and D for C ?

Thanks.

> 3) For memory allocation with __GFP_THISNODE setting in gfp_flags,
>     numa_node_id()/cpu_to_node() should be used if caller only wants to
>     allocate from local memory, otherwise numa_mem_id()/cpu_to_mem()
>     should be used if caller wants to allocate from the nearest node
>     with memory.
> 4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check
>     whether a page is allocated from the nearest node.
>
> Based on above rules, this patch set
> 1) Patch 1 is a bugfix to resolve a crash caused by socket hot-addition
> 2) Patch 2 replaces numa_mem_id() with numa_node_id() when __GFP_THISNODE
>     isn't set in gfp_flags.
> 3) Patch 3-6 replaces numa_node_id()/cpu_to_node() with numa_mem_id()/
>     cpu_to_mem() if caller wants to allocate from local node only.
> 4) Patch 7-9 enables support of memoryless node on x86.
>
> With this patch set applied, on a system with two sockets enabled at boot,
> one with memory and the other without memory, we got following numa
> topology after boot:
> root@bkd04sdp:~# numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
> node 0 size: 15940 MB
> node 0 free: 15397 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
> node 1 size: 0 MB
> node 1 free: 0 MB
> node distances:
> node   0   1
>    0:  10  21
>    1:  21  10
>
> After hot-adding the third socket without memory, we got:
> root@bkd04sdp:~# numactl --hardware
> available: 3 nodes (0-2)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
> node 0 size: 15940 MB
> node 0 free: 15142 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
> node 1 size: 0 MB
> node 1 free: 0 MB
> node 2 cpus:
> node 2 size: 0 MB
> node 2 free: 0 MB
> node distances:
> node   0   1   2
>    0:  10  21  21
>    1:  21  10  21
>    2:  21  21  10
>
> Jiang Liu (9):
>    x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition
>    kernel/profile.c: Replace cpu_to_mem() with cpu_to_node()
>    sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless
>      node
>    openvswitch: Replace cpu_to_node() with cpu_to_mem() to support
>      memoryless node
>    i40e: Use numa_mem_id() to better support memoryless node
>    i40evf: Use numa_mem_id() to better support memoryless node
>    x86, numa: Kill useless code to improve code readability
>    mm: Update _mem_id_[] for every possible CPU when memory
>      configuration changes
>    mm, x86: Enable memoryless node support to better support CPU/memory
>      hotplug
>
>   arch/x86/Kconfig                              |    3 ++
>   arch/x86/kernel/acpi/boot.c                   |    9 +++-
>   arch/x86/kernel/smpboot.c                     |    2 +
>   arch/x86/mm/numa.c                            |   59 +++++++++++++++----------
>   drivers/misc/sgi-xp/xpc_uv.c                  |    2 +-
>   drivers/net/ethernet/intel/i40e/i40e_txrx.c   |    2 +-
>   drivers/net/ethernet/intel/i40evf/i40e_txrx.c |    2 +-
>   kernel/profile.c                              |    2 +-
>   mm/page_alloc.c                               |   10 ++---
>   net/openvswitch/flow.c                        |    2 +-
>   10 files changed, 59 insertions(+), 34 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug
  2015-08-18  6:59     ` Jiang Liu
@ 2015-08-18 11:28       ` Tang Chen
  0 siblings, 0 replies; 38+ messages in thread
From: Tang Chen @ 2015-08-18 11:28 UTC (permalink / raw)
  To: Jiang Liu, Andrew Morton, Mel Gorman, David Rientjes,
	Mike Galbraith, Peter Zijlstra, Rafael J . Wysocki, Tejun Heo,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Rafael J. Wysocki, Len Brown, Pavel Machek, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Dave Hansen,
	"Jan H. Schönherr",
	Igor Mammedov, Paul E. McKenney, Xishi Qiu, Luiz Capitulino,
	Dave Young
  Cc: Tony Luck, linux-mm, linux-hotplug, linux-kernel, Ingo Molnar,
	linux-pm, tangchen


On 08/18/2015 02:59 PM, Jiang Liu wrote:
>
> ...
>>>        }
>>> @@ -739,6 +746,22 @@ void __init init_cpu_to_node(void)
>>>            if (!node_online(node))
>>>                node = find_near_online_node(node);

Hi Liu,

If cpu-less, memory-less and normal node will all be online anyway,
I think we don't need to find_near_online_node() any more for
CPUs on offline nodes.

Or is there any other case ?

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Intel-wired-lan] [Patch V3 6/9] i40evf: Use numa_mem_id() to better support memoryless node
  2015-08-17 19:03   ` [Intel-wired-lan] " Patil, Kiran
@ 2015-08-18 21:34     ` Jeff Kirsher
  0 siblings, 0 replies; 38+ messages in thread
From: Jeff Kirsher @ 2015-08-18 21:34 UTC (permalink / raw)
  To: Patil, Kiran; +Cc: linux-kernel, linux-mm, intel-wired-lan

[-- Attachment #1: Type: text/plain, Size: 1486 bytes --]

On Mon, 2015-08-17 at 12:03 -0700, Patil, Kiran wrote:
> ACK.
> 

Just an FYI, top posting is frowned upon in the Linux public mailing
lists.  Also, if you really want your ACK to be added to the patch, you
need to reply with:

Acked-by: Kiran Patil <kiran.patil@intel.com>

> -----Original Message-----
> From: Intel-wired-lan
> [mailto:intel-wired-lan-bounces@lists.osuosl.org] On Behalf Of Jiang
> Liu
> Sent: Sunday, August 16, 2015 8:19 PM
> To: Andrew Morton; Mel Gorman; David Rientjes; Mike Galbraith; Peter
> Zijlstra; Wysocki, Rafael J; Tang Chen; Tejun Heo; Kirsher, Jeffrey T;
> Brandeburg, Jesse; Nelson, Shannon; Wyborny, Carolyn; Skidmore, Donald
> C; Vick, Matthew; Ronciak, John; Williams, Mitch A
> Cc: Luck, Tony; netdev@vger.kernel.org; x86@kernel.org;
> linux-hotplug@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; intel-wired-lan@lists.osuosl.org; Jiang Liu
> Subject: [Intel-wired-lan] [Patch V3 6/9] i40evf: Use numa_mem_id() to
> better support memoryless node
> 
> Function i40e_clean_rx_irq() tries to reuse memory pages allocated
> from the nearest node. To better support memoryless node, use
> numa_mem_id() instead of numa_node_id() to get the nearest node with
> memory.
> 
> This change should only affect performance.
> 
> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
> ---
>  drivers/net/ethernet/intel/i40evf/i40e_txrx.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 2/9] kernel/profile.c: Replace cpu_to_mem() with cpu_to_node()
  2015-08-18  0:31   ` David Rientjes
@ 2015-08-19  7:18     ` Jiang Liu
  2015-08-20  0:00       ` David Rientjes
  0 siblings, 1 reply; 38+ messages in thread
From: Jiang Liu @ 2015-08-19  7:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Mike Galbraith, Peter Zijlstra,
	Rafael J . Wysocki, Tang Chen, Tejun Heo, Tony Luck, linux-mm,
	linux-hotplug, linux-kernel, x86

On 2015/8/18 8:31, David Rientjes wrote:
> On Mon, 17 Aug 2015, Jiang Liu wrote:
> 
>> Function profile_cpu_callback() allocates memory without specifying
>> __GFP_THISNODE flag, so replace cpu_to_mem() with cpu_to_node()
>> because cpu_to_mem() may cause suboptimal memory allocation if
>> there's no free memory on the node returned by cpu_to_mem().
>>
> 
> Why is cpu_to_node() better with regard to free memory and NUMA locality?
Hi David,
	Thanks for review. This is a special case pointed out by Tejun.
For the imagined topology, A<->B<->X<->C<->D, where A, B, C, D has
memory and X is memoryless.
Possible fallback lists are:
B: [ B, A, C, D]
X: [ B, C, A, D]
C: [ C, D, B, A]

cpu_to_mem(X) will either return B or C. Let's assume it returns B.
Then we will use "B: [ B, A, C, D]" to allocate memory for X, which
is not the optimal fallback list for X. And cpu_to_node(X) returns
X, and "X: [ B, C, A, D]" is the optimal fallback list for X.
Thanks!
Gerry

> 
>> It's safe to use cpu_to_mem() because build_all_zonelists() also
>> builds suitable fallback zonelist for memoryless node.
>>
> 
> Why reference that cpu_to_mem() is safe if you're changing away from it?
Sorry, it should be cpu_to_node() instead of cpu_to_mem().

> 
>> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
>> ---
>>  kernel/profile.c |    2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/profile.c b/kernel/profile.c
>> index a7bcd28d6e9f..d14805bdcc4c 100644
>> --- a/kernel/profile.c
>> +++ b/kernel/profile.c
>> @@ -336,7 +336,7 @@ static int profile_cpu_callback(struct notifier_block *info,
>>  	switch (action) {
>>  	case CPU_UP_PREPARE:
>>  	case CPU_UP_PREPARE_FROZEN:
>> -		node = cpu_to_mem(cpu);
>> +		node = cpu_to_node(cpu);
>>  		per_cpu(cpu_profile_flip, cpu) = 0;
>>  		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
>>  			page = alloc_pages_exact_node(node,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 0/9] Enable memoryless node support for x86
  2015-08-18 10:02 ` Tang Chen
@ 2015-08-19  8:09   ` Jiang Liu
  0 siblings, 0 replies; 38+ messages in thread
From: Jiang Liu @ 2015-08-19  8:09 UTC (permalink / raw)
  To: Tang Chen, Andrew Morton, Mel Gorman, David Rientjes,
	Mike Galbraith, Peter Zijlstra, Rafael J . Wysocki, Tejun Heo
  Cc: Tony Luck, linux-mm, linux-hotplug, linux-kernel, x86

On 2015/8/18 18:02, Tang Chen wrote:
> 
> On 08/17/2015 11:18 AM, Jiang Liu wrote:
>> This is the third version to enable memoryless node support on x86
>> platforms. The previous version (https://lkml.org/lkml/2014/7/11/75)
>> blindly replaces numa_node_id()/cpu_to_node() with numa_mem_id()/
>> cpu_to_mem(). That's not the right solution as pointed out by Tejun
>> and Peter due to:
>> 1) We shouldn't shift the burden to normal slab users.
>> 2) Details of memoryless node should be hidden in arch and mm code
>>     as much as possible.
>>
>> After digging into more code and documentation, we found the rules to
>> deal with memoryless node should be:
>> 1) Arch code should online corresponding NUMA node before onlining any
>>     CPU or memory, otherwise it may cause invalid memory access when
>>     accessing NODE_DATA(nid).
>> 2) For normal memory allocations without __GFP_THISNODE setting in the
>>     gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of
>>     numa_mem_id()/cpu_to_mem() because the latter loses hardware topology
>>     information as pointed out by Tejun:
>>        A - B - X - C - D
>>     Where X is the memless node.  numa_mem_id() on X would return
>>     either B or C, right?  If B or C can't satisfy the allocation,
>>     the allocator would fallback to A from B and D for C, both of
>>     which aren't optimal. It should first fall back to C or B
>>     respectively, which the allocator can't do anymoe because the
>>     information is lost when the caller side performs numa_mem_id().
> 
> Hi Liu,
> 
> BTW, how is this A - B - X - C - D problem solved ?
> I don't quite follow this.
> 
> I cannot tell the difference between numa_node_id()/cpu_to_node() and
> numa_mem_id()/cpu_to_mem() on this point. Even with hardware topology
> info, how could it avoid this problem ?
> 
> Isn't it still possible falling back to A from B and D for C ?
Hi Chen,
For the imagined topology, A<->B<->X<->C<->D, where A, B, C, D has
memory and X is memoryless.
Possible fallback lists are:
B: [ B, A, C, D]
X: [ B, C, A, D]
C: [ C, D, B, A]

cpu_to_mem(X) will either return B or C. Let's assume it returns B.
Then we will use "B: [ B, A, C, D]" to allocate memory for X, which
is not the optimal fallback list for X. And cpu_to_node(X) returns
X, and "X: [ B, C, A, D]" is the optimal fallback list for X.
Thanks!
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 3/9] sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless node
  2015-08-18  0:25   ` David Rientjes
@ 2015-08-19  8:20     ` Jiang Liu
  2015-08-20  0:02       ` David Rientjes
  0 siblings, 1 reply; 38+ messages in thread
From: Jiang Liu @ 2015-08-19  8:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Mike Galbraith, Peter Zijlstra,
	Rafael J . Wysocki, Tang Chen, Tejun Heo, Cliff Whickman,
	Robin Holt, Tony Luck, linux-mm, linux-hotplug, linux-kernel,
	x86

On 2015/8/18 8:25, David Rientjes wrote:
> On Mon, 17 Aug 2015, Jiang Liu wrote:
> 
>> Function xpc_create_gru_mq_uv() allocates memory with __GFP_THISNODE
>> flag set, which may cause permanent memory allocation failure on
>> memoryless node. So replace cpu_to_node() with cpu_to_mem() to better
>> support memoryless node. For node with memory, cpu_to_mem() is the same
>> as cpu_to_node().
>>
>> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
>> ---
>>  drivers/misc/sgi-xp/xpc_uv.c |    2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
>> index 95c894482fdd..9210981c0d5b 100644
>> --- a/drivers/misc/sgi-xp/xpc_uv.c
>> +++ b/drivers/misc/sgi-xp/xpc_uv.c
>> @@ -238,7 +238,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
>>  
>>  	mq->mmr_blade = uv_cpu_to_blade_id(cpu);
>>  
>> -	nid = cpu_to_node(cpu);
>> +	nid = cpu_to_mem(cpu);
>>  	page = alloc_pages_exact_node(nid,
>>  				      GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
>>  				      pg_order);
> 
> Why not simply fix build_zonelists_node() so that the __GFP_THISNODE 
> zonelists are set up to reference the zones of cpu_to_mem() for memoryless 
> nodes?
> 
> It seems much better than checking and maintaining every __GFP_THISNODE 
> user to determine if they are using a memoryless node or not.  I don't 
> feel that this solution is maintainable in the longterm.
Hi David,
	There are some usage cases, such as memory migration,
expect the page allocator rejecting memory allocation requests
if there is no memory on local node. So we have:
1) alloc_pages_node(cpu_to_node(), __GFP_THISNODE) to only allocate
memory from local node.
2) alloc_pages_node(cpu_to_mem(), __GFP_THISNODE) to allocate memory
from local node or from nearest node if local node is memoryless.

Not sure whether we could consolidate all callers specifying
__GFP_THISNODE flag into one case, need more investigating here.
Thanks!
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 3/9] sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless node
  2015-08-17  3:19 ` [Patch V3 3/9] sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless node Jiang Liu
  2015-08-18  0:25   ` David Rientjes
@ 2015-08-19 11:52   ` Robin Holt
  2015-08-19 12:45     ` Jiang Liu
  1 sibling, 1 reply; 38+ messages in thread
From: Robin Holt @ 2015-08-19 11:52 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	Cliff Whickman, Tony Luck, linux-mm, linux-hotplug, LKML, x86

On Sun, Aug 16, 2015 at 10:19 PM, Jiang Liu <jiang.liu@linux.intel.com> wrote:
> Function xpc_create_gru_mq_uv() allocates memory with __GFP_THISNODE
> flag set, which may cause permanent memory allocation failure on
> memoryless node. So replace cpu_to_node() with cpu_to_mem() to better
> support memoryless node. For node with memory, cpu_to_mem() is the same
> as cpu_to_node().
>
> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
> ---
>  drivers/misc/sgi-xp/xpc_uv.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
> index 95c894482fdd..9210981c0d5b 100644
> --- a/drivers/misc/sgi-xp/xpc_uv.c
> +++ b/drivers/misc/sgi-xp/xpc_uv.c
> @@ -238,7 +238,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
>
>         mq->mmr_blade = uv_cpu_to_blade_id(cpu);
>
> -       nid = cpu_to_node(cpu);
> +       nid = cpu_to_mem(cpu);

I would recommend rejecting this.  First, SGI's UV system does not and
can not support memory-less nodes.  Additionally the hardware _REALLY_
wants the memory to be local to the CPU.  We will register this memory
region with the node firmware.  That will set the hardware up to watch
this memory block and raise an IRQ targeting the registered CPU when
anything is written into the memory block.  This is all part of how
cross-partition communications expects to work.

Additionally, the interrupt handler will read the memory region, so
having node-local memory is extremely helpful.

Thanks,
Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 3/9] sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless node
  2015-08-19 11:52   ` Robin Holt
@ 2015-08-19 12:45     ` Jiang Liu
  0 siblings, 0 replies; 38+ messages in thread
From: Jiang Liu @ 2015-08-19 12:45 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrew Morton, Mel Gorman, David Rientjes, Mike Galbraith,
	Peter Zijlstra, Rafael J . Wysocki, Tang Chen, Tejun Heo,
	Cliff Whickman, Tony Luck, linux-mm, linux-hotplug, LKML, x86

On 2015/8/19 19:52, Robin Holt wrote:
> On Sun, Aug 16, 2015 at 10:19 PM, Jiang Liu <jiang.liu@linux.intel.com> wrote:
>> Function xpc_create_gru_mq_uv() allocates memory with __GFP_THISNODE
>> flag set, which may cause permanent memory allocation failure on
>> memoryless node. So replace cpu_to_node() with cpu_to_mem() to better
>> support memoryless node. For node with memory, cpu_to_mem() is the same
>> as cpu_to_node().
>>
>> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
>> ---
>>  drivers/misc/sgi-xp/xpc_uv.c |    2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
>> index 95c894482fdd..9210981c0d5b 100644
>> --- a/drivers/misc/sgi-xp/xpc_uv.c
>> +++ b/drivers/misc/sgi-xp/xpc_uv.c
>> @@ -238,7 +238,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
>>
>>         mq->mmr_blade = uv_cpu_to_blade_id(cpu);
>>
>> -       nid = cpu_to_node(cpu);
>> +       nid = cpu_to_mem(cpu);
> 
> I would recommend rejecting this.  First, SGI's UV system does not and
> can not support memory-less nodes.  Additionally the hardware _REALLY_
> wants the memory to be local to the CPU.  We will register this memory
> region with the node firmware.  That will set the hardware up to watch
> this memory block and raise an IRQ targeting the registered CPU when
> anything is written into the memory block.  This is all part of how
> cross-partition communications expects to work.
> 
> Additionally, the interrupt handler will read the memory region, so
> having node-local memory is extremely helpful.
Hi Robin,
	Thanks for review, I will drop this patch in next version.
Actually, if SGI UV systems don't support memoryless node, cpu_to_mem()
is the same as cpu_to_node().
Thanks!
Gerry
> 
> Thanks,
> Robin
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node
  2015-08-17  3:19 ` [Patch V3 5/9] i40e: Use numa_mem_id() to better " Jiang Liu
  2015-08-18  0:35   ` David Rientjes
@ 2015-08-19 22:38   ` Patil, Kiran
  2015-08-20  0:18     ` David Rientjes
  1 sibling, 1 reply; 38+ messages in thread
From: Patil, Kiran @ 2015-08-19 22:38 UTC (permalink / raw)
  To: Jiang Liu, Andrew Morton, Mel Gorman, David Rientjes,
	Mike Galbraith, Peter Zijlstra, Wysocki, Rafael J, Tang Chen,
	Tejun Heo, Kirsher, Jeffrey T, Brandeburg, Jesse, Nelson,
	Shannon, Wyborny, Carolyn, Skidmore, Donald C, Vick, Matthew,
	Ronciak, John, Williams, Mitch A
  Cc: Luck, Tony, netdev, x86, linux-hotplug, linux-kernel, linux-mm,
	intel-wired-lan

Acked-by: Kiran Patil <kiran.patil@intel.com>

-----Original Message-----
From: Intel-wired-lan [mailto:intel-wired-lan-bounces@lists.osuosl.org] On Behalf Of Jiang Liu
Sent: Sunday, August 16, 2015 8:19 PM
To: Andrew Morton; Mel Gorman; David Rientjes; Mike Galbraith; Peter Zijlstra; Wysocki, Rafael J; Tang Chen; Tejun Heo; Kirsher, Jeffrey T; Brandeburg, Jesse; Nelson, Shannon; Wyborny, Carolyn; Skidmore, Donald C; Vick, Matthew; Ronciak, John; Williams, Mitch A
Cc: Luck, Tony; netdev@vger.kernel.org; x86@kernel.org; linux-hotplug@vger.kernel.org; linux-kernel@vger.kernel.org; linux-mm@kvack.org; intel-wired-lan@lists.osuosl.org; Jiang Liu
Subject: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node

Function i40e_clean_rx_irq() tries to reuse memory pages allocated from the nearest node. To better support memoryless node, use
numa_mem_id() instead of numa_node_id() to get the nearest node with memory.

This change should only affect performance.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 9a4f2bc70cd2..a8f618cb8eb0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1516,7 +1516,7 @@ static int i40e_clean_rx_irq_ps(struct i40e_ring *rx_ring, int budget)
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 	u16 rx_packet_len, rx_header_len, rx_sph, rx_hbo;
 	u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
-	const int current_node = numa_node_id();
+	const int current_node = numa_mem_id();
 	struct i40e_vsi *vsi = rx_ring->vsi;
 	u16 i = rx_ring->next_to_clean;
 	union i40e_rx_desc *rx_desc;
--
1.7.10.4

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@lists.osuosl.org
http://lists.osuosl.org/mailman/listinfo/intel-wired-lan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [Patch V3 2/9] kernel/profile.c: Replace cpu_to_mem() with cpu_to_node()
  2015-08-19  7:18     ` Jiang Liu
@ 2015-08-20  0:00       ` David Rientjes
  2015-10-09  2:35         ` Jiang Liu
  0 siblings, 1 reply; 38+ messages in thread
From: David Rientjes @ 2015-08-20  0:00 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Mel Gorman, Mike Galbraith, Peter Zijlstra,
	Rafael J . Wysocki, Tang Chen, Tejun Heo, Tony Luck, linux-mm,
	linux-hotplug, linux-kernel, x86

On Wed, 19 Aug 2015, Jiang Liu wrote:

> On 2015/8/18 8:31, David Rientjes wrote:
> > On Mon, 17 Aug 2015, Jiang Liu wrote:
> > 
> >> Function profile_cpu_callback() allocates memory without specifying
> >> __GFP_THISNODE flag, so replace cpu_to_mem() with cpu_to_node()
> >> because cpu_to_mem() may cause suboptimal memory allocation if
> >> there's no free memory on the node returned by cpu_to_mem().
> >>
> > 
> > Why is cpu_to_node() better with regard to free memory and NUMA locality?
> Hi David,
> 	Thanks for review. This is a special case pointed out by Tejun.
> For the imagined topology, A<->B<->X<->C<->D, where A, B, C, D has
> memory and X is memoryless.
> Possible fallback lists are:
> B: [ B, A, C, D]
> X: [ B, C, A, D]
> C: [ C, D, B, A]
> 
> cpu_to_mem(X) will either return B or C. Let's assume it returns B.
> Then we will use "B: [ B, A, C, D]" to allocate memory for X, which
> is not the optimal fallback list for X. And cpu_to_node(X) returns
> X, and "X: [ B, C, A, D]" is the optimal fallback list for X.

Ok, that makes sense, but I would prefer that this 
alloc_pages_exact_node() change to alloc_pages_node() since, as you 
mention in your commit message, __GFP_THISNODE is not set.

In the longterm, if we setup both zonelists correctly (no __GFP_THISNODE 
and with __GFP_THISNODE), then I'm not sure there's any reason to ever use 
cpu_to_mem() for alloc_pages().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 3/9] sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless node
  2015-08-19  8:20     ` Jiang Liu
@ 2015-08-20  0:02       ` David Rientjes
  2015-08-20  6:36         ` Jiang Liu
  0 siblings, 1 reply; 38+ messages in thread
From: David Rientjes @ 2015-08-20  0:02 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Andrew Morton, Mel Gorman, Mike Galbraith, Peter Zijlstra,
	Rafael J . Wysocki, Tang Chen, Tejun Heo, Cliff Whickman,
	Robin Holt, Tony Luck, linux-mm, linux-hotplug, linux-kernel,
	x86

On Wed, 19 Aug 2015, Jiang Liu wrote:

> > Why not simply fix build_zonelists_node() so that the __GFP_THISNODE 
> > zonelists are set up to reference the zones of cpu_to_mem() for memoryless 
> > nodes?
> > 
> > It seems much better than checking and maintaining every __GFP_THISNODE 
> > user to determine if they are using a memoryless node or not.  I don't 
> > feel that this solution is maintainable in the longterm.
> Hi David,
> 	There are some usage cases, such as memory migration,
> expect the page allocator rejecting memory allocation requests
> if there is no memory on local node. So we have:
> 1) alloc_pages_node(cpu_to_node(), __GFP_THISNODE) to only allocate
> memory from local node.
> 2) alloc_pages_node(cpu_to_mem(), __GFP_THISNODE) to allocate memory
> from local node or from nearest node if local node is memoryless.
> 

Right, so do you think it would be better to make the default zonelists be 
setup so that cpu_to_node()->zonelists == cpu_to_mem()->zonelists and then 
individual callers that want to fail for memoryless nodes check 
populated_zone() themselves?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node
  2015-08-19 22:38   ` [Intel-wired-lan] " Patil, Kiran
@ 2015-08-20  0:18     ` David Rientjes
  2015-10-08 20:20       ` Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: David Rientjes @ 2015-08-20  0:18 UTC (permalink / raw)
  To: Patil, Kiran
  Cc: Jiang Liu, Andrew Morton, Mel Gorman, Mike Galbraith,
	Peter Zijlstra, Wysocki, Rafael J, Tang Chen, Tejun Heo, Kirsher,
	Jeffrey T, Brandeburg, Jesse, Nelson, Shannon, Wyborny, Carolyn,
	Skidmore, Donald C, Vick, Matthew, Ronciak, John, Williams,
	Mitch A, Luck, Tony, netdev, x86, linux-hotplug, linux-kernel,
	linux-mm, intel-wired-lan

On Wed, 19 Aug 2015, Patil, Kiran wrote:

> Acked-by: Kiran Patil <kiran.patil@intel.com>

Where's the call to preempt_disable() to prevent kernels with preemption 
from making numa_node_id() invalid during this iteration?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 3/9] sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless node
  2015-08-20  0:02       ` David Rientjes
@ 2015-08-20  6:36         ` Jiang Liu
  2015-10-09  5:04           ` Jiang Liu
  0 siblings, 1 reply; 38+ messages in thread
From: Jiang Liu @ 2015-08-20  6:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Mike Galbraith, Peter Zijlstra,
	Rafael J . Wysocki, Tang Chen, Tejun Heo, Cliff Whickman,
	Robin Holt, Tony Luck, linux-mm, linux-hotplug, linux-kernel,
	x86

On 2015/8/20 8:02, David Rientjes wrote:
> On Wed, 19 Aug 2015, Jiang Liu wrote:
> 
>>> Why not simply fix build_zonelists_node() so that the __GFP_THISNODE 
>>> zonelists are set up to reference the zones of cpu_to_mem() for memoryless 
>>> nodes?
>>>
>>> It seems much better than checking and maintaining every __GFP_THISNODE 
>>> user to determine if they are using a memoryless node or not.  I don't 
>>> feel that this solution is maintainable in the longterm.
>> Hi David,
>> 	There are some usage cases, such as memory migration,
>> expect the page allocator rejecting memory allocation requests
>> if there is no memory on local node. So we have:
>> 1) alloc_pages_node(cpu_to_node(), __GFP_THISNODE) to only allocate
>> memory from local node.
>> 2) alloc_pages_node(cpu_to_mem(), __GFP_THISNODE) to allocate memory
>> from local node or from nearest node if local node is memoryless.
>>
> 
> Right, so do you think it would be better to make the default zonelists be 
> setup so that cpu_to_node()->zonelists == cpu_to_mem()->zonelists and then 
> individual callers that want to fail for memoryless nodes check 
> populated_zone() themselves?
Hi David,
	Great idea:) I think that means we are going to kill the
concept of memoryless node, and we only need to specially handle
a few callers who really care about whether there is memory on
local node.
	Then I need some time to audit all usages of __GFP_THISNODE
and update you whether it's doable.
Thanks!
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node
  2015-08-20  0:18     ` David Rientjes
@ 2015-10-08 20:20       ` Andrew Morton
  2015-10-09  5:52         ` Jiang Liu
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2015-10-08 20:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Patil, Kiran, Jiang Liu, Mel Gorman, Mike Galbraith,
	Peter Zijlstra, Wysocki, Rafael J, Tang Chen, Tejun Heo, Kirsher,
	Jeffrey T, Brandeburg, Jesse, Nelson, Shannon, Wyborny, Carolyn,
	Skidmore, Donald C, Vick, Matthew, Ronciak, John, Williams,
	Mitch A, Luck, Tony, netdev, x86, linux-hotplug, linux-kernel,
	linux-mm, intel-wired-lan

On Wed, 19 Aug 2015 17:18:15 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> On Wed, 19 Aug 2015, Patil, Kiran wrote:
> 
> > Acked-by: Kiran Patil <kiran.patil@intel.com>
> 
> Where's the call to preempt_disable() to prevent kernels with preemption 
> from making numa_node_id() invalid during this iteration?

David asked this question twice, received no answer and now the patch
is in the maintainer tree, destined for mainline.

If I was asked this question I would respond

  The use of numa_mem_id() is racy and best-effort.  If the unlikely
  race occurs, the memory allocation will occur on the wrong node, the
  overall result being very slightly suboptimal performance.  The
  existing use of numa_node_id() suffers from the same issue.

But I'm not the person proposing the patch.  Please don't just ignore
reviewer comments!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 2/9] kernel/profile.c: Replace cpu_to_mem() with cpu_to_node()
  2015-08-20  0:00       ` David Rientjes
@ 2015-10-09  2:35         ` Jiang Liu
  0 siblings, 0 replies; 38+ messages in thread
From: Jiang Liu @ 2015-10-09  2:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Mike Galbraith, Peter Zijlstra,
	Rafael J . Wysocki, Tang Chen, Tejun Heo, Tony Luck, linux-mm,
	linux-hotplug, linux-kernel, x86

On 2015/8/20 8:00, David Rientjes wrote:
> On Wed, 19 Aug 2015, Jiang Liu wrote:
> 
>> On 2015/8/18 8:31, David Rientjes wrote:
>>> On Mon, 17 Aug 2015, Jiang Liu wrote:
>>>
>>>> Function profile_cpu_callback() allocates memory without specifying
>>>> __GFP_THISNODE flag, so replace cpu_to_mem() with cpu_to_node()
>>>> because cpu_to_mem() may cause suboptimal memory allocation if
>>>> there's no free memory on the node returned by cpu_to_mem().
>>>>
>>>
>>> Why is cpu_to_node() better with regard to free memory and NUMA locality?
>> Hi David,
>> 	Thanks for review. This is a special case pointed out by Tejun.
>> For the imagined topology, A<->B<->X<->C<->D, where A, B, C, D has
>> memory and X is memoryless.
>> Possible fallback lists are:
>> B: [ B, A, C, D]
>> X: [ B, C, A, D]
>> C: [ C, D, B, A]
>>
>> cpu_to_mem(X) will either return B or C. Let's assume it returns B.
>> Then we will use "B: [ B, A, C, D]" to allocate memory for X, which
>> is not the optimal fallback list for X. And cpu_to_node(X) returns
>> X, and "X: [ B, C, A, D]" is the optimal fallback list for X.
> 
> Ok, that makes sense, but I would prefer that this 
> alloc_pages_exact_node() change to alloc_pages_node() since, as you 
> mention in your commit message, __GFP_THISNODE is not set.
Hi David,
	Sorry for slow response due to personal reasons!
	Function alloc_pages_exact_node() has been renamed as
__alloc_pages_node() by commit 96db800f5d73, and __alloc_pages_node()
is a slightly optimized version of alloc_pages_node() which doesn't
fallback to current node for nid == NUMA_NO_NODE case. So it would
be better to keep using __alloc_pages_node() because cpu_to_node()
always returns valid node id.
Thanks!
Gerry

> 
> In the longterm, if we setup both zonelists correctly (no __GFP_THISNODE 
> and with __GFP_THISNODE), then I'm not sure there's any reason to ever use 
> cpu_to_mem() for alloc_pages().
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Patch V3 3/9] sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless node
  2015-08-20  6:36         ` Jiang Liu
@ 2015-10-09  5:04           ` Jiang Liu
  0 siblings, 0 replies; 38+ messages in thread
From: Jiang Liu @ 2015-10-09  5:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Mike Galbraith, Peter Zijlstra,
	Rafael J . Wysocki, Tang Chen, Tejun Heo, Cliff Whickman,
	Robin Holt, Tony Luck, linux-mm, linux-hotplug, linux-kernel,
	x86

On 2015/8/20 14:36, Jiang Liu wrote:
> On 2015/8/20 8:02, David Rientjes wrote:
>> On Wed, 19 Aug 2015, Jiang Liu wrote:
>>
>>>> Why not simply fix build_zonelists_node() so that the __GFP_THISNODE 
>>>> zonelists are set up to reference the zones of cpu_to_mem() for memoryless 
>>>> nodes?
>>>>
>>>> It seems much better than checking and maintaining every __GFP_THISNODE 
>>>> user to determine if they are using a memoryless node or not.  I don't 
>>>> feel that this solution is maintainable in the longterm.
>>> Hi David,
>>> 	There are some usage cases, such as memory migration,
>>> expect the page allocator rejecting memory allocation requests
>>> if there is no memory on local node. So we have:
>>> 1) alloc_pages_node(cpu_to_node(), __GFP_THISNODE) to only allocate
>>> memory from local node.
>>> 2) alloc_pages_node(cpu_to_mem(), __GFP_THISNODE) to allocate memory
>>> from local node or from nearest node if local node is memoryless.
>>>
>>
>> Right, so do you think it would be better to make the default zonelists be 
>> setup so that cpu_to_node()->zonelists == cpu_to_mem()->zonelists and then 
>> individual callers that want to fail for memoryless nodes check 
>> populated_zone() themselves?
> Hi David,
> 	Great idea:) I think that means we are going to kill the
> concept of memoryless node, and we only need to specially handle
> a few callers who really care about whether there is memory on
> local node.
> 	Then I need some time to audit all usages of __GFP_THISNODE
> and update you whether it's doable.
Hi David,
	It seems that I'm too optimistic:(. After auditing all usages
of __GFP_THISNODE and reading Documentation/vm/numa again, I feel it
would be better to keep cpu_to_mem()/numa_mem_id(). It makes things
more clear if we follow rules:
1) cpu_to_node()/numa_node_id() for schedule domain
2) cpu_to_mem()/numa_mem_id() for memory management domain
3) alloc_pages_node(cpu_to_node(cpu), __GFP_THIS_NODE) for special
   usage cases.
   And it would be easier for maintenance than open-coded checking of
populated_zone() by using alloc_pages_node(cpu_to_node(cpu),
__GFP_THIS_NODE).
Thanks!
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node
  2015-10-08 20:20       ` Andrew Morton
@ 2015-10-09  5:52         ` Jiang Liu
  2015-10-09  9:08           ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 38+ messages in thread
From: Jiang Liu @ 2015-10-09  5:52 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes
  Cc: Patil, Kiran, Mel Gorman, Mike Galbraith, Peter Zijlstra,
	Wysocki, Rafael J, Tang Chen, Tejun Heo, Kirsher, Jeffrey T,
	Brandeburg, Jesse, Nelson, Shannon, Wyborny, Carolyn, Skidmore,
	Donald C, Vick, Matthew, Ronciak, John, Williams, Mitch A, Luck,
	Tony, netdev, x86, linux-hotplug, linux-kernel, linux-mm,
	intel-wired-lan

On 2015/10/9 4:20, Andrew Morton wrote:
> On Wed, 19 Aug 2015 17:18:15 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:
> 
>> On Wed, 19 Aug 2015, Patil, Kiran wrote:
>>
>>> Acked-by: Kiran Patil <kiran.patil@intel.com>
>>
>> Where's the call to preempt_disable() to prevent kernels with preemption 
>> from making numa_node_id() invalid during this iteration?
> 
> David asked this question twice, received no answer and now the patch
> is in the maintainer tree, destined for mainline.
> 
> If I was asked this question I would respond
> 
>   The use of numa_mem_id() is racy and best-effort.  If the unlikely
>   race occurs, the memory allocation will occur on the wrong node, the
>   overall result being very slightly suboptimal performance.  The
>   existing use of numa_node_id() suffers from the same issue.
> 
> But I'm not the person proposing the patch.  Please don't just ignore
> reviewer comments!
Hi Andrew,
	Apologize for the slow response due to personal reasons!
And thanks for answering the question from David. To be honest,
I didn't know how to answer this question before. Actually this
question has puzzled me for a long time when dealing with memory
hot-removal. For normal cases, it only causes sub-optimal memory
allocation if schedule event happens between querying NUMA node id
and calling alloc_pages_node(). But what happens if system run into
following execution sequence?
1) node = numa_mem_id();
2) memory hot-removal event triggers
2.1) remove affected memory
2.2) reset pgdat to zero if node becomes empty after memory removal
3) alloc_pages_node(), which may access zero-ed pgdat structure.

I haven't found a mechanism to protect system from above sequence yet,
so puzzled for a long time already:(. Does stop_machine() protect
system from such a execution sequence?
Thanks!
Gerry

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node
  2015-10-09  5:52         ` Jiang Liu
@ 2015-10-09  9:08           ` Kamezawa Hiroyuki
  2015-10-09  9:25             ` Jiang Liu
  0 siblings, 1 reply; 38+ messages in thread
From: Kamezawa Hiroyuki @ 2015-10-09  9:08 UTC (permalink / raw)
  To: Jiang Liu, Andrew Morton, David Rientjes
  Cc: Patil, Kiran, Mel Gorman, Mike Galbraith, Peter Zijlstra,
	Wysocki, Rafael J, Tang Chen, Tejun Heo, Kirsher, Jeffrey T,
	Brandeburg, Jesse, Nelson, Shannon, Wyborny, Carolyn, Skidmore,
	Donald C, Vick, Matthew, Ronciak, John, Williams, Mitch A, Luck,
	Tony, netdev, x86, linux-hotplug, linux-kernel, linux-mm,
	intel-wired-lan

On 2015/10/09 14:52, Jiang Liu wrote:
> On 2015/10/9 4:20, Andrew Morton wrote:
>> On Wed, 19 Aug 2015 17:18:15 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:
>>
>>> On Wed, 19 Aug 2015, Patil, Kiran wrote:
>>>
>>>> Acked-by: Kiran Patil <kiran.patil@intel.com>
>>>
>>> Where's the call to preempt_disable() to prevent kernels with preemption
>>> from making numa_node_id() invalid during this iteration?
>>
>> David asked this question twice, received no answer and now the patch
>> is in the maintainer tree, destined for mainline.
>>
>> If I was asked this question I would respond
>>
>>    The use of numa_mem_id() is racy and best-effort.  If the unlikely
>>    race occurs, the memory allocation will occur on the wrong node, the
>>    overall result being very slightly suboptimal performance.  The
>>    existing use of numa_node_id() suffers from the same issue.
>>
>> But I'm not the person proposing the patch.  Please don't just ignore
>> reviewer comments!
> Hi Andrew,
> 	Apologize for the slow response due to personal reasons!
> And thanks for answering the question from David. To be honest,
> I didn't know how to answer this question before. Actually this
> question has puzzled me for a long time when dealing with memory
> hot-removal. For normal cases, it only causes sub-optimal memory
> allocation if schedule event happens between querying NUMA node id
> and calling alloc_pages_node(). But what happens if system run into
> following execution sequence?
> 1) node = numa_mem_id();
> 2) memory hot-removal event triggers
> 2.1) remove affected memory
> 2.2) reset pgdat to zero if node becomes empty after memory removal

I'm sorry if I misunderstand something.
After commit b0dc3a342af36f95a68fe229b8f0f73552c5ca08, there is no memset().

> 3) alloc_pages_node(), which may access zero-ed pgdat structure.

?

>
> I haven't found a mechanism to protect system from above sequence yet,
> so puzzled for a long time already:(. Does stop_machine() protect
> system from such a execution sequence?

To access pgdat, a pgdat's zone should be on per-pgdat-zonelist.
Now, __build_all_zonelists() is called under stop_machine(). That's the reason
why you're asking what stop_machine() does. And, as you know, stop_machine() is not
protecting anything. The caller may fallback into removed zone.

Then, let's think.

At first, please note "pgdat" is not removed (and cannot be removed),
accessing pgdat's memory will not cause segmentation fault.

Just contents are problem. At removal, zone's page related information
and pgdat's page related information is cleared.

alloc_pages uses zonelist/zoneref/cache to walk each zones without accessing
pgdat itself. I think accessing zonelist is safe because it's an array updated
by stop_machine().

So, the problem is alloc_pages() can work correctly even if zone contains no page.
I think it should work.

(Note: zones are included in pgdat. So, zeroing pgdat means zeroing zone and other
  structures. it will not work.)

So, what problem you see now ?
I'm sorry I can't chase old discusions.

Thanks,
-Kame















--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node
  2015-10-09  9:08           ` Kamezawa Hiroyuki
@ 2015-10-09  9:25             ` Jiang Liu
  0 siblings, 0 replies; 38+ messages in thread
From: Jiang Liu @ 2015-10-09  9:25 UTC (permalink / raw)
  To: Kamezawa Hiroyuki, Andrew Morton, David Rientjes
  Cc: Patil, Kiran, Mel Gorman, Mike Galbraith, Peter Zijlstra,
	Wysocki, Rafael J, Tang Chen, Tejun Heo, Kirsher, Jeffrey T,
	Brandeburg, Jesse, Nelson, Shannon, Wyborny, Carolyn, Skidmore,
	Donald C, Vick, Matthew, Ronciak, John, Williams, Mitch A, Luck,
	Tony, netdev, x86, linux-hotplug, linux-kernel, linux-mm,
	intel-wired-lan

On 2015/10/9 17:08, Kamezawa Hiroyuki wrote:
> On 2015/10/09 14:52, Jiang Liu wrote:
>> On 2015/10/9 4:20, Andrew Morton wrote:
>>> On Wed, 19 Aug 2015 17:18:15 -0700 (PDT) David Rientjes
>>> <rientjes@google.com> wrote:
>>>
>>>> On Wed, 19 Aug 2015, Patil, Kiran wrote:
>>>>
>>>>> Acked-by: Kiran Patil <kiran.patil@intel.com>
>>>>
>>>> Where's the call to preempt_disable() to prevent kernels with
>>>> preemption
>>>> from making numa_node_id() invalid during this iteration?
>>>
>>> David asked this question twice, received no answer and now the patch
>>> is in the maintainer tree, destined for mainline.
>>>
>>> If I was asked this question I would respond
>>>
>>>    The use of numa_mem_id() is racy and best-effort.  If the unlikely
>>>    race occurs, the memory allocation will occur on the wrong node, the
>>>    overall result being very slightly suboptimal performance.  The
>>>    existing use of numa_node_id() suffers from the same issue.
>>>
>>> But I'm not the person proposing the patch.  Please don't just ignore
>>> reviewer comments!
>> Hi Andrew,
>>     Apologize for the slow response due to personal reasons!
>> And thanks for answering the question from David. To be honest,
>> I didn't know how to answer this question before. Actually this
>> question has puzzled me for a long time when dealing with memory
>> hot-removal. For normal cases, it only causes sub-optimal memory
>> allocation if schedule event happens between querying NUMA node id
>> and calling alloc_pages_node(). But what happens if system run into
>> following execution sequence?
>> 1) node = numa_mem_id();
>> 2) memory hot-removal event triggers
>> 2.1) remove affected memory
>> 2.2) reset pgdat to zero if node becomes empty after memory removal
> 
> I'm sorry if I misunderstand something.
> After commit b0dc3a342af36f95a68fe229b8f0f73552c5ca08, there is no
> memset().
Hi Kamezawa,
	Thanks for the information. The commit solved the issue what
I was puzzling about. With this change applied, thing should work
as expected. Seems it would be better to enhance __build_all_zonelists()
to handle those offlined empty nodes too, but that really doesn't
make to much difference:)
	Thanks for the info again!
Thanks!
Gerry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2015-10-09  9:27 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-17  3:18 [Patch V3 0/9] Enable memoryless node support for x86 Jiang Liu
2015-08-17  3:18 ` [Patch V3 1/9] x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition Jiang Liu
2015-08-17  3:18 ` [Patch V3 2/9] kernel/profile.c: Replace cpu_to_mem() with cpu_to_node() Jiang Liu
2015-08-18  0:31   ` David Rientjes
2015-08-19  7:18     ` Jiang Liu
2015-08-20  0:00       ` David Rientjes
2015-10-09  2:35         ` Jiang Liu
2015-08-17  3:19 ` [Patch V3 3/9] sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless node Jiang Liu
2015-08-18  0:25   ` David Rientjes
2015-08-19  8:20     ` Jiang Liu
2015-08-20  0:02       ` David Rientjes
2015-08-20  6:36         ` Jiang Liu
2015-10-09  5:04           ` Jiang Liu
2015-08-19 11:52   ` Robin Holt
2015-08-19 12:45     ` Jiang Liu
2015-08-17  3:19 ` [Patch V3 4/9] openvswitch: " Jiang Liu
2015-08-18  0:14   ` Pravin Shelar
2015-08-17  3:19 ` [Patch V3 5/9] i40e: Use numa_mem_id() to better " Jiang Liu
2015-08-18  0:35   ` David Rientjes
2015-08-19 22:38   ` [Intel-wired-lan] " Patil, Kiran
2015-08-20  0:18     ` David Rientjes
2015-10-08 20:20       ` Andrew Morton
2015-10-09  5:52         ` Jiang Liu
2015-10-09  9:08           ` Kamezawa Hiroyuki
2015-10-09  9:25             ` Jiang Liu
2015-08-17  3:19 ` [Patch V3 6/9] i40evf: " Jiang Liu
2015-08-17 19:03   ` [Intel-wired-lan] " Patil, Kiran
2015-08-18 21:34     ` Jeff Kirsher
2015-08-17  3:19 ` [Patch V3 7/9] x86, numa: Kill useless code to improve code readability Jiang Liu
2015-08-17  3:19 ` [Patch V3 8/9] mm: Update _mem_id_[] for every possible CPU when memory configuration changes Jiang Liu
2015-08-17  3:19 ` [Patch V3 9/9] mm, x86: Enable memoryless node support to better support CPU/memory hotplug Jiang Liu
2015-08-18  6:11   ` Tang Chen
2015-08-18  6:59     ` Jiang Liu
2015-08-18 11:28       ` Tang Chen
2015-08-18  7:31   ` Ingo Molnar
2015-08-17 21:35 ` [Patch V3 0/9] Enable memoryless node support for x86 Andrew Morton
2015-08-18 10:02 ` Tang Chen
2015-08-19  8:09   ` Jiang Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).