From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f54.google.com (mail-pa0-f54.google.com [209.85.220.54]) by kanga.kvack.org (Postfix) with ESMTP id 471166B0253 for ; Tue, 18 Aug 2015 06:08:43 -0400 (EDT) Received: by pabyb7 with SMTP id yb7so129668494pab.0 for ; Tue, 18 Aug 2015 03:08:43 -0700 (PDT) Received: from heian.cn.fujitsu.com ([59.151.112.132]) by mx.google.com with ESMTP id a9si9410781pdn.109.2015.08.18.03.05.27 for ; Tue, 18 Aug 2015 03:08:42 -0700 (PDT) Message-ID: <55D302CA.9010703@cn.fujitsu.com> Date: Tue, 18 Aug 2015 18:02:50 +0800 From: Tang Chen MIME-Version: 1.0 Subject: Re: [Patch V3 0/9] Enable memoryless node support for x86 References: <1439781546-7217-1-git-send-email-jiang.liu@linux.intel.com> In-Reply-To: <1439781546-7217-1-git-send-email-jiang.liu@linux.intel.com> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jiang Liu , Andrew Morton , Mel Gorman , David Rientjes , Mike Galbraith , Peter Zijlstra , "Rafael J . Wysocki" , Tejun Heo Cc: Tony Luck , linux-mm@kvack.org, linux-hotplug@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org, tangchen@cn.fujitsu.com On 08/17/2015 11:18 AM, Jiang Liu wrote: > This is the third version to enable memoryless node support on x86 > platforms. The previous version (https://lkml.org/lkml/2014/7/11/75) > blindly replaces numa_node_id()/cpu_to_node() with numa_mem_id()/ > cpu_to_mem(). That's not the right solution as pointed out by Tejun > and Peter due to: > 1) We shouldn't shift the burden to normal slab users. > 2) Details of memoryless node should be hidden in arch and mm code > as much as possible. > > After digging into more code and documentation, we found the rules to > deal with memoryless node should be: > 1) Arch code should online corresponding NUMA node before onlining any > CPU or memory, otherwise it may cause invalid memory access when > accessing NODE_DATA(nid). > 2) For normal memory allocations without __GFP_THISNODE setting in the > gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of > numa_mem_id()/cpu_to_mem() because the latter loses hardware topology > information as pointed out by Tejun: > A - B - X - C - D > Where X is the memless node. numa_mem_id() on X would return > either B or C, right? If B or C can't satisfy the allocation, > the allocator would fallback to A from B and D for C, both of > which aren't optimal. It should first fall back to C or B > respectively, which the allocator can't do anymoe because the > information is lost when the caller side performs numa_mem_id(). Hi Liu, BTW, how is this A - B - X - C - D problem solved ? I don't quite follow this. I cannot tell the difference between numa_node_id()/cpu_to_node() and numa_mem_id()/cpu_to_mem() on this point. Even with hardware topology info, how could it avoid this problem ? Isn't it still possible falling back to A from B and D for C ? Thanks. > 3) For memory allocation with __GFP_THISNODE setting in gfp_flags, > numa_node_id()/cpu_to_node() should be used if caller only wants to > allocate from local memory, otherwise numa_mem_id()/cpu_to_mem() > should be used if caller wants to allocate from the nearest node > with memory. > 4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check > whether a page is allocated from the nearest node. > > Based on above rules, this patch set > 1) Patch 1 is a bugfix to resolve a crash caused by socket hot-addition > 2) Patch 2 replaces numa_mem_id() with numa_node_id() when __GFP_THISNODE > isn't set in gfp_flags. > 3) Patch 3-6 replaces numa_node_id()/cpu_to_node() with numa_mem_id()/ > cpu_to_mem() if caller wants to allocate from local node only. > 4) Patch 7-9 enables support of memoryless node on x86. > > With this patch set applied, on a system with two sockets enabled at boot, > one with memory and the other without memory, we got following numa > topology after boot: > root@bkd04sdp:~# numactl --hardware > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 > node 0 size: 15940 MB > node 0 free: 15397 MB > node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 > node 1 size: 0 MB > node 1 free: 0 MB > node distances: > node 0 1 > 0: 10 21 > 1: 21 10 > > After hot-adding the third socket without memory, we got: > root@bkd04sdp:~# numactl --hardware > available: 3 nodes (0-2) > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 > node 0 size: 15940 MB > node 0 free: 15142 MB > node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 > node 1 size: 0 MB > node 1 free: 0 MB > node 2 cpus: > node 2 size: 0 MB > node 2 free: 0 MB > node distances: > node 0 1 2 > 0: 10 21 21 > 1: 21 10 21 > 2: 21 21 10 > > Jiang Liu (9): > x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition > kernel/profile.c: Replace cpu_to_mem() with cpu_to_node() > sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless > node > openvswitch: Replace cpu_to_node() with cpu_to_mem() to support > memoryless node > i40e: Use numa_mem_id() to better support memoryless node > i40evf: Use numa_mem_id() to better support memoryless node > x86, numa: Kill useless code to improve code readability > mm: Update _mem_id_[] for every possible CPU when memory > configuration changes > mm, x86: Enable memoryless node support to better support CPU/memory > hotplug > > arch/x86/Kconfig | 3 ++ > arch/x86/kernel/acpi/boot.c | 9 +++- > arch/x86/kernel/smpboot.c | 2 + > arch/x86/mm/numa.c | 59 +++++++++++++++---------- > drivers/misc/sgi-xp/xpc_uv.c | 2 +- > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +- > drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +- > kernel/profile.c | 2 +- > mm/page_alloc.c | 10 ++--- > net/openvswitch/flow.c | 2 +- > 10 files changed, 59 insertions(+), 34 deletions(-) > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org