From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D895C43441 for ; Fri, 9 Nov 2018 03:42:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 27DB820825 for ; Fri, 9 Nov 2018 03:42:17 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 27DB820825 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727748AbeKINUz (ORCPT ); Fri, 9 Nov 2018 08:20:55 -0500 Received: from foss.arm.com ([217.140.101.70]:52840 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727391AbeKINUz (ORCPT ); Fri, 9 Nov 2018 08:20:55 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 66F5EEBD; Thu, 8 Nov 2018 19:42:14 -0800 (PST) Received: from [10.162.0.72] (p8cg001049571a15.blr.arm.com [10.162.0.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5BA0E3F5CF; Thu, 8 Nov 2018 19:42:12 -0800 (PST) Subject: Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove To: Michal Hocko , linux-mm@kvack.org Cc: Andrew Morton , Oscar Salvador , LKML , Miroslav Benes , Vlastimil Babka References: <20181108100413.966-1-mhocko@kernel.org> <20181108102917.GV27423@dhcp22.suse.cz> From: Anshuman Khandual Message-ID: <048c04ae-7394-d03f-813e-42acdc965dd2@arm.com> Date: Fri, 9 Nov 2018 09:12:09 +0530 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <20181108102917.GV27423@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/08/2018 03:59 PM, Michal Hocko wrote: > [Removing Wen Congyang and Tang Chen from the CC list because their > emails bounce. It seems that we will never learn about their motivation] > > On Thu 08-11-18 11:04:13, Michal Hocko wrote: >> From: Michal Hocko >> >> Per-cpu numa_node provides a default node for each possible cpu. The >> association gets initialized during the boot when the architecture >> specific code explores cpu->NUMA affinity. When the whole NUMA node is >> removed though we are clearing this association >> >> try_offline_node >> check_and_unmap_cpu_on_node >> unmap_cpu_on_node >> numa_clear_node >> numa_set_node(cpu, NUMA_NO_NODE) >> >> This means that whoever calls cpu_to_node for a cpu associated with such >> a node will get NUMA_NO_NODE. This is problematic for two reasons. First >> it is fragile because __alloc_pages_node would simply blow up on an >> out-of-bound access. We have encountered this when loading kvm module >> BUG: unable to handle kernel paging request at 00000000000021c0 >> IP: [] __alloc_pages_nodemask+0x93/0xb70 >> PGD 800000ffe853e067 PUD 7336bbc067 PMD 0 >> Oops: 0000 [#1] SMP >> [...] >> CPU: 88 PID: 1223749 Comm: modprobe Tainted: G W 4.4.156-94.64-default #1 >> task: ffff88727eff1880 ti: ffff887354490000 task.ti: ffff887354490000 >> RIP: 0010:[] [] __alloc_pages_nodemask+0x93/0xb70 >> RSP: 0018:ffff887354493b40 EFLAGS: 00010202 >> RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000 >> RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0 >> RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000 >> R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101 >> R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000 >> FS: 00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Stack: >> 0000000000000086 014000c014d20400 ffff887354493bb8 ffff882614d20f4c >> 0000000000000000 0000000000000046 0000000000000046 ffffffff810ac0c9 >> ffff88ffe78c0000 ffffffff0000009f ffffe8ffe82d3500 ffff88ff8ac55000 >> Call Trace: >> [] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel] >> [] hardware_setup+0x781/0x849 [kvm_intel] >> [] kvm_arch_hardware_setup+0x28/0x190 [kvm] >> [] kvm_init+0x7c/0x2d0 [kvm] >> [] vmx_init+0x1e/0x32c [kvm_intel] >> [] do_one_initcall+0xca/0x1f0 >> [] do_init_module+0x5a/0x1d7 >> [] load_module+0x1393/0x1c90 >> [] SYSC_finit_module+0x70/0xa0 >> [] entry_SYSCALL_64_fastpath+0x1e/0xb7 >> DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7 >> >> on an older kernel but the code is basically the same in the current >> Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which >> would recognize NUMA_NO_NODE and use alloc_pages_node which would translate >> it to numa_mem_id but that is wrong as well because it would use a cpu >> affinity of the local CPU which might be quite far from the original node. But then the original node is getting/already off-lined. The allocation is going to come from a different node. alloc_pages_node() at least steer the allocation alway from VM_BUG_ON() because of NUMA_NO_NODE by replacing it with numa_mem_id(). If node fallback order is important for this allocation then could not it use __alloc_pages_nodemask() directly giving preference for its zonelist node and nodemask. Just curious. >> It is also reasonable to expect that cpu_to_node will provide a sane value >> and there might be many more callers like that. AFAICS there are two choices here. Either mark them NUMA_NO_NODE for all cpus of a node going offline or keep the existing mapping in case the node comes back again. >> >> The second problem is that __register_one_node relies on cpu_to_node >> to properly associate cpus back to the node when it is onlined. We do >> not want to lose that link as there is no arch independent way to get it >> from the early boot time AFAICS. Retaining the links seems to be right unless unmap_cpu_on_node() is sort of a weak callback letting arch to decide. >> >> Drop the whole check_and_unmap_cpu_on_node machinery and keep the >> association to fix both issues. The NODE_DATA(nid) is not deallocated Though retaining the link is a problem in itself but the allocation related crash could be solved by exploring __alloc_pages_nodemask() options. >> so it will stay in place and if anybody wants to allocate from that node >> then a fallback node will be used. Right, NODE_DATA(nid) is an advantage of retaining the link.