From: Anshuman Khandual <anshuman.khandual@arm.com>
To: Michal Hocko <mhocko@kernel.org>, linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
Oscar Salvador <OSalvador@suse.com>,
LKML <linux-kernel@vger.kernel.org>,
Miroslav Benes <mbenes@suse.cz>, Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove
Date: Fri, 9 Nov 2018 09:12:09 +0530 [thread overview]
Message-ID: <048c04ae-7394-d03f-813e-42acdc965dd2@arm.com> (raw)
In-Reply-To: <20181108102917.GV27423@dhcp22.suse.cz>
On 11/08/2018 03:59 PM, Michal Hocko wrote:
> [Removing Wen Congyang and Tang Chen from the CC list because their
> emails bounce. It seems that we will never learn about their motivation]
>
> On Thu 08-11-18 11:04:13, Michal Hocko wrote:
>> From: Michal Hocko <mhocko@suse.com>
>>
>> Per-cpu numa_node provides a default node for each possible cpu. The
>> association gets initialized during the boot when the architecture
>> specific code explores cpu->NUMA affinity. When the whole NUMA node is
>> removed though we are clearing this association
>>
>> try_offline_node
>> check_and_unmap_cpu_on_node
>> unmap_cpu_on_node
>> numa_clear_node
>> numa_set_node(cpu, NUMA_NO_NODE)
>>
>> This means that whoever calls cpu_to_node for a cpu associated with such
>> a node will get NUMA_NO_NODE. This is problematic for two reasons. First
>> it is fragile because __alloc_pages_node would simply blow up on an
>> out-of-bound access. We have encountered this when loading kvm module
>> BUG: unable to handle kernel paging request at 00000000000021c0
>> IP: [<ffffffff8119ccb3>] __alloc_pages_nodemask+0x93/0xb70
>> PGD 800000ffe853e067 PUD 7336bbc067 PMD 0
>> Oops: 0000 [#1] SMP
>> [...]
>> CPU: 88 PID: 1223749 Comm: modprobe Tainted: G W 4.4.156-94.64-default #1
>> task: ffff88727eff1880 ti: ffff887354490000 task.ti: ffff887354490000
>> RIP: 0010:[<ffffffff8119ccb3>] [<ffffffff8119ccb3>] __alloc_pages_nodemask+0x93/0xb70
>> RSP: 0018:ffff887354493b40 EFLAGS: 00010202
>> RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000
>> RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0
>> RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000
>> R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101
>> R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000
>> FS: 00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Stack:
>> 0000000000000086 014000c014d20400 ffff887354493bb8 ffff882614d20f4c
>> 0000000000000000 0000000000000046 0000000000000046 ffffffff810ac0c9
>> ffff88ffe78c0000 ffffffff0000009f ffffe8ffe82d3500 ffff88ff8ac55000
>> Call Trace:
>> [<ffffffffa07476cd>] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
>> [<ffffffffa0772c0c>] hardware_setup+0x781/0x849 [kvm_intel]
>> [<ffffffffa04a1c58>] kvm_arch_hardware_setup+0x28/0x190 [kvm]
>> [<ffffffffa04856fc>] kvm_init+0x7c/0x2d0 [kvm]
>> [<ffffffffa0772cf2>] vmx_init+0x1e/0x32c [kvm_intel]
>> [<ffffffff8100213a>] do_one_initcall+0xca/0x1f0
>> [<ffffffff81193886>] do_init_module+0x5a/0x1d7
>> [<ffffffff81112083>] load_module+0x1393/0x1c90
>> [<ffffffff81112b30>] SYSC_finit_module+0x70/0xa0
>> [<ffffffff8161cbc3>] entry_SYSCALL_64_fastpath+0x1e/0xb7
>> DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
>>
>> on an older kernel but the code is basically the same in the current
>> Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
>> would recognize NUMA_NO_NODE and use alloc_pages_node which would translate
>> it to numa_mem_id but that is wrong as well because it would use a cpu
>> affinity of the local CPU which might be quite far from the original node.
But then the original node is getting/already off-lined. The allocation is
going to come from a different node. alloc_pages_node() at least steer the
allocation alway from VM_BUG_ON() because of NUMA_NO_NODE by replacing it
with numa_mem_id().
If node fallback order is important for this allocation then could not it
use __alloc_pages_nodemask() directly giving preference for its zonelist
node and nodemask. Just curious.
>> It is also reasonable to expect that cpu_to_node will provide a sane value
>> and there might be many more callers like that.
AFAICS there are two choices here. Either mark them NUMA_NO_NODE for all
cpus of a node going offline or keep the existing mapping in case the node
comes back again.
>>
>> The second problem is that __register_one_node relies on cpu_to_node
>> to properly associate cpus back to the node when it is onlined. We do
>> not want to lose that link as there is no arch independent way to get it
>> from the early boot time AFAICS.
Retaining the links seems to be right unless unmap_cpu_on_node() is sort
of a weak callback letting arch to decide.
>>
>> Drop the whole check_and_unmap_cpu_on_node machinery and keep the
>> association to fix both issues. The NODE_DATA(nid) is not deallocated
Though retaining the link is a problem in itself but the allocation related
crash could be solved by exploring __alloc_pages_nodemask() options.
>> so it will stay in place and if anybody wants to allocate from that node
>> then a fallback node will be used.
Right, NODE_DATA(nid) is an advantage of retaining the link.
next prev parent reply other threads:[~2018-11-09 3:42 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-08 10:04 [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove Michal Hocko
2018-11-08 10:29 ` Michal Hocko
2018-11-09 3:42 ` Anshuman Khandual [this message]
2018-11-09 7:59 ` Michal Hocko
2018-11-09 11:04 ` Anshuman Khandual
2018-11-09 11:07 ` Michal Hocko
2018-12-20 22:48 ` Andrew Morton
2018-11-14 7:14 ` Michal Hocko
2018-11-14 23:18 ` Andrew Morton
2018-11-15 7:34 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=048c04ae-7394-d03f-813e-42acdc965dd2@arm.com \
--to=anshuman.khandual@arm.com \
--cc=OSalvador@suse.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mbenes@suse.cz \
--cc=mhocko@kernel.org \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).