All of lore.kernel.org
 help / color / mirror / Atom feed
From: Anshuman Khandual <anshuman.khandual@arm.com>
To: Michal Hocko <mhocko@kernel.org>, linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Oscar Salvador <OSalvador@suse.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Miroslav Benes <mbenes@suse.cz>, Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove
Date: Fri, 9 Nov 2018 09:12:09 +0530	[thread overview]
Message-ID: <048c04ae-7394-d03f-813e-42acdc965dd2@arm.com> (raw)
In-Reply-To: <20181108102917.GV27423@dhcp22.suse.cz>



On 11/08/2018 03:59 PM, Michal Hocko wrote:
> [Removing Wen Congyang and Tang Chen from the CC list because their
>  emails bounce. It seems that we will never learn about their motivation]
> 
> On Thu 08-11-18 11:04:13, Michal Hocko wrote:
>> From: Michal Hocko <mhocko@suse.com>
>>
>> Per-cpu numa_node provides a default node for each possible cpu. The
>> association gets initialized during the boot when the architecture
>> specific code explores cpu->NUMA affinity. When the whole NUMA node is
>> removed though we are clearing this association
>>
>> try_offline_node
>>   check_and_unmap_cpu_on_node
>>     unmap_cpu_on_node
>>       numa_clear_node
>>         numa_set_node(cpu, NUMA_NO_NODE)
>>
>> This means that whoever calls cpu_to_node for a cpu associated with such
>> a node will get NUMA_NO_NODE. This is problematic for two reasons. First
>> it is fragile because __alloc_pages_node would simply blow up on an
>> out-of-bound access. We have encountered this when loading kvm module
>> BUG: unable to handle kernel paging request at 00000000000021c0
>> IP: [<ffffffff8119ccb3>] __alloc_pages_nodemask+0x93/0xb70
>> PGD 800000ffe853e067 PUD 7336bbc067 PMD 0
>> Oops: 0000 [#1] SMP
>> [...]
>> CPU: 88 PID: 1223749 Comm: modprobe Tainted: G        W          4.4.156-94.64-default #1
>> task: ffff88727eff1880 ti: ffff887354490000 task.ti: ffff887354490000
>> RIP: 0010:[<ffffffff8119ccb3>]  [<ffffffff8119ccb3>] __alloc_pages_nodemask+0x93/0xb70
>> RSP: 0018:ffff887354493b40  EFLAGS: 00010202
>> RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000
>> RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0
>> RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000
>> R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101
>> R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000
>> FS:  00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Stack:
>>  0000000000000086 014000c014d20400 ffff887354493bb8 ffff882614d20f4c
>>  0000000000000000 0000000000000046 0000000000000046 ffffffff810ac0c9
>>  ffff88ffe78c0000 ffffffff0000009f ffffe8ffe82d3500 ffff88ff8ac55000
>> Call Trace:
>>  [<ffffffffa07476cd>] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
>>  [<ffffffffa0772c0c>] hardware_setup+0x781/0x849 [kvm_intel]
>>  [<ffffffffa04a1c58>] kvm_arch_hardware_setup+0x28/0x190 [kvm]
>>  [<ffffffffa04856fc>] kvm_init+0x7c/0x2d0 [kvm]
>>  [<ffffffffa0772cf2>] vmx_init+0x1e/0x32c [kvm_intel]
>>  [<ffffffff8100213a>] do_one_initcall+0xca/0x1f0
>>  [<ffffffff81193886>] do_init_module+0x5a/0x1d7
>>  [<ffffffff81112083>] load_module+0x1393/0x1c90
>>  [<ffffffff81112b30>] SYSC_finit_module+0x70/0xa0
>>  [<ffffffff8161cbc3>] entry_SYSCALL_64_fastpath+0x1e/0xb7
>> DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
>>
>> on an older kernel but the code is basically the same in the current
>> Linus tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which
>> would recognize NUMA_NO_NODE and use alloc_pages_node which would translate
>> it to numa_mem_id but that is wrong as well because it would use a cpu
>> affinity of the local CPU which might be quite far from the original node.

But then the original node is getting/already off-lined. The allocation is
going to come from a different node. alloc_pages_node() at least steer the
allocation alway from VM_BUG_ON() because of NUMA_NO_NODE by replacing it
with numa_mem_id().

If node fallback order is important for this allocation then could not it
use __alloc_pages_nodemask() directly giving preference for its zonelist
node and nodemask. Just curious.

>> It is also reasonable to expect that cpu_to_node will provide a sane value
>> and there might be many more callers like that.

AFAICS there are two choices here. Either mark them NUMA_NO_NODE for all
cpus of a node going offline or keep the existing mapping in case the node
comes back again.

>>
>> The second problem is that __register_one_node relies on cpu_to_node
>> to properly associate cpus back to the node when it is onlined. We do
>> not want to lose that link as there is no arch independent way to get it
>> from the early boot time AFAICS.

Retaining the links seems to be right unless unmap_cpu_on_node() is sort
of a weak callback letting arch to decide.

>>
>> Drop the whole check_and_unmap_cpu_on_node machinery and keep the
>> association to fix both issues. The NODE_DATA(nid) is not deallocated
Though retaining the link is a problem in itself but the allocation related
crash could be solved by exploring __alloc_pages_nodemask() options.

>> so it will stay in place and if anybody wants to allocate from that node
>> then a fallback node will be used.

Right, NODE_DATA(nid) is an advantage of retaining the link.

  reply	other threads:[~2018-11-09  3:42 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-08 10:04 [RFC PATCH] mm, memory_hotplug: do not clear numa_node association after hot_remove Michal Hocko
2018-11-08 10:04 ` Michal Hocko
2018-11-08 10:29 ` Michal Hocko
2018-11-09  3:42   ` Anshuman Khandual [this message]
2018-11-09  7:59     ` Michal Hocko
2018-11-09 11:04       ` Anshuman Khandual
2018-11-09 11:07         ` Michal Hocko
2018-12-20 22:48         ` Andrew Morton
2018-11-14  7:14 ` Michal Hocko
2018-11-14 23:18   ` Andrew Morton
2018-11-15  7:34     ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=048c04ae-7394-d03f-813e-42acdc965dd2@arm.com \
    --to=anshuman.khandual@arm.com \
    --cc=OSalvador@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mbenes@suse.cz \
    --cc=mhocko@kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.