Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: David Hildenbrand <david@redhat.com>
To: "Michal Hocko" <mhocko@kernel.org>,
	"Michal Suchánek" <msuchanek@suse.de>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>,
	Mel Gorman <mgorman@suse.de>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Andrew Morton <akpm@linux-foundation.org>,
	linuxppc-dev@lists.ozlabs.org, Christopher Lameter <cl@linux.com>,
	Vlastimil Babka <vbabka@suse.cz>, Andi Kleen <ak@linux.intel.com>
Subject: Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
Date: Fri, 3 Jul 2020 13:32:21 +0200
Message-ID: <3f926058-cabc-94d0-0f92-4e966ea4cdc3@redhat.com> (raw)
In-Reply-To: <20200703105944.GS18446@dhcp22.suse.cz>

On 03.07.20 12:59, Michal Hocko wrote:
> On Fri 03-07-20 11:24:17, Michal Hocko wrote:
>> [Cc Andi]
>>
>> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
>>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
>>>> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
>> [...]
>>>>> Yep, looks like it.
>>>>>
>>>>> [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
>>>>> [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
>>>>> [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
>>>>> [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
>>>>> [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
>>>>> [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
>>>>> [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
>>>>
>>>> This begs a question whether ppc can do the same thing?
>>> Or x86 stop doing it so that you can see on what node you are running?
>>>
>>> What's the point of this indirection other than another way of avoiding
>>> empty node 0?
>>
>> Honestly, I do not have any idea. I've traced it down to
>> Author: Andi Kleen <ak@suse.de>
>> Date:   Tue Jan 11 15:35:48 2005 -0800
>>
>>     [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
>>
>>     Fix fallout from the recent nodemask_t changes. The node ids assigned
>>     in the SRAT parser were off by one.
>>
>>     I added a new first_unset_node() function to nodemask.h to allocate
>>     IDs sanely.
>>
>>     Signed-off-by: Andi Kleen <ak@suse.de>
>>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
>>
>> which doesn't really tell all that much. The historical baggage and a
>> long term behavior which is not really trivial to fix I suspect.
> 
> Thinking about this some more, this logic makes some sense afterall.
> Especially in the world without memory hotplug which was very likely the
> case back then. It is much better to have compact node mask rather than
> sparse one. After all node numbers shouldn't really matter as long as
> you have a clear mapping to the HW. I am not sure we export that
> information (except for the kernel ring buffer) though.
> 
> The memory hotplug changes that somehow because you can hotremove numa
> nodes and therefore make the nodemask sparse but that is not a common
> case. I am not sure what would happen if a completely new node was added
> and its corresponding node was already used by the renumbered one
> though. It would likely conflate the two I am afraid. But I am not sure
> this is really possible with x86 and a lack of a bug report would
> suggest that nobody is doing that at least.
> 

I think the ACPI code takes care of properly mapping PXM to nodes.

So if I start with PXM 0 empty and PXM 1 populated, I will get
PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU

$ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U /var/tmp/monitor
$ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U /var/tmp/monitor

$ echo "info numa" | sudo nc -U /var/tmp/monitor
QEMU 5.0.50 monitor - type 'help' for more information
(qemu) info numa
2 nodes
node 0 cpus:
node 0 size: 1024 MB
node 0 plugged: 1024 MB
node 1 cpus: 0 1 2 3
node 1 size: 4096 MB
node 1 plugged: 0 MB

I get in the guest:

[   50.174435] ------------[ cut here ]------------
[   50.175436] node 1 was absent from the node_possible_map
[   50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 add_memory_resource+0x8c/0x290
[   50.176844] Modules linked in:
[   50.176845] CPU: 0 PID: 7 Comm: kworker/u8:0 Not tainted 5.8.0-rc2+ #4
[   50.176846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
[   50.176846] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   50.176847] RIP: 0010:add_memory_resource+0x8c/0x290
[   50.176849] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 63 c5 48 89 04 24 48 0f a3 05 94 6c 1c 01 72 17 89 ee 48 c78
[   50.176849] RSP: 0018:ffffa7a1c0043d48 EFLAGS: 00010296
[   50.176850] RAX: 000000000000002c RBX: ffff8bc633e63b80 RCX: 0000000000000000
[   50.176851] RDX: ffff8bc63bc27060 RSI: ffff8bc63bc18d00 RDI: ffff8bc63bc18d00
[   50.176851] RBP: 0000000000000001 R08: 00000000000001e1 R09: ffffa7a1c0043bd8
[   50.176852] R10: 0000000000000005 R11: 0000000000000000 R12: 0000000140000000
[   50.176852] R13: 000000017fffffff R14: 0000000040000000 R15: 0000000180000000
[   50.176853] FS:  0000000000000000(0000) GS:ffff8bc63bc00000(0000) knlGS:0000000000000000
[   50.176853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.176855] CR2: 000055dfcbfc5ee8 CR3: 00000000aca0a000 CR4: 00000000000006f0
[   50.176855] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   50.176856] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   50.176856] Call Trace:
[   50.176856]  __add_memory+0x33/0x70
[   50.176857]  acpi_memory_device_add+0x132/0x2f2
[   50.176857]  acpi_bus_attach+0xd2/0x200
[   50.176858]  acpi_bus_scan+0x33/0x70
[   50.176858]  acpi_device_hotplug+0x298/0x390
[   50.176858]  acpi_hotplug_work_fn+0x3d/0x50
[   50.176859]  process_one_work+0x1b4/0x370
[   50.176859]  worker_thread+0x53/0x3e0
[   50.176860]  ? process_one_work+0x370/0x370
[   50.176860]  kthread+0x119/0x140
[   50.176860]  ? __kthread_bind_mask+0x60/0x60
[   50.176861]  ret_from_fork+0x22/0x30
[   50.176861] ---[ end trace 9a2a837c1e0164f1 ]---
[   50.209816] acpi PNP0C80:00: add_memory failed
[   50.210510] acpi PNP0C80:00: acpi_memory_enable_device() error
[   50.211445] acpi PNP0C80:00: Enumeration failure


I remember that we added that check just recently (due to powerpc if I am not wrong).
Not sure why that triggers here.

But it properly maps PXM 0 to node 1.

-- 
Thanks,

David / dhildenb



  reply index

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-24  9:28 [PATCH v5 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
2020-06-24  9:28 ` [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
2020-06-24  9:48   ` Gautham R Shenoy
2020-06-24  9:28 ` [PATCH v5 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
2020-06-24 10:29   ` Gautham R Shenoy
2020-06-24  9:28 ` [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
2020-06-29 14:58   ` Christopher Lameter
2020-06-30  4:01     ` Srikar Dronamraju
2020-07-01 12:23       ` Michal Hocko
2020-07-01  8:42   ` Michal Hocko
2020-07-01 10:04     ` Srikar Dronamraju
2020-07-01 10:15       ` David Hildenbrand
2020-07-01 11:01         ` Srikar Dronamraju
2020-07-01 11:06           ` David Hildenbrand
2020-07-01 11:30             ` David Hildenbrand
2020-07-01 12:21               ` Michal Hocko
2020-07-02  6:44                 ` Srikar Dronamraju
2020-07-02  8:41                   ` Michal Hocko
2020-07-02 14:32                     ` Srikar Dronamraju
2020-07-03  9:10                 ` Michal Suchánek
2020-07-03  9:24                   ` Michal Hocko
2020-07-03 10:59                     ` Michal Hocko
2020-07-03 11:32                       ` David Hildenbrand [this message]
2020-07-03 11:46                         ` Michal Hocko
2020-07-03 12:58                       ` Srikar Dronamraju
2020-08-07  4:32                         ` Andrew Morton
2020-08-07  6:58                           ` David Hildenbrand
2020-08-07 10:04                             ` Michal Suchánek
2020-07-06 16:08                     ` Andi Kleen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3f926058-cabc-94d0-0f92-4e966ea4cdc3@redhat.com \
    --to=david@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=ego@linux.vnet.ibm.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=msuchanek@suse.de \
    --cc=sathnaga@linux.vnet.ibm.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git