All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Rafael Aquini <raquini@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Wei Yang <richard.weiyang@gmail.com>,
	Dennis Zhou <dennis@kernel.org>,
	Alexey Makhalov <amakhalov@vmware.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/1] arch/x86/mm/numa: Do not initialize nodes twice
Date: Mon, 21 Feb 2022 10:20:02 +0100	[thread overview]
Message-ID: <YhNZQgGSZglGQvcg@dhcp22.suse.cz> (raw)
In-Reply-To: <20220218224302.5282-2-osalvador@suse.de>

On Fri 18-02-22 23:43:02, Oscar Salvador wrote:
> On x86, prior to ("mm: handle uninitialized numa nodes gracecully"),
> NUMA nodes could be allocated at three different places.
> 
> - numa_register_memblks
> - init_cpu_to_node
> - init_gi_nodes
> 
> All these calls happen at setup_arch, and have the following order:
> 
> setup_arch
>   ...
>   x86_numa_init
>    numa_init
>     numa_register_memblks
>   ...
>   init_cpu_to_node
>    init_memory_less_node
>     alloc_node_data
>     free_area_init_memoryless_node
>   init_gi_nodes
>    init_memory_less_node
>     alloc_node_data
>     free_area_init_memoryless_node
> 
> numa_register_memblks() is only interested in those nodes which have memory,
> so it skips over any memoryless node it founds.
> Later on, when we have read ACPI's SRAT table, we call init_cpu_to_node()
> and init_gi_nodes(), which initialize any memoryless node we might have
> that have either CPU or Initiator affinity, meaning we allocate pg_data_t
> struct for them and we mark them as ONLINE.
> 
> So far so good, but the thing is that after ("mm: handle uninitialized numa
> nodes gracefully"), we allocate all possible NUMA nodes in free_area_init(),
> meaning we have a picture like the following:
> 
> setup_arch
>   x86_numa_init
>    numa_init
>     numa_register_memblks  <-- allocate non-memoryless node
>   x86_init.paging.pagetable_init
>    ...
>     free_area_init
>      free_area_init_memoryless <-- allocate memoryless node
>   init_cpu_to_node
>    alloc_node_data             <-- allocate memoryless node with CPU
>    free_area_init_memoryless_node
>   init_gi_nodes
>    alloc_node_data             <-- allocate memoryless node with Initiator
>    free_area_init_memoryless_node

Thanks for diving into this and double checking after me. I misread the
ordering and thought free_area_init is the last one to be called.

> free_area_init() already allocates all possible NUMA nodes, but
> init_cpu_to_node() and init_gi_nodes() are clueless about that,
> so they go ahead and allocate a new pg_data_t struct without
> checking anything, meaning we end up allocating twice.
> 
> It should be mad clear that this only happens in the case where
> memoryless NUMA node happens to have a CPU/Initiator affinity.
> 
> So get rid of init_memory_less_node() and just set the node online.
> 
> Note that setting the node online is needed, otherwise we choke
> down the chain when bringup_nonboot_cpus() ends up calling
> __try_online_node()->register_one_node()->... and we blow up in
> bus_add_device(). Like can be seen here:
> 
> =========
> [    0.585060] BUG: kernel NULL pointer dereference, address: 0000000000000060
> [    0.586091] #PF: supervisor read access in kernel mode
> [    0.586831] #PF: error_code(0x0000) - not-present page
> [    0.586930] PGD 0 P4D 0
> [    0.586930] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
> [    0.586930] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
> [    0.586930] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
> [    0.586930] RIP: 0010:bus_add_device+0x5a/0x140
> [    0.586930] Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
> [    0.586930] RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
> [    0.586930] RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
> [    0.586930] RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
> [    0.586930] RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
> [    0.586930] R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
> [    0.586930] R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
> [    0.586930] FS:  0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
> [    0.586930] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.586930] CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
> [    0.586930] Call Trace:
> [    0.586930]  <TASK>
> [    0.586930]  device_add+0x4c0/0x910
> [    0.586930]  __register_one_node+0x97/0x2d0
> [    0.586930]  __try_online_node+0x85/0xc0
> [    0.586930]  try_online_node+0x25/0x40
> [    0.586930]  cpu_up+0x4f/0x100
> [    0.586930]  bringup_nonboot_cpus+0x4f/0x60
> [    0.586930]  smp_init+0x26/0x79
> [    0.586930]  kernel_init_freeable+0x130/0x2f1
> [    0.586930]  ? rest_init+0x100/0x100
> [    0.586930]  kernel_init+0x17/0x150
> [    0.586930]  ? rest_init+0x100/0x100
> [    0.586930]  ret_from_fork+0x22/0x30
> [    0.586930]  </TASK>
> [    0.586930] Modules linked in:
> [    0.586930] CR2: 0000000000000060
> [    0.586930] ---[ end trace 0000000000000000 ]---
> =========
> 
> The reason is simple, by the time bringup_nonboot_cpus() gets called,
> we did not register the node_subsys bus yet, so we crash when bus_add_device()
> tries to dereference bus()->p.
> 
> The following shows the order of the calls:
> 
> kernel_init_freeable
>  smp_init
>   bringup_nonboot_cpus
>    ...
>      bus_add_device()      <- we did not register node_subsys yet
>  do_basic_setup
>   do_initcalls
>    postcore_initcall(register_node_type);
>     register_node_type
>      subsys_system_register
>       subsys_register
>        bus_register         <- register node_subsys bus
> 
> Why setting the node online saves us then? Well, simply because
> __try_online_node() backs off when the node is online, meaning
> we do not end up calling register_one_node() in the first place.

This is really a mess and a house built on sand. Thanks for looking into
it and hopefully this can get cleaned up to a saner state.

> This is subtle, broken and deserves a deep analysis and thought
> about how to put this into shape, but for now let us have this
> easy fix for the leaking memory issue.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")

This sha1 is from linux-next very likely so it won't be persistent.
Please drop it.

Other than that
Acked-by: Michal Hocko <mhocko@suse.com>

One nit below

Thanks!

> ---
>  arch/x86/mm/numa.c | 15 ++-------------
>  include/linux/mm.h |  1 -
>  mm/page_alloc.c    |  2 +-
>  3 files changed, 3 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index c6b1213086d6..37039a6af8ae 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -738,17 +738,6 @@ void __init x86_numa_init(void)
>  	numa_init(dummy_numa_init);
>  }
>  
> -static void __init init_memory_less_node(int nid)
> -{
> -	/* Allocate and initialize node data. Memory-less node is now online.*/
> -	alloc_node_data(nid);
> -	free_area_init_memoryless_node(nid);
> -
> -	/*
> -	 * All zonelists will be built later in start_kernel() after per cpu
> -	 * areas are initialized.
> -	 */
> -}
>  
>  /*
>   * A node may exist which has one or more Generic Initiators but no CPUs and no
> @@ -768,7 +757,7 @@ void __init init_gi_nodes(void)
>  
>  	for_each_node_state(nid, N_GENERIC_INITIATOR)
>  		if (!node_online(nid))
> -			init_memory_less_node(nid);
> +			node_set_online(nid);

I would stick a TODO here.
			/*
			 * Exclude this node from 
			 * bringup_nonboot_cpus
			 *  cpu_up
			 *    __try_online_node
			 *      register_one_node
			 * because node_subsys is not initialized yet
			 * TODO remove dependency on node_online()
			 */
>  }
>  
>  /*
> @@ -799,7 +788,7 @@ void __init init_cpu_to_node(void)
>  			continue;
>  
>  		if (!node_online(node))
> -			init_memory_less_node(node);
> +			node_set_online(node);

and here as well.
-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2022-02-21  9:52 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-18 22:43 [PATCH 0/1] Fix allocating nodes twice on x86 Oscar Salvador
2022-02-18 22:43 ` [PATCH 1/1] arch/x86/mm/numa: Do not initialize nodes twice Oscar Salvador
2022-02-21  9:20   ` Michal Hocko [this message]
2022-02-21  9:47     ` Oscar Salvador
2022-02-21 13:32       ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YhNZQgGSZglGQvcg@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=amakhalov@vmware.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=dennis@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=osalvador@suse.de \
    --cc=raquini@redhat.com \
    --cc=richard.weiyang@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.