All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Michal Hocko <mhocko@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
	Alexey Makhalov <amakhalov@vmware.com>,
	Dennis Zhou <dennis@kernel.org>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	Oscar Salvador <osalvador@suse.de>, Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@linux.com>, Nico Pache <npache@redhat.com>,
	Wei Yang <richard.weiyang@gmail.com>,
	Rafael Aquini <raquini@redhat.com>,
	Michal Hocko <mhocko@suse.com>
Subject: Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
Date: Thu, 27 Jan 2022 13:41:16 +0100	[thread overview]
Message-ID: <bd0a9f4e-d204-6a06-ba3f-11acb6ac16c0@redhat.com> (raw)
In-Reply-To: <20220127085305.20890-3-mhocko@kernel.org>

On 27.01.22 09:53, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> We have had several reports [1][2][3] that page allocator blows up when
> an allocation from a possible node is requested. The underlying reason
> is that NODE_DATA for the specific node is not allocated.
> 
> NUMA specific initialization is arch specific and it can vary a lot.
> E.g. x86 tries to initialize all nodes that have some cpu affinity (see
> init_cpu_to_node) but this can be insufficient because the node might be
> cpuless for example.
> 
> One way to address this problem would be to check for !node_online nodes
> when trying to get a zonelist and silently fall back to another node.
> That is unfortunately adding a branch into allocator hot path and it
> doesn't handle any other potential NODE_DATA users.
> 
> This patch takes a different approach (following a lead of [3]) and it
> pre allocates pgdat for all possible nodes in an arch indipendent code
> - free_area_init. All uninitialized nodes are treated as memoryless
> nodes. node_state of the node is not changed because that would lead to
> other side effects - e.g. sysfs representation of such a node and from
> past discussions [4] it is known that some tools might have problems
> digesting that.
> 
> Newly allocated pgdat only gets a minimal initialization and the rest of
> the work is expected to be done by the memory hotplug - hotadd_new_pgdat
> (renamed to hotadd_init_pgdat).
> 
> generic_alloc_nodedata is changed to use the memblock allocator because
> neither page nor slab allocators are available at the stage when all
> pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can
> use the early boot allocator. The only arch specific implementation is
> ia64 and that is changed to use the early allocator as well.
> 
> Reported-by: Alexey Makhalov <amakhalov@vmware.com>
> Tested-by: Alexey Makhalov <amakhalov@vmware.com>
> Reported-by: Nico Pache <npache@redhat.com>
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Tested-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
> [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
> [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
> [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com
> ---
>  arch/ia64/mm/discontig.c       |  4 ++--
>  include/linux/memory_hotplug.h |  2 +-
>  mm/memory_hotplug.c            | 21 +++++++++------------
>  mm/page_alloc.c                | 34 +++++++++++++++++++++++++++++++---
>  4 files changed, 43 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
> index 8dc8a554f774..dd0cf4834eaa 100644
> --- a/arch/ia64/mm/discontig.c
> +++ b/arch/ia64/mm/discontig.c
> @@ -608,11 +608,11 @@ void __init paging_init(void)
>  	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
>  }
>  
> -pg_data_t *arch_alloc_nodedata(int nid)
> +pg_data_t * __init arch_alloc_nodedata(int nid)
>  {
>  	unsigned long size = compute_pernodesize(nid);
>  
> -	return kzalloc(size, GFP_KERNEL);
> +	return memblock_alloc(size, SMP_CACHE_BYTES);

I feel like we should have

long arch_pgdat_size(void) instead and have a generic allocation function.

But we can clean that up in the future.

>  }
>  
>  void arch_free_nodedata(pg_data_t *pgdat)
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 4355983b364d..cdd66bfdf855 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -44,7 +44,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
>   */
>  #define generic_alloc_nodedata(nid)				\
>  ({								\
> -	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
> +	memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES);	\
>  })
>  /*
>   * This definition is just for error path in node hotadd.
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2a9627dc784c..fc991831d296 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1162,19 +1162,21 @@ static void reset_node_present_pages(pg_data_t *pgdat)
>  }
>  
>  /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
> -static pg_data_t __ref *hotadd_new_pgdat(int nid)
> +static pg_data_t __ref *hotadd_init_pgdat(int nid)
>  {
>  	struct pglist_data *pgdat;
>  
>  	pgdat = NODE_DATA(nid);
> -	if (!pgdat) {
> -		pgdat = arch_alloc_nodedata(nid);
> -		if (!pgdat)
> -			return NULL;
>  
> +	/*
> +	 * NODE_DATA is preallocated (free_area_init) but its internal
> +	 * state is not allocated completely. Add missing pieces.
> +	 * Completely offline nodes stay around and they just need
> +	 * reintialization.
> +	 */
> +	if (pgdat->per_cpu_nodestats == &boot_nodestats) {
>  		pgdat->per_cpu_nodestats =
>  			alloc_percpu(struct per_cpu_nodestat);
> -		arch_refresh_nodedata(nid, pgdat);
>  	} else {
>  		int cpu;
>  		/*
> @@ -1193,8 +1195,6 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid)
>  		}
>  	}
>  
> -	/* we can use NODE_DATA(nid) from here */
> -	pgdat->node_id = nid;
>  	pgdat->node_start_pfn = 0;
>  
>  	/* init node's zones as empty zones, we don't have any present pages.*/
> @@ -1246,7 +1246,7 @@ static int __try_online_node(int nid, bool set_node_online)
>  	if (node_online(nid))
>  		return 0;
>  
> -	pgdat = hotadd_new_pgdat(nid);
> +	pgdat = hotadd_init_pgdat(nid);
>  	if (!pgdat) {
>  		pr_err("Cannot online node %d due to NULL pgdat\n", nid);
>  		ret = -ENOMEM;
> @@ -1445,9 +1445,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  
>  	return ret;
>  error:
> -	/* rollback pgdat allocation and others */
> -	if (new_node)
> -		rollback_node_hotadd(nid);

As static rollback_node_hotadd() is unused in this patch, doesn't this
trigger a warning? IOW, maybe merge at least the rollback_node_hotadd()
removal into this patch. The arch_free_nodedata() removal can stay separate.

>  	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
>  		memblock_remove(start, size);
>  error_mem_hotplug_end:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589febc6d31..1a05669044d3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6402,7 +6402,11 @@ static void __build_all_zonelists(void *data)
>  	if (self && !node_online(self->node_id)) {
>  		build_zonelists(self);
>  	} else {
> -		for_each_online_node(nid) {
> +		/*
> +		 * All possible nodes have pgdat preallocated

... in free_area_init() ?

> +		 * free_area_init
> +		 */
> +		for_each_node(nid) {
>  			pg_data_t *pgdat = NODE_DATA(nid);
>  
>  			build_zonelists(pgdat);
> @@ -8096,8 +8100,32 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>  	/* Initialise every node */
>  	mminit_verify_pageflags_layout();
>  	setup_nr_node_ids();
> -	for_each_online_node(nid) {
> -		pg_data_t *pgdat = NODE_DATA(nid);
> +	for_each_node(nid) {
> +		pg_data_t *pgdat;
> +
> +		if (!node_online(nid)) {
> +			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
> +
> +			/* Allocator not initialized yet */
> +			pgdat = arch_alloc_nodedata(nid);
> +			if (!pgdat) {
> +				pr_err("Cannot allocate %zuB for node %d.\n",
> +						sizeof(*pgdat), nid);
> +				continue;
> +			}
> +			arch_refresh_nodedata(nid, pgdat);

We could get rid of arch_refresh_nodedata() now and simply merge that
into arch_alloc_nodedata(). But depends on how we want to proceed with
arch_alloc_nodedata() eventually.


Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


  reply	other threads:[~2022-01-27 12:41 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-27  8:52 [PATCH 0/6] mm, memory_hotplug: handle unitialized numa node gracefully Michal Hocko
2022-01-27  8:53 ` [PATCH 1/6] mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG Michal Hocko
2022-01-27 12:27   ` David Hildenbrand
2022-01-27 13:36   ` Mike Rapoport
2022-01-28  6:18   ` Oscar Salvador
2022-02-01  2:13   ` Wei Yang
2022-01-27  8:53 ` [PATCH 2/6] mm: handle uninitialized numa nodes gracefully Michal Hocko
2022-01-27 12:41   ` David Hildenbrand [this message]
2022-01-27 14:50     ` Michal Hocko
2022-01-27 13:37   ` Mike Rapoport
2022-01-27 14:47     ` Michal Hocko
2022-01-27 15:04       ` Mike Rapoport
2022-02-01  2:41       ` Wei Yang
2022-02-01  9:54         ` Michal Hocko
2022-02-03  0:21           ` Wei Yang
2022-02-03  7:23             ` Mike Rapoport
2022-02-03  8:27               ` David Hildenbrand
2022-02-03  9:08                 ` Mike Rapoport
2022-02-03  9:11                   ` David Hildenbrand
2022-02-03  9:19                     ` Michal Hocko
2022-02-03  9:21                       ` David Hildenbrand
2022-02-03  8:39               ` Michal Hocko
2022-02-04 22:54                 ` Andrew Morton
2022-01-28  6:27   ` Oscar Salvador
2022-01-31 10:34   ` Michal Hocko
2022-01-27  8:53 ` [PATCH 3/6] mm, memory_hotplug: drop arch_free_nodedata Michal Hocko
2022-01-27 12:42   ` David Hildenbrand
2022-01-27 13:37   ` Mike Rapoport
2022-01-28  6:29   ` Oscar Salvador
2022-02-01  2:41   ` Wei Yang
2022-01-27  8:53 ` [PATCH 4/6] mm, memory_hotplug: reorganize new pgdat initialization Michal Hocko
2022-01-27 12:46   ` David Hildenbrand
2022-01-27 14:44     ` Michal Hocko
2022-02-17 10:40       ` Oscar Salvador
2022-01-27 13:39   ` Mike Rapoport
2022-01-27 14:45     ` Michal Hocko
2022-01-28 10:51   ` Oscar Salvador
2022-02-01  2:42   ` Wei Yang
2022-01-27  8:53 ` [PATCH 5/6] mm: make free_area_init_node aware of memory less nodes Michal Hocko
2022-01-27 12:47   ` David Hildenbrand
2022-01-27 13:34   ` Mike Rapoport
2022-01-28 10:59   ` Oscar Salvador
2022-02-01  2:43   ` Wei Yang
2022-01-27  8:53 ` [PATCH 6/6] memcg: do not tweak node in alloc_mem_cgroup_per_node_info Michal Hocko
2022-01-27 12:50   ` David Hildenbrand
2022-01-28 11:01   ` Oscar Salvador
2022-02-01  2:45   ` Wei Yang
2022-02-01  9:51     ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bd0a9f4e-d204-6a06-ba3f-11acb6ac16c0@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=amakhalov@vmware.com \
    --cc=cl@linux.com \
    --cc=dennis@kernel.org \
    --cc=eric.dumazet@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=osalvador@suse.de \
    --cc=raquini@redhat.com \
    --cc=richard.weiyang@gmail.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.