linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 1/6] mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG
       [not found] ` <20220127085305.20890-2-mhocko@kernel.org>
@ 2022-01-27 12:27   ` David Hildenbrand
  2022-01-27 13:36   ` Mike Rapoport
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2022-01-27 12:27 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: linux-mm, LKML, Alexey Makhalov, Dennis Zhou, Eric Dumazet,
	Oscar Salvador, Tejun Heo, Christoph Lameter, Nico Pache,
	Wei Yang, Rafael Aquini, Michal Hocko

On 27.01.22 09:53, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> This is a preparatory patch and it doesn't introduce any functional
> change. It merely pulls out arch_alloc_nodedata (and co) outside of
> CONFIG_MEMORY_HOTPLUG because the following patch will need to call this
> from the generic MM code.
> 

I would have hoped we'd have dropped support for ia64 by now, to get rid
of this special case ...

> Acked-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  arch/ia64/mm/discontig.c       |   2 -
>  include/linux/memory_hotplug.h | 119 ++++++++++++++++-----------------
>  2 files changed, 59 insertions(+), 62 deletions(-)

Acked-by: David Hildenbrand <david@redhat.com>


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
       [not found] ` <20220127085305.20890-3-mhocko@kernel.org>
@ 2022-01-27 12:41   ` David Hildenbrand
  2022-01-27 14:50     ` Michal Hocko
  2022-01-27 13:37   ` Mike Rapoport
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2022-01-27 12:41 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: linux-mm, LKML, Alexey Makhalov, Dennis Zhou, Eric Dumazet,
	Oscar Salvador, Tejun Heo, Christoph Lameter, Nico Pache,
	Wei Yang, Rafael Aquini, Michal Hocko

On 27.01.22 09:53, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> We have had several reports [1][2][3] that page allocator blows up when
> an allocation from a possible node is requested. The underlying reason
> is that NODE_DATA for the specific node is not allocated.
> 
> NUMA specific initialization is arch specific and it can vary a lot.
> E.g. x86 tries to initialize all nodes that have some cpu affinity (see
> init_cpu_to_node) but this can be insufficient because the node might be
> cpuless for example.
> 
> One way to address this problem would be to check for !node_online nodes
> when trying to get a zonelist and silently fall back to another node.
> That is unfortunately adding a branch into allocator hot path and it
> doesn't handle any other potential NODE_DATA users.
> 
> This patch takes a different approach (following a lead of [3]) and it
> pre allocates pgdat for all possible nodes in an arch indipendent code
> - free_area_init. All uninitialized nodes are treated as memoryless
> nodes. node_state of the node is not changed because that would lead to
> other side effects - e.g. sysfs representation of such a node and from
> past discussions [4] it is known that some tools might have problems
> digesting that.
> 
> Newly allocated pgdat only gets a minimal initialization and the rest of
> the work is expected to be done by the memory hotplug - hotadd_new_pgdat
> (renamed to hotadd_init_pgdat).
> 
> generic_alloc_nodedata is changed to use the memblock allocator because
> neither page nor slab allocators are available at the stage when all
> pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can
> use the early boot allocator. The only arch specific implementation is
> ia64 and that is changed to use the early allocator as well.
> 
> Reported-by: Alexey Makhalov <amakhalov@vmware.com>
> Tested-by: Alexey Makhalov <amakhalov@vmware.com>
> Reported-by: Nico Pache <npache@redhat.com>
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Tested-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
> [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
> [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
> [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com
> ---
>  arch/ia64/mm/discontig.c       |  4 ++--
>  include/linux/memory_hotplug.h |  2 +-
>  mm/memory_hotplug.c            | 21 +++++++++------------
>  mm/page_alloc.c                | 34 +++++++++++++++++++++++++++++++---
>  4 files changed, 43 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
> index 8dc8a554f774..dd0cf4834eaa 100644
> --- a/arch/ia64/mm/discontig.c
> +++ b/arch/ia64/mm/discontig.c
> @@ -608,11 +608,11 @@ void __init paging_init(void)
>  	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
>  }
>  
> -pg_data_t *arch_alloc_nodedata(int nid)
> +pg_data_t * __init arch_alloc_nodedata(int nid)
>  {
>  	unsigned long size = compute_pernodesize(nid);
>  
> -	return kzalloc(size, GFP_KERNEL);
> +	return memblock_alloc(size, SMP_CACHE_BYTES);

I feel like we should have

long arch_pgdat_size(void) instead and have a generic allocation function.

But we can clean that up in the future.

>  }
>  
>  void arch_free_nodedata(pg_data_t *pgdat)
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 4355983b364d..cdd66bfdf855 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -44,7 +44,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
>   */
>  #define generic_alloc_nodedata(nid)				\
>  ({								\
> -	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
> +	memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES);	\
>  })
>  /*
>   * This definition is just for error path in node hotadd.
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2a9627dc784c..fc991831d296 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1162,19 +1162,21 @@ static void reset_node_present_pages(pg_data_t *pgdat)
>  }
>  
>  /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
> -static pg_data_t __ref *hotadd_new_pgdat(int nid)
> +static pg_data_t __ref *hotadd_init_pgdat(int nid)
>  {
>  	struct pglist_data *pgdat;
>  
>  	pgdat = NODE_DATA(nid);
> -	if (!pgdat) {
> -		pgdat = arch_alloc_nodedata(nid);
> -		if (!pgdat)
> -			return NULL;
>  
> +	/*
> +	 * NODE_DATA is preallocated (free_area_init) but its internal
> +	 * state is not allocated completely. Add missing pieces.
> +	 * Completely offline nodes stay around and they just need
> +	 * reintialization.
> +	 */
> +	if (pgdat->per_cpu_nodestats == &boot_nodestats) {
>  		pgdat->per_cpu_nodestats =
>  			alloc_percpu(struct per_cpu_nodestat);
> -		arch_refresh_nodedata(nid, pgdat);
>  	} else {
>  		int cpu;
>  		/*
> @@ -1193,8 +1195,6 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid)
>  		}
>  	}
>  
> -	/* we can use NODE_DATA(nid) from here */
> -	pgdat->node_id = nid;
>  	pgdat->node_start_pfn = 0;
>  
>  	/* init node's zones as empty zones, we don't have any present pages.*/
> @@ -1246,7 +1246,7 @@ static int __try_online_node(int nid, bool set_node_online)
>  	if (node_online(nid))
>  		return 0;
>  
> -	pgdat = hotadd_new_pgdat(nid);
> +	pgdat = hotadd_init_pgdat(nid);
>  	if (!pgdat) {
>  		pr_err("Cannot online node %d due to NULL pgdat\n", nid);
>  		ret = -ENOMEM;
> @@ -1445,9 +1445,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  
>  	return ret;
>  error:
> -	/* rollback pgdat allocation and others */
> -	if (new_node)
> -		rollback_node_hotadd(nid);

As static rollback_node_hotadd() is unused in this patch, doesn't this
trigger a warning? IOW, maybe merge at least the rollback_node_hotadd()
removal into this patch. The arch_free_nodedata() removal can stay separate.

>  	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
>  		memblock_remove(start, size);
>  error_mem_hotplug_end:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589febc6d31..1a05669044d3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6402,7 +6402,11 @@ static void __build_all_zonelists(void *data)
>  	if (self && !node_online(self->node_id)) {
>  		build_zonelists(self);
>  	} else {
> -		for_each_online_node(nid) {
> +		/*
> +		 * All possible nodes have pgdat preallocated

... in free_area_init() ?

> +		 * free_area_init
> +		 */
> +		for_each_node(nid) {
>  			pg_data_t *pgdat = NODE_DATA(nid);
>  
>  			build_zonelists(pgdat);
> @@ -8096,8 +8100,32 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>  	/* Initialise every node */
>  	mminit_verify_pageflags_layout();
>  	setup_nr_node_ids();
> -	for_each_online_node(nid) {
> -		pg_data_t *pgdat = NODE_DATA(nid);
> +	for_each_node(nid) {
> +		pg_data_t *pgdat;
> +
> +		if (!node_online(nid)) {
> +			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
> +
> +			/* Allocator not initialized yet */
> +			pgdat = arch_alloc_nodedata(nid);
> +			if (!pgdat) {
> +				pr_err("Cannot allocate %zuB for node %d.\n",
> +						sizeof(*pgdat), nid);
> +				continue;
> +			}
> +			arch_refresh_nodedata(nid, pgdat);

We could get rid of arch_refresh_nodedata() now and simply merge that
into arch_alloc_nodedata(). But depends on how we want to proceed with
arch_alloc_nodedata() eventually.


Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/6] mm, memory_hotplug: drop arch_free_nodedata
       [not found] ` <20220127085305.20890-4-mhocko@kernel.org>
@ 2022-01-27 12:42   ` David Hildenbrand
  2022-01-27 13:37   ` Mike Rapoport
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2022-01-27 12:42 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: linux-mm, LKML, Alexey Makhalov, Dennis Zhou, Eric Dumazet,
	Oscar Salvador, Tejun Heo, Christoph Lameter, Nico Pache,
	Wei Yang, Rafael Aquini, Michal Hocko

On 27.01.22 09:53, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Prior to "mm: handle uninitialized numa nodes gracefully" memory hotplug
> used to allocate pgdat when memory has been added to a node
> (hotadd_init_pgdat) arch_free_nodedata has been only used in the
> failure path because once the pgdat is exported (to be visible
> by NODA_DATA(nid)) it cannot really be freed because there is no
> synchronization available for that.
> 
> pgdat is allocated for each possible nodes now so the memory hotplug
> doesn't need to do the ever use arch_free_nodedata so drop it.
> 
> This patch doesn't introduce any functional change.
> 
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  arch/ia64/mm/discontig.c       |  5 -----
>  include/linux/memory_hotplug.h |  3 ---
>  mm/memory_hotplug.c            | 10 ----------
>  3 files changed, 18 deletions(-)
> 
> diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
> index dd0cf4834eaa..73d0db36edb6 100644
> --- a/arch/ia64/mm/discontig.c
> +++ b/arch/ia64/mm/discontig.c
> @@ -615,11 +615,6 @@ pg_data_t * __init arch_alloc_nodedata(int nid)
>  	return memblock_alloc(size, SMP_CACHE_BYTES);
>  }
>  
> -void arch_free_nodedata(pg_data_t *pgdat)
> -{
> -	kfree(pgdat);
> -}
> -
>  void arch_refresh_nodedata(int update_node, pg_data_t *update_pgdat)
>  {
>  	pgdat_list[update_node] = update_pgdat;
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index cdd66bfdf855..60f09d3ebb3d 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -24,17 +24,14 @@ struct vmem_altmap;
>   * node_data[nid] = kzalloc() works well. But it depends on the architecture.
>   *
>   * In general, generic_alloc_nodedata() is used.
> - * Now, arch_free_nodedata() is just defined for error path of node_hot_add.
>   *
>   */
>  extern pg_data_t *arch_alloc_nodedata(int nid);
> -extern void arch_free_nodedata(pg_data_t *pgdat);
>  extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
>  
>  #else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
>  
>  #define arch_alloc_nodedata(nid)	generic_alloc_nodedata(nid)
> -#define arch_free_nodedata(pgdat)	generic_free_nodedata(pgdat)
>  
>  #ifdef CONFIG_NUMA
>  /*
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index fc991831d296..875cdc7ffa58 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1217,16 +1217,6 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
>  	return pgdat;
>  }
>  
> -static void rollback_node_hotadd(int nid)
> -{
> -	pg_data_t *pgdat = NODE_DATA(nid);
> -
> -	arch_refresh_nodedata(nid, NULL);
> -	free_percpu(pgdat->per_cpu_nodestats);
> -	arch_free_nodedata(pgdat);
> -}

As mentioned, maybe we want to rip out rollback_node_hotadd() in the
previous patch.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] mm, memory_hotplug: reorganize new pgdat initialization
       [not found] ` <20220127085305.20890-5-mhocko@kernel.org>
@ 2022-01-27 12:46   ` David Hildenbrand
  2022-01-27 14:44     ` Michal Hocko
  2022-01-27 13:39   ` Mike Rapoport
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2022-01-27 12:46 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: linux-mm, LKML, Alexey Makhalov, Dennis Zhou, Eric Dumazet,
	Oscar Salvador, Tejun Heo, Christoph Lameter, Nico Pache,
	Wei Yang, Rafael Aquini, Michal Hocko

On 27.01.22 09:53, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> When a !node_online node is brought up it needs a hotplug specific
> initialization because the node could be either uninitialized yet or it
> could have been recycled after previous hotremove. hotadd_init_pgdat is
> responsible for that.
> 
> Internal pgdat state is initialized at two places currently
> 	- hotadd_init_pgdat
> 	- free_area_init_core_hotplug
> There is no real clear cut what should go where but this patch's chosen to
> move the whole internal state initialization into free_area_init_core_hotplug.
> hotadd_init_pgdat is still responsible to pull all the parts together -
> most notably to initialize zonelists because those depend on the overall topology.
> 
> This patch doesn't introduce any functional change.
> 
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/memory_hotplug.h |  2 +-
>  mm/memory_hotplug.c            | 28 +++-------------------------
>  mm/page_alloc.c                | 25 +++++++++++++++++++++++--
>  3 files changed, 27 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 60f09d3ebb3d..76bf2de86def 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -319,7 +319,7 @@ extern void set_zone_contiguous(struct zone *zone);
>  extern void clear_zone_contiguous(struct zone *zone);
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -extern void __ref free_area_init_core_hotplug(int nid);
> +extern void __ref free_area_init_core_hotplug(struct pglist_data *pgdat);
>  extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
>  extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
>  extern int add_memory_resource(int nid, struct resource *resource,
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 875cdc7ffa58..ddc62f8b591f 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1166,39 +1166,16 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
>  {
>  	struct pglist_data *pgdat;
>  
> -	pgdat = NODE_DATA(nid);
> -
>  	/*
>  	 * NODE_DATA is preallocated (free_area_init) but its internal
>  	 * state is not allocated completely. Add missing pieces.
>  	 * Completely offline nodes stay around and they just need
>  	 * reintialization.
>  	 */
> -	if (pgdat->per_cpu_nodestats == &boot_nodestats) {
> -		pgdat->per_cpu_nodestats =
> -			alloc_percpu(struct per_cpu_nodestat);
> -	} else {
> -		int cpu;
> -		/*
> -		 * Reset the nr_zones, order and highest_zoneidx before reuse.
> -		 * Note that kswapd will init kswapd_highest_zoneidx properly
> -		 * when it starts in the near future.
> -		 */
> -		pgdat->nr_zones = 0;
> -		pgdat->kswapd_order = 0;
> -		pgdat->kswapd_highest_zoneidx = 0;
> -		for_each_online_cpu(cpu) {
> -			struct per_cpu_nodestat *p;
> -
> -			p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
> -			memset(p, 0, sizeof(*p));
> -		}
> -	}
> -
> -	pgdat->node_start_pfn = 0;
> +	pgdat = NODE_DATA(nid);
>  
>  	/* init node's zones as empty zones, we don't have any present pages.*/
> -	free_area_init_core_hotplug(nid);
> +	free_area_init_core_hotplug(pgdat);
>  
>  	/*
>  	 * The node we allocated has no zone fallback lists. For avoiding
> @@ -1210,6 +1187,7 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
>  	 * When memory is hot-added, all the memory is in offline state. So
>  	 * clear all zones' present_pages because they will be updated in
>  	 * online_pages() and offline_pages().
> +	 * TODO: should be in free_area_init_core_hotplug?
>  	 */
>  	reset_node_managed_pages(pgdat);
>  	reset_node_present_pages(pgdat);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1a05669044d3..32d0189de4c5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7506,12 +7506,33 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx,
>   * NOTE: this function is only called during memory hotplug
>   */
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -void __ref free_area_init_core_hotplug(int nid)
> +void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
>  {
> +	int nid = pgdat->node_id;
>  	enum zone_type z;
> -	pg_data_t *pgdat = NODE_DATA(nid);
> +	int cpu;
>  
>  	pgdat_init_internals(pgdat);
> +
> +	if (pgdat->per_cpu_nodestats == &boot_nodestats)
> +		pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
> +
> +	/*
> +	 * Reset the nr_zones, order and highest_zoneidx before reuse.
> +	 * Note that kswapd will init kswapd_highest_zoneidx properly
> +	 * when it starts in the near future.
> +	 */
> +	pgdat->nr_zones = 0;
> +	pgdat->kswapd_order = 0;
> +	pgdat->kswapd_highest_zoneidx = 0;
> +	pgdat->node_start_pfn = 0;
> +	for_each_online_cpu(cpu) {
> +		struct per_cpu_nodestat *p;
> +
> +		p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
> +		memset(p, 0, sizeof(*p));
> +	}
> +
>  	for (z = 0; z < MAX_NR_ZONES; z++)
>  		zone_init_internals(&pgdat->node_zones[z], z, nid, 0);
>  }

I feel like we should be initializing all of this only once, just after
allocating the node. There should be no difference between a node we're
reusing and a "fresh" node. IOW, memory offlining should be resetting
all state accordingly when the last memory goes away.

But I might be wrong and this patch looks like an improvement, as you
say,  without functional change

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: make free_area_init_node aware of memory less nodes
       [not found] ` <20220127085305.20890-6-mhocko@kernel.org>
@ 2022-01-27 12:47   ` David Hildenbrand
  2022-01-27 13:34   ` Mike Rapoport
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2022-01-27 12:47 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: linux-mm, LKML, Alexey Makhalov, Dennis Zhou, Eric Dumazet,
	Oscar Salvador, Tejun Heo, Christoph Lameter, Nico Pache,
	Wei Yang, Rafael Aquini, Michal Hocko

On 27.01.22 09:53, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> free_area_init_node is also called from memory less node initialization
> path (free_area_init_memoryless_node). It doesn't really make much sense
> to display the physical memory range for those nodes:
> Initmem setup node XX [mem 0x0000000000000000-0x0000000000000000]
> 
> Instead be explicit that the node is memoryless:
> Initmem setup node XX as memoryless
> 
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 32d0189de4c5..83da2279be72 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7682,9 +7682,14 @@ static void __init free_area_init_node(int nid)
>  	pgdat->node_start_pfn = start_pfn;
>  	pgdat->per_cpu_nodestats = NULL;
>  
> -	pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
> -		(u64)start_pfn << PAGE_SHIFT,
> -		end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
> +	if (start_pfn != end_pfn) {
> +		pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
> +			(u64)start_pfn << PAGE_SHIFT,
> +			end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
> +	} else {
> +		pr_info("Initmem setup node %d as memoryless\n", nid);
> +	}
> +
>  	calculate_node_totalpages(pgdat, start_pfn, end_pfn);
>  
>  	alloc_node_mem_map(pgdat);

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] memcg: do not tweak node in alloc_mem_cgroup_per_node_info
       [not found] ` <20220127085305.20890-7-mhocko@kernel.org>
@ 2022-01-27 12:50   ` David Hildenbrand
  2022-01-28 11:01   ` Oscar Salvador
  2022-02-01  2:45   ` Wei Yang
  2 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2022-01-27 12:50 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: linux-mm, LKML, Alexey Makhalov, Dennis Zhou, Eric Dumazet,
	Oscar Salvador, Tejun Heo, Christoph Lameter, Nico Pache,
	Wei Yang, Rafael Aquini, Michal Hocko

On 27.01.22 09:53, Michal Hocko wrote:
> From: Wei Yang <richard.weiyang@gmail.com>
> 
> alloc_mem_cgroup_per_node_info is allocated for each possible node and
> this used to be a problem because not !node_online nodes didn't have
> appropriate data structure allocated. This has changed by "mm: handle
> uninitialized numa nodes gracefully" so we can drop the special casing
> here.
> 
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memcontrol.c | 14 ++------------
>  1 file changed, 2 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 09d342c7cbd0..8b3f9eef4c65 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5067,18 +5067,8 @@ struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
>  static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>  {
>  	struct mem_cgroup_per_node *pn;
> -	int tmp = node;
> -	/*
> -	 * This routine is called against possible nodes.
> -	 * But it's BUG to call kmalloc() against offline node.
> -	 *
> -	 * TODO: this routine can waste much memory for nodes which will
> -	 *       never be onlined. It's better to use memory hotplug callback
> -	 *       function.
> -	 */
> -	if (!node_state(node, N_NORMAL_MEMORY))
> -		tmp = -1;
> -	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, tmp);
> +
> +	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, node);
>  	if (!pn)
>  		return 1;
>  

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: make free_area_init_node aware of memory less nodes
       [not found] ` <20220127085305.20890-6-mhocko@kernel.org>
  2022-01-27 12:47   ` [PATCH 5/6] mm: make free_area_init_node aware of memory less nodes David Hildenbrand
@ 2022-01-27 13:34   ` Mike Rapoport
  2022-01-28 10:59   ` Oscar Salvador
  2022-02-01  2:43   ` Wei Yang
  3 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2022-01-27 13:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Wei Yang,
	Rafael Aquini, Michal Hocko

On Thu, Jan 27, 2022 at 09:53:04AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> free_area_init_node is also called from memory less node initialization
> path (free_area_init_memoryless_node). It doesn't really make much sense
> to display the physical memory range for those nodes:
> Initmem setup node XX [mem 0x0000000000000000-0x0000000000000000]
> 
> Instead be explicit that the node is memoryless:
> Initmem setup node XX as memoryless
> 
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  mm/page_alloc.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 32d0189de4c5..83da2279be72 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7682,9 +7682,14 @@ static void __init free_area_init_node(int nid)
>  	pgdat->node_start_pfn = start_pfn;
>  	pgdat->per_cpu_nodestats = NULL;
>  
> -	pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
> -		(u64)start_pfn << PAGE_SHIFT,
> -		end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
> +	if (start_pfn != end_pfn) {
> +		pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
> +			(u64)start_pfn << PAGE_SHIFT,
> +			end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
> +	} else {
> +		pr_info("Initmem setup node %d as memoryless\n", nid);
> +	}
> +
>  	calculate_node_totalpages(pgdat, start_pfn, end_pfn);
>  
>  	alloc_node_mem_map(pgdat);
> -- 
> 2.30.2
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG
       [not found] ` <20220127085305.20890-2-mhocko@kernel.org>
  2022-01-27 12:27   ` [PATCH 1/6] mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG David Hildenbrand
@ 2022-01-27 13:36   ` Mike Rapoport
  2022-01-28  6:18   ` Oscar Salvador
  2022-02-01  2:13   ` Wei Yang
  3 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2022-01-27 13:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Wei Yang,
	Rafael Aquini, Michal Hocko

On Thu, Jan 27, 2022 at 09:53:00AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> This is a preparatory patch and it doesn't introduce any functional
> change. It merely pulls out arch_alloc_nodedata (and co) outside of
> CONFIG_MEMORY_HOTPLUG because the following patch will need to call this
> from the generic MM code.
> 
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  arch/ia64/mm/discontig.c       |   2 -
>  include/linux/memory_hotplug.h | 119 ++++++++++++++++-----------------
>  2 files changed, 59 insertions(+), 62 deletions(-)
> 
> diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
> index 791d4176e4a6..8dc8a554f774 100644
> --- a/arch/ia64/mm/discontig.c
> +++ b/arch/ia64/mm/discontig.c
> @@ -608,7 +608,6 @@ void __init paging_init(void)
>  	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
>  }
>  
> -#ifdef CONFIG_MEMORY_HOTPLUG
>  pg_data_t *arch_alloc_nodedata(int nid)
>  {
>  	unsigned long size = compute_pernodesize(nid);
> @@ -626,7 +625,6 @@ void arch_refresh_nodedata(int update_node, pg_data_t *update_pgdat)
>  	pgdat_list[update_node] = update_pgdat;
>  	scatter_node_data();
>  }
> -#endif
>  
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index be48e003a518..4355983b364d 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -16,6 +16,65 @@ struct memory_group;
>  struct resource;
>  struct vmem_altmap;
>  
> +#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
> +/*
> + * For supporting node-hotadd, we have to allocate a new pgdat.
> + *
> + * If an arch has generic style NODE_DATA(),
> + * node_data[nid] = kzalloc() works well. But it depends on the architecture.
> + *
> + * In general, generic_alloc_nodedata() is used.
> + * Now, arch_free_nodedata() is just defined for error path of node_hot_add.
> + *
> + */
> +extern pg_data_t *arch_alloc_nodedata(int nid);
> +extern void arch_free_nodedata(pg_data_t *pgdat);
> +extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
> +
> +#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
> +
> +#define arch_alloc_nodedata(nid)	generic_alloc_nodedata(nid)
> +#define arch_free_nodedata(pgdat)	generic_free_nodedata(pgdat)
> +
> +#ifdef CONFIG_NUMA
> +/*
> + * XXX: node aware allocation can't work well to get new node's memory at this time.
> + *	Because, pgdat for the new node is not allocated/initialized yet itself.
> + *	To use new node's memory, more consideration will be necessary.
> + */
> +#define generic_alloc_nodedata(nid)				\
> +({								\
> +	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
> +})
> +/*
> + * This definition is just for error path in node hotadd.
> + * For node hotremove, we have to replace this.
> + */
> +#define generic_free_nodedata(pgdat)	kfree(pgdat)
> +
> +extern pg_data_t *node_data[];
> +static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
> +{
> +	node_data[nid] = pgdat;
> +}
> +
> +#else /* !CONFIG_NUMA */
> +
> +/* never called */
> +static inline pg_data_t *generic_alloc_nodedata(int nid)
> +{
> +	BUG();
> +	return NULL;
> +}
> +static inline void generic_free_nodedata(pg_data_t *pgdat)
> +{
> +}
> +static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
> +{
> +}
> +#endif /* CONFIG_NUMA */
> +#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
> +
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  struct page *pfn_to_online_page(unsigned long pfn);
>  
> @@ -154,66 +213,6 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
>  	      struct mhp_params *params);
>  #endif /* ARCH_HAS_ADD_PAGES */
>  
> -#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
> -/*
> - * For supporting node-hotadd, we have to allocate a new pgdat.
> - *
> - * If an arch has generic style NODE_DATA(),
> - * node_data[nid] = kzalloc() works well. But it depends on the architecture.
> - *
> - * In general, generic_alloc_nodedata() is used.
> - * Now, arch_free_nodedata() is just defined for error path of node_hot_add.
> - *
> - */
> -extern pg_data_t *arch_alloc_nodedata(int nid);
> -extern void arch_free_nodedata(pg_data_t *pgdat);
> -extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
> -
> -#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
> -
> -#define arch_alloc_nodedata(nid)	generic_alloc_nodedata(nid)
> -#define arch_free_nodedata(pgdat)	generic_free_nodedata(pgdat)
> -
> -#ifdef CONFIG_NUMA
> -/*
> - * If ARCH_HAS_NODEDATA_EXTENSION=n, this func is used to allocate pgdat.
> - * XXX: kmalloc_node() can't work well to get new node's memory at this time.
> - *	Because, pgdat for the new node is not allocated/initialized yet itself.
> - *	To use new node's memory, more consideration will be necessary.
> - */
> -#define generic_alloc_nodedata(nid)				\
> -({								\
> -	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
> -})
> -/*
> - * This definition is just for error path in node hotadd.
> - * For node hotremove, we have to replace this.
> - */
> -#define generic_free_nodedata(pgdat)	kfree(pgdat)
> -
> -extern pg_data_t *node_data[];
> -static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
> -{
> -	node_data[nid] = pgdat;
> -}
> -
> -#else /* !CONFIG_NUMA */
> -
> -/* never called */
> -static inline pg_data_t *generic_alloc_nodedata(int nid)
> -{
> -	BUG();
> -	return NULL;
> -}
> -static inline void generic_free_nodedata(pg_data_t *pgdat)
> -{
> -}
> -static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
> -{
> -}
> -#endif /* CONFIG_NUMA */
> -#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
> -
>  void get_online_mems(void);
>  void put_online_mems(void);
>  
> -- 
> 2.30.2
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
       [not found] ` <20220127085305.20890-3-mhocko@kernel.org>
  2022-01-27 12:41   ` [PATCH 2/6] mm: handle uninitialized numa nodes gracefully David Hildenbrand
@ 2022-01-27 13:37   ` Mike Rapoport
  2022-01-27 14:47     ` Michal Hocko
  2022-01-28  6:27   ` Oscar Salvador
  2022-01-31 10:34   ` Michal Hocko
  3 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2022-01-27 13:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Wei Yang,
	Rafael Aquini, Michal Hocko

On Thu, Jan 27, 2022 at 09:53:01AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> We have had several reports [1][2][3] that page allocator blows up when
> an allocation from a possible node is requested. The underlying reason
> is that NODE_DATA for the specific node is not allocated.
> 
> NUMA specific initialization is arch specific and it can vary a lot.
> E.g. x86 tries to initialize all nodes that have some cpu affinity (see
> init_cpu_to_node) but this can be insufficient because the node might be
> cpuless for example.
> 
> One way to address this problem would be to check for !node_online nodes
> when trying to get a zonelist and silently fall back to another node.
> That is unfortunately adding a branch into allocator hot path and it
> doesn't handle any other potential NODE_DATA users.
> 
> This patch takes a different approach (following a lead of [3]) and it
> pre allocates pgdat for all possible nodes in an arch indipendent code
> - free_area_init. All uninitialized nodes are treated as memoryless
> nodes. node_state of the node is not changed because that would lead to
> other side effects - e.g. sysfs representation of such a node and from
> past discussions [4] it is known that some tools might have problems
> digesting that.
> 
> Newly allocated pgdat only gets a minimal initialization and the rest of
> the work is expected to be done by the memory hotplug - hotadd_new_pgdat
> (renamed to hotadd_init_pgdat).
> 
> generic_alloc_nodedata is changed to use the memblock allocator because
> neither page nor slab allocators are available at the stage when all
> pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can
> use the early boot allocator. The only arch specific implementation is
> ia64 and that is changed to use the early allocator as well.
> 
> Reported-by: Alexey Makhalov <amakhalov@vmware.com>
> Tested-by: Alexey Makhalov <amakhalov@vmware.com>
> Reported-by: Nico Pache <npache@redhat.com>
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Tested-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Some minor nits below, other than that

Acked-by: Mike Rapoport <rppt@linux.ibm.com>

> 
> [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
> [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
> [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
> [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com
> ---
>  arch/ia64/mm/discontig.c       |  4 ++--
>  include/linux/memory_hotplug.h |  2 +-
>  mm/memory_hotplug.c            | 21 +++++++++------------
>  mm/page_alloc.c                | 34 +++++++++++++++++++++++++++++++---
>  4 files changed, 43 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
> index 8dc8a554f774..dd0cf4834eaa 100644
> --- a/arch/ia64/mm/discontig.c
> +++ b/arch/ia64/mm/discontig.c
> @@ -608,11 +608,11 @@ void __init paging_init(void)
>  	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
>  }
>  
> -pg_data_t *arch_alloc_nodedata(int nid)
> +pg_data_t * __init arch_alloc_nodedata(int nid)
>  {
>  	unsigned long size = compute_pernodesize(nid);
>  
> -	return kzalloc(size, GFP_KERNEL);
> +	return memblock_alloc(size, SMP_CACHE_BYTES);
>  }
>  
>  void arch_free_nodedata(pg_data_t *pgdat)
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 4355983b364d..cdd66bfdf855 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -44,7 +44,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
>   */
>  #define generic_alloc_nodedata(nid)				\
>  ({								\
> -	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
> +	memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES);	\
>  })
>  /*
>   * This definition is just for error path in node hotadd.
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2a9627dc784c..fc991831d296 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1162,19 +1162,21 @@ static void reset_node_present_pages(pg_data_t *pgdat)
>  }
>  
>  /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
> -static pg_data_t __ref *hotadd_new_pgdat(int nid)
> +static pg_data_t __ref *hotadd_init_pgdat(int nid)
>  {
>  	struct pglist_data *pgdat;
>  
>  	pgdat = NODE_DATA(nid);
> -	if (!pgdat) {
> -		pgdat = arch_alloc_nodedata(nid);
> -		if (!pgdat)
> -			return NULL;
>  
> +	/*
> +	 * NODE_DATA is preallocated (free_area_init) but its internal
> +	 * state is not allocated completely. Add missing pieces.
> +	 * Completely offline nodes stay around and they just need
> +	 * reintialization.
> +	 */
> +	if (pgdat->per_cpu_nodestats == &boot_nodestats) {
>  		pgdat->per_cpu_nodestats =
>  			alloc_percpu(struct per_cpu_nodestat);
> -		arch_refresh_nodedata(nid, pgdat);
>  	} else {
>  		int cpu;
>  		/*
> @@ -1193,8 +1195,6 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid)
>  		}
>  	}
>  
> -	/* we can use NODE_DATA(nid) from here */
> -	pgdat->node_id = nid;
>  	pgdat->node_start_pfn = 0;
>  
>  	/* init node's zones as empty zones, we don't have any present pages.*/
> @@ -1246,7 +1246,7 @@ static int __try_online_node(int nid, bool set_node_online)
>  	if (node_online(nid))
>  		return 0;
>  
> -	pgdat = hotadd_new_pgdat(nid);
> +	pgdat = hotadd_init_pgdat(nid);
>  	if (!pgdat) {
>  		pr_err("Cannot online node %d due to NULL pgdat\n", nid);
>  		ret = -ENOMEM;
> @@ -1445,9 +1445,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  
>  	return ret;
>  error:
> -	/* rollback pgdat allocation and others */
> -	if (new_node)
> -		rollback_node_hotadd(nid);
>  	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
>  		memblock_remove(start, size);
>  error_mem_hotplug_end:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589febc6d31..1a05669044d3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6402,7 +6402,11 @@ static void __build_all_zonelists(void *data)
>  	if (self && !node_online(self->node_id)) {
>  		build_zonelists(self);
>  	} else {
> -		for_each_online_node(nid) {
> +		/*
> +		 * All possible nodes have pgdat preallocated
> +		 * free_area_init

Nit: "by free_area_init"

> +		 */
> +		for_each_node(nid) {
>  			pg_data_t *pgdat = NODE_DATA(nid);
>  
>  			build_zonelists(pgdat);
> @@ -8096,8 +8100,32 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>  	/* Initialise every node */
>  	mminit_verify_pageflags_layout();
>  	setup_nr_node_ids();
> -	for_each_online_node(nid) {
> -		pg_data_t *pgdat = NODE_DATA(nid);
> +	for_each_node(nid) {
> +		pg_data_t *pgdat;
> +
> +		if (!node_online(nid)) {
> +			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);

Do we assume that platform code must allocate node data for all nodes in
the system? Because if we don't this warning is misleading.

> +
> +			/* Allocator not initialized yet */
> +			pgdat = arch_alloc_nodedata(nid);
> +			if (!pgdat) {
> +				pr_err("Cannot allocate %zuB for node %d.\n",
> +						sizeof(*pgdat), nid);
> +				continue;
> +			}
> +			arch_refresh_nodedata(nid, pgdat);
> +			free_area_init_memoryless_node(nid);
> +			/*
> +			 * not marking this node online because we do not want to
> +			 * confuse userspace by sysfs files/directories for node
> +			 * without any memory attached to it (see topology_init)
> +			 * The pgdat will get fully initialized when a memory is
> +			 * hotpluged into it by hotadd_init_pgdat
> +			 */
> +			continue;

This can be made slightly more concise if we fall through after
arch_refresh_nodedata(), e.g. something like

			...

			arch_refresh_nodedata(nid, pgdat);
		}

		pgdat = NODE_DATA(nid);
		free_area_init_node(nid);

		/*
		 * Do not mark memoryless node online because we do not want to
		 * confuse userspace by sysfs files/directories for node
		 * without any memory attached to it (see topology_init)
		 * The pgdat will get fully initialized when a memory is
		 * hotpluged into it by hotadd_init_pgdat
		 */
		if (!pgdat->node_present_pages)
			continue;

but I don't feel strongly about it.

> +		}
> +
> +		pgdat = NODE_DATA(nid);
>  		free_area_init_node(nid);
>  
>  		/* Any memory on that node */
> -- 
> 2.30.2
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/6] mm, memory_hotplug: drop arch_free_nodedata
       [not found] ` <20220127085305.20890-4-mhocko@kernel.org>
  2022-01-27 12:42   ` [PATCH 3/6] mm, memory_hotplug: drop arch_free_nodedata David Hildenbrand
@ 2022-01-27 13:37   ` Mike Rapoport
  2022-01-28  6:29   ` Oscar Salvador
  2022-02-01  2:41   ` Wei Yang
  3 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2022-01-27 13:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Wei Yang,
	Rafael Aquini, Michal Hocko

On Thu, Jan 27, 2022 at 09:53:02AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Prior to "mm: handle uninitialized numa nodes gracefully" memory hotplug
> used to allocate pgdat when memory has been added to a node
> (hotadd_init_pgdat) arch_free_nodedata has been only used in the
> failure path because once the pgdat is exported (to be visible
> by NODA_DATA(nid)) it cannot really be freed because there is no
> synchronization available for that.
> 
> pgdat is allocated for each possible nodes now so the memory hotplug
> doesn't need to do the ever use arch_free_nodedata so drop it.
> 
> This patch doesn't introduce any functional change.
> 
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  arch/ia64/mm/discontig.c       |  5 -----
>  include/linux/memory_hotplug.h |  3 ---
>  mm/memory_hotplug.c            | 10 ----------
>  3 files changed, 18 deletions(-)
> 
> diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
> index dd0cf4834eaa..73d0db36edb6 100644
> --- a/arch/ia64/mm/discontig.c
> +++ b/arch/ia64/mm/discontig.c
> @@ -615,11 +615,6 @@ pg_data_t * __init arch_alloc_nodedata(int nid)
>  	return memblock_alloc(size, SMP_CACHE_BYTES);
>  }
>  
> -void arch_free_nodedata(pg_data_t *pgdat)
> -{
> -	kfree(pgdat);
> -}
> -
>  void arch_refresh_nodedata(int update_node, pg_data_t *update_pgdat)
>  {
>  	pgdat_list[update_node] = update_pgdat;
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index cdd66bfdf855..60f09d3ebb3d 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -24,17 +24,14 @@ struct vmem_altmap;
>   * node_data[nid] = kzalloc() works well. But it depends on the architecture.
>   *
>   * In general, generic_alloc_nodedata() is used.
> - * Now, arch_free_nodedata() is just defined for error path of node_hot_add.
>   *
>   */
>  extern pg_data_t *arch_alloc_nodedata(int nid);
> -extern void arch_free_nodedata(pg_data_t *pgdat);
>  extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
>  
>  #else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
>  
>  #define arch_alloc_nodedata(nid)	generic_alloc_nodedata(nid)
> -#define arch_free_nodedata(pgdat)	generic_free_nodedata(pgdat)
>  
>  #ifdef CONFIG_NUMA
>  /*
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index fc991831d296..875cdc7ffa58 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1217,16 +1217,6 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
>  	return pgdat;
>  }
>  
> -static void rollback_node_hotadd(int nid)
> -{
> -	pg_data_t *pgdat = NODE_DATA(nid);
> -
> -	arch_refresh_nodedata(nid, NULL);
> -	free_percpu(pgdat->per_cpu_nodestats);
> -	arch_free_nodedata(pgdat);
> -}
> -
> -
>  /*
>   * __try_online_node - online a node if offlined
>   * @nid: the node ID
> -- 
> 2.30.2
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] mm, memory_hotplug: reorganize new pgdat initialization
       [not found] ` <20220127085305.20890-5-mhocko@kernel.org>
  2022-01-27 12:46   ` [PATCH 4/6] mm, memory_hotplug: reorganize new pgdat initialization David Hildenbrand
@ 2022-01-27 13:39   ` Mike Rapoport
  2022-01-27 14:45     ` Michal Hocko
  2022-01-28 10:51   ` Oscar Salvador
  2022-02-01  2:42   ` Wei Yang
  3 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2022-01-27 13:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Wei Yang,
	Rafael Aquini, Michal Hocko

On Thu, Jan 27, 2022 at 09:53:03AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> When a !node_online node is brought up it needs a hotplug specific
> initialization because the node could be either uninitialized yet or it
> could have been recycled after previous hotremove. hotadd_init_pgdat is
> responsible for that.
> 
> Internal pgdat state is initialized at two places currently
> 	- hotadd_init_pgdat
> 	- free_area_init_core_hotplug
> There is no real clear cut what should go where but this patch's chosen to
> move the whole internal state initialization into free_area_init_core_hotplug.
> hotadd_init_pgdat is still responsible to pull all the parts together -
> most notably to initialize zonelists because those depend on the overall topology.
> 
> This patch doesn't introduce any functional change.
> 
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/memory_hotplug.h |  2 +-
>  mm/memory_hotplug.c            | 28 +++-------------------------
>  mm/page_alloc.c                | 25 +++++++++++++++++++++++--
>  3 files changed, 27 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 60f09d3ebb3d..76bf2de86def 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -319,7 +319,7 @@ extern void set_zone_contiguous(struct zone *zone);
>  extern void clear_zone_contiguous(struct zone *zone);
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -extern void __ref free_area_init_core_hotplug(int nid);
> +extern void __ref free_area_init_core_hotplug(struct pglist_data *pgdat);
>  extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
>  extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
>  extern int add_memory_resource(int nid, struct resource *resource,
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 875cdc7ffa58..ddc62f8b591f 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1166,39 +1166,16 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
>  {
>  	struct pglist_data *pgdat;
>  
> -	pgdat = NODE_DATA(nid);
> -
>  	/*
>  	 * NODE_DATA is preallocated (free_area_init) but its internal
>  	 * state is not allocated completely. Add missing pieces.
>  	 * Completely offline nodes stay around and they just need
>  	 * reintialization.
>  	 */
> -	if (pgdat->per_cpu_nodestats == &boot_nodestats) {
> -		pgdat->per_cpu_nodestats =
> -			alloc_percpu(struct per_cpu_nodestat);
> -	} else {
> -		int cpu;
> -		/*
> -		 * Reset the nr_zones, order and highest_zoneidx before reuse.
> -		 * Note that kswapd will init kswapd_highest_zoneidx properly
> -		 * when it starts in the near future.
> -		 */
> -		pgdat->nr_zones = 0;
> -		pgdat->kswapd_order = 0;
> -		pgdat->kswapd_highest_zoneidx = 0;
> -		for_each_online_cpu(cpu) {
> -			struct per_cpu_nodestat *p;
> -
> -			p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
> -			memset(p, 0, sizeof(*p));
> -		}
> -	}
> -
> -	pgdat->node_start_pfn = 0;
> +	pgdat = NODE_DATA(nid);
>  
>  	/* init node's zones as empty zones, we don't have any present pages.*/
> -	free_area_init_core_hotplug(nid);
> +	free_area_init_core_hotplug(pgdat);
>  
>  	/*
>  	 * The node we allocated has no zone fallback lists. For avoiding
> @@ -1210,6 +1187,7 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
>  	 * When memory is hot-added, all the memory is in offline state. So
>  	 * clear all zones' present_pages because they will be updated in
>  	 * online_pages() and offline_pages().
> +	 * TODO: should be in free_area_init_core_hotplug?
>  	 */
>  	reset_node_managed_pages(pgdat);
>  	reset_node_present_pages(pgdat);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1a05669044d3..32d0189de4c5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7506,12 +7506,33 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx,
>   * NOTE: this function is only called during memory hotplug
>   */
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -void __ref free_area_init_core_hotplug(int nid)
> +void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
>  {
> +	int nid = pgdat->node_id;
>  	enum zone_type z;
> -	pg_data_t *pgdat = NODE_DATA(nid);
> +	int cpu;
>  
>  	pgdat_init_internals(pgdat);
> +
> +	if (pgdat->per_cpu_nodestats == &boot_nodestats)
> +		pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
> +
> +	/*
> +	 * Reset the nr_zones, order and highest_zoneidx before reuse.
> +	 * Note that kswapd will init kswapd_highest_zoneidx properly
> +	 * when it starts in the near future.
> +	 */
> +	pgdat->nr_zones = 0;
> +	pgdat->kswapd_order = 0;
> +	pgdat->kswapd_highest_zoneidx = 0;
> +	pgdat->node_start_pfn = 0;
> +	for_each_online_cpu(cpu) {
> +		struct per_cpu_nodestat *p;
> +
> +		p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
> +		memset(p, 0, sizeof(*p));
> +	}

It seems to me that most of this is already done by free_area_init_node()
at boot time anyway. Do I miss something?

> +
>  	for (z = 0; z < MAX_NR_ZONES; z++)
>  		zone_init_internals(&pgdat->node_zones[z], z, nid, 0);
>  }
> -- 
> 2.30.2
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] mm, memory_hotplug: reorganize new pgdat initialization
  2022-01-27 12:46   ` [PATCH 4/6] mm, memory_hotplug: reorganize new pgdat initialization David Hildenbrand
@ 2022-01-27 14:44     ` Michal Hocko
  2022-02-17 10:40       ` Oscar Salvador
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2022-01-27 14:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-mm, LKML, Alexey Makhalov, Dennis Zhou,
	Eric Dumazet, Oscar Salvador, Tejun Heo, Christoph Lameter,
	Nico Pache, Wei Yang, Rafael Aquini

On Thu 27-01-22 13:46:53, David Hildenbrand wrote:
> On 27.01.22 09:53, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > When a !node_online node is brought up it needs a hotplug specific
> > initialization because the node could be either uninitialized yet or it
> > could have been recycled after previous hotremove. hotadd_init_pgdat is
> > responsible for that.
> > 
> > Internal pgdat state is initialized at two places currently
> > 	- hotadd_init_pgdat
> > 	- free_area_init_core_hotplug
> > There is no real clear cut what should go where but this patch's chosen to
> > move the whole internal state initialization into free_area_init_core_hotplug.
> > hotadd_init_pgdat is still responsible to pull all the parts together -
> > most notably to initialize zonelists because those depend on the overall topology.
> > 
> > This patch doesn't introduce any functional change.
> > 
> > Acked-by: Rafael Aquini <raquini@redhat.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  include/linux/memory_hotplug.h |  2 +-
> >  mm/memory_hotplug.c            | 28 +++-------------------------
> >  mm/page_alloc.c                | 25 +++++++++++++++++++++++--
> >  3 files changed, 27 insertions(+), 28 deletions(-)
> > 
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index 60f09d3ebb3d..76bf2de86def 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -319,7 +319,7 @@ extern void set_zone_contiguous(struct zone *zone);
> >  extern void clear_zone_contiguous(struct zone *zone);
> >  
> >  #ifdef CONFIG_MEMORY_HOTPLUG
> > -extern void __ref free_area_init_core_hotplug(int nid);
> > +extern void __ref free_area_init_core_hotplug(struct pglist_data *pgdat);
> >  extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
> >  extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
> >  extern int add_memory_resource(int nid, struct resource *resource,
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 875cdc7ffa58..ddc62f8b591f 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1166,39 +1166,16 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
> >  {
> >  	struct pglist_data *pgdat;
> >  
> > -	pgdat = NODE_DATA(nid);
> > -
> >  	/*
> >  	 * NODE_DATA is preallocated (free_area_init) but its internal
> >  	 * state is not allocated completely. Add missing pieces.
> >  	 * Completely offline nodes stay around and they just need
> >  	 * reintialization.
> >  	 */
> > -	if (pgdat->per_cpu_nodestats == &boot_nodestats) {
> > -		pgdat->per_cpu_nodestats =
> > -			alloc_percpu(struct per_cpu_nodestat);
> > -	} else {
> > -		int cpu;
> > -		/*
> > -		 * Reset the nr_zones, order and highest_zoneidx before reuse.
> > -		 * Note that kswapd will init kswapd_highest_zoneidx properly
> > -		 * when it starts in the near future.
> > -		 */
> > -		pgdat->nr_zones = 0;
> > -		pgdat->kswapd_order = 0;
> > -		pgdat->kswapd_highest_zoneidx = 0;
> > -		for_each_online_cpu(cpu) {
> > -			struct per_cpu_nodestat *p;
> > -
> > -			p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
> > -			memset(p, 0, sizeof(*p));
> > -		}
> > -	}
> > -
> > -	pgdat->node_start_pfn = 0;
> > +	pgdat = NODE_DATA(nid);
> >  
> >  	/* init node's zones as empty zones, we don't have any present pages.*/
> > -	free_area_init_core_hotplug(nid);
> > +	free_area_init_core_hotplug(pgdat);
> >  
> >  	/*
> >  	 * The node we allocated has no zone fallback lists. For avoiding
> > @@ -1210,6 +1187,7 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
> >  	 * When memory is hot-added, all the memory is in offline state. So
> >  	 * clear all zones' present_pages because they will be updated in
> >  	 * online_pages() and offline_pages().
> > +	 * TODO: should be in free_area_init_core_hotplug?
> >  	 */
> >  	reset_node_managed_pages(pgdat);
> >  	reset_node_present_pages(pgdat);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1a05669044d3..32d0189de4c5 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -7506,12 +7506,33 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx,
> >   * NOTE: this function is only called during memory hotplug
> >   */
> >  #ifdef CONFIG_MEMORY_HOTPLUG
> > -void __ref free_area_init_core_hotplug(int nid)
> > +void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
> >  {
> > +	int nid = pgdat->node_id;
> >  	enum zone_type z;
> > -	pg_data_t *pgdat = NODE_DATA(nid);
> > +	int cpu;
> >  
> >  	pgdat_init_internals(pgdat);
> > +
> > +	if (pgdat->per_cpu_nodestats == &boot_nodestats)
> > +		pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
> > +
> > +	/*
> > +	 * Reset the nr_zones, order and highest_zoneidx before reuse.
> > +	 * Note that kswapd will init kswapd_highest_zoneidx properly
> > +	 * when it starts in the near future.
> > +	 */
> > +	pgdat->nr_zones = 0;
> > +	pgdat->kswapd_order = 0;
> > +	pgdat->kswapd_highest_zoneidx = 0;
> > +	pgdat->node_start_pfn = 0;
> > +	for_each_online_cpu(cpu) {
> > +		struct per_cpu_nodestat *p;
> > +
> > +		p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
> > +		memset(p, 0, sizeof(*p));
> > +	}
> > +
> >  	for (z = 0; z < MAX_NR_ZONES; z++)
> >  		zone_init_internals(&pgdat->node_zones[z], z, nid, 0);
> >  }
> 
> I feel like we should be initializing all of this only once, just after
> allocating the node. There should be no difference between a node we're
> reusing and a "fresh" node. IOW, memory offlining should be resetting
> all state accordingly when the last memory goes away.
> 
> But I might be wrong and this patch looks like an improvement, as you
> say,  without functional change

Yeah, I really wanted to have this simple and straightforward. To be
completely honest I am not even sure this is necessary. Something really
woth looking at.
 
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] mm, memory_hotplug: reorganize new pgdat initialization
  2022-01-27 13:39   ` Mike Rapoport
@ 2022-01-27 14:45     ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2022-01-27 14:45 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Wei Yang,
	Rafael Aquini

On Thu 27-01-22 15:39:55, Mike Rapoport wrote:
> On Thu, Jan 27, 2022 at 09:53:03AM +0100, Michal Hocko wrote:
[...]
> > +	pgdat->nr_zones = 0;
> > +	pgdat->kswapd_order = 0;
> > +	pgdat->kswapd_highest_zoneidx = 0;
> > +	pgdat->node_start_pfn = 0;
> > +	for_each_online_cpu(cpu) {
> > +		struct per_cpu_nodestat *p;
> > +
> > +		p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
> > +		memset(p, 0, sizeof(*p));
> > +	}
> 
> It seems to me that most of this is already done by free_area_init_node()
> at boot time anyway. Do I miss something?

As already replied to David I really didn't want to change the existing
logic.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
  2022-01-27 13:37   ` Mike Rapoport
@ 2022-01-27 14:47     ` Michal Hocko
  2022-01-27 15:04       ` Mike Rapoport
  2022-02-01  2:41       ` Wei Yang
  0 siblings, 2 replies; 41+ messages in thread
From: Michal Hocko @ 2022-01-27 14:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Wei Yang,
	Rafael Aquini

On Thu 27-01-22 15:37:23, Mike Rapoport wrote:
> On Thu, Jan 27, 2022 at 09:53:01AM +0100, Michal Hocko wrote:
[...]
> > +		if (!node_online(nid)) {
> > +			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
> 
> Do we assume that platform code must allocate node data for all nodes in
> the system? Because if we don't this warning is misleading.

At least x86 does that (init_cpu_to_node). Now that you brought that up
I guess you are right that this could be more misleading than helpful.
What about
			pr_info("Initializing node %d as memoryless\n", nid);
Is this better?

> > +
> > +			/* Allocator not initialized yet */
> > +			pgdat = arch_alloc_nodedata(nid);
> > +			if (!pgdat) {
> > +				pr_err("Cannot allocate %zuB for node %d.\n",
> > +						sizeof(*pgdat), nid);
> > +				continue;
> > +			}
> > +			arch_refresh_nodedata(nid, pgdat);
> > +			free_area_init_memoryless_node(nid);
> > +			/*
> > +			 * not marking this node online because we do not want to
> > +			 * confuse userspace by sysfs files/directories for node
> > +			 * without any memory attached to it (see topology_init)
> > +			 * The pgdat will get fully initialized when a memory is
> > +			 * hotpluged into it by hotadd_init_pgdat
> > +			 */
> > +			continue;
> 
> This can be made slightly more concise if we fall through after
> arch_refresh_nodedata(), e.g. something like
> 
> 			...
> 
> 			arch_refresh_nodedata(nid, pgdat);
> 		}
> 
> 		pgdat = NODE_DATA(nid);
> 		free_area_init_node(nid);
> 
> 		/*
> 		 * Do not mark memoryless node online because we do not want to
> 		 * confuse userspace by sysfs files/directories for node
> 		 * without any memory attached to it (see topology_init)
> 		 * The pgdat will get fully initialized when a memory is
> 		 * hotpluged into it by hotadd_init_pgdat
> 		 */
> 		if (!pgdat->node_present_pages)
> 			continue;
> 
> but I don't feel strongly about it.

I do not have strong preference either way. Unless this is considered
better by more people I would stick with what I have.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
  2022-01-27 12:41   ` [PATCH 2/6] mm: handle uninitialized numa nodes gracefully David Hildenbrand
@ 2022-01-27 14:50     ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2022-01-27 14:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-mm, LKML, Alexey Makhalov, Dennis Zhou,
	Eric Dumazet, Oscar Salvador, Tejun Heo, Christoph Lameter,
	Nico Pache, Wei Yang, Rafael Aquini

On Thu 27-01-22 13:41:16, David Hildenbrand wrote:
> On 27.01.22 09:53, Michal Hocko wrote:
[...]
> > diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
> > index 8dc8a554f774..dd0cf4834eaa 100644
> > --- a/arch/ia64/mm/discontig.c
> > +++ b/arch/ia64/mm/discontig.c
> > @@ -608,11 +608,11 @@ void __init paging_init(void)
> >  	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
> >  }
> >  
> > -pg_data_t *arch_alloc_nodedata(int nid)
> > +pg_data_t * __init arch_alloc_nodedata(int nid)
> >  {
> >  	unsigned long size = compute_pernodesize(nid);
> >  
> > -	return kzalloc(size, GFP_KERNEL);
> > +	return memblock_alloc(size, SMP_CACHE_BYTES);
> 
> I feel like we should have
> 
> long arch_pgdat_size(void) instead and have a generic allocation function.
> 
> But we can clean that up in the future.

I can have a look later (unless somebody beat me to it). I am not even
sure this whole ia64 weirdness is useful these days.

[...]
> > @@ -1445,9 +1445,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> >  
> >  	return ret;
> >  error:
> > -	/* rollback pgdat allocation and others */
> > -	if (new_node)
> > -		rollback_node_hotadd(nid);
> 
> As static rollback_node_hotadd() is unused in this patch, doesn't this
> trigger a warning? IOW, maybe merge at least the rollback_node_hotadd()
> removal into this patch. The arch_free_nodedata() removal can stay separate.

It is my slight preference to have this patch smaller to be easier to
review and therefore I have removed all parts in a follow up. If a
warning is a real deal at this stage then I can change that of course.

> 
> >  	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
> >  		memblock_remove(start, size);
> >  error_mem_hotplug_end:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3589febc6d31..1a05669044d3 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6402,7 +6402,11 @@ static void __build_all_zonelists(void *data)
> >  	if (self && !node_online(self->node_id)) {
> >  		build_zonelists(self);
> >  	} else {
> > -		for_each_online_node(nid) {
> > +		/*
> > +		 * All possible nodes have pgdat preallocated
> 
> ... in free_area_init() ?

Will fix it up.

> > +		 * free_area_init
> > +		 */
> > +		for_each_node(nid) {
> >  			pg_data_t *pgdat = NODE_DATA(nid);
> >  
> >  			build_zonelists(pgdat);
> > @@ -8096,8 +8100,32 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> >  	/* Initialise every node */
> >  	mminit_verify_pageflags_layout();
> >  	setup_nr_node_ids();
> > -	for_each_online_node(nid) {
> > -		pg_data_t *pgdat = NODE_DATA(nid);
> > +	for_each_node(nid) {
> > +		pg_data_t *pgdat;
> > +
> > +		if (!node_online(nid)) {
> > +			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
> > +
> > +			/* Allocator not initialized yet */
> > +			pgdat = arch_alloc_nodedata(nid);
> > +			if (!pgdat) {
> > +				pr_err("Cannot allocate %zuB for node %d.\n",
> > +						sizeof(*pgdat), nid);
> > +				continue;
> > +			}
> > +			arch_refresh_nodedata(nid, pgdat);
> 
> We could get rid of arch_refresh_nodedata() now and simply merge that
> into arch_alloc_nodedata(). But depends on how we want to proceed with
> arch_alloc_nodedata() eventually.

yeah, I will postpone that to a later follow ups.
 
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

Btw now I have realized that I have lost the following hung somewhere:

diff --git a/mm/internal.h b/mm/internal.h
index d80300392a19..43b8ccf56b7f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -718,4 +718,6 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 		      unsigned long addr, int page_nid, int *flags);
 
+DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1a05669044d3..4120cc855673 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6380,7 +6380,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 #define BOOT_PAGESET_BATCH	1
 static DEFINE_PER_CPU(struct per_cpu_pages, boot_pageset);
 static DEFINE_PER_CPU(struct per_cpu_zonestat, boot_zonestats);
-static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
+DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
 static void __build_all_zonelists(void *data)
 {
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
  2022-01-27 14:47     ` Michal Hocko
@ 2022-01-27 15:04       ` Mike Rapoport
  2022-02-01  2:41       ` Wei Yang
  1 sibling, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2022-01-27 15:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Wei Yang,
	Rafael Aquini

On Thu, Jan 27, 2022 at 03:47:40PM +0100, Michal Hocko wrote:
> On Thu 27-01-22 15:37:23, Mike Rapoport wrote:
> > On Thu, Jan 27, 2022 at 09:53:01AM +0100, Michal Hocko wrote:
> [...]
> > > +		if (!node_online(nid)) {
> > > +			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
> > 
> > Do we assume that platform code must allocate node data for all nodes in
> > the system? Because if we don't this warning is misleading.
> 
> At least x86 does that (init_cpu_to_node). Now that you brought that up
> I guess you are right that this could be more misleading than helpful.

I'm not sure if other architectures allocate memoryless nodes, but for sure
only x86 initializes it with free_area_init_memoryless_node().

> What about
> 			pr_info("Initializing node %d as memoryless\n", nid);
> Is this better?

I think yes.

> > > +
> > > +			/* Allocator not initialized yet */
> > > +			pgdat = arch_alloc_nodedata(nid);
> > > +			if (!pgdat) {
> > > +				pr_err("Cannot allocate %zuB for node %d.\n",
> > > +						sizeof(*pgdat), nid);
> > > +				continue;
> > > +			}
> > > +			arch_refresh_nodedata(nid, pgdat);
> > > +			free_area_init_memoryless_node(nid);
> > > +			/*
> > > +			 * not marking this node online because we do not want to
> > > +			 * confuse userspace by sysfs files/directories for node
> > > +			 * without any memory attached to it (see topology_init)
> > > +			 * The pgdat will get fully initialized when a memory is
> > > +			 * hotpluged into it by hotadd_init_pgdat
> > > +			 */
> > > +			continue;
> > 
> > This can be made slightly more concise if we fall through after
> > arch_refresh_nodedata(), e.g. something like
> > 
> > 			...
> > 
> > 			arch_refresh_nodedata(nid, pgdat);
> > 		}
> > 
> > 		pgdat = NODE_DATA(nid);
> > 		free_area_init_node(nid);
> > 
> > 		/*
> > 		 * Do not mark memoryless node online because we do not want to
> > 		 * confuse userspace by sysfs files/directories for node
> > 		 * without any memory attached to it (see topology_init)
> > 		 * The pgdat will get fully initialized when a memory is
> > 		 * hotpluged into it by hotadd_init_pgdat
> > 		 */
> > 		if (!pgdat->node_present_pages)
> > 			continue;
> > 
> > but I don't feel strongly about it.
> 
> I do not have strong preference either way. Unless this is considered
> better by more people I would stick with what I have.
> -- 
> Michal Hocko
> SUSE Labs

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG
       [not found] ` <20220127085305.20890-2-mhocko@kernel.org>
  2022-01-27 12:27   ` [PATCH 1/6] mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG David Hildenbrand
  2022-01-27 13:36   ` Mike Rapoport
@ 2022-01-28  6:18   ` Oscar Salvador
  2022-02-01  2:13   ` Wei Yang
  3 siblings, 0 replies; 41+ messages in thread
From: Oscar Salvador @ 2022-01-28  6:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Tejun Heo,
	Christoph Lameter, Nico Pache, Wei Yang, Rafael Aquini,
	Michal Hocko

On 2022-01-27 09:53, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> This is a preparatory patch and it doesn't introduce any functional
> change. It merely pulls out arch_alloc_nodedata (and co) outside of
> CONFIG_MEMORY_HOTPLUG because the following patch will need to call 
> this
> from the generic MM code.
> 
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
       [not found] ` <20220127085305.20890-3-mhocko@kernel.org>
  2022-01-27 12:41   ` [PATCH 2/6] mm: handle uninitialized numa nodes gracefully David Hildenbrand
  2022-01-27 13:37   ` Mike Rapoport
@ 2022-01-28  6:27   ` Oscar Salvador
  2022-01-31 10:34   ` Michal Hocko
  3 siblings, 0 replies; 41+ messages in thread
From: Oscar Salvador @ 2022-01-28  6:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Tejun Heo,
	Christoph Lameter, Nico Pache, Wei Yang, Rafael Aquini,
	Michal Hocko

On 2022-01-27 09:53, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> We have had several reports [1][2][3] that page allocator blows up when
> an allocation from a possible node is requested. The underlying reason
> is that NODE_DATA for the specific node is not allocated.
> 
> NUMA specific initialization is arch specific and it can vary a lot.
> E.g. x86 tries to initialize all nodes that have some cpu affinity (see
> init_cpu_to_node) but this can be insufficient because the node might 
> be
> cpuless for example.
> 
> One way to address this problem would be to check for !node_online 
> nodes
> when trying to get a zonelist and silently fall back to another node.
> That is unfortunately adding a branch into allocator hot path and it
> doesn't handle any other potential NODE_DATA users.
> 
> This patch takes a different approach (following a lead of [3]) and it
> pre allocates pgdat for all possible nodes in an arch indipendent code
> - free_area_init. All uninitialized nodes are treated as memoryless
> nodes. node_state of the node is not changed because that would lead to
> other side effects - e.g. sysfs representation of such a node and from
> past discussions [4] it is known that some tools might have problems
> digesting that.
> 
> Newly allocated pgdat only gets a minimal initialization and the rest 
> of
> the work is expected to be done by the memory hotplug - 
> hotadd_new_pgdat
> (renamed to hotadd_init_pgdat).
> 
> generic_alloc_nodedata is changed to use the memblock allocator because
> neither page nor slab allocators are available at the stage when all
> pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can
> use the early boot allocator. The only arch specific implementation is
> ia64 and that is changed to use the early allocator as well.
> 
> Reported-by: Alexey Makhalov <amakhalov@vmware.com>
> Tested-by: Alexey Makhalov <amakhalov@vmware.com>
> Reported-by: Nico Pache <npache@redhat.com>
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Tested-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

With the mentioned fixups:

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/6] mm, memory_hotplug: drop arch_free_nodedata
       [not found] ` <20220127085305.20890-4-mhocko@kernel.org>
  2022-01-27 12:42   ` [PATCH 3/6] mm, memory_hotplug: drop arch_free_nodedata David Hildenbrand
  2022-01-27 13:37   ` Mike Rapoport
@ 2022-01-28  6:29   ` Oscar Salvador
  2022-02-01  2:41   ` Wei Yang
  3 siblings, 0 replies; 41+ messages in thread
From: Oscar Salvador @ 2022-01-28  6:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Tejun Heo,
	Christoph Lameter, Nico Pache, Wei Yang, Rafael Aquini,
	Michal Hocko

On 2022-01-27 09:53, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Prior to "mm: handle uninitialized numa nodes gracefully" memory 
> hotplug
> used to allocate pgdat when memory has been added to a node
> (hotadd_init_pgdat) arch_free_nodedata has been only used in the
> failure path because once the pgdat is exported (to be visible
> by NODA_DATA(nid)) it cannot really be freed because there is no
> synchronization available for that.
> 
> pgdat is allocated for each possible nodes now so the memory hotplug
> doesn't need to do the ever use arch_free_nodedata so drop it.
> 
> This patch doesn't introduce any functional change.
> 
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] mm, memory_hotplug: reorganize new pgdat initialization
       [not found] ` <20220127085305.20890-5-mhocko@kernel.org>
  2022-01-27 12:46   ` [PATCH 4/6] mm, memory_hotplug: reorganize new pgdat initialization David Hildenbrand
  2022-01-27 13:39   ` Mike Rapoport
@ 2022-01-28 10:51   ` Oscar Salvador
  2022-02-01  2:42   ` Wei Yang
  3 siblings, 0 replies; 41+ messages in thread
From: Oscar Salvador @ 2022-01-28 10:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Tejun Heo,
	Christoph Lameter, Nico Pache, Wei Yang, Rafael Aquini,
	Michal Hocko

On 2022-01-27 09:53, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> When a !node_online node is brought up it needs a hotplug specific
> initialization because the node could be either uninitialized yet or it
> could have been recycled after previous hotremove. hotadd_init_pgdat is
> responsible for that.
> 
> Internal pgdat state is initialized at two places currently
> 	- hotadd_init_pgdat
> 	- free_area_init_core_hotplug
> There is no real clear cut what should go where but this patch's chosen 
> to
> move the whole internal state initialization into 
> free_area_init_core_hotplug.
> hotadd_init_pgdat is still responsible to pull all the parts together -
> most notably to initialize zonelists because those depend on the
> overall topology.
> 
> This patch doesn't introduce any functional change.
> 
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

I have some comments below:

Reviewed-by: Oscar Salvador <osalvador@suse.de>

>  	/*
>  	 * The node we allocated has no zone fallback lists. For avoiding
> @@ -1210,6 +1187,7 @@ static pg_data_t __ref *hotadd_init_pgdat(int 
> nid)
>  	 * When memory is hot-added, all the memory is in offline state. So
>  	 * clear all zones' present_pages because they will be updated in
>  	 * online_pages() and offline_pages().
> +	 * TODO: should be in free_area_init_core_hotplug?
>  	 */
>  	reset_node_managed_pages(pgdat);
>  	reset_node_present_pages(pgdat);

Most likely yes, but more below.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1a05669044d3..32d0189de4c5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7506,12 +7506,33 @@ static void __meminit
> zone_init_internals(struct zone *zone, enum zone_type idx,
>   * NOTE: this function is only called during memory hotplug
>   */
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -void __ref free_area_init_core_hotplug(int nid)
> +void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
>  {
> +	int nid = pgdat->node_id;
>  	enum zone_type z;
> -	pg_data_t *pgdat = NODE_DATA(nid);
> +	int cpu;
> 
>  	pgdat_init_internals(pgdat);
> +
> +	if (pgdat->per_cpu_nodestats == &boot_nodestats)
> +		pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
> +
> +	/*
> +	 * Reset the nr_zones, order and highest_zoneidx before reuse.
> +	 * Note that kswapd will init kswapd_highest_zoneidx properly
> +	 * when it starts in the near future.
> +	 */
> +	pgdat->nr_zones = 0;
> +	pgdat->kswapd_order = 0;
> +	pgdat->kswapd_highest_zoneidx = 0;
> +	pgdat->node_start_pfn = 0;
> +	for_each_online_cpu(cpu) {
> +		struct per_cpu_nodestat *p;
> +
> +		p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
> +		memset(p, 0, sizeof(*p));
> +	}
> +

I am with David that the cleanest way would be to initialize all of this
(pgdat internals) only once at boot time now that we are initializing 
the
node anyway, and memory-offlining would have to wipe it out once last 
pgdat's
chunk of memory goes away.

I guess free_area_init_core_hotplug() would still need to allocate
alloc_percpu(struct per_cpu_nodestat) for the "new" node since doing it 
at
boot time is not desirable because of the memory waste on unused nodes.

As you said, probably worth looking at but not really something to cover 
on
this patchset at all.

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: make free_area_init_node aware of memory less nodes
       [not found] ` <20220127085305.20890-6-mhocko@kernel.org>
  2022-01-27 12:47   ` [PATCH 5/6] mm: make free_area_init_node aware of memory less nodes David Hildenbrand
  2022-01-27 13:34   ` Mike Rapoport
@ 2022-01-28 10:59   ` Oscar Salvador
  2022-02-01  2:43   ` Wei Yang
  3 siblings, 0 replies; 41+ messages in thread
From: Oscar Salvador @ 2022-01-28 10:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Tejun Heo,
	Christoph Lameter, Nico Pache, Wei Yang, Rafael Aquini,
	Michal Hocko

On 2022-01-27 09:53, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> free_area_init_node is also called from memory less node initialization
> path (free_area_init_memoryless_node). It doesn't really make much 
> sense
> to display the physical memory range for those nodes:
> Initmem setup node XX [mem 0x0000000000000000-0x0000000000000000]
> 
> Instead be explicit that the node is memoryless:
> Initmem setup node XX as memoryless
> 
> Acked-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] memcg: do not tweak node in alloc_mem_cgroup_per_node_info
       [not found] ` <20220127085305.20890-7-mhocko@kernel.org>
  2022-01-27 12:50   ` [PATCH 6/6] memcg: do not tweak node in alloc_mem_cgroup_per_node_info David Hildenbrand
@ 2022-01-28 11:01   ` Oscar Salvador
  2022-02-01  2:45   ` Wei Yang
  2 siblings, 0 replies; 41+ messages in thread
From: Oscar Salvador @ 2022-01-28 11:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Tejun Heo,
	Christoph Lameter, Nico Pache, Wei Yang, Rafael Aquini,
	Michal Hocko

On 2022-01-27 09:53, Michal Hocko wrote:
> From: Wei Yang <richard.weiyang@gmail.com>
> 
> alloc_mem_cgroup_per_node_info is allocated for each possible node and
> this used to be a problem because not !node_online nodes didn't have
> appropriate data structure allocated. This has changed by "mm: handle
> uninitialized numa nodes gracefully" so we can drop the special casing
> here.
> 
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
       [not found] ` <20220127085305.20890-3-mhocko@kernel.org>
                     ` (2 preceding siblings ...)
  2022-01-28  6:27   ` Oscar Salvador
@ 2022-01-31 10:34   ` Michal Hocko
  3 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2022-01-31 10:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, David Hildenbrand, Alexey Makhalov, Dennis Zhou,
	Eric Dumazet, Oscar Salvador, Tejun Heo, Christoph Lameter,
	Nico Pache, Wei Yang, Rafael Aquini

This is an updated version of the patch including acks and reviews. All
other patches in this series have only got acks and reviews at this
stage so I am not reposting the full bunch. Please add to the MM tree
Andrew. Thanks!
---
From ad93c2641c4073213a0f9f0d0cd5286aff0de8e0 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 9 Dec 2021 10:00:02 +0100
Subject: [PATCH] mm: handle uninitialized numa nodes gracefully

We have had several reports [1][2][3] that page allocator blows up when
an allocation from a possible node is requested. The underlying reason
is that NODE_DATA for the specific node is not allocated.

NUMA specific initialization is arch specific and it can vary a lot.
E.g. x86 tries to initialize all nodes that have some cpu affinity (see
init_cpu_to_node) but this can be insufficient because the node might be
cpuless for example.

One way to address this problem would be to check for !node_online nodes
when trying to get a zonelist and silently fall back to another node.
That is unfortunately adding a branch into allocator hot path and it
doesn't handle any other potential NODE_DATA users.

This patch takes a different approach (following a lead of [3]) and it
pre allocates pgdat for all possible nodes in an arch indipendent code
- free_area_init. All uninitialized nodes are treated as memoryless
nodes. node_state of the node is not changed because that would lead to
other side effects - e.g. sysfs representation of such a node and from
past discussions [4] it is known that some tools might have problems
digesting that.

Newly allocated pgdat only gets a minimal initialization and the rest of
the work is expected to be done by the memory hotplug - hotadd_new_pgdat
(renamed to hotadd_init_pgdat).

generic_alloc_nodedata is changed to use the memblock allocator because
neither page nor slab allocators are available at the stage when all
pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can
use the early boot allocator. The only arch specific implementation is
ia64 and that is changed to use the early allocator as well.

Reported-by: Alexey Makhalov <amakhalov@vmware.com>
Tested-by: Alexey Makhalov <amakhalov@vmware.com>
Reported-by: Nico Pache <npache@redhat.com>
Acked-by: Rafael Aquini <raquini@redhat.com>
Tested-by: Rafael Aquini <raquini@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>

[1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
[2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
[3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
[4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com
---
 arch/ia64/mm/discontig.c       |  4 ++--
 include/linux/memory_hotplug.h |  2 +-
 mm/internal.h                  |  2 ++
 mm/memory_hotplug.c            | 21 +++++++++-----------
 mm/page_alloc.c                | 36 ++++++++++++++++++++++++++++++----
 5 files changed, 46 insertions(+), 19 deletions(-)

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 8dc8a554f774..dd0cf4834eaa 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -608,11 +608,11 @@ void __init paging_init(void)
 	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
 }
 
-pg_data_t *arch_alloc_nodedata(int nid)
+pg_data_t * __init arch_alloc_nodedata(int nid)
 {
 	unsigned long size = compute_pernodesize(nid);
 
-	return kzalloc(size, GFP_KERNEL);
+	return memblock_alloc(size, SMP_CACHE_BYTES);
 }
 
 void arch_free_nodedata(pg_data_t *pgdat)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 4355983b364d..cdd66bfdf855 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -44,7 +44,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
  */
 #define generic_alloc_nodedata(nid)				\
 ({								\
-	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
+	memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES);	\
 })
 /*
  * This definition is just for error path in node hotadd.
diff --git a/mm/internal.h b/mm/internal.h
index d80300392a19..43b8ccf56b7f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -718,4 +718,6 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 		      unsigned long addr, int page_nid, int *flags);
 
+DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2a9627dc784c..fc991831d296 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1162,19 +1162,21 @@ static void reset_node_present_pages(pg_data_t *pgdat)
 }
 
 /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
-static pg_data_t __ref *hotadd_new_pgdat(int nid)
+static pg_data_t __ref *hotadd_init_pgdat(int nid)
 {
 	struct pglist_data *pgdat;
 
 	pgdat = NODE_DATA(nid);
-	if (!pgdat) {
-		pgdat = arch_alloc_nodedata(nid);
-		if (!pgdat)
-			return NULL;
 
+	/*
+	 * NODE_DATA is preallocated (free_area_init) but its internal
+	 * state is not allocated completely. Add missing pieces.
+	 * Completely offline nodes stay around and they just need
+	 * reintialization.
+	 */
+	if (pgdat->per_cpu_nodestats == &boot_nodestats) {
 		pgdat->per_cpu_nodestats =
 			alloc_percpu(struct per_cpu_nodestat);
-		arch_refresh_nodedata(nid, pgdat);
 	} else {
 		int cpu;
 		/*
@@ -1193,8 +1195,6 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid)
 		}
 	}
 
-	/* we can use NODE_DATA(nid) from here */
-	pgdat->node_id = nid;
 	pgdat->node_start_pfn = 0;
 
 	/* init node's zones as empty zones, we don't have any present pages.*/
@@ -1246,7 +1246,7 @@ static int __try_online_node(int nid, bool set_node_online)
 	if (node_online(nid))
 		return 0;
 
-	pgdat = hotadd_new_pgdat(nid);
+	pgdat = hotadd_init_pgdat(nid);
 	if (!pgdat) {
 		pr_err("Cannot online node %d due to NULL pgdat\n", nid);
 		ret = -ENOMEM;
@@ -1445,9 +1445,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 
 	return ret;
 error:
-	/* rollback pgdat allocation and others */
-	if (new_node)
-		rollback_node_hotadd(nid);
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
 		memblock_remove(start, size);
 error_mem_hotplug_end:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..ef9cfc1069e9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6380,7 +6380,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 #define BOOT_PAGESET_BATCH	1
 static DEFINE_PER_CPU(struct per_cpu_pages, boot_pageset);
 static DEFINE_PER_CPU(struct per_cpu_zonestat, boot_zonestats);
-static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
+DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
 static void __build_all_zonelists(void *data)
 {
@@ -6402,7 +6402,11 @@ static void __build_all_zonelists(void *data)
 	if (self && !node_online(self->node_id)) {
 		build_zonelists(self);
 	} else {
-		for_each_online_node(nid) {
+		/*
+		 * All possible nodes have pgdat preallocated
+		 * in free_area_init
+		 */
+		for_each_node(nid) {
 			pg_data_t *pgdat = NODE_DATA(nid);
 
 			build_zonelists(pgdat);
@@ -8096,8 +8100,32 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
 	setup_nr_node_ids();
-	for_each_online_node(nid) {
-		pg_data_t *pgdat = NODE_DATA(nid);
+	for_each_node(nid) {
+		pg_data_t *pgdat;
+
+		if (!node_online(nid)) {
+			pr_info("Initializing node %d as memoryless\n", nid);
+
+			/* Allocator not initialized yet */
+			pgdat = arch_alloc_nodedata(nid);
+			if (!pgdat) {
+				pr_err("Cannot allocate %zuB for node %d.\n",
+						sizeof(*pgdat), nid);
+				continue;
+			}
+			arch_refresh_nodedata(nid, pgdat);
+			free_area_init_memoryless_node(nid);
+			/*
+			 * not marking this node online because we do not want to
+			 * confuse userspace by sysfs files/directories for node
+			 * without any memory attached to it (see topology_init)
+			 * The pgdat will get fully initialized when a memory is
+			 * hotpluged into it by hotadd_init_pgdat
+			 */
+			continue;
+		}
+
+		pgdat = NODE_DATA(nid);
 		free_area_init_node(nid);
 
 		/* Any memory on that node */
-- 
2.30.2

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG
       [not found] ` <20220127085305.20890-2-mhocko@kernel.org>
                     ` (2 preceding siblings ...)
  2022-01-28  6:18   ` Oscar Salvador
@ 2022-02-01  2:13   ` Wei Yang
  3 siblings, 0 replies; 41+ messages in thread
From: Wei Yang @ 2022-02-01  2:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Wei Yang,
	Rafael Aquini, Michal Hocko

On Thu, Jan 27, 2022 at 09:53:00AM +0100, Michal Hocko wrote:
>From: Michal Hocko <mhocko@suse.com>
>
>This is a preparatory patch and it doesn't introduce any functional
>change. It merely pulls out arch_alloc_nodedata (and co) outside of
>CONFIG_MEMORY_HOTPLUG because the following patch will need to call this
>from the generic MM code.
>
>Acked-by: Rafael Aquini <raquini@redhat.com>
>Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
  2022-01-27 14:47     ` Michal Hocko
  2022-01-27 15:04       ` Mike Rapoport
@ 2022-02-01  2:41       ` Wei Yang
  2022-02-01  9:54         ` Michal Hocko
  1 sibling, 1 reply; 41+ messages in thread
From: Wei Yang @ 2022-02-01  2:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Wei Yang,
	Rafael Aquini

On Thu, Jan 27, 2022 at 03:47:40PM +0100, Michal Hocko wrote:
>[...]
>> > +
>> > +			/* Allocator not initialized yet */
>> > +			pgdat = arch_alloc_nodedata(nid);
>> > +			if (!pgdat) {
>> > +				pr_err("Cannot allocate %zuB for node %d.\n",
>> > +						sizeof(*pgdat), nid);
>> > +				continue;
>> > +			}
>> > +			arch_refresh_nodedata(nid, pgdat);
>> > +			free_area_init_memoryless_node(nid);

free_area_init_memoryless_node() seems to be defined used out side
page_alloc.c? It just call free_area_init_node() directly. We want to use the
name to reflect the effect?

>> > +			/*
>> > +			 * not marking this node online because we do not want to
>> > +			 * confuse userspace by sysfs files/directories for node
>> > +			 * without any memory attached to it (see topology_init)
>> > +			 * The pgdat will get fully initialized when a memory is
>> > +			 * hotpluged into it by hotadd_init_pgdat
>> > +			 */

Hmm... which following step would mark the node online? On x86, the node is
onlined in alloc_node_date(). This is not onlined here.

>> > +			continue;
>> 
>> This can be made slightly more concise if we fall through after
>> arch_refresh_nodedata(), e.g. something like
>> 
>> 			...
>> 
>> 			arch_refresh_nodedata(nid, pgdat);
>> 		}
>> 
>> 		pgdat = NODE_DATA(nid);
>> 		free_area_init_node(nid);
>> 
>> 		/*
>> 		 * Do not mark memoryless node online because we do not want to
>> 		 * confuse userspace by sysfs files/directories for node
>> 		 * without any memory attached to it (see topology_init)
>> 		 * The pgdat will get fully initialized when a memory is
>> 		 * hotpluged into it by hotadd_init_pgdat
>> 		 */
>> 		if (!pgdat->node_present_pages)
>> 			continue;
>> 
>> but I don't feel strongly about it.
>
>I do not have strong preference either way. Unless this is considered
>better by more people I would stick with what I have.

I agree with Mike.

>-- 
>Michal Hocko
>SUSE Labs

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/6] mm, memory_hotplug: drop arch_free_nodedata
       [not found] ` <20220127085305.20890-4-mhocko@kernel.org>
                     ` (2 preceding siblings ...)
  2022-01-28  6:29   ` Oscar Salvador
@ 2022-02-01  2:41   ` Wei Yang
  3 siblings, 0 replies; 41+ messages in thread
From: Wei Yang @ 2022-02-01  2:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Wei Yang,
	Rafael Aquini, Michal Hocko

On Thu, Jan 27, 2022 at 09:53:02AM +0100, Michal Hocko wrote:
>From: Michal Hocko <mhocko@suse.com>
>
>Prior to "mm: handle uninitialized numa nodes gracefully" memory hotplug
>used to allocate pgdat when memory has been added to a node
>(hotadd_init_pgdat) arch_free_nodedata has been only used in the
>failure path because once the pgdat is exported (to be visible
>by NODA_DATA(nid)) it cannot really be freed because there is no
>synchronization available for that.
>
>pgdat is allocated for each possible nodes now so the memory hotplug
>doesn't need to do the ever use arch_free_nodedata so drop it.
>
>This patch doesn't introduce any functional change.
>
>Acked-by: Rafael Aquini <raquini@redhat.com>
>Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] mm, memory_hotplug: reorganize new pgdat initialization
       [not found] ` <20220127085305.20890-5-mhocko@kernel.org>
                     ` (2 preceding siblings ...)
  2022-01-28 10:51   ` Oscar Salvador
@ 2022-02-01  2:42   ` Wei Yang
  3 siblings, 0 replies; 41+ messages in thread
From: Wei Yang @ 2022-02-01  2:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Wei Yang,
	Rafael Aquini, Michal Hocko

On Thu, Jan 27, 2022 at 09:53:03AM +0100, Michal Hocko wrote:
>From: Michal Hocko <mhocko@suse.com>
>
>When a !node_online node is brought up it needs a hotplug specific
>initialization because the node could be either uninitialized yet or it
>could have been recycled after previous hotremove. hotadd_init_pgdat is
>responsible for that.
>
>Internal pgdat state is initialized at two places currently
>	- hotadd_init_pgdat
>	- free_area_init_core_hotplug
>There is no real clear cut what should go where but this patch's chosen to
>move the whole internal state initialization into free_area_init_core_hotplug.
>hotadd_init_pgdat is still responsible to pull all the parts together -
>most notably to initialize zonelists because those depend on the overall topology.
>
>This patch doesn't introduce any functional change.
>
>Acked-by: Rafael Aquini <raquini@redhat.com>
>Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: make free_area_init_node aware of memory less nodes
       [not found] ` <20220127085305.20890-6-mhocko@kernel.org>
                     ` (2 preceding siblings ...)
  2022-01-28 10:59   ` Oscar Salvador
@ 2022-02-01  2:43   ` Wei Yang
  3 siblings, 0 replies; 41+ messages in thread
From: Wei Yang @ 2022-02-01  2:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Wei Yang,
	Rafael Aquini, Michal Hocko

On Thu, Jan 27, 2022 at 09:53:04AM +0100, Michal Hocko wrote:
>From: Michal Hocko <mhocko@suse.com>
>
>free_area_init_node is also called from memory less node initialization
>path (free_area_init_memoryless_node). It doesn't really make much sense
>to display the physical memory range for those nodes:
>Initmem setup node XX [mem 0x0000000000000000-0x0000000000000000]
>
>Instead be explicit that the node is memoryless:
>Initmem setup node XX as memoryless
>
>Acked-by: Rafael Aquini <raquini@redhat.com>
>Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] memcg: do not tweak node in alloc_mem_cgroup_per_node_info
       [not found] ` <20220127085305.20890-7-mhocko@kernel.org>
  2022-01-27 12:50   ` [PATCH 6/6] memcg: do not tweak node in alloc_mem_cgroup_per_node_info David Hildenbrand
  2022-01-28 11:01   ` Oscar Salvador
@ 2022-02-01  2:45   ` Wei Yang
  2022-02-01  9:51     ` Michal Hocko
  2 siblings, 1 reply; 41+ messages in thread
From: Wei Yang @ 2022-02-01  2:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Wei Yang,
	Rafael Aquini, Michal Hocko

On Thu, Jan 27, 2022 at 09:53:05AM +0100, Michal Hocko wrote:
>From: Wei Yang <richard.weiyang@gmail.com>
>
>alloc_mem_cgroup_per_node_info is allocated for each possible node and
>this used to be a problem because not !node_online nodes didn't have
                                   ^^^ a typo here?

				   !node_online nodes is enough?

>appropriate data structure allocated. This has changed by "mm: handle
>uninitialized numa nodes gracefully" so we can drop the special casing
>here.
>
>Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>Signed-off-by: Michal Hocko <mhocko@suse.com>

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] memcg: do not tweak node in alloc_mem_cgroup_per_node_info
  2022-02-01  2:45   ` Wei Yang
@ 2022-02-01  9:51     ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2022-02-01  9:51 UTC (permalink / raw)
  To: Wei Yang
  Cc: Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Rafael Aquini

On Tue 01-02-22 02:45:44, Wei Yang wrote:
> On Thu, Jan 27, 2022 at 09:53:05AM +0100, Michal Hocko wrote:
> >From: Wei Yang <richard.weiyang@gmail.com>
> >
> >alloc_mem_cgroup_per_node_info is allocated for each possible node and
> >this used to be a problem because not !node_online nodes didn't have
>                                    ^^^ a typo here?
> 
> 				   !node_online nodes is enough?

Yes. Fixed up. Thx!

 
> >appropriate data structure allocated. This has changed by "mm: handle
> >uninitialized numa nodes gracefully" so we can drop the special casing
> >here.
> >
> >Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> >Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> -- 
> Wei Yang
> Help you, Help me

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
  2022-02-01  2:41       ` Wei Yang
@ 2022-02-01  9:54         ` Michal Hocko
  2022-02-03  0:21           ` Wei Yang
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2022-02-01  9:54 UTC (permalink / raw)
  To: Wei Yang
  Cc: Mike Rapoport, Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Rafael Aquini

On Tue 01-02-22 02:41:19, Wei Yang wrote:
> On Thu, Jan 27, 2022 at 03:47:40PM +0100, Michal Hocko wrote:
> >[...]
> >> > +
> >> > +			/* Allocator not initialized yet */
> >> > +			pgdat = arch_alloc_nodedata(nid);
> >> > +			if (!pgdat) {
> >> > +				pr_err("Cannot allocate %zuB for node %d.\n",
> >> > +						sizeof(*pgdat), nid);
> >> > +				continue;
> >> > +			}
> >> > +			arch_refresh_nodedata(nid, pgdat);
> >> > +			free_area_init_memoryless_node(nid);
> 
> free_area_init_memoryless_node() seems to be defined used out side
> page_alloc.c? It just call free_area_init_node() directly. We want to use the
> name to reflect the effect?

yes.

> >> > +			/*
> >> > +			 * not marking this node online because we do not want to
> >> > +			 * confuse userspace by sysfs files/directories for node
> >> > +			 * without any memory attached to it (see topology_init)
> >> > +			 * The pgdat will get fully initialized when a memory is
> >> > +			 * hotpluged into it by hotadd_init_pgdat
> >> > +			 */
> 
> Hmm... which following step would mark the node online? On x86, the node is
> onlined in alloc_node_date(). This is not onlined here.

The comment tries to explain that this happens during the memory
hotplug. Or maybe I have missed your question?

[...]
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
  2022-02-01  9:54         ` Michal Hocko
@ 2022-02-03  0:21           ` Wei Yang
  2022-02-03  7:23             ` Mike Rapoport
  0 siblings, 1 reply; 41+ messages in thread
From: Wei Yang @ 2022-02-03  0:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Wei Yang, Mike Rapoport, Andrew Morton, linux-mm, LKML,
	David Hildenbrand, Alexey Makhalov, Dennis Zhou, Eric Dumazet,
	Oscar Salvador, Tejun Heo, Christoph Lameter, Nico Pache,
	Rafael Aquini

On Tue, Feb 01, 2022 at 10:54:37AM +0100, Michal Hocko wrote:
>On Tue 01-02-22 02:41:19, Wei Yang wrote:
>> On Thu, Jan 27, 2022 at 03:47:40PM +0100, Michal Hocko wrote:
>> [...]
>> >> > +			/*
>> >> > +			 * not marking this node online because we do not want to
>> >> > +			 * confuse userspace by sysfs files/directories for node
>> >> > +			 * without any memory attached to it (see topology_init)
>> >> > +			 * The pgdat will get fully initialized when a memory is
>> >> > +			 * hotpluged into it by hotadd_init_pgdat
>> >> > +			 */
>> 
>> Hmm... which following step would mark the node online? On x86, the node is
>> onlined in alloc_node_date(). This is not onlined here.
>
>The comment tries to explain that this happens during the memory
>hotplug. Or maybe I have missed your question?
>

I am not sure for others, while the comment confused me a little.

Currently in kernel, there are two situations for node online:

  * during memory hotplug
  * during sys-init

For memory hotplug, we allocate pgdat and online node. And current hot-add
process has already put them in two steps:

  1. __try_online_node()
  2. node_set_online()

So emphasize "not online" node here, confuse me a little. It is a natural
thing to not online node until it has memory.

But from another point of view, the comment here is reasonable. During
sys-init, we online node at the same time when creating pgdat. And even for
memory-less node on x86, we online them too.

Well, this is all about the comment. I have tried to grab may head but not
come up with a better idea.

Or maybe this is just my personal feeling, don't bother if no-one else feel
like this.

>[...]
>-- 
>Michal Hocko
>SUSE Labs

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
  2022-02-03  0:21           ` Wei Yang
@ 2022-02-03  7:23             ` Mike Rapoport
  2022-02-03  8:27               ` David Hildenbrand
  2022-02-03  8:39               ` Michal Hocko
  0 siblings, 2 replies; 41+ messages in thread
From: Mike Rapoport @ 2022-02-03  7:23 UTC (permalink / raw)
  To: Wei Yang
  Cc: Michal Hocko, Andrew Morton, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Rafael Aquini

On Thu, Feb 03, 2022 at 12:21:16AM +0000, Wei Yang wrote:
> On Tue, Feb 01, 2022 at 10:54:37AM +0100, Michal Hocko wrote:
> >On Tue 01-02-22 02:41:19, Wei Yang wrote:
> >> On Thu, Jan 27, 2022 at 03:47:40PM +0100, Michal Hocko wrote:
> >> [...]
> >> >> > +			/*
> >> >> > +			 * not marking this node online because we do not want to
> >> >> > +			 * confuse userspace by sysfs files/directories for node
> >> >> > +			 * without any memory attached to it (see topology_init)
> >> >> > +			 * The pgdat will get fully initialized when a memory is
> >> >> > +			 * hotpluged into it by hotadd_init_pgdat
> >> >> > +			 */
> >> 
> >> Hmm... which following step would mark the node online? On x86, the node is
> >> onlined in alloc_node_date(). This is not onlined here.
> >
> >The comment tries to explain that this happens during the memory
> >hotplug. Or maybe I have missed your question?
> >
> 
> I am not sure for others, while the comment confused me a little.
> 
> Currently in kernel, there are two situations for node online:
> 
>   * during memory hotplug
>   * during sys-init
> 
> For memory hotplug, we allocate pgdat and online node. And current hot-add
> process has already put them in two steps:
> 
>   1. __try_online_node()
>   2. node_set_online()
> 
> So emphasize "not online" node here, confuse me a little. It is a natural
> thing to not online node until it has memory.
> 
> But from another point of view, the comment here is reasonable. During
> sys-init, we online node at the same time when creating pgdat. And even for
> memory-less node on x86, we online them too.
> 
> Well, this is all about the comment. I have tried to grab may head but not
> come up with a better idea.
> 
> Or maybe this is just my personal feeling, don't bother if no-one else feel
> like this.

I shuffled the words a bit, maybe this sounds better not only to me :)

/*
 * We do not want to confuse userspace by sysfs files/directories for node
 * without any memory attached to it, so this node is not marked as
 * N_MEMORY and not marked online so that topology_init() won't create
 * sysfs hierarchy for this node.  The pgdat will get fully initialized by
 * hotadd_init_pgdat() when memory is hotpluged into this node
 */

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
  2022-02-03  7:23             ` Mike Rapoport
@ 2022-02-03  8:27               ` David Hildenbrand
  2022-02-03  9:08                 ` Mike Rapoport
  2022-02-03  8:39               ` Michal Hocko
  1 sibling, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2022-02-03  8:27 UTC (permalink / raw)
  To: Mike Rapoport, Wei Yang
  Cc: Michal Hocko, Andrew Morton, linux-mm, LKML, Alexey Makhalov,
	Dennis Zhou, Eric Dumazet, Oscar Salvador, Tejun Heo,
	Christoph Lameter, Nico Pache, Rafael Aquini

On 03.02.22 08:23, Mike Rapoport wrote:
> On Thu, Feb 03, 2022 at 12:21:16AM +0000, Wei Yang wrote:
>> On Tue, Feb 01, 2022 at 10:54:37AM +0100, Michal Hocko wrote:
>>> On Tue 01-02-22 02:41:19, Wei Yang wrote:
>>>> On Thu, Jan 27, 2022 at 03:47:40PM +0100, Michal Hocko wrote:
>>>> [...]
>>>>>>> +			/*
>>>>>>> +			 * not marking this node online because we do not want to
>>>>>>> +			 * confuse userspace by sysfs files/directories for node
>>>>>>> +			 * without any memory attached to it (see topology_init)
>>>>>>> +			 * The pgdat will get fully initialized when a memory is
>>>>>>> +			 * hotpluged into it by hotadd_init_pgdat
>>>>>>> +			 */
>>>>
>>>> Hmm... which following step would mark the node online? On x86, the node is
>>>> onlined in alloc_node_date(). This is not onlined here.
>>>
>>> The comment tries to explain that this happens during the memory
>>> hotplug. Or maybe I have missed your question?
>>>
>>
>> I am not sure for others, while the comment confused me a little.
>>
>> Currently in kernel, there are two situations for node online:
>>
>>   * during memory hotplug
>>   * during sys-init
>>
>> For memory hotplug, we allocate pgdat and online node. And current hot-add
>> process has already put them in two steps:
>>
>>   1. __try_online_node()
>>   2. node_set_online()
>>
>> So emphasize "not online" node here, confuse me a little. It is a natural
>> thing to not online node until it has memory.
>>
>> But from another point of view, the comment here is reasonable. During
>> sys-init, we online node at the same time when creating pgdat. And even for
>> memory-less node on x86, we online them too.
>>
>> Well, this is all about the comment. I have tried to grab may head but not
>> come up with a better idea.
>>
>> Or maybe this is just my personal feeling, don't bother if no-one else feel
>> like this.
> 
> I shuffled the words a bit, maybe this sounds better not only to me :)
> 
> /*
>  * We do not want to confuse userspace by sysfs files/directories for node
>  * without any memory attached to it, so this node is not marked as
>  * N_MEMORY and not marked online so that topology_init() won't create
>  * sysfs hierarchy for this node.  The pgdat will get fully initialized by
>  * hotadd_init_pgdat() when memory is hotpluged into this node
>  */
> 

Note that the topology_init() part might change soon [1] so maybe we
want to rephrase that to "so that no sysfs hierarchy will be created via
register_one_node() for this node." right away.

or (because there is also /sys/devices/system/node/online  )

"so that the node won't be indicated as online to user space, for
example, via register_one_node()."

[1] https://lkml.kernel.org/r/20220128151540.164759-1-david@redhat.com

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
  2022-02-03  7:23             ` Mike Rapoport
  2022-02-03  8:27               ` David Hildenbrand
@ 2022-02-03  8:39               ` Michal Hocko
  2022-02-04 22:54                 ` Andrew Morton
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2022-02-03  8:39 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: Wei Yang, linux-mm, LKML, David Hildenbrand, Alexey Makhalov,
	Dennis Zhou, Eric Dumazet, Oscar Salvador, Tejun Heo,
	Christoph Lameter, Nico Pache, Rafael Aquini

On Thu 03-02-22 09:23:20, Mike Rapoport wrote:
> On Thu, Feb 03, 2022 at 12:21:16AM +0000, Wei Yang wrote:
> > On Tue, Feb 01, 2022 at 10:54:37AM +0100, Michal Hocko wrote:
> > >On Tue 01-02-22 02:41:19, Wei Yang wrote:
> > >> On Thu, Jan 27, 2022 at 03:47:40PM +0100, Michal Hocko wrote:
> > >> [...]
> > >> >> > +			/*
> > >> >> > +			 * not marking this node online because we do not want to
> > >> >> > +			 * confuse userspace by sysfs files/directories for node
> > >> >> > +			 * without any memory attached to it (see topology_init)
> > >> >> > +			 * The pgdat will get fully initialized when a memory is
> > >> >> > +			 * hotpluged into it by hotadd_init_pgdat
> > >> >> > +			 */
> > >> 
> > >> Hmm... which following step would mark the node online? On x86, the node is
> > >> onlined in alloc_node_date(). This is not onlined here.
> > >
> > >The comment tries to explain that this happens during the memory
> > >hotplug. Or maybe I have missed your question?
> > >
> > 
> > I am not sure for others, while the comment confused me a little.
> > 
> > Currently in kernel, there are two situations for node online:
> > 
> >   * during memory hotplug
> >   * during sys-init
> > 
> > For memory hotplug, we allocate pgdat and online node. And current hot-add
> > process has already put them in two steps:
> > 
> >   1. __try_online_node()
> >   2. node_set_online()
> > 
> > So emphasize "not online" node here, confuse me a little. It is a natural
> > thing to not online node until it has memory.
> > 
> > But from another point of view, the comment here is reasonable. During
> > sys-init, we online node at the same time when creating pgdat. And even for
> > memory-less node on x86, we online them too.
> > 
> > Well, this is all about the comment. I have tried to grab may head but not
> > come up with a better idea.
> > 
> > Or maybe this is just my personal feeling, don't bother if no-one else feel
> > like this.
> 
> I shuffled the words a bit, maybe this sounds better not only to me :)
> 
> /*
>  * We do not want to confuse userspace by sysfs files/directories for node
>  * without any memory attached to it, so this node is not marked as
>  * N_MEMORY and not marked online so that topology_init() won't create
>  * sysfs hierarchy for this node.  The pgdat will get fully initialized by
>  * hotadd_init_pgdat() when memory is hotpluged into this node
>  */

Yes, I like this comment more than mine. Andrew could you replace it in
your tree or should I repost the patch?

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
  2022-02-03  8:27               ` David Hildenbrand
@ 2022-02-03  9:08                 ` Mike Rapoport
  2022-02-03  9:11                   ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2022-02-03  9:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Wei Yang, Michal Hocko, Andrew Morton, linux-mm, LKML,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Rafael Aquini

On Thu, Feb 03, 2022 at 09:27:16AM +0100, David Hildenbrand wrote:
> On 03.02.22 08:23, Mike Rapoport wrote:
> > On Thu, Feb 03, 2022 at 12:21:16AM +0000, Wei Yang wrote:
> >> On Tue, Feb 01, 2022 at 10:54:37AM +0100, Michal Hocko wrote:
> >>> On Tue 01-02-22 02:41:19, Wei Yang wrote:
> >>>> On Thu, Jan 27, 2022 at 03:47:40PM +0100, Michal Hocko wrote:
> >>>> [...]
> >>>>>>> +			/*
> >>>>>>> +			 * not marking this node online because we do not want to
> >>>>>>> +			 * confuse userspace by sysfs files/directories for node
> >>>>>>> +			 * without any memory attached to it (see topology_init)
> >>>>>>> +			 * The pgdat will get fully initialized when a memory is
> >>>>>>> +			 * hotpluged into it by hotadd_init_pgdat
> >>>>>>> +			 */
> >>>>
> >>>> Hmm... which following step would mark the node online? On x86, the node is
> >>>> onlined in alloc_node_date(). This is not onlined here.
> >>>
> >>> The comment tries to explain that this happens during the memory
> >>> hotplug. Or maybe I have missed your question?
> >>>
> >>
> >> I am not sure for others, while the comment confused me a little.
> >>
> >> Currently in kernel, there are two situations for node online:
> >>
> >>   * during memory hotplug
> >>   * during sys-init
> >>
> >> For memory hotplug, we allocate pgdat and online node. And current hot-add
> >> process has already put them in two steps:
> >>
> >>   1. __try_online_node()
> >>   2. node_set_online()
> >>
> >> So emphasize "not online" node here, confuse me a little. It is a natural
> >> thing to not online node until it has memory.
> >>
> >> But from another point of view, the comment here is reasonable. During
> >> sys-init, we online node at the same time when creating pgdat. And even for
> >> memory-less node on x86, we online them too.
> >>
> >> Well, this is all about the comment. I have tried to grab may head but not
> >> come up with a better idea.
> >>
> >> Or maybe this is just my personal feeling, don't bother if no-one else feel
> >> like this.
> > 
> > I shuffled the words a bit, maybe this sounds better not only to me :)
> > 
> > /*
> >  * We do not want to confuse userspace by sysfs files/directories for node
> >  * without any memory attached to it, so this node is not marked as
> >  * N_MEMORY and not marked online so that topology_init() won't create
> >  * sysfs hierarchy for this node.  The pgdat will get fully initialized by
> >  * hotadd_init_pgdat() when memory is hotpluged into this node
> >  */
> > 
> 
> Note that the topology_init() part might change soon [1] so maybe we
> want to rephrase that to "so that no sysfs hierarchy will be created via
> register_one_node() for this node." right away.

Heh, this will be your responsibility to update the comment here when you
post non-RFC version ;-)
 
> or (because there is also /sys/devices/system/node/online  )
> 
> "so that the node won't be indicated as online to user space, for
> example, via register_one_node()."
> 
> [1] https://lkml.kernel.org/r/20220128151540.164759-1-david@redhat.com
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
  2022-02-03  9:08                 ` Mike Rapoport
@ 2022-02-03  9:11                   ` David Hildenbrand
  2022-02-03  9:19                     ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2022-02-03  9:11 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Wei Yang, Michal Hocko, Andrew Morton, linux-mm, LKML,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Rafael Aquini

On 03.02.22 10:08, Mike Rapoport wrote:
> On Thu, Feb 03, 2022 at 09:27:16AM +0100, David Hildenbrand wrote:
>> On 03.02.22 08:23, Mike Rapoport wrote:
>>> On Thu, Feb 03, 2022 at 12:21:16AM +0000, Wei Yang wrote:
>>>> On Tue, Feb 01, 2022 at 10:54:37AM +0100, Michal Hocko wrote:
>>>>> On Tue 01-02-22 02:41:19, Wei Yang wrote:
>>>>>> On Thu, Jan 27, 2022 at 03:47:40PM +0100, Michal Hocko wrote:
>>>>>> [...]
>>>>>>>>> +			/*
>>>>>>>>> +			 * not marking this node online because we do not want to
>>>>>>>>> +			 * confuse userspace by sysfs files/directories for node
>>>>>>>>> +			 * without any memory attached to it (see topology_init)
>>>>>>>>> +			 * The pgdat will get fully initialized when a memory is
>>>>>>>>> +			 * hotpluged into it by hotadd_init_pgdat
>>>>>>>>> +			 */
>>>>>>
>>>>>> Hmm... which following step would mark the node online? On x86, the node is
>>>>>> onlined in alloc_node_date(). This is not onlined here.
>>>>>
>>>>> The comment tries to explain that this happens during the memory
>>>>> hotplug. Or maybe I have missed your question?
>>>>>
>>>>
>>>> I am not sure for others, while the comment confused me a little.
>>>>
>>>> Currently in kernel, there are two situations for node online:
>>>>
>>>>   * during memory hotplug
>>>>   * during sys-init
>>>>
>>>> For memory hotplug, we allocate pgdat and online node. And current hot-add
>>>> process has already put them in two steps:
>>>>
>>>>   1. __try_online_node()
>>>>   2. node_set_online()
>>>>
>>>> So emphasize "not online" node here, confuse me a little. It is a natural
>>>> thing to not online node until it has memory.
>>>>
>>>> But from another point of view, the comment here is reasonable. During
>>>> sys-init, we online node at the same time when creating pgdat. And even for
>>>> memory-less node on x86, we online them too.
>>>>
>>>> Well, this is all about the comment. I have tried to grab may head but not
>>>> come up with a better idea.
>>>>
>>>> Or maybe this is just my personal feeling, don't bother if no-one else feel
>>>> like this.
>>>
>>> I shuffled the words a bit, maybe this sounds better not only to me :)
>>>
>>> /*
>>>  * We do not want to confuse userspace by sysfs files/directories for node
>>>  * without any memory attached to it, so this node is not marked as
>>>  * N_MEMORY and not marked online so that topology_init() won't create
>>>  * sysfs hierarchy for this node.  The pgdat will get fully initialized by
>>>  * hotadd_init_pgdat() when memory is hotpluged into this node
>>>  */
>>>
>>
>> Note that the topology_init() part might change soon [1] so maybe we
>> want to rephrase that to "so that no sysfs hierarchy will be created via
>> register_one_node() for this node." right away.
> 
> Heh, this will be your responsibility to update the comment here when you
> post non-RFC version ;-)

I'm usually sending patches against Linus' tree. And I'll post non-RFC
most probably today (after testing on aarch64) ;)

So I'd appreciate if we could just phrase it more generically, as I
tried. But of course, we can try making my life harder ;)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
  2022-02-03  9:11                   ` David Hildenbrand
@ 2022-02-03  9:19                     ` Michal Hocko
  2022-02-03  9:21                       ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2022-02-03  9:19 UTC (permalink / raw)
  To: David Hildenbrand, Mike Rapoport, Andrew Morton
  Cc: Wei Yang, linux-mm, LKML, Alexey Makhalov, Dennis Zhou,
	Eric Dumazet, Oscar Salvador, Tejun Heo, Christoph Lameter,
	Nico Pache, Rafael Aquini

So we will go with the following.
/*
 * We do not want to confuse userspace by sysfs files/directories for node
 * without any memory attached to it, so this node is not marked as
 * N_MEMORY and not marked online so that no sysfs hierarchy will be
 * created via register_one_node for this node. The pgdat will get fully
 * initialized by hotadd_init_pgdat() when memory is hotpluged into this
 * node
 */

OK?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
  2022-02-03  9:19                     ` Michal Hocko
@ 2022-02-03  9:21                       ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2022-02-03  9:21 UTC (permalink / raw)
  To: Michal Hocko, Mike Rapoport, Andrew Morton
  Cc: Wei Yang, linux-mm, LKML, Alexey Makhalov, Dennis Zhou,
	Eric Dumazet, Oscar Salvador, Tejun Heo, Christoph Lameter,
	Nico Pache, Rafael Aquini

On 03.02.22 10:19, Michal Hocko wrote:
> So we will go with the following.
> /*
>  * We do not want to confuse userspace by sysfs files/directories for node
>  * without any memory attached to it, so this node is not marked as
>  * N_MEMORY and not marked online so that no sysfs hierarchy will be
>  * created via register_one_node for this node. The pgdat will get fully
>  * initialized by hotadd_init_pgdat() when memory is hotpluged into this
>  * node
>  */
> 
> OK?

LGTM, thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] mm: handle uninitialized numa nodes gracefully
  2022-02-03  8:39               ` Michal Hocko
@ 2022-02-04 22:54                 ` Andrew Morton
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Morton @ 2022-02-04 22:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Wei Yang, linux-mm, LKML, David Hildenbrand,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Rafael Aquini

On Thu, 3 Feb 2022 09:39:32 +0100 Michal Hocko <mhocko@suse.com> wrote:

> > I shuffled the words a bit, maybe this sounds better not only to me :)
> > 
> > /*
> >  * We do not want to confuse userspace by sysfs files/directories for node
> >  * without any memory attached to it, so this node is not marked as
> >  * N_MEMORY and not marked online so that topology_init() won't create
> >  * sysfs hierarchy for this node.  The pgdat will get fully initialized by
> >  * hotadd_init_pgdat() when memory is hotpluged into this node
> >  */
> 
> Yes, I like this comment more than mine. Andrew could you replace it in
> your tree or should I repost the patch?

I did this.

--- a/mm/page_alloc.c~mm-handle-uninitialized-numa-nodes-gracefully-fix
+++ a/mm/page_alloc.c
@@ -8155,12 +8155,16 @@ void __init free_area_init(unsigned long
 			}
 			arch_refresh_nodedata(nid, pgdat);
 			free_area_init_memoryless_node(nid);
+
 			/*
-			 * not marking this node online because we do not want to
-			 * confuse userspace by sysfs files/directories for node
-			 * without any memory attached to it (see topology_init)
-			 * The pgdat will get fully initialized when a memory is
-			 * hotpluged into it by hotadd_init_pgdat
+			 * We do not want to confuse userspace by sysfs
+			 * files/directories for node without any memory
+			 * attached to it, so this node is not marked as
+			 * N_MEMORY and not marked online so that no sysfs
+			 * hierarchy will be created via register_one_node for
+			 * it. The pgdat will get fully initialized by
+			 * hotadd_init_pgdat() when memory is hotplugged into
+			 * this node.
 			 */
 			continue;
 		}
_



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] mm, memory_hotplug: reorganize new pgdat initialization
  2022-01-27 14:44     ` Michal Hocko
@ 2022-02-17 10:40       ` Oscar Salvador
  0 siblings, 0 replies; 41+ messages in thread
From: Oscar Salvador @ 2022-02-17 10:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Andrew Morton, linux-mm, LKML,
	Alexey Makhalov, Dennis Zhou, Eric Dumazet, Tejun Heo,
	Christoph Lameter, Nico Pache, Wei Yang, Rafael Aquini

On Thu, Jan 27, 2022 at 03:44:21PM +0100, Michal Hocko wrote:
> > I feel like we should be initializing all of this only once, just after
> > allocating the node. There should be no difference between a node we're
> > reusing and a "fresh" node. IOW, memory offlining should be resetting
> > all state accordingly when the last memory goes away.
> > 
> > But I might be wrong and this patch looks like an improvement, as you
> > say,  without functional change
> 
> Yeah, I really wanted to have this simple and straightforward. To be
> completely honest I am not even sure this is necessary. Something really
> woth looking at.

Seizing the opportunity that I had to look at this code and at x86's
numa init code again I am preparing something to further sort this out and
simplify it a bit.

So I will have something soon unless someone beats me to it first. 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2022-02-17 10:40 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220127085305.20890-1-mhocko@kernel.org>
     [not found] ` <20220127085305.20890-4-mhocko@kernel.org>
2022-01-27 12:42   ` [PATCH 3/6] mm, memory_hotplug: drop arch_free_nodedata David Hildenbrand
2022-01-27 13:37   ` Mike Rapoport
2022-01-28  6:29   ` Oscar Salvador
2022-02-01  2:41   ` Wei Yang
     [not found] ` <20220127085305.20890-6-mhocko@kernel.org>
2022-01-27 12:47   ` [PATCH 5/6] mm: make free_area_init_node aware of memory less nodes David Hildenbrand
2022-01-27 13:34   ` Mike Rapoport
2022-01-28 10:59   ` Oscar Salvador
2022-02-01  2:43   ` Wei Yang
     [not found] ` <20220127085305.20890-5-mhocko@kernel.org>
2022-01-27 12:46   ` [PATCH 4/6] mm, memory_hotplug: reorganize new pgdat initialization David Hildenbrand
2022-01-27 14:44     ` Michal Hocko
2022-02-17 10:40       ` Oscar Salvador
2022-01-27 13:39   ` Mike Rapoport
2022-01-27 14:45     ` Michal Hocko
2022-01-28 10:51   ` Oscar Salvador
2022-02-01  2:42   ` Wei Yang
     [not found] ` <20220127085305.20890-2-mhocko@kernel.org>
2022-01-27 12:27   ` [PATCH 1/6] mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG David Hildenbrand
2022-01-27 13:36   ` Mike Rapoport
2022-01-28  6:18   ` Oscar Salvador
2022-02-01  2:13   ` Wei Yang
     [not found] ` <20220127085305.20890-3-mhocko@kernel.org>
2022-01-27 12:41   ` [PATCH 2/6] mm: handle uninitialized numa nodes gracefully David Hildenbrand
2022-01-27 14:50     ` Michal Hocko
2022-01-27 13:37   ` Mike Rapoport
2022-01-27 14:47     ` Michal Hocko
2022-01-27 15:04       ` Mike Rapoport
2022-02-01  2:41       ` Wei Yang
2022-02-01  9:54         ` Michal Hocko
2022-02-03  0:21           ` Wei Yang
2022-02-03  7:23             ` Mike Rapoport
2022-02-03  8:27               ` David Hildenbrand
2022-02-03  9:08                 ` Mike Rapoport
2022-02-03  9:11                   ` David Hildenbrand
2022-02-03  9:19                     ` Michal Hocko
2022-02-03  9:21                       ` David Hildenbrand
2022-02-03  8:39               ` Michal Hocko
2022-02-04 22:54                 ` Andrew Morton
2022-01-28  6:27   ` Oscar Salvador
2022-01-31 10:34   ` Michal Hocko
     [not found] ` <20220127085305.20890-7-mhocko@kernel.org>
2022-01-27 12:50   ` [PATCH 6/6] memcg: do not tweak node in alloc_mem_cgroup_per_node_info David Hildenbrand
2022-01-28 11:01   ` Oscar Salvador
2022-02-01  2:45   ` Wei Yang
2022-02-01  9:51     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).