All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] mm: Dont allocate pages on a offline node
@ 2021-12-06  3:33 Nico Pache
  2021-12-06  3:33 ` [RFC PATCH 1/2] include/linux/gfp.h: Do not allocate pages on a offlined node Nico Pache
  2021-12-06  3:33 ` [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes Nico Pache
  0 siblings, 2 replies; 37+ messages in thread
From: Nico Pache @ 2021-12-06  3:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: shakeelb, ktkhai, shy828301, guro, vbabka, vdavydov.dev, raquini

Prevent page allocations from occuring on an offlined node by adding a
check in _alloc_pages_node and correcting the use of for_each_node when
allocating shrinkers.

The first patch is more of a defensive check while the second patch
directly relates to an issue we have found during testing.

Signed-Off-by: Nico Pache <npache@redhat.com>

Nico Pache (2):
  include/linux/gfp.h: Do not allocate pages on a offlined node
  mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes

 include/linux/gfp.h | 5 ++++-
 mm/vmscan.c         | 8 ++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

-- 
2.33.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC PATCH 1/2] include/linux/gfp.h: Do not allocate pages on a offlined node
  2021-12-06  3:33 [RFC PATCH 0/2] mm: Dont allocate pages on a offline node Nico Pache
@ 2021-12-06  3:33 ` Nico Pache
  2021-12-06  3:37   ` Matthew Wilcox
                     ` (2 more replies)
  2021-12-06  3:33 ` [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes Nico Pache
  1 sibling, 3 replies; 37+ messages in thread
From: Nico Pache @ 2021-12-06  3:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: shakeelb, ktkhai, shy828301, guro, vbabka, vdavydov.dev, raquini,
	Rafael Aquini

We shouldn't allocate pages on a unavailable node. Add a check for this
in __alloc_pages_node and return NULL to avoid issues further down the
callchain.

Also update the VM_WARN_ON in __alloc_pages_node which could skip this
warn if the gfp_mask is not GFP_THISNODE.

Co-developed-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/gfp.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b976c4177299..e7e18f6d0d9d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -565,7 +565,10 @@ static inline struct page *
 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 {
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
-	VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
+	VM_WARN_ON(!node_online(nid));
+
+	if (!node_online(nid))
+		return NULL;
 
 	return __alloc_pages(gfp_mask, order, nid, NULL);
 }
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06  3:33 [RFC PATCH 0/2] mm: Dont allocate pages on a offline node Nico Pache
  2021-12-06  3:33 ` [RFC PATCH 1/2] include/linux/gfp.h: Do not allocate pages on a offlined node Nico Pache
@ 2021-12-06  3:33 ` Nico Pache
  2021-12-06  8:32   ` David Hildenbrand
                     ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Nico Pache @ 2021-12-06  3:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: shakeelb, ktkhai, shy828301, guro, vbabka, vdavydov.dev, raquini

We have run into a panic caused by a shrinker allocation being attempted
on an offlined node.

Our crash analysis has determined that the issue originates from trying
to allocate pages on an offlined node in expand_one_shrinker_info. This
function makes the incorrect assumption that we can allocate on any node.
To correct this we make sure we only itterate over online nodes.

This assumption can lead to an incorrect address being assigned to ac->zonelist
in the following callchain:
	__alloc_pages
	-> prepare_alloc_pages
	 -> node_zonelist

static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
{
        return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
}
if the node is not online the return of node_zonelist will evaluate to a
invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0)

This address is then dereferenced further down the callchain in:
	prepare_alloc_pages
	-> first_zones_zonelist
  	 -> next_zones_zonelist
	  -> zonelist_zone_idx

static inline int zonelist_zone_idx(struct zoneref *zoneref)
{
        return zoneref->zone_idx;
}

Leading the system to panic.

We also correct this behavior in alloc_shrinker_info, free_shrinker_info,
and reparent_shrinker_deferred.

Fixes: 2bfd36374edd ("mm: vmscan: consolidate shrinker_maps handling code")
Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/vmscan.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..731564b61e3f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -221,7 +221,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
 	int nid;
 	int size = map_size + defer_size;
 
-	for_each_node(nid) {
+	for_each_online_node(nid) {
 		pn = memcg->nodeinfo[nid];
 		old = shrinker_info_protected(memcg, nid);
 		/* Not yet online memcg */
@@ -256,7 +256,7 @@ void free_shrinker_info(struct mem_cgroup *memcg)
 	struct shrinker_info *info;
 	int nid;
 
-	for_each_node(nid) {
+	for_each_online_node(nid) {
 		pn = memcg->nodeinfo[nid];
 		info = rcu_dereference_protected(pn->shrinker_info, true);
 		kvfree(info);
@@ -274,7 +274,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
 	map_size = shrinker_map_size(shrinker_nr_max);
 	defer_size = shrinker_defer_size(shrinker_nr_max);
 	size = map_size + defer_size;
-	for_each_node(nid) {
+	for_each_online_node(nid) {
 		info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
 		if (!info) {
 			free_shrinker_info(memcg);
@@ -417,7 +417,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 
 	/* Prevent from concurrent shrinker_info expand */
 	down_read(&shrinker_rwsem);
-	for_each_node(nid) {
+	for_each_online_node(nid) {
 		child_info = shrinker_info_protected(memcg, nid);
 		parent_info = shrinker_info_protected(parent, nid);
 		for (i = 0; i < shrinker_nr_max; i++) {
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 1/2] include/linux/gfp.h: Do not allocate pages on a offlined node
  2021-12-06  3:33 ` [RFC PATCH 1/2] include/linux/gfp.h: Do not allocate pages on a offlined node Nico Pache
@ 2021-12-06  3:37   ` Matthew Wilcox
  2021-12-06  8:29   ` David Hildenbrand
  2021-12-06  9:22   ` Michal Hocko
  2 siblings, 0 replies; 37+ messages in thread
From: Matthew Wilcox @ 2021-12-06  3:37 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, akpm, shakeelb, ktkhai, shy828301, guro,
	vbabka, vdavydov.dev, raquini, Rafael Aquini

On Sun, Dec 05, 2021 at 10:33:37PM -0500, Nico Pache wrote:
> We shouldn't allocate pages on a unavailable node. Add a check for this
> in __alloc_pages_node and return NULL to avoid issues further down the
> callchain.
> 
> Also update the VM_WARN_ON in __alloc_pages_node which could skip this
> warn if the gfp_mask is not GFP_THISNODE.

Why is this not also needed in __folio_alloc_node()?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 1/2] include/linux/gfp.h: Do not allocate pages on a offlined node
  2021-12-06  3:33 ` [RFC PATCH 1/2] include/linux/gfp.h: Do not allocate pages on a offlined node Nico Pache
  2021-12-06  3:37   ` Matthew Wilcox
@ 2021-12-06  8:29   ` David Hildenbrand
  2021-12-06  9:22   ` Michal Hocko
  2 siblings, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-12-06  8:29 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm, akpm, Michal Hocko
  Cc: shakeelb, ktkhai, shy828301, guro, vbabka, vdavydov.dev, raquini,
	Rafael Aquini

On 06.12.21 04:33, Nico Pache wrote:
> We shouldn't allocate pages on a unavailable node. Add a check for this
> in __alloc_pages_node and return NULL to avoid issues further down the
> callchain.
> 
> Also update the VM_WARN_ON in __alloc_pages_node which could skip this
> warn if the gfp_mask is not GFP_THISNODE.
> 
> Co-developed-by: Rafael Aquini <aquini@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  include/linux/gfp.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index b976c4177299..e7e18f6d0d9d 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -565,7 +565,10 @@ static inline struct page *
>  __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>  {
>  	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> -	VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
> +	VM_WARN_ON(!node_online(nid));
> +
> +	if (!node_online(nid))
> +		return NULL;
>  
>  	return __alloc_pages(gfp_mask, order, nid, NULL);
>  }
> 

We had a related discussion just a month ago or so, and callers should
make sure to not allocate pages on an offline node. No need for runtime
checks for each end every allocation. The question is usually, where do
callers get the random offline node from. See:

https://lkml.kernel.org/r/20211108202325.20304-1-amakhalov@vmware.com

There was also a discussion about turning this at least into
VM_WARN_ON(!node_online(nid)), however, there are valid points that this
won't really help much. We'd need a BUG_ON, but we'll already crash
properly later when de-referencing the NULL pgdat pointer.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06  3:33 ` [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes Nico Pache
@ 2021-12-06  8:32   ` David Hildenbrand
  2021-12-06  9:22   ` Michal Hocko
  2021-12-06 18:45   ` Yang Shi
  2 siblings, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-12-06  8:32 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm, akpm
  Cc: shakeelb, ktkhai, shy828301, guro, vbabka, vdavydov.dev, raquini

On 06.12.21 04:33, Nico Pache wrote:
> We have run into a panic caused by a shrinker allocation being attempted
> on an offlined node.
> 
> Our crash analysis has determined that the issue originates from trying
> to allocate pages on an offlined node in expand_one_shrinker_info. This
> function makes the incorrect assumption that we can allocate on any node.
> To correct this we make sure we only itterate over online nodes.
> 
> This assumption can lead to an incorrect address being assigned to ac->zonelist
> in the following callchain:
> 	__alloc_pages
> 	-> prepare_alloc_pages
> 	 -> node_zonelist
> 
> static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> {
>         return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> }
> if the node is not online the return of node_zonelist will evaluate to a
> invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0)
> 
> This address is then dereferenced further down the callchain in:
> 	prepare_alloc_pages
> 	-> first_zones_zonelist
>   	 -> next_zones_zonelist
> 	  -> zonelist_zone_idx
> 
> static inline int zonelist_zone_idx(struct zoneref *zoneref)
> {
>         return zoneref->zone_idx;
> }
> 
> Leading the system to panic.
> 
> We also correct this behavior in alloc_shrinker_info, free_shrinker_info,
> and reparent_shrinker_deferred.
> 
> Fixes: 2bfd36374edd ("mm: vmscan: consolidate shrinker_maps handling code")
> Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/vmscan.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fb9584641ac7..731564b61e3f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -221,7 +221,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>  	int nid;
>  	int size = map_size + defer_size;
>  
> -	for_each_node(nid) {
> +	for_each_online_node(nid) {
>  		pn = memcg->nodeinfo[nid];
>  		old = shrinker_info_protected(memcg, nid);
>  		/* Not yet online memcg */
> @@ -256,7 +256,7 @@ void free_shrinker_info(struct mem_cgroup *memcg)
>  	struct shrinker_info *info;
>  	int nid;
>  
> -	for_each_node(nid) {
> +	for_each_online_node(nid) {
>  		pn = memcg->nodeinfo[nid];
>  		info = rcu_dereference_protected(pn->shrinker_info, true);
>  		kvfree(info);
> @@ -274,7 +274,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>  	map_size = shrinker_map_size(shrinker_nr_max);
>  	defer_size = shrinker_defer_size(shrinker_nr_max);
>  	size = map_size + defer_size;
> -	for_each_node(nid) {
> +	for_each_online_node(nid) {
>  		info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
>  		if (!info) {
>  			free_shrinker_info(memcg);
> @@ -417,7 +417,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>  
>  	/* Prevent from concurrent shrinker_info expand */
>  	down_read(&shrinker_rwsem);
> -	for_each_node(nid) {
> +	for_each_online_node(nid) {
>  		child_info = shrinker_info_protected(memcg, nid);
>  		parent_info = shrinker_info_protected(parent, nid);
>  		for (i = 0; i < shrinker_nr_max; i++) {
> 

What happens on memory/cpu hotplug, resulting in a new node getting
onlined? Will the data structures get allocated and the data get
properly set up?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 1/2] include/linux/gfp.h: Do not allocate pages on a offlined node
  2021-12-06  3:33 ` [RFC PATCH 1/2] include/linux/gfp.h: Do not allocate pages on a offlined node Nico Pache
  2021-12-06  3:37   ` Matthew Wilcox
  2021-12-06  8:29   ` David Hildenbrand
@ 2021-12-06  9:22   ` Michal Hocko
  2021-12-07 21:24     ` Nico Pache
  2 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-12-06  9:22 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, akpm, shakeelb, ktkhai, shy828301, guro,
	vbabka, vdavydov.dev, raquini, Rafael Aquini

On Sun 05-12-21 22:33:37, Nico Pache wrote:
> We shouldn't allocate pages on a unavailable node. Add a check for this
> in __alloc_pages_node and return NULL to avoid issues further down the
> callchain.
> 
> Also update the VM_WARN_ON in __alloc_pages_node which could skip this
> warn if the gfp_mask is not GFP_THISNODE.

Page allocator trusts callers to know they are doing a right thing so
that no unnecessary branches have to be implemented and the fast path is
really optimized for performance. Allocating from an !node_online node
is a bug in the caller. One that is not really that hard to make
unfortunatelly but also one that is is not that common.

That being said I am not really sure we want to introduce this check.

> Co-developed-by: Rafael Aquini <aquini@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  include/linux/gfp.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index b976c4177299..e7e18f6d0d9d 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -565,7 +565,10 @@ static inline struct page *
>  __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>  {
>  	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> -	VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
> +	VM_WARN_ON(!node_online(nid));
> +
> +	if (!node_online(nid))
> +		return NULL;
>  
>  	return __alloc_pages(gfp_mask, order, nid, NULL);
>  }
> -- 
> 2.33.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06  3:33 ` [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes Nico Pache
  2021-12-06  8:32   ` David Hildenbrand
@ 2021-12-06  9:22   ` Michal Hocko
  2021-12-06  9:24     ` Michal Hocko
                       ` (2 more replies)
  2021-12-06 18:45   ` Yang Shi
  2 siblings, 3 replies; 37+ messages in thread
From: Michal Hocko @ 2021-12-06  9:22 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, akpm, shakeelb, ktkhai, shy828301, guro,
	vbabka, vdavydov.dev, raquini

On Sun 05-12-21 22:33:38, Nico Pache wrote:
> We have run into a panic caused by a shrinker allocation being attempted
> on an offlined node.
> 
> Our crash analysis has determined that the issue originates from trying
> to allocate pages on an offlined node in expand_one_shrinker_info. This
> function makes the incorrect assumption that we can allocate on any node.
> To correct this we make sure we only itterate over online nodes.
> 
> This assumption can lead to an incorrect address being assigned to ac->zonelist
> in the following callchain:
> 	__alloc_pages
> 	-> prepare_alloc_pages
> 	 -> node_zonelist
> 
> static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> {
>         return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> }
> if the node is not online the return of node_zonelist will evaluate to a
> invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0)
> 
> This address is then dereferenced further down the callchain in:
> 	prepare_alloc_pages
> 	-> first_zones_zonelist
>   	 -> next_zones_zonelist
> 	  -> zonelist_zone_idx
> 
> static inline int zonelist_zone_idx(struct zoneref *zoneref)
> {
>         return zoneref->zone_idx;
> }
> 
> Leading the system to panic.

Thanks for the analysis! Please also add an oops report so that this is
easier to search for. It would be also interesting to see specifics
about the issue. Why was the specific node !online in the first place?
What architecture was this on?

> We also correct this behavior in alloc_shrinker_info, free_shrinker_info,
> and reparent_shrinker_deferred.
> 
> Fixes: 2bfd36374edd ("mm: vmscan: consolidate shrinker_maps handling code")
> Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")

Normally I would split the fix as it is fixing two issues one introduced
in 4.19 the other in 5.13.

> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/vmscan.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fb9584641ac7..731564b61e3f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -221,7 +221,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>  	int nid;
>  	int size = map_size + defer_size;
>  
> -	for_each_node(nid) {
> +	for_each_online_node(nid) {
>  		pn = memcg->nodeinfo[nid];
>  		old = shrinker_info_protected(memcg, nid);
>  		/* Not yet online memcg */
> @@ -256,7 +256,7 @@ void free_shrinker_info(struct mem_cgroup *memcg)
>  	struct shrinker_info *info;
>  	int nid;
>  
> -	for_each_node(nid) {
> +	for_each_online_node(nid) {
>  		pn = memcg->nodeinfo[nid];
>  		info = rcu_dereference_protected(pn->shrinker_info, true);
>  		kvfree(info);
> @@ -274,7 +274,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>  	map_size = shrinker_map_size(shrinker_nr_max);
>  	defer_size = shrinker_defer_size(shrinker_nr_max);
>  	size = map_size + defer_size;
> -	for_each_node(nid) {
> +	for_each_online_node(nid) {
>  		info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
>  		if (!info) {
>  			free_shrinker_info(memcg);
> @@ -417,7 +417,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>  
>  	/* Prevent from concurrent shrinker_info expand */
>  	down_read(&shrinker_rwsem);
> -	for_each_node(nid) {
> +	for_each_online_node(nid) {
>  		child_info = shrinker_info_protected(memcg, nid);
>  		parent_info = shrinker_info_protected(parent, nid);
>  		for (i = 0; i < shrinker_nr_max; i++) {
> -- 
> 2.33.1

This doesn't seen complete. Slab shrinkers are used in the reclaim
context. Previously offline nodes could be onlined later and this would
lead to NULL ptr because there is no hook to allocate new shrinker
infos. This would be also really impractical because this would have to
update all existing memcgs...

To be completely honest I am not really sure this is a practical problem
because some architectures allocate (aka make online) all possible nodes
reported by the platform. There are major inconsistencies there. Maybe
that should be unified, so that problems like this one do not really
have to add a complexity to the code.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06  9:22   ` Michal Hocko
@ 2021-12-06  9:24     ` Michal Hocko
  2021-12-06 10:45     ` David Hildenbrand
  2021-12-07 21:34     ` Nico Pache
  2 siblings, 0 replies; 37+ messages in thread
From: Michal Hocko @ 2021-12-06  9:24 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, akpm, shakeelb, ktkhai, shy828301, guro,
	vbabka, vdavydov.dev, raquini, David Hildenbrand

[Cc David. I have only now noticed he has replied to this thread already
 pointing out the offline->online case]

On Mon 06-12-21 10:23:00, Michal Hocko wrote:
> On Sun 05-12-21 22:33:38, Nico Pache wrote:
> > We have run into a panic caused by a shrinker allocation being attempted
> > on an offlined node.
> > 
> > Our crash analysis has determined that the issue originates from trying
> > to allocate pages on an offlined node in expand_one_shrinker_info. This
> > function makes the incorrect assumption that we can allocate on any node.
> > To correct this we make sure we only itterate over online nodes.
> > 
> > This assumption can lead to an incorrect address being assigned to ac->zonelist
> > in the following callchain:
> > 	__alloc_pages
> > 	-> prepare_alloc_pages
> > 	 -> node_zonelist
> > 
> > static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> > {
> >         return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> > }
> > if the node is not online the return of node_zonelist will evaluate to a
> > invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0)
> > 
> > This address is then dereferenced further down the callchain in:
> > 	prepare_alloc_pages
> > 	-> first_zones_zonelist
> >   	 -> next_zones_zonelist
> > 	  -> zonelist_zone_idx
> > 
> > static inline int zonelist_zone_idx(struct zoneref *zoneref)
> > {
> >         return zoneref->zone_idx;
> > }
> > 
> > Leading the system to panic.
> 
> Thanks for the analysis! Please also add an oops report so that this is
> easier to search for. It would be also interesting to see specifics
> about the issue. Why was the specific node !online in the first place?
> What architecture was this on?
> 
> > We also correct this behavior in alloc_shrinker_info, free_shrinker_info,
> > and reparent_shrinker_deferred.
> > 
> > Fixes: 2bfd36374edd ("mm: vmscan: consolidate shrinker_maps handling code")
> > Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
> 
> Normally I would split the fix as it is fixing two issues one introduced
> in 4.19 the other in 5.13.
> 
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/vmscan.c | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index fb9584641ac7..731564b61e3f 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -221,7 +221,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> >  	int nid;
> >  	int size = map_size + defer_size;
> >  
> > -	for_each_node(nid) {
> > +	for_each_online_node(nid) {
> >  		pn = memcg->nodeinfo[nid];
> >  		old = shrinker_info_protected(memcg, nid);
> >  		/* Not yet online memcg */
> > @@ -256,7 +256,7 @@ void free_shrinker_info(struct mem_cgroup *memcg)
> >  	struct shrinker_info *info;
> >  	int nid;
> >  
> > -	for_each_node(nid) {
> > +	for_each_online_node(nid) {
> >  		pn = memcg->nodeinfo[nid];
> >  		info = rcu_dereference_protected(pn->shrinker_info, true);
> >  		kvfree(info);
> > @@ -274,7 +274,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> >  	map_size = shrinker_map_size(shrinker_nr_max);
> >  	defer_size = shrinker_defer_size(shrinker_nr_max);
> >  	size = map_size + defer_size;
> > -	for_each_node(nid) {
> > +	for_each_online_node(nid) {
> >  		info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
> >  		if (!info) {
> >  			free_shrinker_info(memcg);
> > @@ -417,7 +417,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
> >  
> >  	/* Prevent from concurrent shrinker_info expand */
> >  	down_read(&shrinker_rwsem);
> > -	for_each_node(nid) {
> > +	for_each_online_node(nid) {
> >  		child_info = shrinker_info_protected(memcg, nid);
> >  		parent_info = shrinker_info_protected(parent, nid);
> >  		for (i = 0; i < shrinker_nr_max; i++) {
> > -- 
> > 2.33.1
> 
> This doesn't seen complete. Slab shrinkers are used in the reclaim
> context. Previously offline nodes could be onlined later and this would
> lead to NULL ptr because there is no hook to allocate new shrinker
> infos. This would be also really impractical because this would have to
> update all existing memcgs...
> 
> To be completely honest I am not really sure this is a practical problem
> because some architectures allocate (aka make online) all possible nodes
> reported by the platform. There are major inconsistencies there. Maybe
> that should be unified, so that problems like this one do not really
> have to add a complexity to the code.
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06  9:22   ` Michal Hocko
  2021-12-06  9:24     ` Michal Hocko
@ 2021-12-06 10:45     ` David Hildenbrand
  2021-12-06 10:54       ` Michal Hocko
                         ` (2 more replies)
  2021-12-07 21:34     ` Nico Pache
  2 siblings, 3 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-12-06 10:45 UTC (permalink / raw)
  To: Michal Hocko, Nico Pache
  Cc: linux-kernel, linux-mm, akpm, shakeelb, ktkhai, shy828301, guro,
	vbabka, vdavydov.dev, raquini

> This doesn't seen complete. Slab shrinkers are used in the reclaim
> context. Previously offline nodes could be onlined later and this would
> lead to NULL ptr because there is no hook to allocate new shrinker
> infos. This would be also really impractical because this would have to
> update all existing memcgs...

Instead of going through the trouble of updating...

...  maybe just keep for_each_node() and check if the target node is
offline. If it's offline, just allocate from the first online node.
After all, we're not using __GFP_THISNODE, so there are no guarantees
either way ...

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 10:45     ` David Hildenbrand
@ 2021-12-06 10:54       ` Michal Hocko
  2021-12-06 11:00         ` David Hildenbrand
  2021-12-06 13:19       ` Kirill Tkhai
  2021-12-07 21:40       ` Nico Pache
  2 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-12-06 10:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nico Pache, linux-kernel, linux-mm, akpm, shakeelb, ktkhai,
	shy828301, guro, vbabka, vdavydov.dev, raquini

On Mon 06-12-21 11:45:54, David Hildenbrand wrote:
> > This doesn't seen complete. Slab shrinkers are used in the reclaim
> > context. Previously offline nodes could be onlined later and this would
> > lead to NULL ptr because there is no hook to allocate new shrinker
> > infos. This would be also really impractical because this would have to
> > update all existing memcgs...
> 
> Instead of going through the trouble of updating...
> 
> ...  maybe just keep for_each_node() and check if the target node is
> offline. If it's offline, just allocate from the first online node.
> After all, we're not using __GFP_THISNODE, so there are no guarantees
> either way ...

This looks like another way to paper over a deeper underlying problem
IMHO. Fundamentally we have a problem that some pgdata are not allocated
and that causes a lot of headache. Not to mention that node_online
is just adding to a confusion because it doesn't really tell anything
about the logical state of the node.

I think we really should get rid of this approach rather than play a
whack-a-mole. We should really drop all notion of node_online and
instead allocate pgdat for each possible node. Arch specific code should
make sure that zone lists are properly initialized.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 10:54       ` Michal Hocko
@ 2021-12-06 11:00         ` David Hildenbrand
  2021-12-06 11:22           ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2021-12-06 11:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Nico Pache, linux-kernel, linux-mm, akpm, shakeelb, ktkhai,
	shy828301, guro, vbabka, vdavydov.dev, raquini

On 06.12.21 11:54, Michal Hocko wrote:
> On Mon 06-12-21 11:45:54, David Hildenbrand wrote:
>>> This doesn't seen complete. Slab shrinkers are used in the reclaim
>>> context. Previously offline nodes could be onlined later and this would
>>> lead to NULL ptr because there is no hook to allocate new shrinker
>>> infos. This would be also really impractical because this would have to
>>> update all existing memcgs...
>>
>> Instead of going through the trouble of updating...
>>
>> ...  maybe just keep for_each_node() and check if the target node is
>> offline. If it's offline, just allocate from the first online node.
>> After all, we're not using __GFP_THISNODE, so there are no guarantees
>> either way ...
> 
> This looks like another way to paper over a deeper underlying problem
> IMHO. Fundamentally we have a problem that some pgdata are not allocated
> and that causes a lot of headache. Not to mention that node_online
> is just adding to a confusion because it doesn't really tell anything
> about the logical state of the node.
> 
> I think we really should get rid of this approach rather than play a
> whack-a-mole. We should really drop all notion of node_online and
> instead allocate pgdat for each possible node. Arch specific code should
> make sure that zone lists are properly initialized.
> 

I'm not sure if it's rally whack-a-mole really applies. It's just the
for_each_node_* calls that need love. In other cases, we shouldn't
really stumble over an offline node.

Someone deliberately decided to use "for_each_node()" instead of
for_each_online_node() without taking care of online vs. offline
semantics. That's just a BUG and needs fixing IMHO.

After all, we do need patches to backport, reworking pgdat init isn't
really something feasible for that I think. And I heard of PPC that can
hotplug thousands of nodes ...

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 11:00         ` David Hildenbrand
@ 2021-12-06 11:22           ` Michal Hocko
  2021-12-06 12:43             ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-12-06 11:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nico Pache, linux-kernel, linux-mm, akpm, shakeelb, ktkhai,
	shy828301, guro, vbabka, vdavydov.dev, raquini

On Mon 06-12-21 12:00:50, David Hildenbrand wrote:
> On 06.12.21 11:54, Michal Hocko wrote:
> > On Mon 06-12-21 11:45:54, David Hildenbrand wrote:
> >>> This doesn't seen complete. Slab shrinkers are used in the reclaim
> >>> context. Previously offline nodes could be onlined later and this would
> >>> lead to NULL ptr because there is no hook to allocate new shrinker
> >>> infos. This would be also really impractical because this would have to
> >>> update all existing memcgs...
> >>
> >> Instead of going through the trouble of updating...
> >>
> >> ...  maybe just keep for_each_node() and check if the target node is
> >> offline. If it's offline, just allocate from the first online node.
> >> After all, we're not using __GFP_THISNODE, so there are no guarantees
> >> either way ...
> > 
> > This looks like another way to paper over a deeper underlying problem
> > IMHO. Fundamentally we have a problem that some pgdata are not allocated
> > and that causes a lot of headache. Not to mention that node_online
> > is just adding to a confusion because it doesn't really tell anything
> > about the logical state of the node.
> > 
> > I think we really should get rid of this approach rather than play a
> > whack-a-mole. We should really drop all notion of node_online and
> > instead allocate pgdat for each possible node. Arch specific code should
> > make sure that zone lists are properly initialized.
> > 
> 
> I'm not sure if it's rally whack-a-mole really applies. It's just the
> for_each_node_* calls that need love. In other cases, we shouldn't
> really stumble over an offline node.
> 
> Someone deliberately decided to use "for_each_node()" instead of
> for_each_online_node() without taking care of online vs. offline
> semantics. That's just a BUG and needs fixing IMHO.

Well, I would argue this is too much to ask for the API users. It is
also a trap that is just too easy to fall into. Turning for_each_node
into for_each_online_node will not solve the problem just by itself.
As you have pointed out an offline node can become online and without a
hotplug notifier there is no way for_each_online_node would be a proper
fit for anything that allocates memory only for online nodes. See the
trap? People think they are doing the right thing but they are not in
fact.

Now practically speaking !node_online should not apear node_online (note
I am attentionally avoiding to say offline and online as that has a
completely different semantic) shouldn't really happen for some
architectures. x86 should allocate pgdat for each possible node. I do
not know what was the architecture in this case but we already have
another report for x86 that remains unexplained.

My existing experience tells me that we have two major problems in this
area. Arch inconsistent behavior and really confusing APIs. We should be
addressing those rather than adding more hacks.

> After all, we do need patches to backport, reworking pgdat init isn't
> really something feasible for that I think. And I heard of PPC that can
> hotplug thousands of nodes ...

If this turns out to be a big problem them we can think of putting pgdat
on diet.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 11:22           ` Michal Hocko
@ 2021-12-06 12:43             ` David Hildenbrand
  2021-12-06 13:06               ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2021-12-06 12:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Nico Pache, linux-kernel, linux-mm, akpm, shakeelb, ktkhai,
	shy828301, guro, vbabka, vdavydov.dev, raquini

>> I'm not sure if it's rally whack-a-mole really applies. It's just the
>> for_each_node_* calls that need love. In other cases, we shouldn't
>> really stumble over an offline node.
>>
>> Someone deliberately decided to use "for_each_node()" instead of
>> for_each_online_node() without taking care of online vs. offline
>> semantics. That's just a BUG and needs fixing IMHO.
> 
> Well, I would argue this is too much to ask for the API users. It is
> also a trap that is just too easy to fall into. Turning for_each_node
> into for_each_online_node will not solve the problem just by itself.
> As you have pointed out an offline node can become online and without a
> hotplug notifier there is no way for_each_online_node would be a proper
> fit for anything that allocates memory only for online nodes. See the
> trap? People think they are doing the right thing but they are not in
> fact.

I agree that it requires care in the caller. And some callers don't seem
to know that the nid they hold in their hand is garbage and should be
used with care. And this is pretty much undocumented.

> 
> Now practically speaking !node_online should not apear node_online (note
> I am attentionally avoiding to say offline and online as that has a
> completely different semantic) shouldn't really happen for some
> architectures. x86 should allocate pgdat for each possible node. I do
> not know what was the architecture in this case but we already have
> another report for x86 that remains unexplained.

So we'd allocate the pgdat although all we want is just a zonelist. The
obvious alternative is to implement the fallback where reasonable -- for
example, in the page allocator. It knows the fallback order:
build_zonelists(). That's pretty much all we need the preferred_nid for.

So just making prepare_alloc_pages()/node_zonelist() deal with a missing
pgdat could make sense as well. Something like:


diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b976c4177299..2d2649e78766 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -508,9 +508,14 @@ static inline int gfp_zonelist(gfp_t flags)
  *
  * For the case of non-NUMA systems the NODE_DATA() gets optimized to
  * &contig_page_data at compile-time.
+ *
+ * If the node does not have a pgdat yet, returns the zonelist of the
+ * first online node.
  */
 static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 {
+       if (unlikely(!NODE_DATA(nid)))
+               nid = first_online_node;
        return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
 }


But of course, there might be value in a proper node-aware fallback list
as we have in build_zonelists() -- but it remains questionable if the
difference for these corner cases would be relevant in practice.

Further, if we could have thousands of nodes, we'd have to update each
and every one when building zone lists ...

Removing the hotadd_new_pgdat() stuff does sound appealing, though.

-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 12:43             ` David Hildenbrand
@ 2021-12-06 13:06               ` Michal Hocko
  2021-12-06 13:47                 ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-12-06 13:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nico Pache, linux-kernel, linux-mm, akpm, shakeelb, ktkhai,
	shy828301, guro, vbabka, vdavydov.dev, raquini

On Mon 06-12-21 13:43:27, David Hildenbrand wrote:
[...]
> > Now practically speaking !node_online should not apear node_online (note
> > I am attentionally avoiding to say offline and online as that has a
> > completely different semantic) shouldn't really happen for some
> > architectures. x86 should allocate pgdat for each possible node. I do
> > not know what was the architecture in this case but we already have
> > another report for x86 that remains unexplained.
> 
> So we'd allocate the pgdat although all we want is just a zonelist. The
> obvious alternative is to implement the fallback where reasonable -- for
> example, in the page allocator. It knows the fallback order:
> build_zonelists(). That's pretty much all we need the preferred_nid for.
> 
> So just making prepare_alloc_pages()/node_zonelist() deal with a missing
> pgdat could make sense as well. Something like:
> 
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index b976c4177299..2d2649e78766 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -508,9 +508,14 @@ static inline int gfp_zonelist(gfp_t flags)
>   *
>   * For the case of non-NUMA systems the NODE_DATA() gets optimized to
>   * &contig_page_data at compile-time.
> + *
> + * If the node does not have a pgdat yet, returns the zonelist of the
> + * first online node.
>   */
>  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>  {
> +       if (unlikely(!NODE_DATA(nid)))
> +               nid = first_online_node;
>         return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>  }

This is certainly possible. But it a) adds a branch to the hotpath and
b) it doesn't solve any other potential dereference of garbage node.
 
> But of course, there might be value in a proper node-aware fallback list
> as we have in build_zonelists() -- but it remains questionable if the
> difference for these corner cases would be relevant in practice.

Only the platform knows the proper node topology and that includes
memory less nodes. So they should be setting up a node properly and we
shouldn't be dealing with this at the allocator nor any other code.

> Further, if we could have thousands of nodes, we'd have to update each
> and every one when building zone lists ...

Why would that be a practical problem?

> Removing the hotadd_new_pgdat() stuff does sound appealing, though.

Yes our hotplug code could be simplified as well. All we really need is
an arch code to initialize all the possible nodes.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 10:45     ` David Hildenbrand
  2021-12-06 10:54       ` Michal Hocko
@ 2021-12-06 13:19       ` Kirill Tkhai
  2021-12-06 13:24         ` Michal Hocko
                           ` (2 more replies)
  2021-12-07 21:40       ` Nico Pache
  2 siblings, 3 replies; 37+ messages in thread
From: Kirill Tkhai @ 2021-12-06 13:19 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko, Nico Pache
  Cc: linux-kernel, linux-mm, akpm, shakeelb, shy828301, guro, vbabka,
	vdavydov.dev, raquini

On 06.12.2021 13:45, David Hildenbrand wrote:
>> This doesn't seen complete. Slab shrinkers are used in the reclaim
>> context. Previously offline nodes could be onlined later and this would
>> lead to NULL ptr because there is no hook to allocate new shrinker
>> infos. This would be also really impractical because this would have to
>> update all existing memcgs...
> 
> Instead of going through the trouble of updating...
> 
> ...  maybe just keep for_each_node() and check if the target node is
> offline. If it's offline, just allocate from the first online node.
> After all, we're not using __GFP_THISNODE, so there are no guarantees
> either way ...

Hm, can't we add shrinker maps allocation to __try_online_node() in addition
to this patch? 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 13:19       ` Kirill Tkhai
@ 2021-12-06 13:24         ` Michal Hocko
  2021-12-08 19:00           ` Nico Pache
  2021-12-06 18:42         ` Yang Shi
  2021-12-07 21:45         ` Nico Pache
  2 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-12-06 13:24 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: David Hildenbrand, Nico Pache, linux-kernel, linux-mm, akpm,
	shakeelb, shy828301, guro, vbabka, vdavydov.dev, raquini

On Mon 06-12-21 16:19:12, Kirill Tkhai wrote:
> On 06.12.2021 13:45, David Hildenbrand wrote:
> >> This doesn't seen complete. Slab shrinkers are used in the reclaim
> >> context. Previously offline nodes could be onlined later and this would
> >> lead to NULL ptr because there is no hook to allocate new shrinker
> >> infos. This would be also really impractical because this would have to
> >> update all existing memcgs...
> > 
> > Instead of going through the trouble of updating...
> > 
> > ...  maybe just keep for_each_node() and check if the target node is
> > offline. If it's offline, just allocate from the first online node.
> > After all, we're not using __GFP_THISNODE, so there are no guarantees
> > either way ...
> 
> Hm, can't we add shrinker maps allocation to __try_online_node() in addition
> to this patch? 

Either that or through hotplug notifier (which would be a better
solution). But allocating a new shrinker map for each memcg would have
to be done as has been mentioned earlier.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 13:06               ` Michal Hocko
@ 2021-12-06 13:47                 ` David Hildenbrand
  2021-12-06 14:06                   ` Michal Hocko
  2021-12-06 14:15                   ` Michal Hocko
  0 siblings, 2 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-12-06 13:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Nico Pache, linux-kernel, linux-mm, akpm, shakeelb, ktkhai,
	shy828301, guro, vbabka, vdavydov.dev, raquini

On 06.12.21 14:06, Michal Hocko wrote:
> On Mon 06-12-21 13:43:27, David Hildenbrand wrote:
> [...]
>>> Now practically speaking !node_online should not apear node_online (note
>>> I am attentionally avoiding to say offline and online as that has a
>>> completely different semantic) shouldn't really happen for some
>>> architectures. x86 should allocate pgdat for each possible node. I do
>>> not know what was the architecture in this case but we already have
>>> another report for x86 that remains unexplained.
>>
>> So we'd allocate the pgdat although all we want is just a zonelist. The
>> obvious alternative is to implement the fallback where reasonable -- for
>> example, in the page allocator. It knows the fallback order:
>> build_zonelists(). That's pretty much all we need the preferred_nid for.
>>
>> So just making prepare_alloc_pages()/node_zonelist() deal with a missing
>> pgdat could make sense as well. Something like:
>>
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index b976c4177299..2d2649e78766 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -508,9 +508,14 @@ static inline int gfp_zonelist(gfp_t flags)
>>   *
>>   * For the case of non-NUMA systems the NODE_DATA() gets optimized to
>>   * &contig_page_data at compile-time.
>> + *
>> + * If the node does not have a pgdat yet, returns the zonelist of the
>> + * first online node.
>>   */
>>  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>>  {
>> +       if (unlikely(!NODE_DATA(nid)))
>> +               nid = first_online_node;
>>         return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>>  }
> 
> This is certainly possible. But it a) adds a branch to the hotpath and
> b) it doesn't solve any other potential dereference of garbage node.

I don't think a) is  a problem but it's easy to measure. Agreed to b),
however, the page allocator has been the most prominent source of error
reports for this.

>  
>> But of course, there might be value in a proper node-aware fallback list
>> as we have in build_zonelists() -- but it remains questionable if the
>> difference for these corner cases would be relevant in practice.
> 
> Only the platform knows the proper node topology and that includes
> memory less nodes. So they should be setting up a node properly and we
> shouldn't be dealing with this at the allocator nor any other code.

I *think* there are cases where the topology of a new node is only know
once it actually gets used. For example, I remember talking to CXL and
there are ideas to have a pool of possible nodes, which can get used
dynamically for CXL memory. Of course, some kind of reconfiguration
could be imaginable.

> 
>> Further, if we could have thousands of nodes, we'd have to update each
>> and every one when building zone lists ...
> 
> Why would that be a practical problem?

We'll need at least

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..e5d958abc7cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6382,7 +6382,7 @@ static void __build_all_zonelists(void *data)
        if (self && !node_online(self->node_id)) {
                build_zonelists(self);
        } else {
-               for_each_online_node(nid) {
+               for_each_node(nid) {
                        pg_data_t *pgdat = NODE_DATA(nid);

                        build_zonelists(pgdat);


But there might be more missing. Onlining a new zone will get more
expensive in setups with a lot of possible nodes (x86-64 shouldn't
really be an issue in that regard).

If we want stable backports, we'll want something simple upfront.


-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 13:47                 ` David Hildenbrand
@ 2021-12-06 14:06                   ` Michal Hocko
  2021-12-06 14:08                     ` David Hildenbrand
  2021-12-06 14:15                   ` Michal Hocko
  1 sibling, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-12-06 14:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nico Pache, linux-kernel, linux-mm, akpm, shakeelb, ktkhai,
	shy828301, guro, vbabka, vdavydov.dev, raquini

On Mon 06-12-21 14:47:13, David Hildenbrand wrote:
> On 06.12.21 14:06, Michal Hocko wrote:
> > On Mon 06-12-21 13:43:27, David Hildenbrand wrote:
> > [...]
> >>> Now practically speaking !node_online should not apear node_online (note
> >>> I am attentionally avoiding to say offline and online as that has a
> >>> completely different semantic) shouldn't really happen for some
> >>> architectures. x86 should allocate pgdat for each possible node. I do
> >>> not know what was the architecture in this case but we already have
> >>> another report for x86 that remains unexplained.
> >>
> >> So we'd allocate the pgdat although all we want is just a zonelist. The
> >> obvious alternative is to implement the fallback where reasonable -- for
> >> example, in the page allocator. It knows the fallback order:
> >> build_zonelists(). That's pretty much all we need the preferred_nid for.
> >>
> >> So just making prepare_alloc_pages()/node_zonelist() deal with a missing
> >> pgdat could make sense as well. Something like:
> >>
> >>
> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >> index b976c4177299..2d2649e78766 100644
> >> --- a/include/linux/gfp.h
> >> +++ b/include/linux/gfp.h
> >> @@ -508,9 +508,14 @@ static inline int gfp_zonelist(gfp_t flags)
> >>   *
> >>   * For the case of non-NUMA systems the NODE_DATA() gets optimized to
> >>   * &contig_page_data at compile-time.
> >> + *
> >> + * If the node does not have a pgdat yet, returns the zonelist of the
> >> + * first online node.
> >>   */
> >>  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> >>  {
> >> +       if (unlikely(!NODE_DATA(nid)))
> >> +               nid = first_online_node;
> >>         return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> >>  }
> > 
> > This is certainly possible. But it a) adds a branch to the hotpath and
> > b) it doesn't solve any other potential dereference of garbage node.
> 
> I don't think a) is  a problem but it's easy to measure. Agreed to b),
> however, the page allocator has been the most prominent source of error
> reports for this.
> 
> >  
> >> But of course, there might be value in a proper node-aware fallback list
> >> as we have in build_zonelists() -- but it remains questionable if the
> >> difference for these corner cases would be relevant in practice.
> > 
> > Only the platform knows the proper node topology and that includes
> > memory less nodes. So they should be setting up a node properly and we
> > shouldn't be dealing with this at the allocator nor any other code.
> 
> I *think* there are cases where the topology of a new node is only know
> once it actually gets used. For example, I remember talking to CXL and
> there are ideas to have a pool of possible nodes, which can get used
> dynamically for CXL memory. Of course, some kind of reconfiguration
> could be imaginable.

I am not really familiar with CXL but if a dynamic reconfiguration is
needed then this would only require to update the zonelist which should
be a trivial thing to do. The code using the node doesn't really have to
be aware of that as the pgdat will stay unchanged.

> >> Further, if we could have thousands of nodes, we'd have to update each
> >> and every one when building zone lists ...
> > 
> > Why would that be a practical problem?
> 
> We'll need at least
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5952749ad40..e5d958abc7cc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6382,7 +6382,7 @@ static void __build_all_zonelists(void *data)
>         if (self && !node_online(self->node_id)) {
>                 build_zonelists(self);
>         } else {
> -               for_each_online_node(nid) {
> +               for_each_node(nid) {
>                         pg_data_t *pgdat = NODE_DATA(nid);
> 
>                         build_zonelists(pgdat);

If we do have preallocated pgdat for all possible nodes that
for_each_online_node doesn't really need to exist.
 
> But there might be more missing. Onlining a new zone will get more
> expensive in setups with a lot of possible nodes (x86-64 shouldn't
> really be an issue in that regard).

Honestly, I am not really concerned by platforms with too many nodes
without any memory. If they want to shoot their feet then that's their
choice. We can optimize for those if they ever prove to be standar.
 
> If we want stable backports, we'll want something simple upfront.

For stable backports I would be fine by doing your NODE_DATA check in
the allocator. In upstream I think we should be aiming for a more robust
solution that is also easier to maintain further down the line. Even if
that is an investment at this momemnt because the initialization code is
a mess.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 14:06                   ` Michal Hocko
@ 2021-12-06 14:08                     ` David Hildenbrand
  2021-12-06 14:21                       ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2021-12-06 14:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Nico Pache, linux-kernel, linux-mm, akpm, shakeelb, ktkhai,
	shy828301, guro, vbabka, vdavydov.dev, raquini


>> But there might be more missing. Onlining a new zone will get more
>> expensive in setups with a lot of possible nodes (x86-64 shouldn't
>> really be an issue in that regard).
> 
> Honestly, I am not really concerned by platforms with too many nodes
> without any memory. If they want to shoot their feet then that's their
> choice. We can optimize for those if they ever prove to be standar.
>  
>> If we want stable backports, we'll want something simple upfront.
> 
> For stable backports I would be fine by doing your NODE_DATA check in
> the allocator. In upstream I think we should be aiming for a more robust
> solution that is also easier to maintain further down the line. Even if
> that is an investment at this momemnt because the initialization code is
> a mess.
> 

Agreed. I would be curious *why* we decided to dynamically allocate the
pgdat. is this just a historical coincidence or was there real reason to
not allocate it for all possible nodes during boot?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 13:47                 ` David Hildenbrand
  2021-12-06 14:06                   ` Michal Hocko
@ 2021-12-06 14:15                   ` Michal Hocko
  1 sibling, 0 replies; 37+ messages in thread
From: Michal Hocko @ 2021-12-06 14:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nico Pache, linux-kernel, linux-mm, akpm, shakeelb, ktkhai,
	shy828301, guro, vbabka, vdavydov.dev, raquini

On Mon 06-12-21 14:47:13, David Hildenbrand wrote:
> On 06.12.21 14:06, Michal Hocko wrote:
[...]
> > This is certainly possible. But it a) adds a branch to the hotpath and
> > b) it doesn't solve any other potential dereference of garbage node.
> 
> I don't think a) is  a problem but it's easy to measure.

Let me just clarify on this one. This single particular branch will be
indeed hard to match to any performance changes. On a single call of the
allocator it is nothing. But it is a condition that will be executed by
each caller while it won't ever trigger in most systems and workloads.

In other words it will cause a lot of wasted cpu cycles long term. A
more expensive one time initialization is worth in order to have
allocator fast path lighter IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 14:08                     ` David Hildenbrand
@ 2021-12-06 14:21                       ` Michal Hocko
  2021-12-06 14:30                         ` Vlastimil Babka
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-12-06 14:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nico Pache, linux-kernel, linux-mm, akpm, shakeelb, ktkhai,
	shy828301, guro, vbabka, vdavydov.dev, raquini

On Mon 06-12-21 15:08:10, David Hildenbrand wrote:
> 
> >> But there might be more missing. Onlining a new zone will get more
> >> expensive in setups with a lot of possible nodes (x86-64 shouldn't
> >> really be an issue in that regard).
> > 
> > Honestly, I am not really concerned by platforms with too many nodes
> > without any memory. If they want to shoot their feet then that's their
> > choice. We can optimize for those if they ever prove to be standar.
> >  
> >> If we want stable backports, we'll want something simple upfront.
> > 
> > For stable backports I would be fine by doing your NODE_DATA check in
> > the allocator. In upstream I think we should be aiming for a more robust
> > solution that is also easier to maintain further down the line. Even if
> > that is an investment at this momemnt because the initialization code is
> > a mess.
> > 
> 
> Agreed. I would be curious *why* we decided to dynamically allocate the
> pgdat. is this just a historical coincidence or was there real reason to
> not allocate it for all possible nodes during boot?

I don't know but if I was to guess the most likely explanation would be
that the numa init code was in a similar order as now and it was easier
to simply allocate a pgdat when a new one was onlined.
9af3c2dea3a3 ("[PATCH] pgdat allocation for new node add (call pgdat allocation)")
doesn't really tell much.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 14:21                       ` Michal Hocko
@ 2021-12-06 14:30                         ` Vlastimil Babka
  2021-12-06 14:53                           ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2021-12-06 14:30 UTC (permalink / raw)
  To: Michal Hocko, David Hildenbrand
  Cc: Nico Pache, linux-kernel, linux-mm, akpm, shakeelb, ktkhai,
	shy828301, guro, vdavydov.dev, raquini

On 12/6/21 15:21, Michal Hocko wrote:
> On Mon 06-12-21 15:08:10, David Hildenbrand wrote:
>> 
>> >> But there might be more missing. Onlining a new zone will get more
>> >> expensive in setups with a lot of possible nodes (x86-64 shouldn't
>> >> really be an issue in that regard).
>> > 
>> > Honestly, I am not really concerned by platforms with too many nodes
>> > without any memory. If they want to shoot their feet then that's their
>> > choice. We can optimize for those if they ever prove to be standar.
>> >  
>> >> If we want stable backports, we'll want something simple upfront.
>> > 
>> > For stable backports I would be fine by doing your NODE_DATA check in
>> > the allocator. In upstream I think we should be aiming for a more robust
>> > solution that is also easier to maintain further down the line. Even if
>> > that is an investment at this momemnt because the initialization code is
>> > a mess.
>> > 
>> 
>> Agreed. I would be curious *why* we decided to dynamically allocate the
>> pgdat. is this just a historical coincidence or was there real reason to
>> not allocate it for all possible nodes during boot?
> 
> I don't know but if I was to guess the most likely explanation would be
> that the numa init code was in a similar order as now and it was easier
> to simply allocate a pgdat when a new one was onlined.
> 9af3c2dea3a3 ("[PATCH] pgdat allocation for new node add (call pgdat allocation)")
> doesn't really tell much.

I don't know if that's true for pgdat specifically, but generally IMHO the
advantages of allocating during/after online instead for each possible is
- memory savings when some possible node is actually never online
- at least in some cases, the allocations can be local to the node in
question where the advantages is
  - faster access
  - less memory occupied on nodes that are earlier online, especially node 0

So while the approach of allocate on boot for all possible nodes instead of
just online nodes has advantages of being generally safer and simpler (no
memory hotplug callbacks etc), we should also be careful not to overdo this
approach so we don't end up with Node 0 memory filled with structures used
for nodes 1-X that are just onlined later. I imagine that could be a problem
even for "sane" archs that don't have tons of possible, but offline nodes.

Concretely, pgdat should probably be fine, but things like all shrinkers?
Maybe less so.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 14:30                         ` Vlastimil Babka
@ 2021-12-06 14:53                           ` Michal Hocko
  2021-12-06 18:26                             ` Yang Shi
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-12-06 14:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Hildenbrand, Nico Pache, linux-kernel, linux-mm, akpm,
	shakeelb, ktkhai, shy828301, guro, vdavydov.dev, raquini

On Mon 06-12-21 15:30:37, Vlastimil Babka wrote:
> On 12/6/21 15:21, Michal Hocko wrote:
> > On Mon 06-12-21 15:08:10, David Hildenbrand wrote:
> >> 
> >> >> But there might be more missing. Onlining a new zone will get more
> >> >> expensive in setups with a lot of possible nodes (x86-64 shouldn't
> >> >> really be an issue in that regard).
> >> > 
> >> > Honestly, I am not really concerned by platforms with too many nodes
> >> > without any memory. If they want to shoot their feet then that's their
> >> > choice. We can optimize for those if they ever prove to be standar.
> >> >  
> >> >> If we want stable backports, we'll want something simple upfront.
> >> > 
> >> > For stable backports I would be fine by doing your NODE_DATA check in
> >> > the allocator. In upstream I think we should be aiming for a more robust
> >> > solution that is also easier to maintain further down the line. Even if
> >> > that is an investment at this momemnt because the initialization code is
> >> > a mess.
> >> > 
> >> 
> >> Agreed. I would be curious *why* we decided to dynamically allocate the
> >> pgdat. is this just a historical coincidence or was there real reason to
> >> not allocate it for all possible nodes during boot?
> > 
> > I don't know but if I was to guess the most likely explanation would be
> > that the numa init code was in a similar order as now and it was easier
> > to simply allocate a pgdat when a new one was onlined.
> > 9af3c2dea3a3 ("[PATCH] pgdat allocation for new node add (call pgdat allocation)")
> > doesn't really tell much.
> 
> I don't know if that's true for pgdat specifically, but generally IMHO the
> advantages of allocating during/after online instead for each possible is
> - memory savings when some possible node is actually never online
> - at least in some cases, the allocations can be local to the node in
> question where the advantages is
>   - faster access
>   - less memory occupied on nodes that are earlier online, especially node 0
> 
> So while the approach of allocate on boot for all possible nodes instead of
> just online nodes has advantages of being generally safer and simpler (no
> memory hotplug callbacks etc), we should also be careful not to overdo this
> approach so we don't end up with Node 0 memory filled with structures used
> for nodes 1-X that are just onlined later. I imagine that could be a problem
> even for "sane" archs that don't have tons of possible, but offline nodes.

Yes this can indeed turn out to be a problem as the memory allocations
scales not only with numa nodes but memcgs as well. The later one being
a more visible one.

> Concretely, pgdat should probably be fine, but things like all shrinkers?
> Maybe less so.

Yeah, right. But for that purpose the concept of online_node is just
misleading. You would need a check whether the node is populated with
memory and implement hotplug notifiers.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 14:53                           ` Michal Hocko
@ 2021-12-06 18:26                             ` Yang Shi
  2021-12-07 10:15                               ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2021-12-06 18:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, David Hildenbrand, Nico Pache,
	Linux Kernel Mailing List, Linux MM, Andrew Morton, Shakeel Butt,
	Kirill Tkhai, Roman Gushchin, Vladimir Davydov, raquini

On Mon, Dec 6, 2021 at 6:53 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 06-12-21 15:30:37, Vlastimil Babka wrote:
> > On 12/6/21 15:21, Michal Hocko wrote:
> > > On Mon 06-12-21 15:08:10, David Hildenbrand wrote:
> > >>
> > >> >> But there might be more missing. Onlining a new zone will get more
> > >> >> expensive in setups with a lot of possible nodes (x86-64 shouldn't
> > >> >> really be an issue in that regard).
> > >> >
> > >> > Honestly, I am not really concerned by platforms with too many nodes
> > >> > without any memory. If they want to shoot their feet then that's their
> > >> > choice. We can optimize for those if they ever prove to be standar.
> > >> >
> > >> >> If we want stable backports, we'll want something simple upfront.
> > >> >
> > >> > For stable backports I would be fine by doing your NODE_DATA check in
> > >> > the allocator. In upstream I think we should be aiming for a more robust
> > >> > solution that is also easier to maintain further down the line. Even if
> > >> > that is an investment at this momemnt because the initialization code is
> > >> > a mess.
> > >> >
> > >>
> > >> Agreed. I would be curious *why* we decided to dynamically allocate the
> > >> pgdat. is this just a historical coincidence or was there real reason to
> > >> not allocate it for all possible nodes during boot?
> > >
> > > I don't know but if I was to guess the most likely explanation would be
> > > that the numa init code was in a similar order as now and it was easier
> > > to simply allocate a pgdat when a new one was onlined.
> > > 9af3c2dea3a3 ("[PATCH] pgdat allocation for new node add (call pgdat allocation)")
> > > doesn't really tell much.
> >
> > I don't know if that's true for pgdat specifically, but generally IMHO the
> > advantages of allocating during/after online instead for each possible is
> > - memory savings when some possible node is actually never online
> > - at least in some cases, the allocations can be local to the node in
> > question where the advantages is
> >   - faster access
> >   - less memory occupied on nodes that are earlier online, especially node 0
> >
> > So while the approach of allocate on boot for all possible nodes instead of
> > just online nodes has advantages of being generally safer and simpler (no
> > memory hotplug callbacks etc), we should also be careful not to overdo this
> > approach so we don't end up with Node 0 memory filled with structures used
> > for nodes 1-X that are just onlined later. I imagine that could be a problem
> > even for "sane" archs that don't have tons of possible, but offline nodes.
>
> Yes this can indeed turn out to be a problem as the memory allocations
> scales not only with numa nodes but memcgs as well. The later one being
> a more visible one.
>
> > Concretely, pgdat should probably be fine, but things like all shrinkers?
> > Maybe less so.
>
> Yeah, right. But for that purpose the concept of online_node is just
> misleading. You would need a check whether the node is populated with
> memory and implement hotplug notifiers.

Yes, the cons is memory waste. I think it is a known problem since
memcg has per node data (a.k.a. mem_cgroup_per_node_info) which holds
lruvec and shrinker infos. And the comment in the code of
alloc_mem_cgroup_per_node_info() does say:

"TODO: this routine can waste much memory for nodes which will never
be onlined. It's better to use memory hotplug callback function."

But IMHO actually the memory usage should be not that bad for memcg
heavy usecases since there should be not too many "never onlined"
nodes for such workloads?

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 13:19       ` Kirill Tkhai
  2021-12-06 13:24         ` Michal Hocko
@ 2021-12-06 18:42         ` Yang Shi
  2021-12-06 19:01           ` David Hildenbrand
  2021-12-07 21:45         ` Nico Pache
  2 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2021-12-06 18:42 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: David Hildenbrand, Michal Hocko, Nico Pache,
	Linux Kernel Mailing List, Linux MM, Andrew Morton, Shakeel Butt,
	Roman Gushchin, Vlastimil Babka, Vladimir Davydov, raquini

On Mon, Dec 6, 2021 at 5:19 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 06.12.2021 13:45, David Hildenbrand wrote:
> >> This doesn't seen complete. Slab shrinkers are used in the reclaim
> >> context. Previously offline nodes could be onlined later and this would
> >> lead to NULL ptr because there is no hook to allocate new shrinker
> >> infos. This would be also really impractical because this would have to
> >> update all existing memcgs...
> >
> > Instead of going through the trouble of updating...
> >
> > ...  maybe just keep for_each_node() and check if the target node is
> > offline. If it's offline, just allocate from the first online node.
> > After all, we're not using __GFP_THISNODE, so there are no guarantees
> > either way ...
>
> Hm, can't we add shrinker maps allocation to __try_online_node() in addition
> to this patch?

I think the below fix (an example, doesn't cover all affected
callsites) should be good enough for now? It doesn't touch the hot
path of the page allocator.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..1252a33f7c28 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -222,13 +222,15 @@ static int expand_one_shrinker_info(struct
mem_cgroup *memcg,
        int size = map_size + defer_size;

        for_each_node(nid) {
+               int tmp = nid;
                pn = memcg->nodeinfo[nid];
                old = shrinker_info_protected(memcg, nid);
                /* Not yet online memcg */
                if (!old)
                        return 0;
-
-               new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
+               if (!node_online(nid))
+                       tmp = -1;
+               new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
                if (!new)
                        return -ENOMEM;

It used to use kvmalloc instead of kvmalloc_node(). The commit
86daf94efb11d7319fbef5e480018c4807add6ef ("mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node") changed to use *_node()
version. The justification was that "kswapd is always bound to
specific node. So allocate shrinker_map from the related NUMA node to
respect its NUMA locality." There is no kswapd for offlined node, so
just allocate shrinker info on node 0. This is also what
alloc_mem_cgroup_per_node_info() does.

Making memcg per node data node allocation memory hotplug aware should
be solved in a separate patchset IMHO.

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06  3:33 ` [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes Nico Pache
  2021-12-06  8:32   ` David Hildenbrand
  2021-12-06  9:22   ` Michal Hocko
@ 2021-12-06 18:45   ` Yang Shi
  2 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2021-12-06 18:45 UTC (permalink / raw)
  To: Nico Pache
  Cc: Linux Kernel Mailing List, Linux MM, Andrew Morton, Shakeel Butt,
	Kirill Tkhai, Roman Gushchin, Vlastimil Babka, Vladimir Davydov,
	raquini

On Sun, Dec 5, 2021 at 7:34 PM Nico Pache <npache@redhat.com> wrote:
>
> We have run into a panic caused by a shrinker allocation being attempted
> on an offlined node.
>
> Our crash analysis has determined that the issue originates from trying
> to allocate pages on an offlined node in expand_one_shrinker_info. This
> function makes the incorrect assumption that we can allocate on any node.
> To correct this we make sure we only itterate over online nodes.
>
> This assumption can lead to an incorrect address being assigned to ac->zonelist
> in the following callchain:
>         __alloc_pages
>         -> prepare_alloc_pages
>          -> node_zonelist
>
> static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> {
>         return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> }
> if the node is not online the return of node_zonelist will evaluate to a
> invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0)
>
> This address is then dereferenced further down the callchain in:
>         prepare_alloc_pages
>         -> first_zones_zonelist
>          -> next_zones_zonelist
>           -> zonelist_zone_idx
>
> static inline int zonelist_zone_idx(struct zoneref *zoneref)
> {
>         return zoneref->zone_idx;
> }
>
> Leading the system to panic.
>
> We also correct this behavior in alloc_shrinker_info, free_shrinker_info,
> and reparent_shrinker_deferred.
>
> Fixes: 2bfd36374edd ("mm: vmscan: consolidate shrinker_maps handling code")
> Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")

I think the correct fix tag should be: 86daf94efb11 ("mm/memcontrol.c:
allocate shrinker_map on appropriate NUMA node") regardless of how we
will fix this problem.

> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/vmscan.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fb9584641ac7..731564b61e3f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -221,7 +221,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>         int nid;
>         int size = map_size + defer_size;
>
> -       for_each_node(nid) {
> +       for_each_online_node(nid) {
>                 pn = memcg->nodeinfo[nid];
>                 old = shrinker_info_protected(memcg, nid);
>                 /* Not yet online memcg */
> @@ -256,7 +256,7 @@ void free_shrinker_info(struct mem_cgroup *memcg)
>         struct shrinker_info *info;
>         int nid;
>
> -       for_each_node(nid) {
> +       for_each_online_node(nid) {
>                 pn = memcg->nodeinfo[nid];
>                 info = rcu_dereference_protected(pn->shrinker_info, true);
>                 kvfree(info);
> @@ -274,7 +274,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>         map_size = shrinker_map_size(shrinker_nr_max);
>         defer_size = shrinker_defer_size(shrinker_nr_max);
>         size = map_size + defer_size;
> -       for_each_node(nid) {
> +       for_each_online_node(nid) {
>                 info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
>                 if (!info) {
>                         free_shrinker_info(memcg);
> @@ -417,7 +417,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>
>         /* Prevent from concurrent shrinker_info expand */
>         down_read(&shrinker_rwsem);
> -       for_each_node(nid) {
> +       for_each_online_node(nid) {
>                 child_info = shrinker_info_protected(memcg, nid);
>                 parent_info = shrinker_info_protected(parent, nid);
>                 for (i = 0; i < shrinker_nr_max; i++) {
> --
> 2.33.1
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 18:42         ` Yang Shi
@ 2021-12-06 19:01           ` David Hildenbrand
  2021-12-06 21:28             ` Yang Shi
  2021-12-07 10:55             ` Michal Hocko
  0 siblings, 2 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-12-06 19:01 UTC (permalink / raw)
  To: Yang Shi, Kirill Tkhai
  Cc: Michal Hocko, Nico Pache, Linux Kernel Mailing List, Linux MM,
	Andrew Morton, Shakeel Butt, Roman Gushchin, Vlastimil Babka,
	Vladimir Davydov, raquini

On 06.12.21 19:42, Yang Shi wrote:
> On Mon, Dec 6, 2021 at 5:19 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 06.12.2021 13:45, David Hildenbrand wrote:
>>>> This doesn't seen complete. Slab shrinkers are used in the reclaim
>>>> context. Previously offline nodes could be onlined later and this would
>>>> lead to NULL ptr because there is no hook to allocate new shrinker
>>>> infos. This would be also really impractical because this would have to
>>>> update all existing memcgs...
>>>
>>> Instead of going through the trouble of updating...
>>>
>>> ...  maybe just keep for_each_node() and check if the target node is
>>> offline. If it's offline, just allocate from the first online node.
>>> After all, we're not using __GFP_THISNODE, so there are no guarantees
>>> either way ...
>>
>> Hm, can't we add shrinker maps allocation to __try_online_node() in addition
>> to this patch?
> 
> I think the below fix (an example, doesn't cover all affected
> callsites) should be good enough for now? It doesn't touch the hot
> path of the page allocator.
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fb9584641ac7..1252a33f7c28 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -222,13 +222,15 @@ static int expand_one_shrinker_info(struct
> mem_cgroup *memcg,
>         int size = map_size + defer_size;
> 
>         for_each_node(nid) {
> +               int tmp = nid;
>                 pn = memcg->nodeinfo[nid];
>                 old = shrinker_info_protected(memcg, nid);
>                 /* Not yet online memcg */
>                 if (!old)
>                         return 0;
> -
> -               new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> +               if (!node_online(nid))
> +                       tmp = -1;
> +               new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
>                 if (!new)
>                         return -ENOMEM;
> 
> It used to use kvmalloc instead of kvmalloc_node(). The commit
> 86daf94efb11d7319fbef5e480018c4807add6ef ("mm/memcontrol.c: allocate
> shrinker_map on appropriate NUMA node") changed to use *_node()
> version. The justification was that "kswapd is always bound to
> specific node. So allocate shrinker_map from the related NUMA node to
> respect its NUMA locality." There is no kswapd for offlined node, so
> just allocate shrinker info on node 0. This is also what
> alloc_mem_cgroup_per_node_info() does.

Yes, that's what I refer to as fixing it in the caller -- similar to
[1]. Michals point is to not require such node_online() checks at all,
neither in the caller nor in the buddy.

I see 2 options short-term

1) What we have in [1].
2) What I proposed in [2], fixing it for all such instances until we
have something better.

Long term I tend to agree that what Michal proposes is better.

Short term I tend to like [2], because it avoids having to mess with all
such instances to eventually get it right and the temporary overhead
until we have the code reworked should be really negligible ...



[1] https://lkml.kernel.org/r/20211108202325.20304-1-amakhalov@vmware.com
[2]
https://lkml.kernel.org/r/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 19:01           ` David Hildenbrand
@ 2021-12-06 21:28             ` Yang Shi
  2021-12-07 10:15               ` David Hildenbrand
  2021-12-07 10:55             ` Michal Hocko
  1 sibling, 1 reply; 37+ messages in thread
From: Yang Shi @ 2021-12-06 21:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kirill Tkhai, Michal Hocko, Nico Pache,
	Linux Kernel Mailing List, Linux MM, Andrew Morton, Shakeel Butt,
	Roman Gushchin, Vlastimil Babka, Vladimir Davydov, raquini

On Mon, Dec 6, 2021 at 11:01 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 06.12.21 19:42, Yang Shi wrote:
> > On Mon, Dec 6, 2021 at 5:19 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >> On 06.12.2021 13:45, David Hildenbrand wrote:
> >>>> This doesn't seen complete. Slab shrinkers are used in the reclaim
> >>>> context. Previously offline nodes could be onlined later and this would
> >>>> lead to NULL ptr because there is no hook to allocate new shrinker
> >>>> infos. This would be also really impractical because this would have to
> >>>> update all existing memcgs...
> >>>
> >>> Instead of going through the trouble of updating...
> >>>
> >>> ...  maybe just keep for_each_node() and check if the target node is
> >>> offline. If it's offline, just allocate from the first online node.
> >>> After all, we're not using __GFP_THISNODE, so there are no guarantees
> >>> either way ...
> >>
> >> Hm, can't we add shrinker maps allocation to __try_online_node() in addition
> >> to this patch?
> >
> > I think the below fix (an example, doesn't cover all affected
> > callsites) should be good enough for now? It doesn't touch the hot
> > path of the page allocator.
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index fb9584641ac7..1252a33f7c28 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -222,13 +222,15 @@ static int expand_one_shrinker_info(struct
> > mem_cgroup *memcg,
> >         int size = map_size + defer_size;
> >
> >         for_each_node(nid) {
> > +               int tmp = nid;
> >                 pn = memcg->nodeinfo[nid];
> >                 old = shrinker_info_protected(memcg, nid);
> >                 /* Not yet online memcg */
> >                 if (!old)
> >                         return 0;
> > -
> > -               new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> > +               if (!node_online(nid))
> > +                       tmp = -1;
> > +               new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
> >                 if (!new)
> >                         return -ENOMEM;
> >
> > It used to use kvmalloc instead of kvmalloc_node(). The commit
> > 86daf94efb11d7319fbef5e480018c4807add6ef ("mm/memcontrol.c: allocate
> > shrinker_map on appropriate NUMA node") changed to use *_node()
> > version. The justification was that "kswapd is always bound to
> > specific node. So allocate shrinker_map from the related NUMA node to
> > respect its NUMA locality." There is no kswapd for offlined node, so
> > just allocate shrinker info on node 0. This is also what
> > alloc_mem_cgroup_per_node_info() does.
>
> Yes, that's what I refer to as fixing it in the caller -- similar to
> [1]. Michals point is to not require such node_online() checks at all,
> neither in the caller nor in the buddy.
>
> I see 2 options short-term
>
> 1) What we have in [1].
> 2) What I proposed in [2], fixing it for all such instances until we
> have something better.
>
> Long term I tend to agree that what Michal proposes is better.
>
> Short term I tend to like [2], because it avoids having to mess with all
> such instances to eventually get it right and the temporary overhead
> until we have the code reworked should be really negligible ...

Thanks, David. Basically either option looks fine to me. But I'm a
little bit concerned about [2]. It silently changes the node requested
by the callers. It actually papers over potential bugs? And what if
the callers specify __GFP_THISNODE (I didn't search if such callers
really exist in the current code)?

How's about a helper function, for example, called
kvmalloc_best_node()? It does:

void * kvmalloc_best_node(unsigned long size, int flag, int nid)
{
    bool onlined = node_online(nid);

    WARN_ON_ONCE((flag & __GFP_THISNODE) && !onlined);

    if (!onlined)
        nid = -1;

    return kvmalloc_node(size, GFP_xxx, nid);
}

>
>
>
> [1] https://lkml.kernel.org/r/20211108202325.20304-1-amakhalov@vmware.com
> [2]
> https://lkml.kernel.org/r/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com
>
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 18:26                             ` Yang Shi
@ 2021-12-07 10:15                               ` Michal Hocko
  0 siblings, 0 replies; 37+ messages in thread
From: Michal Hocko @ 2021-12-07 10:15 UTC (permalink / raw)
  To: Yang Shi
  Cc: Vlastimil Babka, David Hildenbrand, Nico Pache,
	Linux Kernel Mailing List, Linux MM, Andrew Morton, Shakeel Butt,
	Kirill Tkhai, Roman Gushchin, Vladimir Davydov, raquini

On Mon 06-12-21 10:26:32, Yang Shi wrote:
[...]
> But IMHO actually the memory usage should be not that bad for memcg
> heavy usecases since there should be not too many "never onlined"
> nodes for such workloads?

Hardware with very sparse nodes topology are really scarce. More likely
on ppc (LPARs) but even then we are talking about really low number of
nodes. At least this is my experience.

So while the memory wasting is possible it doesn't seem to be a really
pressing problem. I would be more careful about a code which scales with
MAX_NUMNODES because that can be really large.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 21:28             ` Yang Shi
@ 2021-12-07 10:15               ` David Hildenbrand
  0 siblings, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2021-12-07 10:15 UTC (permalink / raw)
  To: Yang Shi
  Cc: Kirill Tkhai, Michal Hocko, Nico Pache,
	Linux Kernel Mailing List, Linux MM, Andrew Morton, Shakeel Butt,
	Roman Gushchin, Vlastimil Babka, Vladimir Davydov, raquini

>>
>> Short term I tend to like [2], because it avoids having to mess with all
>> such instances to eventually get it right and the temporary overhead
>> until we have the code reworked should be really negligible ...
> 
> Thanks, David. Basically either option looks fine to me. But I'm a
> little bit concerned about [2]. It silently changes the node requested
> by the callers. It actually papers over potential bugs? And what if

Hi,

It's just a preferred node, so we do have a node fallback already via
the zonelist to other nodes for proper online nodes -- and would have
the proper node fallback when preallcoating all pgdat.

*Not* doing the fallback with a preferred node that is not online could
be considered a BUG instead.

Note that [2] was just a quick draft. We might have to do some minor
adjustments to handle __GFP_THISNODE properly.

But after all, we have:

VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));

in __alloc_pages_node() and __folio_alloc_node(). So it might not be
worth adjusting at all.

> the callers specify __GFP_THISNODE (I didn't search if such callers
> really exist in the current code)?
> 
> How's about a helper function, for example, called
> kvmalloc_best_node()? It does:
> 
> void * kvmalloc_best_node(unsigned long size, int flag, int nid)
> {
>     bool onlined = node_online(nid);
> 
>     WARN_ON_ONCE((flag & __GFP_THISNODE) && !onlined);
> 
>     if (!onlined)
>         nid = -1;
> 
>     return kvmalloc_node(size, GFP_xxx, nid);
> }

We still have to "fix" each and every affected for_each_node() ... code
until we have preallcoation of pgdat for all possible nodes.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 19:01           ` David Hildenbrand
  2021-12-06 21:28             ` Yang Shi
@ 2021-12-07 10:55             ` Michal Hocko
  1 sibling, 0 replies; 37+ messages in thread
From: Michal Hocko @ 2021-12-07 10:55 UTC (permalink / raw)
  To: David Hildenbrand, Nico Pache, Rafael Aquini
  Cc: Yang Shi, Kirill Tkhai, Linux Kernel Mailing List, Linux MM,
	Andrew Morton, Shakeel Butt, Roman Gushchin, Vlastimil Babka,
	Vladimir Davydov, raquini

On Mon 06-12-21 20:01:34, David Hildenbrand wrote:
[...]
> Yes, that's what I refer to as fixing it in the caller -- similar to
> [1]. Michals point is to not require such node_online() checks at all,
> neither in the caller nor in the buddy.
> 
> I see 2 options short-term
> 
> 1) What we have in [1].
> 2) What I proposed in [2], fixing it for all such instances until we
> have something better.
> 
> Long term I tend to agree that what Michal proposes is better.
> 
> Short term I tend to like [2], because it avoids having to mess with all
> such instances to eventually get it right and the temporary overhead
> until we have the code reworked should be really negligible ...

I do dislike both but if I were to chose which to chose between the two
then 2 is surely more targeted. We really do not want to spread this
into bulk/pcp or whatever other allocator there is. The problem is that
somebody might still try to access NODE_DATA (e.g. via a helper that
hides that fact).

Anyway, I am not sure whether authors of the patch can reproduce the
problem and whether they can run a testing code on their machine. If yes
it would be great to try with http://lkml.kernel.org/r/Ya89aqij6nMwJrIZ@dhcp22.suse.cz
that I have just sent.

> [1] https://lkml.kernel.org/r/20211108202325.20304-1-amakhalov@vmware.com
> [2]
> https://lkml.kernel.org/r/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com
> 
> -- 
> Thanks,
> 
> David / dhildenb

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 1/2] include/linux/gfp.h: Do not allocate pages on a offlined node
  2021-12-06  9:22   ` Michal Hocko
@ 2021-12-07 21:24     ` Nico Pache
  0 siblings, 0 replies; 37+ messages in thread
From: Nico Pache @ 2021-12-07 21:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, akpm, shakeelb, ktkhai, shy828301, guro,
	vbabka, vdavydov.dev, raquini, Rafael Aquini



On 12/6/21 04:22, Michal Hocko wrote:
> On Sun 05-12-21 22:33:37, Nico Pache wrote:
>> We shouldn't allocate pages on a unavailable node. Add a check for this
>> in __alloc_pages_node and return NULL to avoid issues further down the
>> callchain.
>>
>> Also update the VM_WARN_ON in __alloc_pages_node which could skip this
>> warn if the gfp_mask is not GFP_THISNODE.
> 
> Page allocator trusts callers to know they are doing a right thing so
> that no unnecessary branches have to be implemented and the fast path is
> really optimized for performance. Allocating from an !node_online node
> is a bug in the caller. One that is not really that hard to make
> unfortunatelly but also one that is is not that common.

Thank you for your review,

That makes sense. I will drop this patch on my second posting to avoid any
performance regression.

Cheers,
-- Nico

> 
> That being said I am not really sure we want to introduce this check.
> 
>> Co-developed-by: Rafael Aquini <aquini@redhat.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>  include/linux/gfp.h | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index b976c4177299..e7e18f6d0d9d 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -565,7 +565,10 @@ static inline struct page *
>>  __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>>  {
>>  	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> -	VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
>> +	VM_WARN_ON(!node_online(nid));
>> +
>> +	if (!node_online(nid))
>> +		return NULL;
>>  
>>  	return __alloc_pages(gfp_mask, order, nid, NULL);
>>  }
>> -- 
>> 2.33.1
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06  9:22   ` Michal Hocko
  2021-12-06  9:24     ` Michal Hocko
  2021-12-06 10:45     ` David Hildenbrand
@ 2021-12-07 21:34     ` Nico Pache
  2 siblings, 0 replies; 37+ messages in thread
From: Nico Pache @ 2021-12-07 21:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, akpm, shakeelb, ktkhai, shy828301, guro,
	vbabka, vdavydov.dev, raquini



On 12/6/21 04:22, Michal Hocko wrote:
> On Sun 05-12-21 22:33:38, Nico Pache wrote:
>> We have run into a panic caused by a shrinker allocation being attempted
>> on an offlined node.
>>
>> Our crash analysis has determined that the issue originates from trying
>> to allocate pages on an offlined node in expand_one_shrinker_info. This
>> function makes the incorrect assumption that we can allocate on any node.
>> To correct this we make sure we only itterate over online nodes.
>>
>> This assumption can lead to an incorrect address being assigned to ac->zonelist
>> in the following callchain:
>> 	__alloc_pages
>> 	-> prepare_alloc_pages
>> 	 -> node_zonelist
>>
>> static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>> {
>>         return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>> }
>> if the node is not online the return of node_zonelist will evaluate to a
>> invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0)
>>
>> This address is then dereferenced further down the callchain in:
>> 	prepare_alloc_pages
>> 	-> first_zones_zonelist
>>   	 -> next_zones_zonelist
>> 	  -> zonelist_zone_idx
>>
>> static inline int zonelist_zone_idx(struct zoneref *zoneref)
>> {
>>         return zoneref->zone_idx;
>> }
>>
>> Leading the system to panic.
> 
> Thanks for the analysis! Please also add an oops report so that this is
> easier to search for. It would be also interesting to see specifics
> about the issue. Why was the specific node !online in the first place?
> What architecture was this on?

Here is the Oops report. I will also add it to my commit message on the second
posting! This was x86 btw, but it has also been hit on PPC64.

[  362.179917] RIP: 0010:prepare_alloc_pages.constprop.0+0xc6/0x150
[  362.186639] Code: 03 80 c9 80 83 e2 03 83 fa 01 0f 44 c1 41 89 04 24 c1 eb 0c
48 8b 55 08 83 e3 01 8b 75 1c 48 8b 7d 00 88 5d 20 48 85 d2 75 6b <3b> 77 08 72
66 48 89 7d 10 b8 01 00 00 00 48 83 c4 08 5b 5d 41 5c
[  362.207604] RSP: 0018:ffffb4ba31427bc0 EFLAGS: 00010246
[  362.213443] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000081
[  362.221412] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[  362.229380] RBP: ffffb4ba31427bf8 R08: 0000000000000000 R09: ffffb4ba31427bf4
[  362.237347] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb4ba31427bf0
[  362.245316] R13: 0000000000000002 R14: ffff9f2fb3788000 R15: 0000000000000078
[  362.253285] FS:  00007fbc57bfd740(0000) GS:ffff9f4c7d780000(0000)
knlGS:0000000000000000
[  362.262322] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  362.268739] CR2: 0000000000002088 CR3: 000002004cb58002 CR4: 00000000007706e0
[  362.276707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  362.284675] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  362.292644] PKRU: 55555554
[  362.295669] Call Trace:
[  362.298402]  __alloc_pages+0x9d/0x210
[  362.302501]  kmalloc_large_node+0x40/0x90
[  362.306988]  __kmalloc_node+0x3ac/0x480
[  362.311279]  kvmalloc_node+0x46/0x80
[  362.315276]  expand_one_shrinker_info+0x84/0x190
[  362.320437]  prealloc_shrinker+0x166/0x1c0
[  362.325015]  alloc_super+0x2ba/0x330
[  362.329011]  ? __fput_sync+0x30/0x30
[  362.333003]  ? set_anon_super+0x40/0x40
[  362.337288]  sget_fc+0x6c/0x2f0
[  362.340798]  ? mqueue_create+0x20/0x20
[  362.344992]  get_tree_keyed+0x2f/0xc0
[  362.349086]  vfs_get_tree+0x25/0xb0
[  362.352982]  fc_mount+0xe/0x30
[  362.356397]  mq_init_ns+0x105/0x1a0
[  362.360291]  copy_ipcs+0x129/0x220
[  362.364093]  create_new_namespaces+0xa1/0x2e0
[  362.368960]  unshare_nsproxy_namespaces+0x55/0xa0
[  362.374216]  ksys_unshare+0x198/0x380
[  362.378310]  __x64_sys_unshare+0xe/0x20
[  362.382595]  do_syscall_64+0x3b/0x90
[  362.386597]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  362.392247] RIP: 0033:0x7fbc57d14d7b
[  362.396242] Code: 73 01 c3 48 8b 0d ad 70 0e 00 f7 d8 64 89 01 48 83 c8 ff c3
66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 01 00 00 0f 05 <48> 3d 01 f0
ff ff 73 01 c3 48 8b 0d 7d 70 0e 00 f7 d8 64 89 01 48
[  362.417204] RSP: 002b:00007fbc4dc73f88 EFLAGS: 00000206 ORIG_RAX:
0000000000000110
[  362.425664] RAX: ffffffffffffffda RBX: 0000560144b09578 RCX: 00007fbc57d14d7b
[  362.433634] RDX: 0000000000000010 RSI: 00007fbc4dc73f90 RDI: 0000000008000000
[  362.441602] RBP: 0000560144b095a0 R08: 0000000000000000 R09: 0000000000000000
[  362.449573] R10: 0000000000000000 R11: 0000000000000206 R12: 00007fbc4dc73f90
[  362.457542] R13: 000000000000006a R14: 0000560144e7e5a0 R15: 00007fff5dec8e10

> 
>> We also correct this behavior in alloc_shrinker_info, free_shrinker_info,
>> and reparent_shrinker_deferred.
>>
>> Fixes: 2bfd36374edd ("mm: vmscan: consolidate shrinker_maps handling code")
>> Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
> 
> Normally I would split the fix as it is fixing two issues one introduced
> in 4.19 the other in 5.13.

These are both commits that introduced the function, one introduces it while the
other moves it to a separate file. But as Yang Shi pointed out the better commit
to blame is 86daf94efb11 ("mm/memcontrol.c: allocate shrinker_map on appropriate
NUMA node") which actually made the change that would have caused the allocator
to go for an !online node.

> 
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>  mm/vmscan.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index fb9584641ac7..731564b61e3f 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -221,7 +221,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>  	int nid;
>>  	int size = map_size + defer_size;
>>  
>> -	for_each_node(nid) {
>> +	for_each_online_node(nid) {
>>  		pn = memcg->nodeinfo[nid];
>>  		old = shrinker_info_protected(memcg, nid);
>>  		/* Not yet online memcg */
>> @@ -256,7 +256,7 @@ void free_shrinker_info(struct mem_cgroup *memcg)
>>  	struct shrinker_info *info;
>>  	int nid;
>>  
>> -	for_each_node(nid) {
>> +	for_each_online_node(nid) {
>>  		pn = memcg->nodeinfo[nid];
>>  		info = rcu_dereference_protected(pn->shrinker_info, true);
>>  		kvfree(info);
>> @@ -274,7 +274,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>  	map_size = shrinker_map_size(shrinker_nr_max);
>>  	defer_size = shrinker_defer_size(shrinker_nr_max);
>>  	size = map_size + defer_size;
>> -	for_each_node(nid) {
>> +	for_each_online_node(nid) {
>>  		info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
>>  		if (!info) {
>>  			free_shrinker_info(memcg);
>> @@ -417,7 +417,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>>  
>>  	/* Prevent from concurrent shrinker_info expand */
>>  	down_read(&shrinker_rwsem);
>> -	for_each_node(nid) {
>> +	for_each_online_node(nid) {
>>  		child_info = shrinker_info_protected(memcg, nid);
>>  		parent_info = shrinker_info_protected(parent, nid);
>>  		for (i = 0; i < shrinker_nr_max; i++) {
>> -- 
>> 2.33.1
> 
> This doesn't seen complete. Slab shrinkers are used in the reclaim
> context. Previously offline nodes could be onlined later and this would
> lead to NULL ptr because there is no hook to allocate new shrinker
> infos. This would be also really impractical because this would have to
> update all existing memcgs...
> 
> To be completely honest I am not really sure this is a practical problem
> because some architectures allocate (aka make online) all possible nodes
> reported by the platform. There are major inconsistencies there. Maybe
> that should be unified, so that problems like this one do not really
> have to add a complexity to the code.

Im currently working a solution that will use register_one_node to perform the
memcg node allocation for all memcgs. I will post that once I've verified all
potential corner cases.

Cheers,
-- Nico

> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 10:45     ` David Hildenbrand
  2021-12-06 10:54       ` Michal Hocko
  2021-12-06 13:19       ` Kirill Tkhai
@ 2021-12-07 21:40       ` Nico Pache
  2 siblings, 0 replies; 37+ messages in thread
From: Nico Pache @ 2021-12-07 21:40 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: linux-kernel, linux-mm, akpm, shakeelb, ktkhai, shy828301, guro,
	vbabka, vdavydov.dev, raquini



On 12/6/21 05:45, David Hildenbrand wrote:
>> This doesn't seen complete. Slab shrinkers are used in the reclaim
>> context. Previously offline nodes could be onlined later and this would
>> lead to NULL ptr because there is no hook to allocate new shrinker
>> infos. This would be also really impractical because this would have to
>> update all existing memcgs...
> 
> Instead of going through the trouble of updating...
> 
> ...  maybe just keep for_each_node() and check if the target node is
> offline. If it's offline, just allocate from the first online node.
> After all, we're not using __GFP_THISNODE, so there are no guarantees
> either way ...

Thanks for your reviews David :)

Yes, I believe this would be a viable solution. At this point it looks like our
short term solution would be something along this line. Long term we would want
to either allocate all pgdats (which will waste memory) or update the memcgs
with the appropriate node information once a node is onlined. Im currently
working a solution that does the latter.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 13:19       ` Kirill Tkhai
  2021-12-06 13:24         ` Michal Hocko
  2021-12-06 18:42         ` Yang Shi
@ 2021-12-07 21:45         ` Nico Pache
  2 siblings, 0 replies; 37+ messages in thread
From: Nico Pache @ 2021-12-07 21:45 UTC (permalink / raw)
  To: Kirill Tkhai, David Hildenbrand, Michal Hocko
  Cc: linux-kernel, linux-mm, akpm, shakeelb, shy828301, guro, vbabka,
	vdavydov.dev, raquini



On 12/6/21 08:19, Kirill Tkhai wrote:
> On 06.12.2021 13:45, David Hildenbrand wrote:
>>> This doesn't seen complete. Slab shrinkers are used in the reclaim
>>> context. Previously offline nodes could be onlined later and this would
>>> lead to NULL ptr because there is no hook to allocate new shrinker
>>> infos. This would be also really impractical because this would have to
>>> update all existing memcgs...
>>
>> Instead of going through the trouble of updating...
>>
>> ...  maybe just keep for_each_node() and check if the target node is
>> offline. If it's offline, just allocate from the first online node.
>> After all, we're not using __GFP_THISNODE, so there are no guarantees
>> either way ...
> 
> Hm, can't we add shrinker maps allocation to __try_online_node() in addition
> to this patch? 
> 
Thanks for the feedback :) I am currently working a solution similar to this.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-06 13:24         ` Michal Hocko
@ 2021-12-08 19:00           ` Nico Pache
  0 siblings, 0 replies; 37+ messages in thread
From: Nico Pache @ 2021-12-08 19:00 UTC (permalink / raw)
  To: Michal Hocko, Kirill Tkhai, David Hildenbrand
  Cc: linux-kernel, linux-mm, akpm, shakeelb, shy828301, guro, vbabka,
	vdavydov.dev, raquini



On 12/6/21 08:24, Michal Hocko wrote:
> On Mon 06-12-21 16:19:12, Kirill Tkhai wrote:
>> On 06.12.2021 13:45, David Hildenbrand wrote:
>>>> This doesn't seen complete. Slab shrinkers are used in the reclaim
>>>> context. Previously offline nodes could be onlined later and this would
>>>> lead to NULL ptr because there is no hook to allocate new shrinker
>>>> infos. This would be also really impractical because this would have to
>>>> update all existing memcgs...
>>>
>>> Instead of going through the trouble of updating...
>>>
>>> ...  maybe just keep for_each_node() and check if the target node is
>>> offline. If it's offline, just allocate from the first online node.
>>> After all, we're not using __GFP_THISNODE, so there are no guarantees
>>> either way ...
>>
>> Hm, can't we add shrinker maps allocation to __try_online_node() in addition
>> to this patch? 
> 
> Either that or through hotplug notifier (which would be a better
> solution). But allocating a new shrinker map for each memcg would have
> to be done as has been mentioned earlier.

I took a stab at this approach. It may be incomplete but please let me know what
you think. This would go on top of this series.

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c5c403f4be6..6c842382fa73 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -520,6 +520,7 @@ static inline struct mem_cgroup *page_memcg_check(struct
page *page)
 	return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
 }

+int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node);
 #ifdef CONFIG_MEMCG_KMEM
 /*
  * folio_memcg_kmem - Check if the folio has the memcg_kmem flag set.
diff --git a/include/linux/node.h b/include/linux/node.h
index bb21fd631b16..5e8c737ea751 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -19,7 +19,7 @@
 #include <linux/cpumask.h>
 #include <linux/list.h>
 #include <linux/workqueue.h>
-
+#include <linux/memcontrol.h>
 /**
  * struct node_hmem_attrs - heterogeneous memory performance attributes
  *
@@ -118,6 +118,7 @@ extern int __register_one_node(int nid);
 /* Registers an online node */
 static inline int register_one_node(int nid)
 {
+	struct mem_cgroup *memcg;
 	int error = 0;

 	if (node_online(nid)) {
@@ -130,6 +131,14 @@ static inline int register_one_node(int nid)
 			return error;
 		/* link memory sections under this node */
 		link_mem_sections(nid, start_pfn, end_pfn, MEMINIT_EARLY);
+		/* Iterate over memcgs and update nodeinfo  */
+		memcg = mem_cgroup_iter(NULL, NULL, NULL);
+		do {
+			if (alloc_mem_cgroup_per_node_info(memcg,nid)) {
+				mem_cgroup_iter_break(NULL, memcg);
+				return error;
+			}
+		} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
 	}

 	return error;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6863a834ed42..2d55fad3229b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5041,18 +5041,11 @@ struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 	return idr_find(&mem_cgroup_idr, id);
 }

-static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
+int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
 	int tmp = node;
-	/*
-	 * This routine is called against possible nodes.
-	 * But it's BUG to call kmalloc() against offline node.
-	 *
-	 * TODO: this routine can waste much memory for nodes which will
-	 *       never be onlined. It's better to use memory hotplug callback
-	 *       function.
-	 */
+
 	if (!node_state(node, N_NORMAL_MEMORY))
 		tmp = -1;
 	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, tmp);
@@ -5130,7 +5123,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	if (!memcg->vmstats_percpu)
 		goto fail;

-	for_each_node(node)
+	for_each_online_node(node)
 		if (alloc_mem_cgroup_per_node_info(memcg, node))
 			goto fail;


^ permalink raw reply related	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2021-12-08 19:00 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-06  3:33 [RFC PATCH 0/2] mm: Dont allocate pages on a offline node Nico Pache
2021-12-06  3:33 ` [RFC PATCH 1/2] include/linux/gfp.h: Do not allocate pages on a offlined node Nico Pache
2021-12-06  3:37   ` Matthew Wilcox
2021-12-06  8:29   ` David Hildenbrand
2021-12-06  9:22   ` Michal Hocko
2021-12-07 21:24     ` Nico Pache
2021-12-06  3:33 ` [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes Nico Pache
2021-12-06  8:32   ` David Hildenbrand
2021-12-06  9:22   ` Michal Hocko
2021-12-06  9:24     ` Michal Hocko
2021-12-06 10:45     ` David Hildenbrand
2021-12-06 10:54       ` Michal Hocko
2021-12-06 11:00         ` David Hildenbrand
2021-12-06 11:22           ` Michal Hocko
2021-12-06 12:43             ` David Hildenbrand
2021-12-06 13:06               ` Michal Hocko
2021-12-06 13:47                 ` David Hildenbrand
2021-12-06 14:06                   ` Michal Hocko
2021-12-06 14:08                     ` David Hildenbrand
2021-12-06 14:21                       ` Michal Hocko
2021-12-06 14:30                         ` Vlastimil Babka
2021-12-06 14:53                           ` Michal Hocko
2021-12-06 18:26                             ` Yang Shi
2021-12-07 10:15                               ` Michal Hocko
2021-12-06 14:15                   ` Michal Hocko
2021-12-06 13:19       ` Kirill Tkhai
2021-12-06 13:24         ` Michal Hocko
2021-12-08 19:00           ` Nico Pache
2021-12-06 18:42         ` Yang Shi
2021-12-06 19:01           ` David Hildenbrand
2021-12-06 21:28             ` Yang Shi
2021-12-07 10:15               ` David Hildenbrand
2021-12-07 10:55             ` Michal Hocko
2021-12-07 21:45         ` Nico Pache
2021-12-07 21:40       ` Nico Pache
2021-12-07 21:34     ` Nico Pache
2021-12-06 18:45   ` Yang Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.