linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/1] Dont allocate pages on a offline node
@ 2021-12-07 22:40 Nico Pache
  2021-12-07 22:40 ` [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes Nico Pache
  0 siblings, 1 reply; 17+ messages in thread
From: Nico Pache @ 2021-12-07 22:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: shakeelb, ktkhai, shy828301, guro, vbabka, vdavydov.dev, raquini,
	mhocko, david

Prevent page allocations from occuring on an offlined by adding a check
to confirm the node is online and if not, use the closest node. Some
further work is needed for my previous solution to be complete. This
will provide a fix to the problem while the more complete solution is
being worked on.

V2:
 * drop the first patch that will introduce a regression by adding a
   branch in the hotpath.
 * Remove the for_each_online_nodes introduced as that will require
   further work for memcg and hotplug to work correctly. Long term a
   rework will be needed to either allocate all pgdatas or update all
   memcgs when a new node is onlined.

Signed-off-by: Nico Pache <npache@redhat.com>

Nico Pache (1):
  mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes

 mm/vmscan.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

-- 
2.33.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-07 22:40 [PATCH v2 0/1] Dont allocate pages on a offline node Nico Pache
@ 2021-12-07 22:40 ` Nico Pache
  2021-12-07 23:34   ` Matthew Wilcox
  2021-12-07 23:44   ` Andrew Morton
  0 siblings, 2 replies; 17+ messages in thread
From: Nico Pache @ 2021-12-07 22:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: shakeelb, ktkhai, shy828301, guro, vbabka, vdavydov.dev, raquini,
	mhocko, david

We have run into a panic caused by a shrinker allocation being attempted
on an offlined node.

Our crash analysis has determined that the issue originates from trying
to allocate pages on an offlined node in expand_one_shrinker_info. This
function makes the incorrect assumption that we can allocate on any node.
To correct this we make sure the node is online before tempting an
allocation. If it is not online choose the closest node.

This assumption can lead to an incorrect address being assigned to ac->zonelist
in the following callchain:
	__alloc_pages
	-> prepare_alloc_pages
	 -> node_zonelist

static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
{
        return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
}
if the node is not online the return of node_zonelist will evaluate to a
invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0)

This address is then dereferenced further down the callchain in:
	prepare_alloc_pages
	-> first_zones_zonelist
  	 -> next_zones_zonelist
	  -> zonelist_zone_idx

static inline int zonelist_zone_idx(struct zoneref *zoneref)
{
        return zoneref->zone_idx;
}

Leading the system to panic.

[  362.132167] BUG: unable to handle page fault for address: 0000000000002088
[  362.139854] #PF: supervisor read access in kernel mode
[  362.145595] #PF: error_code(0x0000) - not-present page
[  362.151334] PGD 0 P4D 0
[  362.154166] Oops: 0000 [#1] SMP PTI
[  362.158064] CPU: 62 PID: 16612 Comm: stress-ng Kdump:
[  362.167876] Hardware name:
[  362.179917] RIP: 0010:prepare_alloc_pages.constprop.0+0xc6/0x150
[  362.207604] RSP: 0018:ffffb4ba31427bc0 EFLAGS: 00010246
[  362.213443] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000081
[  362.221412] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[  362.229380] RBP: ffffb4ba31427bf8 R08: 0000000000000000 R09: ffffb4ba31427bf4
[  362.237347] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb4ba31427bf0
[  362.245316] R13: 0000000000000002 R14: ffff9f2fb3788000 R15: 0000000000000078
[  362.253285] FS:  00007fbc57bfd740(0000) GS:ffff9f4c7d780000(0000) knlGS:0000000000000000
[  362.262322] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  362.268739] CR2: 0000000000002088 CR3: 000002004cb58002 CR4: 00000000007706e0
[  362.276707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  362.284675] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  362.292644] PKRU: 55555554
[  362.295669] Call Trace:
[  362.298402]  __alloc_pages+0x9d/0x210
[  362.302501]  kmalloc_large_node+0x40/0x90
[  362.306988]  __kmalloc_node+0x3ac/0x480
[  362.311279]  kvmalloc_node+0x46/0x80
[  362.315276]  expand_one_shrinker_info+0x84/0x190
[  362.320437]  prealloc_shrinker+0x166/0x1c0
[  362.325015]  alloc_super+0x2ba/0x330
[  362.329011]  ? __fput_sync+0x30/0x30
[  362.333003]  ? set_anon_super+0x40/0x40
[  362.337288]  sget_fc+0x6c/0x2f0
[  362.340798]  ? mqueue_create+0x20/0x20
[  362.344992]  get_tree_keyed+0x2f/0xc0
[  362.349086]  vfs_get_tree+0x25/0xb0
[  362.352982]  fc_mount+0xe/0x30
[  362.356397]  mq_init_ns+0x105/0x1a0
[  362.360291]  copy_ipcs+0x129/0x220
[  362.364093]  create_new_namespaces+0xa1/0x2e0
[  362.368960]  unshare_nsproxy_namespaces+0x55/0xa0
[  362.374216]  ksys_unshare+0x198/0x380
[  362.378310]  __x64_sys_unshare+0xe/0x20
[  362.382595]  do_syscall_64+0x3b/0x90
[  362.386597]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  362.392247] RIP: 0033:0x7fbc57d14d7b
[  362.417204] RSP: 002b:00007fbc4dc73f88 EFLAGS: 00000206 ORIG_RAX: 0000000000000110
[  362.425664] RAX: ffffffffffffffda RBX: 0000560144b09578 RCX: 00007fbc57d14d7b
[  362.433634] RDX: 0000000000000010 RSI: 00007fbc4dc73f90 RDI: 0000000008000000
[  362.441602] RBP: 0000560144b095a0 R08: 0000000000000000 R09: 0000000000000000
[  362.449573] R10: 0000000000000000 R11: 0000000000000206 R12: 00007fbc4dc73f90
[  362.457542] R13: 000000000000006a R14: 0000560144e7e5a0 R15: 00007fff5dec8e10

Fixes: 86daf94efb11 ("mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node")
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/vmscan.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..5ed5b5d29d47 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
 	int size = map_size + defer_size;
 
 	for_each_node(nid) {
+		int tmp = nid;
 		pn = memcg->nodeinfo[nid];
 		old = shrinker_info_protected(memcg, nid);
 		/* Not yet online memcg */
 		if (!old)
 			return 0;
 
-		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
+		if(!node_online(nid))
+			tmp = numa_mem_id();
+		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
 		if (!new)
 			return -ENOMEM;
 
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-07 22:40 ` [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes Nico Pache
@ 2021-12-07 23:34   ` Matthew Wilcox
  2021-12-08  0:25     ` Nico Pache
  2021-12-07 23:44   ` Andrew Morton
  1 sibling, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2021-12-07 23:34 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, akpm, shakeelb, ktkhai, shy828301, guro,
	vbabka, vdavydov.dev, raquini, mhocko, david

On Tue, Dec 07, 2021 at 05:40:13PM -0500, Nico Pache wrote:
> +++ b/mm/vmscan.c
> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>  	int size = map_size + defer_size;
>  
>  	for_each_node(nid) {
> +		int tmp = nid;
>  		pn = memcg->nodeinfo[nid];
>  		old = shrinker_info_protected(memcg, nid);
>  		/* Not yet online memcg */
>  		if (!old)
>  			return 0;
>  
> -		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> +		if(!node_online(nid))
> +			tmp = numa_mem_id();
> +		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
>  		if (!new)

Why should this be fixed here and not in, say, kvmalloc_node()?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-07 22:40 ` [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes Nico Pache
  2021-12-07 23:34   ` Matthew Wilcox
@ 2021-12-07 23:44   ` Andrew Morton
  2021-12-08  0:26     ` Yang Shi
  2021-12-08  0:40     ` Nico Pache
  1 sibling, 2 replies; 17+ messages in thread
From: Andrew Morton @ 2021-12-07 23:44 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, shakeelb, ktkhai, shy828301, guro,
	vbabka, vdavydov.dev, raquini, mhocko, david

On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:

> We have run into a panic caused by a shrinker allocation being attempted
> on an offlined node.
>
> Our crash analysis has determined that the issue originates from trying
> to allocate pages on an offlined node in expand_one_shrinker_info. This
> function makes the incorrect assumption that we can allocate on any node.
> To correct this we make sure the node is online before tempting an
> allocation. If it is not online choose the closest node.

This isn't fully accurate, is it?  We could allocate on a node which is
presently offline but which was previously onlined, by testing
NODE_DATA(nid).

It isn't entirely clear to me from the v1 discussion why this approach
isn't being taken?

AFAICT the proposed patch is *already* taking this approach, by having
no protection against a concurrent or subsequent node offlining?

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>  	int size = map_size + defer_size;
>  
>  	for_each_node(nid) {
> +		int tmp = nid;

Not `tmp', please.  Better to use an identifier which explains the
variable's use.  target_nid?

And a newline after defining locals, please.

>  		pn = memcg->nodeinfo[nid];
>  		old = shrinker_info_protected(memcg, nid);
>  		/* Not yet online memcg */
>  		if (!old)
>  			return 0;
>  
> -		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> +		if(!node_online(nid))

s/if(/if (/

> +			tmp = numa_mem_id();
> +		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
>  		if (!new)
>  			return -ENOMEM;
>  

And a code comment fully explaining what's going on here?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-07 23:34   ` Matthew Wilcox
@ 2021-12-08  0:25     ` Nico Pache
  2021-12-08  1:53       ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Nico Pache @ 2021-12-08  0:25 UTC (permalink / raw)
  To: Matthew Wilcox, mhocko
  Cc: linux-kernel, linux-mm, akpm, shakeelb, ktkhai, shy828301, guro,
	vbabka, vdavydov.dev, raquini, david



On 12/7/21 18:34, Matthew Wilcox wrote:
> On Tue, Dec 07, 2021 at 05:40:13PM -0500, Nico Pache wrote:
>> +++ b/mm/vmscan.c
>> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>  	int size = map_size + defer_size;
>>  
>>  	for_each_node(nid) {
>> +		int tmp = nid;
>>  		pn = memcg->nodeinfo[nid];
>>  		old = shrinker_info_protected(memcg, nid);
>>  		/* Not yet online memcg */
>>  		if (!old)
>>  			return 0;
>>  
>> -		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>> +		if(!node_online(nid))
>> +			tmp = numa_mem_id();
>> +		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
>>  		if (!new)
> 
> Why should this be fixed here and not in, say, kvmalloc_node()?

according to Michal, the caller should be responsible for making sure it is
allocating on a correct node. This avoids adding branches to hot-paths and
wasting cycles. Im not opposed to moving it to kvmalloc_node, but it may result
in masking other issues from other callers.
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-07 23:44   ` Andrew Morton
@ 2021-12-08  0:26     ` Yang Shi
  2021-12-08  0:33       ` Nico Pache
  2022-01-10 17:09       ` Rafael Aquini
  2021-12-08  0:40     ` Nico Pache
  1 sibling, 2 replies; 17+ messages in thread
From: Yang Shi @ 2021-12-08  0:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nico Pache, Linux Kernel Mailing List, Linux MM, Shakeel Butt,
	Kirill Tkhai, Roman Gushchin, Vlastimil Babka, Vladimir Davydov,
	raquini, Michal Hocko, David Hildenbrand

On Tue, Dec 7, 2021 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:
>
> > We have run into a panic caused by a shrinker allocation being attempted
> > on an offlined node.
> >
> > Our crash analysis has determined that the issue originates from trying
> > to allocate pages on an offlined node in expand_one_shrinker_info. This
> > function makes the incorrect assumption that we can allocate on any node.
> > To correct this we make sure the node is online before tempting an
> > allocation. If it is not online choose the closest node.
>
> This isn't fully accurate, is it?  We could allocate on a node which is
> presently offline but which was previously onlined, by testing
> NODE_DATA(nid).
>
> It isn't entirely clear to me from the v1 discussion why this approach
> isn't being taken?
>
> AFAICT the proposed patch is *already* taking this approach, by having
> no protection against a concurrent or subsequent node offlining?

AFAICT, we have not reached agreement on how to fix it yet. I saw 3
proposals at least:

1. From Michal, allocate node data for all possible nodes.
https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u

2. What this patch does. Proposed originally from
https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u

3. From David, fix in node_zonelist().
https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u

>
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> >       int size = map_size + defer_size;
> >
> >       for_each_node(nid) {
> > +             int tmp = nid;
>
> Not `tmp', please.  Better to use an identifier which explains the
> variable's use.  target_nid?
>
> And a newline after defining locals, please.
>
> >               pn = memcg->nodeinfo[nid];
> >               old = shrinker_info_protected(memcg, nid);
> >               /* Not yet online memcg */
> >               if (!old)
> >                       return 0;
> >
> > -             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> > +             if(!node_online(nid))
>
> s/if(/if (/
>
> > +                     tmp = numa_mem_id();
> > +             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
> >               if (!new)
> >                       return -ENOMEM;
> >
>
> And a code comment fully explaining what's going on here?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-08  0:26     ` Yang Shi
@ 2021-12-08  0:33       ` Nico Pache
  2021-12-08  1:23         ` Yang Shi
  2022-01-10 17:09       ` Rafael Aquini
  1 sibling, 1 reply; 17+ messages in thread
From: Nico Pache @ 2021-12-08  0:33 UTC (permalink / raw)
  To: Yang Shi, Andrew Morton
  Cc: Linux Kernel Mailing List, Linux MM, Shakeel Butt, Kirill Tkhai,
	Roman Gushchin, Vlastimil Babka, Vladimir Davydov, raquini,
	Michal Hocko, David Hildenbrand



On 12/7/21 19:26, Yang Shi wrote:
> On Tue, Dec 7, 2021 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:
>>
>>> We have run into a panic caused by a shrinker allocation being attempted
>>> on an offlined node.
>>>
>>> Our crash analysis has determined that the issue originates from trying
>>> to allocate pages on an offlined node in expand_one_shrinker_info. This
>>> function makes the incorrect assumption that we can allocate on any node.
>>> To correct this we make sure the node is online before tempting an
>>> allocation. If it is not online choose the closest node.
>>
>> This isn't fully accurate, is it?  We could allocate on a node which is
>> presently offline but which was previously onlined, by testing
>> NODE_DATA(nid).
>>
>> It isn't entirely clear to me from the v1 discussion why this approach
>> isn't being taken?
>>
>> AFAICT the proposed patch is *already* taking this approach, by having
>> no protection against a concurrent or subsequent node offlining?
> 
> AFAICT, we have not reached agreement on how to fix it yet. I saw 3
> proposals at least:
> 
> 1. From Michal, allocate node data for all possible nodes.
> https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u
> 
> 2. What this patch does. Proposed originally from
> https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u

Correct me if im wrong, but isn't that a different caller? This patch fixes the
issue in expand_one_shrinker_info.

> 3. From David, fix in node_zonelist().
> https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u
> 
>>
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>>       int size = map_size + defer_size;
>>>
>>>       for_each_node(nid) {
>>> +             int tmp = nid;
>>
>> Not `tmp', please.  Better to use an identifier which explains the
>> variable's use.  target_nid?
>>
>> And a newline after defining locals, please.
>>
>>>               pn = memcg->nodeinfo[nid];
>>>               old = shrinker_info_protected(memcg, nid);
>>>               /* Not yet online memcg */
>>>               if (!old)
>>>                       return 0;
>>>
>>> -             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>>> +             if(!node_online(nid))
>>
>> s/if(/if (/
>>
>>> +                     tmp = numa_mem_id();
>>> +             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
>>>               if (!new)
>>>                       return -ENOMEM;
>>>
>>
>> And a code comment fully explaining what's going on here?
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-07 23:44   ` Andrew Morton
  2021-12-08  0:26     ` Yang Shi
@ 2021-12-08  0:40     ` Nico Pache
  2021-12-08  7:54       ` Michal Hocko
  1 sibling, 1 reply; 17+ messages in thread
From: Nico Pache @ 2021-12-08  0:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, shakeelb, ktkhai, shy828301, guro,
	vbabka, vdavydov.dev, raquini, mhocko, david



On 12/7/21 18:44, Andrew Morton wrote:
> On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:
> 
>> We have run into a panic caused by a shrinker allocation being attempted
>> on an offlined node.
>>
>> Our crash analysis has determined that the issue originates from trying
>> to allocate pages on an offlined node in expand_one_shrinker_info. This
>> function makes the incorrect assumption that we can allocate on any node.
>> To correct this we make sure the node is online before tempting an
>> allocation. If it is not online choose the closest node.
> 
> This isn't fully accurate, is it?  We could allocate on a node which is
> presently offline but which was previously onlined, by testing
> NODE_DATA(nid).

Thanks for the review! I took your changes below into consideration for my V3.

My knowledge of offlined/onlined nodes is quite limited but after looking into
it it doesnt seem like anything clears the state of NODE_DATA(nid) after a
try_offline_node is attempted. So theoretically the panic we saw would not
happen. What is the expected behavior of trying to allocate a page on a offline
node?

> 
> It isn't entirely clear to me from the v1 discussion why this approach
> isn't being taken?
> 
> AFAICT the proposed patch is *already* taking this approach, by having
> no protection against a concurrent or subsequent node offlining?
> 
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>  	int size = map_size + defer_size;
>>  
>>  	for_each_node(nid) {
>> +		int tmp = nid;
> 
> Not `tmp', please.  Better to use an identifier which explains the
> variable's use.  target_nid?
> 
> And a newline after defining locals, please.
> 
>>  		pn = memcg->nodeinfo[nid];
>>  		old = shrinker_info_protected(memcg, nid);
>>  		/* Not yet online memcg */
>>  		if (!old)
>>  			return 0;
>>  
>> -		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>> +		if(!node_online(nid))
> 
> s/if(/if (/
> 
>> +			tmp = numa_mem_id();
>> +		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
>>  		if (!new)
>>  			return -ENOMEM;
>>  
> 
> And a code comment fully explaining what's going on here?
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-08  0:33       ` Nico Pache
@ 2021-12-08  1:23         ` Yang Shi
  2021-12-08  1:26           ` Yang Shi
  0 siblings, 1 reply; 17+ messages in thread
From: Yang Shi @ 2021-12-08  1:23 UTC (permalink / raw)
  To: Nico Pache
  Cc: Andrew Morton, Linux Kernel Mailing List, Linux MM, Shakeel Butt,
	Kirill Tkhai, Roman Gushchin, Vlastimil Babka, Vladimir Davydov,
	raquini, Michal Hocko, David Hildenbrand

On Tue, Dec 7, 2021 at 4:33 PM Nico Pache <npache@redhat.com> wrote:
>
>
>
> On 12/7/21 19:26, Yang Shi wrote:
> > On Tue, Dec 7, 2021 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>
> >> On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:
> >>
> >>> We have run into a panic caused by a shrinker allocation being attempted
> >>> on an offlined node.
> >>>
> >>> Our crash analysis has determined that the issue originates from trying
> >>> to allocate pages on an offlined node in expand_one_shrinker_info. This
> >>> function makes the incorrect assumption that we can allocate on any node.
> >>> To correct this we make sure the node is online before tempting an
> >>> allocation. If it is not online choose the closest node.
> >>
> >> This isn't fully accurate, is it?  We could allocate on a node which is
> >> presently offline but which was previously onlined, by testing
> >> NODE_DATA(nid).
> >>
> >> It isn't entirely clear to me from the v1 discussion why this approach
> >> isn't being taken?
> >>
> >> AFAICT the proposed patch is *already* taking this approach, by having
> >> no protection against a concurrent or subsequent node offlining?
> >
> > AFAICT, we have not reached agreement on how to fix it yet. I saw 3
> > proposals at least:
> >
> > 1. From Michal, allocate node data for all possible nodes.
> > https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u
> >
> > 2. What this patch does. Proposed originally from
> > https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u
>
> Correct me if im wrong, but isn't that a different caller? This patch fixes the
> issue in expand_one_shrinker_info.

Yes, different caller, but same approach. The cons with this approach
is we have to fix all the places. It seems Michal and David are not
fans for this approach IIRC.

>
> > 3. From David, fix in node_zonelist().
> > https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u
> >
> >>
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> >>>       int size = map_size + defer_size;
> >>>
> >>>       for_each_node(nid) {
> >>> +             int tmp = nid;
> >>
> >> Not `tmp', please.  Better to use an identifier which explains the
> >> variable's use.  target_nid?
> >>
> >> And a newline after defining locals, please.
> >>
> >>>               pn = memcg->nodeinfo[nid];
> >>>               old = shrinker_info_protected(memcg, nid);
> >>>               /* Not yet online memcg */
> >>>               if (!old)
> >>>                       return 0;
> >>>
> >>> -             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> >>> +             if(!node_online(nid))
> >>
> >> s/if(/if (/
> >>
> >>> +                     tmp = numa_mem_id();
> >>> +             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
> >>>               if (!new)
> >>>                       return -ENOMEM;
> >>>
> >>
> >> And a code comment fully explaining what's going on here?
> >
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-08  1:23         ` Yang Shi
@ 2021-12-08  1:26           ` Yang Shi
  2021-12-08  7:59             ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Yang Shi @ 2021-12-08  1:26 UTC (permalink / raw)
  To: Nico Pache
  Cc: Andrew Morton, Linux Kernel Mailing List, Linux MM, Shakeel Butt,
	Kirill Tkhai, Roman Gushchin, Vlastimil Babka, Vladimir Davydov,
	raquini, Michal Hocko, David Hildenbrand

On Tue, Dec 7, 2021 at 5:23 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Dec 7, 2021 at 4:33 PM Nico Pache <npache@redhat.com> wrote:
> >
> >
> >
> > On 12/7/21 19:26, Yang Shi wrote:
> > > On Tue, Dec 7, 2021 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >>
> > >> On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:
> > >>
> > >>> We have run into a panic caused by a shrinker allocation being attempted
> > >>> on an offlined node.
> > >>>
> > >>> Our crash analysis has determined that the issue originates from trying
> > >>> to allocate pages on an offlined node in expand_one_shrinker_info. This
> > >>> function makes the incorrect assumption that we can allocate on any node.
> > >>> To correct this we make sure the node is online before tempting an
> > >>> allocation. If it is not online choose the closest node.
> > >>
> > >> This isn't fully accurate, is it?  We could allocate on a node which is
> > >> presently offline but which was previously onlined, by testing
> > >> NODE_DATA(nid).
> > >>
> > >> It isn't entirely clear to me from the v1 discussion why this approach
> > >> isn't being taken?
> > >>
> > >> AFAICT the proposed patch is *already* taking this approach, by having
> > >> no protection against a concurrent or subsequent node offlining?
> > >
> > > AFAICT, we have not reached agreement on how to fix it yet. I saw 3
> > > proposals at least:
> > >
> > > 1. From Michal, allocate node data for all possible nodes.
> > > https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u
> > >
> > > 2. What this patch does. Proposed originally from
> > > https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u
> >
> > Correct me if im wrong, but isn't that a different caller? This patch fixes the
> > issue in expand_one_shrinker_info.
>
> Yes, different caller, but same approach. The cons with this approach

And the same underlying problem.

> is we have to fix all the places. It seems Michal and David are not
> fans for this approach IIRC.
>
> >
> > > 3. From David, fix in node_zonelist().
> > > https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u
> > >
> > >>
> > >>> --- a/mm/vmscan.c
> > >>> +++ b/mm/vmscan.c
> > >>> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> > >>>       int size = map_size + defer_size;
> > >>>
> > >>>       for_each_node(nid) {
> > >>> +             int tmp = nid;
> > >>
> > >> Not `tmp', please.  Better to use an identifier which explains the
> > >> variable's use.  target_nid?
> > >>
> > >> And a newline after defining locals, please.
> > >>
> > >>>               pn = memcg->nodeinfo[nid];
> > >>>               old = shrinker_info_protected(memcg, nid);
> > >>>               /* Not yet online memcg */
> > >>>               if (!old)
> > >>>                       return 0;
> > >>>
> > >>> -             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> > >>> +             if(!node_online(nid))
> > >>
> > >> s/if(/if (/
> > >>
> > >>> +                     tmp = numa_mem_id();
> > >>> +             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
> > >>>               if (!new)
> > >>>                       return -ENOMEM;
> > >>>
> > >>
> > >> And a code comment fully explaining what's going on here?
> > >
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-08  0:25     ` Nico Pache
@ 2021-12-08  1:53       ` Andrew Morton
  0 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2021-12-08  1:53 UTC (permalink / raw)
  To: Nico Pache
  Cc: Matthew Wilcox, mhocko, linux-kernel, linux-mm, shakeelb, ktkhai,
	shy828301, guro, vbabka, vdavydov.dev, raquini, david

On Tue, 7 Dec 2021 19:25:25 -0500 Nico Pache <npache@redhat.com> wrote:

> 
> 
> On 12/7/21 18:34, Matthew Wilcox wrote:
> > On Tue, Dec 07, 2021 at 05:40:13PM -0500, Nico Pache wrote:
> >> +++ b/mm/vmscan.c
> >> @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> >>  	int size = map_size + defer_size;
> >>  
> >>  	for_each_node(nid) {
> >> +		int tmp = nid;
> >>  		pn = memcg->nodeinfo[nid];
> >>  		old = shrinker_info_protected(memcg, nid);
> >>  		/* Not yet online memcg */
> >>  		if (!old)
> >>  			return 0;
> >>  
> >> -		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> >> +		if(!node_online(nid))
> >> +			tmp = numa_mem_id();
> >> +		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
> >>  		if (!new)
> > 
> > Why should this be fixed here and not in, say, kvmalloc_node()?
> 
> according to Michal, the caller should be responsible for making sure it is
> allocating on a correct node. This avoids adding branches to hot-paths and
> wasting cycles. Im not opposed to moving it to kvmalloc_node, but it may result
> in masking other issues from other callers.
> > 

Yes, kvmalloc_node(nid) should allocate on `nid', or should fail.

A new kvmalloc_try_node() or whatever would express this idea.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-08  0:40     ` Nico Pache
@ 2021-12-08  7:54       ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2021-12-08  7:54 UTC (permalink / raw)
  To: Nico Pache
  Cc: Andrew Morton, linux-kernel, linux-mm, shakeelb, ktkhai,
	shy828301, guro, vbabka, vdavydov.dev, raquini, david

On Tue 07-12-21 19:40:33, Nico Pache wrote:
> 
> 
> On 12/7/21 18:44, Andrew Morton wrote:
> > On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:
> > 
> >> We have run into a panic caused by a shrinker allocation being attempted
> >> on an offlined node.
> >>
> >> Our crash analysis has determined that the issue originates from trying
> >> to allocate pages on an offlined node in expand_one_shrinker_info. This
> >> function makes the incorrect assumption that we can allocate on any node.
> >> To correct this we make sure the node is online before tempting an
> >> allocation. If it is not online choose the closest node.
> > 
> > This isn't fully accurate, is it?  We could allocate on a node which is
> > presently offline but which was previously onlined, by testing
> > NODE_DATA(nid).
> 
> Thanks for the review! I took your changes below into consideration for my V3.
> 
> My knowledge of offlined/onlined nodes is quite limited but after looking into
> it it doesnt seem like anything clears the state of NODE_DATA(nid) after a
> try_offline_node is attempted. So theoretically the panic we saw would not
> happen. What is the expected behavior of trying to allocate a page on a offline
> node?

To fall back (in the zonelist order) into the other node. If
__GFP_THISNODE is specified then simply fail the allocation.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-08  1:26           ` Yang Shi
@ 2021-12-08  7:59             ` Michal Hocko
  2021-12-13 19:10               ` Yang Shi
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2021-12-08  7:59 UTC (permalink / raw)
  To: Yang Shi
  Cc: Nico Pache, Andrew Morton, Linux Kernel Mailing List, Linux MM,
	Shakeel Butt, Kirill Tkhai, Roman Gushchin, Vlastimil Babka,
	Vladimir Davydov, raquini, David Hildenbrand

On Tue 07-12-21 17:26:32, Yang Shi wrote:
> On Tue, Dec 7, 2021 at 5:23 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Tue, Dec 7, 2021 at 4:33 PM Nico Pache <npache@redhat.com> wrote:
> > >
> > >
> > >
> > > On 12/7/21 19:26, Yang Shi wrote:
[...]
> > > > AFAICT, we have not reached agreement on how to fix it yet. I saw 3
> > > > proposals at least:
> > > >
> > > > 1. From Michal, allocate node data for all possible nodes.
> > > > https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u
> > > >
> > > > 2. What this patch does. Proposed originally from
> > > > https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u
> > >
> > > Correct me if im wrong, but isn't that a different caller? This patch fixes the
> > > issue in expand_one_shrinker_info.
> >
> > Yes, different caller, but same approach. The cons with this approach
> 
> And the same underlying problem.
> 
> > is we have to fix all the places. It seems Michal and David are not
> > fans for this approach IIRC.

Yes, agreed. We definitely do not want to spread this node_offline
oddity all over the place. There are two different way to approach this.
Either we handle node_offline nodes at the page allocator level when
setting the proper zonelist (ideally protect that by a static key for
setups which have these nodes) or we allocate pgdat for all possible
nodes. I would prefer the second because that is more robust (less
likely to blow up when somebody does
	for_each_node(nid)
		something(NODE_DATA(nid))

The discussion is ongoing at the original thread where Alexey Makhalov
reported a similar problem (the subthread is
http://lkml.kernel.org/r/Ya89aqij6nMwJrIZ@dhcp22.suse.cz)
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-08  7:59             ` Michal Hocko
@ 2021-12-13 19:10               ` Yang Shi
  0 siblings, 0 replies; 17+ messages in thread
From: Yang Shi @ 2021-12-13 19:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Nico Pache, Andrew Morton, Linux Kernel Mailing List, Linux MM,
	Shakeel Butt, Kirill Tkhai, Roman Gushchin, Vlastimil Babka,
	Vladimir Davydov, raquini, David Hildenbrand

On Tue, Dec 7, 2021 at 11:59 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 07-12-21 17:26:32, Yang Shi wrote:
> > On Tue, Dec 7, 2021 at 5:23 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Tue, Dec 7, 2021 at 4:33 PM Nico Pache <npache@redhat.com> wrote:
> > > >
> > > >
> > > >
> > > > On 12/7/21 19:26, Yang Shi wrote:
> [...]
> > > > > AFAICT, we have not reached agreement on how to fix it yet. I saw 3
> > > > > proposals at least:
> > > > >
> > > > > 1. From Michal, allocate node data for all possible nodes.
> > > > > https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u
> > > > >
> > > > > 2. What this patch does. Proposed originally from
> > > > > https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u
> > > >
> > > > Correct me if im wrong, but isn't that a different caller? This patch fixes the
> > > > issue in expand_one_shrinker_info.
> > >
> > > Yes, different caller, but same approach. The cons with this approach
> >
> > And the same underlying problem.
> >
> > > is we have to fix all the places. It seems Michal and David are not
> > > fans for this approach IIRC.
>
> Yes, agreed. We definitely do not want to spread this node_offline
> oddity all over the place. There are two different way to approach this.
> Either we handle node_offline nodes at the page allocator level when
> setting the proper zonelist (ideally protect that by a static key for
> setups which have these nodes) or we allocate pgdat for all possible
> nodes. I would prefer the second because that is more robust (less
> likely to blow up when somebody does
>         for_each_node(nid)
>                 something(NODE_DATA(nid))
>
> The discussion is ongoing at the original thread where Alexey Makhalov
> reported a similar problem (the subthread is
> http://lkml.kernel.org/r/Ya89aqij6nMwJrIZ@dhcp22.suse.cz)

Thanks, Michal. Yeah, it seems more straightforward and more robust to
me. I'm not familiar with memory hotplug, hopefully it doesn't break
hotplug.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2021-12-08  0:26     ` Yang Shi
  2021-12-08  0:33       ` Nico Pache
@ 2022-01-10 17:09       ` Rafael Aquini
  2022-01-10 17:16         ` Michal Hocko
  1 sibling, 1 reply; 17+ messages in thread
From: Rafael Aquini @ 2022-01-10 17:09 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, Nico Pache, Linux Kernel Mailing List, Linux MM,
	Shakeel Butt, Kirill Tkhai, Roman Gushchin, Vlastimil Babka,
	Vladimir Davydov, raquini, Michal Hocko, David Hildenbrand

On Tue, Dec 07, 2021 at 04:26:28PM -0800, Yang Shi wrote:
> On Tue, Dec 7, 2021 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Tue,  7 Dec 2021 17:40:13 -0500 Nico Pache <npache@redhat.com> wrote:
> >
> > > We have run into a panic caused by a shrinker allocation being attempted
> > > on an offlined node.
> > >
> > > Our crash analysis has determined that the issue originates from trying
> > > to allocate pages on an offlined node in expand_one_shrinker_info. This
> > > function makes the incorrect assumption that we can allocate on any node.
> > > To correct this we make sure the node is online before tempting an
> > > allocation. If it is not online choose the closest node.
> >
> > This isn't fully accurate, is it?  We could allocate on a node which is
> > presently offline but which was previously onlined, by testing
> > NODE_DATA(nid).
> >
> > It isn't entirely clear to me from the v1 discussion why this approach
> > isn't being taken?
> >
> > AFAICT the proposed patch is *already* taking this approach, by having
> > no protection against a concurrent or subsequent node offlining?
> 
> AFAICT, we have not reached agreement on how to fix it yet. I saw 3
> proposals at least:
> 
> 1. From Michal, allocate node data for all possible nodes.
> https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u
> 
> 2. What this patch does. Proposed originally from
> https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u
>

This patch doesn't cut it, because it only fixes up one caller. The issue,
however, boils down to any iteration like

...
   for_each_possible_cpu(cpu) {
      struct page *page = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
...

where topology allows for possible cpus in yet-to-be-online node to
cause node_zonelist() to produce a bogus zonelist pointer when
populating ac->zonelist within prepare_alloc_pages().

I just bisected the following commit causing a PPC host to crash
at boot, when initializing the cgoup subsystem:

    commit bd0e7491a931f5a2960555b10b9551464ff8cc8e
    Author: Vlastimil Babka <vbabka@suse.cz>
    Date:   Sat May 22 01:59:38 2021 +0200

        mm, slub: convert kmem_cpu_slab protection to local_lock

The commit, obviously is not responsible for the crash, and the only "sin" 
committed by commit bd0e7491a93 as far as the crash issue is concerned is to
increase the size of struct kmem_cache_cpu size with one local_lock_t, which
in the worst case scenario -- !CONFIG_PREEMPT_RT && CONFIG_DEBUG_LOCK_ALLOC --
will end up creating an overhead of 56 bytes for each kmem_cache_cpu created,
burning faster through the bootstrap pre-allocated embedded per-cpu pool.

---8<---
[    0.001888] BUG: Kernel NULL pointer dereference on read at 0x00001508
[    0.001900] Faulting instruction address: 0xc0000000005277fc
[    0.001905] Oops: Kernel access of bad area, sig: 11 [#1]
[    0.001909] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[    0.001915] Modules linked in:
[    0.001919] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.14.0+ #24
[    0.001925] NIP:  c0000000005277fc LR: c0000000005276d8 CTR: 0000000000000000
[    0.001930] REGS: c000000002c936d0 TRAP: 0380   Not tainted  (5.14.0+)
[    0.001935] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 24024222  XER: 00000002
[    0.001948] CFAR: c0000000005276e0 IRQMASK: 0
[    0.001948] GPR00: c0000000005276d8 c000000002c93970 c000000002c95500 0000000000001500
[    0.001948] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    0.001948] GPR08: 0000000000000081 0000000000000001 0000000000001500 0000000044024222
[    0.001948] GPR12: 00000003fa6a0000 c0000000049d0000 0000000000000000 c0000000053f0780
[    0.001948] GPR16: 0000000000000000 0000000000000000 0000000000000008 c000000002cec160
[    0.001948] GPR20: 0000000000000000 0000000000000cc2 c000000002cf0a30 c000000002cf02c0
[    0.001948] GPR24: 0000000000000001 0000000000000001 0000000000000000 0000000000000000
[    0.001948] GPR28: 0000000000000000 c000000002c939e8 c000000002c939f0 0000000000000000
[    0.002007] NIP [c0000000005277fc] prepare_alloc_pages.constprop.0+0x20c/0x280
[    0.002014] LR [c0000000005276d8] prepare_alloc_pages.constprop.0+0xe8/0x280
[    0.002020] Call Trace:
[    0.002022] [c000000002c93970] [c0000000005276c4] prepare_alloc_pages.constprop.0+0xd4/0x280 (unreliable)
[    0.002030] [c000000002c939c0] [c000000000531270] __alloc_pages+0xb0/0x3b0
[    0.002036] [c000000002c93a50] [c0000000004cd4d4] pcpu_alloc_pages.constprop.0+0x144/0x2a0
[    0.002044] [c000000002c93ae0] [c0000000004cd714] pcpu_populate_chunk+0x64/0x2b0
[    0.002050] [c000000002c93b90] [c0000000004d1a14] pcpu_alloc+0x944/0xd80
[    0.002056] [c000000002c93cb0] [c0000000005abcd4] mem_cgroup_alloc+0x144/0x470
[    0.002063] [c000000002c93d30] [c0000000005bcedc] mem_cgroup_css_alloc+0xac/0x390
[    0.002070] [c000000002c93d80] [c00000000203181c] cgroup_init_subsys+0xf4/0x260
[    0.002078] [c000000002c93e30] [c000000002031d40] cgroup_init+0x1f4/0x528
[    0.002085] [c000000002c93f00] [c0000000020051ec] start_kernel+0x64c/0x6b0
[    0.002091] [c000000002c93f90] [c00000000000d39c] start_here_common+0x1c/0x600
[    0.002098] Instruction dump:
[    0.002101] 61280080 2c0a0001 7d28489e 913d0000 57ffa7fe 9bfe0020 809e001c e8be0008
[    0.002112] e87e0000 2c250000 7c6a1b78 40820058 <81230008> 7c044840 4180004c f95e0010
[    0.002124] ---[ end trace 7392155448beabaa ]---
[    0.002127]
[    1.002133] Kernel panic - not syncing: Attempted to kill the idle task!
[    1.002195] ------------[ cut here ]------------
--->8---


> 3. From David, fix in node_zonelist().
> https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u

It seems to me that (3) is the simplest and effective way to cope with this case

Cheers,
-- Rafael

> >
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> > >       int size = map_size + defer_size;
> > >
> > >       for_each_node(nid) {
> > > +             int tmp = nid;
> >
> > Not `tmp', please.  Better to use an identifier which explains the
> > variable's use.  target_nid?
> >
> > And a newline after defining locals, please.
> >
> > >               pn = memcg->nodeinfo[nid];
> > >               old = shrinker_info_protected(memcg, nid);
> > >               /* Not yet online memcg */
> > >               if (!old)
> > >                       return 0;
> > >
> > > -             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
> > > +             if(!node_online(nid))
> >
> > s/if(/if (/
> >
> > > +                     tmp = numa_mem_id();
> > > +             new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp);
> > >               if (!new)
> > >                       return -ENOMEM;
> > >
> >
> > And a code comment fully explaining what's going on here?
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2022-01-10 17:09       ` Rafael Aquini
@ 2022-01-10 17:16         ` Michal Hocko
  2022-01-10 17:21           ` Rafael Aquini
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2022-01-10 17:16 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Yang Shi, Andrew Morton, Nico Pache, Linux Kernel Mailing List,
	Linux MM, Shakeel Butt, Kirill Tkhai, Roman Gushchin,
	Vlastimil Babka, Vladimir Davydov, raquini, David Hildenbrand

On Mon 10-01-22 12:09:50, Rafael Aquini wrote:
[...]
> > 3. From David, fix in node_zonelist().
> > https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u
> 
> It seems to me that (3) is the simplest and effective way to cope with this case

Did you have chance to look at http://lkml.kernel.org/r/20211214100732.26335-1-mhocko@kernel.org
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes
  2022-01-10 17:16         ` Michal Hocko
@ 2022-01-10 17:21           ` Rafael Aquini
  0 siblings, 0 replies; 17+ messages in thread
From: Rafael Aquini @ 2022-01-10 17:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yang Shi, Andrew Morton, Nico Pache, Linux Kernel Mailing List,
	Linux MM, Shakeel Butt, Kirill Tkhai, Roman Gushchin,
	Vlastimil Babka, Vladimir Davydov, raquini, David Hildenbrand

On Mon, Jan 10, 2022 at 06:16:23PM +0100, Michal Hocko wrote:
> On Mon 10-01-22 12:09:50, Rafael Aquini wrote:
> [...]
> > > 3. From David, fix in node_zonelist().
> > > https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u
> > 
> > It seems to me that (3) is the simplest and effective way to cope with this case
> 
> Did you have chance to look at http://lkml.kernel.org/r/20211214100732.26335-1-mhocko@kernel.org

I'll take a closer look at it. 

Thanks Michal.

-- Rafael


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-01-10 17:21 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-07 22:40 [PATCH v2 0/1] Dont allocate pages on a offline node Nico Pache
2021-12-07 22:40 ` [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes Nico Pache
2021-12-07 23:34   ` Matthew Wilcox
2021-12-08  0:25     ` Nico Pache
2021-12-08  1:53       ` Andrew Morton
2021-12-07 23:44   ` Andrew Morton
2021-12-08  0:26     ` Yang Shi
2021-12-08  0:33       ` Nico Pache
2021-12-08  1:23         ` Yang Shi
2021-12-08  1:26           ` Yang Shi
2021-12-08  7:59             ` Michal Hocko
2021-12-13 19:10               ` Yang Shi
2022-01-10 17:09       ` Rafael Aquini
2022-01-10 17:16         ` Michal Hocko
2022-01-10 17:21           ` Rafael Aquini
2021-12-08  0:40     ` Nico Pache
2021-12-08  7:54       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).