[mm-unstable v3 PATCH 1/3] Revert "mm: don't warn if the node is offlined"

All of lore.kernel.org
 help / color / mirror / Atom feed

* [mm-unstable v3 PATCH 1/3] Revert "mm: don't warn if the node is offlined"
@ 2022-11-08 18:43 Yang Shi
  2022-11-08 18:43 ` [mm-unstable v3 PATCH 2/3] mm: khugepaged: allow page allocation fallback to eligible nodes Yang Shi
  2022-11-08 18:43 ` [mm-unstable v3 PATCH 3/3] mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE Yang Shi
  0 siblings, 2 replies; 8+ messages in thread
From: Yang Shi @ 2022-11-08 18:43 UTC (permalink / raw)
  To: zokeefe, mhocko, akpm; +Cc: shy828301, linux-mm, linux-kernel

This reverts commit 6baf85a005ad36dd8586e5c3de598f8c3059dc91.
---
 include/linux/gfp.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 594d6dee5646..ef4aea3b356e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -218,6 +218,7 @@ static inline struct page *
 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 {
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+	VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
 
 	return __alloc_pages(gfp_mask, order, nid, NULL);
 }
@@ -226,6 +227,7 @@ static inline
 struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
 {
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+	VM_WARN_ON((gfp & __GFP_THISNODE) && !node_online(nid));
 
 	return __folio_alloc(gfp, order, nid, NULL);
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [mm-unstable v3 PATCH 2/3] mm: khugepaged: allow page allocation fallback to eligible nodes
  2022-11-08 18:43 [mm-unstable v3 PATCH 1/3] Revert "mm: don't warn if the node is offlined" Yang Shi
@ 2022-11-08 18:43 ` Yang Shi
  2022-11-08 19:20   ` Andrew Morton
  2022-11-08 18:43 ` [mm-unstable v3 PATCH 3/3] mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE Yang Shi
  1 sibling, 1 reply; 8+ messages in thread
From: Yang Shi @ 2022-11-08 18:43 UTC (permalink / raw)
  To: zokeefe, mhocko, akpm
  Cc: shy828301, linux-mm, linux-kernel, syzbot+0044b22d177870ee974f,
	Zach O'Keefe

Syzbot reported the below splat:

WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Modules linked in:
CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
 hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
 madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
 madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
 madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
 do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
 do_madvise mm/madvise.c:1432 [inline]
 __do_sys_madvise mm/madvise.c:1432 [inline]
 __se_sys_madvise mm/madvise.c:1430 [inline]
 __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f6b48a4eef9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
 </TASK>

The khugepaged code would pick up the node with the most hit as the preferred
node, and also tries to do some balance if several nodes have the same
hit record.  Basically it does conceptually:
    * If the target_node <= last_target_node, then iterate from
last_target_node + 1 to MAX_NUMNODES (1024 on default config)
    * If the max_value == node_load[nid], then target_node = nid

But there is a corner case, paritucularly for MADV_COLLAPSE, that the
non-existing node may be returned as preferred node.

Assuming the system has 2 nodes, the target_node is 0 and the
last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may
be 0, then it may return 2 for target_node, but it is actually not
existing (offline), so the warn is triggered.

The node balance was introduced by commit 9f1b868a13ac ("mm: thp:
khugepaged: add policy for finding target node") to satisfy
"numactl --interleave=all".  But interleaving is a mere hint rather than
something that has hard requirements.

So use nodemask to record the nodes which have the same hit record, the
hugepage allocation could fallback to those nodes.  And remove
__GFP_THISNODE since it does disallow fallback.  And if the nodemask
just has one node set, it means there is one single node has the most
hit record, the nodemask approach actually behaves like __GFP_THISNODE.

Reported-by: syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Zach O'Keefe <zokeefe@googel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
v3: * Corrected the commit log a little bit.

 mm/khugepaged.c | 32 ++++++++++++++------------------
 1 file changed, 14 insertions(+), 18 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ea0d186bc9d4..572ce7dbf4b0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -97,8 +97,8 @@ struct collapse_control {
 	/* Num pages scanned per node */
 	u32 node_load[MAX_NUMNODES];
 
-	/* Last target selected in hpage_collapse_find_target_node() */
-	int last_target_node;
+	/* nodemask for allocation fallback */
+	nodemask_t alloc_nmask;
 };
 
 /**
@@ -734,7 +734,6 @@ static void khugepaged_alloc_sleep(void)
 
 struct collapse_control khugepaged_collapse_control = {
 	.is_khugepaged = true,
-	.last_target_node = NUMA_NO_NODE,
 };
 
 static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
@@ -783,16 +782,11 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
 			target_node = nid;
 		}
 
-	/* do some balance if several nodes have the same hit record */
-	if (target_node <= cc->last_target_node)
-		for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
-		     nid++)
-			if (max_value == cc->node_load[nid]) {
-				target_node = nid;
-				break;
-			}
+	for_each_online_node(nid) {
+		if (max_value == cc->node_load[nid])
+			node_set(nid, cc->alloc_nmask);
+	}
 
-	cc->last_target_node = target_node;
 	return target_node;
 }
 #else
@@ -802,9 +796,10 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
 }
 #endif
 
-static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node)
+static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node,
+				      nodemask_t *nmask)
 {
-	*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
+	*hpage = __alloc_pages(gfp, HPAGE_PMD_ORDER, node, nmask);
 	if (unlikely(!*hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
 		return false;
@@ -955,12 +950,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
 			      struct collapse_control *cc)
 {
-	/* Only allocate from the target node */
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
-		     GFP_TRANSHUGE) | __GFP_THISNODE;
+		     GFP_TRANSHUGE);
 	int node = hpage_collapse_find_target_node(cc);
 
-	if (!hpage_collapse_alloc_page(hpage, gfp, node))
+	if (!hpage_collapse_alloc_page(hpage, gfp, node, &cc->alloc_nmask))
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
 	if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
 		return SCAN_CGROUP_CHARGE_FAIL;
@@ -1144,6 +1138,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		goto out;
 
 	memset(cc->node_load, 0, sizeof(cc->node_load));
+	nodes_clear(cc->alloc_nmask);
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
 	     _pte++, _address += PAGE_SIZE) {
@@ -2078,6 +2073,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 	present = 0;
 	swap = 0;
 	memset(cc->node_load, 0, sizeof(cc->node_load));
+	nodes_clear(cc->alloc_nmask);
 	rcu_read_lock();
 	xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) {
 		if (xas_retry(&xas, page))
@@ -2581,7 +2577,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	if (!cc)
 		return -ENOMEM;
 	cc->is_khugepaged = false;
-	cc->last_target_node = NUMA_NO_NODE;
 
 	mmgrab(mm);
 	lru_add_drain_all();
@@ -2607,6 +2602,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		}
 		mmap_assert_locked(mm);
 		memset(cc->node_load, 0, sizeof(cc->node_load));
+		nodes_clear(cc->alloc_nmask);
 		if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
 			struct file *file = get_file(vma->vm_file);
 			pgoff_t pgoff = linear_page_index(vma, addr);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [mm-unstable v3 PATCH 3/3] mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE
  2022-11-08 18:43 [mm-unstable v3 PATCH 1/3] Revert "mm: don't warn if the node is offlined" Yang Shi
  2022-11-08 18:43 ` [mm-unstable v3 PATCH 2/3] mm: khugepaged: allow page allocation fallback to eligible nodes Yang Shi
@ 2022-11-08 18:43 ` Yang Shi
  2022-11-08 22:15   ` Michal Hocko
  2022-11-23 19:30   ` [mm-unstable v4 PATCH] " Yang Shi
  1 sibling, 2 replies; 8+ messages in thread
From: Yang Shi @ 2022-11-08 18:43 UTC (permalink / raw)
  To: zokeefe, mhocko, akpm
  Cc: shy828301, linux-mm, linux-kernel, syzbot+0044b22d177870ee974f

Syzbot reported the below splat:

WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node
include/linux/gfp.h:221 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Modules linked in:
CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted
6.1.0-rc1-syzkaller-00454-ga70385240892 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 10/11/2022
RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc
ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9
96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
 hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
 madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
 madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
 madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
 do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
 do_madvise mm/madvise.c:1432 [inline]
 __do_sys_madvise mm/madvise.c:1432 [inline]
 __se_sys_madvise mm/madvise.c:1430 [inline]
 __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f6b48a4eef9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
 </TASK>

It is because khugepaged allocates pages with __GFP_THISNODE, but the
preferred node is bogus.  The previous patch fixed the khugepaged
code to avoid allocating page from non-existing node.  But it is still
racy against memory hotremove.  There is no synchronization with the
memory hotplug so it is possible that memory gets offline during a
longer taking scanning.

So this warning still seems not quite helpful because:
  * There is no guarantee the node is online for __GFP_THISNODE context
    for all the callsites.
  * Kernel just fails the allocation regardless the warning, and it looks
    all callsites handle the allocation failure gracefully.

Although while the warning has helped to identify a buggy code, it is not
safe in general and this warning could panic the system with panic-on-warn
configuration which tends to be used surprisingly often.  So replace
VM_WARN_ON to pr_warn().  And the warning will be triggered if
__GFP_NOWARN is set since the allocator would print out warning for such
case if __GFP_NOWARN is not set.

Reported-by: syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: Zach O'Keefe <zokeefe@google.com>
---
v3: * Reverted the old commit from mm-unstable which remove the VM_WARN
      (patch 1/3).
    * Incorporated the suggestion from Michal to use pr_warn.

v2: * Added patch 1/2.
    * Reworded the commit log per Michal.

 include/linux/gfp.h | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ef4aea3b356e..60a1c70ec85c 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,6 +210,16 @@ alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct p
 	return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
 }
 
+static inline void warn_if_node_offline(int nid, gfp_t gfp_mask)
+{
+	gfp_t gfp = gfp_mask & (__GFP_THISNODE|__GFP_NOWARN);
+
+	if ((gfp == (__GFP_THISNODE|__GFP_NOWARN)) && !node_online(nid)) {
+		pr_warn("%pGg allocation from offline node %d\n", &gfp, nid);
+		dump_stack();
+	}
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
@@ -218,7 +228,7 @@ static inline struct page *
 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 {
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
-	VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
+	warn_if_node_offline(nid, gfp_mask);
 
 	return __alloc_pages(gfp_mask, order, nid, NULL);
 }
@@ -227,7 +237,7 @@ static inline
 struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
 {
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
-	VM_WARN_ON((gfp & __GFP_THISNODE) && !node_online(nid));
+	warn_if_node_offline(nid, gfp);
 
 	return __folio_alloc(gfp, order, nid, NULL);
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [mm-unstable v3 PATCH 2/3] mm: khugepaged: allow page allocation fallback to eligible nodes
  2022-11-08 18:43 ` [mm-unstable v3 PATCH 2/3] mm: khugepaged: allow page allocation fallback to eligible nodes Yang Shi
@ 2022-11-08 19:20   ` Andrew Morton
  2022-11-08 19:33     ` Yang Shi
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2022-11-08 19:20 UTC (permalink / raw)
  To: Yang Shi
  Cc: zokeefe, mhocko, linux-mm, linux-kernel,
	syzbot+0044b22d177870ee974f, Zach O'Keefe

On Tue,  8 Nov 2022 10:43:56 -0800 Yang Shi <shy828301@gmail.com> wrote:

> Syzbot reported the below splat:
> 
> ...
>

"mm: don't warn if the node is offlined" remains in
mm-hotfixes-unstable, so I'll just drop it.

> 
> Reported-by: syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com
> Suggested-by: Zach O'Keefe <zokeefe@google.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Zach O'Keefe <zokeefe@googel.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Yang Shi <shy828301@gmail.com>

I should add cc:stable, yes?

Are we able to identify a Fixes: target?



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [mm-unstable v3 PATCH 2/3] mm: khugepaged: allow page allocation fallback to eligible nodes
  2022-11-08 19:20   ` Andrew Morton
@ 2022-11-08 19:33     ` Yang Shi
  0 siblings, 0 replies; 8+ messages in thread
From: Yang Shi @ 2022-11-08 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: zokeefe, mhocko, linux-mm, linux-kernel,
	syzbot+0044b22d177870ee974f, Zach O'Keefe

On Tue, Nov 8, 2022 at 11:20 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue,  8 Nov 2022 10:43:56 -0800 Yang Shi <shy828301@gmail.com> wrote:
>
> > Syzbot reported the below splat:
> >
> > ...
> >
>
> "mm: don't warn if the node is offlined" remains in
> mm-hotfixes-unstable, so I'll just drop it.

Yes, please.

>
> >
> > Reported-by: syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com
> > Suggested-by: Zach O'Keefe <zokeefe@google.com>
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Zach O'Keefe <zokeefe@googel.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
>
> I should add cc:stable, yes?

This problem should just happen with MADV_COLLAPSE, which is merged
for 6.1, so we should not cc to stable.

>
> Are we able to identify a Fixes: target?

Please use this target: 7d8faaf15545 mm/madvise: introduce
MADV_COLLAPSE sync hugepage collapse

>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [mm-unstable v3 PATCH 3/3] mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE
  2022-11-08 18:43 ` [mm-unstable v3 PATCH 3/3] mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE Yang Shi
@ 2022-11-08 22:15   ` Michal Hocko
  2022-11-09 19:33     ` Yang Shi
  2022-11-23 19:30   ` [mm-unstable v4 PATCH] " Yang Shi
  1 sibling, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2022-11-08 22:15 UTC (permalink / raw)
  To: Yang Shi
  Cc: zokeefe, akpm, linux-mm, linux-kernel, syzbot+0044b22d177870ee974f

On Tue 08-11-22 10:43:57, Yang Shi wrote:
> Syzbot reported the below splat:
> 
> WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node
> include/linux/gfp.h:221 [inline]
> WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
> hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
> alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> Modules linked in:
> CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted
> 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 10/11/2022
> RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc
> ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9
> 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <TASK>
>  collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
>  hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
>  madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
>  madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
>  madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
>  do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
>  do_madvise mm/madvise.c:1432 [inline]
>  __do_sys_madvise mm/madvise.c:1432 [inline]
>  __se_sys_madvise mm/madvise.c:1430 [inline]
>  __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7f6b48a4eef9
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89
> f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
> f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
> RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
> RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
> RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
> R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
>  </TASK>
> 
> It is because khugepaged allocates pages with __GFP_THISNODE, but the
> preferred node is bogus.  The previous patch fixed the khugepaged
> code to avoid allocating page from non-existing node.  But it is still
> racy against memory hotremove.  There is no synchronization with the
> memory hotplug so it is possible that memory gets offline during a
> longer taking scanning.
> 
> So this warning still seems not quite helpful because:
>   * There is no guarantee the node is online for __GFP_THISNODE context
>     for all the callsites.
>   * Kernel just fails the allocation regardless the warning, and it looks
>     all callsites handle the allocation failure gracefully.
> 
> Although while the warning has helped to identify a buggy code, it is not
> safe in general and this warning could panic the system with panic-on-warn
> configuration which tends to be used surprisingly often.  So replace
> VM_WARN_ON to pr_warn().  And the warning will be triggered if
> __GFP_NOWARN is set since the allocator would print out warning for such
> case if __GFP_NOWARN is not set.
> 
> Reported-by: syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Cc: Zach O'Keefe <zokeefe@google.com>

Acked-by: Michal Hocko <mhocko@suse.com>

minor nit

> ---
> v3: * Reverted the old commit from mm-unstable which remove the VM_WARN
>       (patch 1/3).
>     * Incorporated the suggestion from Michal to use pr_warn.
> 
> v2: * Added patch 1/2.
>     * Reworded the commit log per Michal.
> 
>  include/linux/gfp.h | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index ef4aea3b356e..60a1c70ec85c 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -210,6 +210,16 @@ alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct p
>  	return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
>  }
>  
> +static inline void warn_if_node_offline(int nid, gfp_t gfp_mask)
> +{
> +	gfp_t gfp = gfp_mask & (__GFP_THISNODE|__GFP_NOWARN);

I would use rather this_node or similarly descriptive name. gfp sounds
like a pasm but this is only a subset of it. If you really want to
improve the readability then you can restructure the condition a bit

	if (gfp_mask & (__GFP_THISNODE|__GFP_NOWARN)) != (__GFP_THISNODE|__GFP_NOWARN))
		return;
	
	if (node_online(nid)) 
		return;
	
	pr_warn("%pGg allocation from offline node %d\n", &gfp, nid);
	dump_stack();

> +
> +	if ((gfp == (__GFP_THISNODE|__GFP_NOWARN)) && !node_online(nid)) {
> +		pr_warn("%pGg allocation from offline node %d\n", &gfp, nid);
> +		dump_stack();
> +	}
> +}
> +
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [mm-unstable v3 PATCH 3/3] mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE
  2022-11-08 22:15   ` Michal Hocko
@ 2022-11-09 19:33     ` Yang Shi
  0 siblings, 0 replies; 8+ messages in thread
From: Yang Shi @ 2022-11-09 19:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: zokeefe, akpm, linux-mm, linux-kernel, syzbot+0044b22d177870ee974f

On Tue, Nov 8, 2022 at 2:15 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 08-11-22 10:43:57, Yang Shi wrote:
> > Syzbot reported the below splat:
> >
> > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node
> > include/linux/gfp.h:221 [inline]
> > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
> > hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
> > alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > Modules linked in:
> > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted
> > 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 10/11/2022
> > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc
> > ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9
> > 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> > FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000)
> > knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> >  <TASK>
> >  collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
> >  hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
> >  madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
> >  madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
> >  madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
> >  do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
> >  do_madvise mm/madvise.c:1432 [inline]
> >  __do_sys_madvise mm/madvise.c:1432 [inline]
> >  __se_sys_madvise mm/madvise.c:1430 [inline]
> >  __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
> >  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> >  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > RIP: 0033:0x7f6b48a4eef9
> > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89
> > f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
> > f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
> > RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
> > RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
> > RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
> > R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
> >  </TASK>
> >
> > It is because khugepaged allocates pages with __GFP_THISNODE, but the
> > preferred node is bogus.  The previous patch fixed the khugepaged
> > code to avoid allocating page from non-existing node.  But it is still
> > racy against memory hotremove.  There is no synchronization with the
> > memory hotplug so it is possible that memory gets offline during a
> > longer taking scanning.
> >
> > So this warning still seems not quite helpful because:
> >   * There is no guarantee the node is online for __GFP_THISNODE context
> >     for all the callsites.
> >   * Kernel just fails the allocation regardless the warning, and it looks
> >     all callsites handle the allocation failure gracefully.
> >
> > Although while the warning has helped to identify a buggy code, it is not
> > safe in general and this warning could panic the system with panic-on-warn
> > configuration which tends to be used surprisingly often.  So replace
> > VM_WARN_ON to pr_warn().  And the warning will be triggered if
> > __GFP_NOWARN is set since the allocator would print out warning for such
> > case if __GFP_NOWARN is not set.
> >
> > Reported-by: syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Cc: Zach O'Keefe <zokeefe@google.com>
>
> Acked-by: Michal Hocko <mhocko@suse.com>
>
> minor nit
>
> > ---
> > v3: * Reverted the old commit from mm-unstable which remove the VM_WARN
> >       (patch 1/3).
> >     * Incorporated the suggestion from Michal to use pr_warn.
> >
> > v2: * Added patch 1/2.
> >     * Reworded the commit log per Michal.
> >
> >  include/linux/gfp.h | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index ef4aea3b356e..60a1c70ec85c 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -210,6 +210,16 @@ alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct p
> >       return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
> >  }
> >
> > +static inline void warn_if_node_offline(int nid, gfp_t gfp_mask)
> > +{
> > +     gfp_t gfp = gfp_mask & (__GFP_THISNODE|__GFP_NOWARN);
>
> I would use rather this_node or similarly descriptive name. gfp sounds

this_node sounds good to me.

> like a pasm but this is only a subset of it. If you really want to
> improve the readability then you can restructure the condition a bit

Just because I thought "if (gfp_mask & (__GFP_THISNODE|__GFP_NOWARN))
!= (__GFP_THISNODE|__GFP_NOWARN))" is too long, so I added gfp to
store the intermediate result.

>
>         if (gfp_mask & (__GFP_THISNODE|__GFP_NOWARN)) != (__GFP_THISNODE|__GFP_NOWARN))
>                 return;
>
>         if (node_online(nid))
>                 return;
>
>         pr_warn("%pGg allocation from offline node %d\n", &gfp, nid);
>         dump_stack();
>
> > +
> > +     if ((gfp == (__GFP_THISNODE|__GFP_NOWARN)) && !node_online(nid)) {
> > +             pr_warn("%pGg allocation from offline node %d\n", &gfp, nid);
> > +             dump_stack();
> > +     }
> > +}
> > +
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [mm-unstable v4 PATCH] mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE
  2022-11-08 18:43 ` [mm-unstable v3 PATCH 3/3] mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE Yang Shi
  2022-11-08 22:15   ` Michal Hocko
@ 2022-11-23 19:30   ` Yang Shi
  1 sibling, 0 replies; 8+ messages in thread
From: Yang Shi @ 2022-11-23 19:30 UTC (permalink / raw)
  To: zokeefe, mhocko, akpm; +Cc: shy828301, linux-mm, linux-kernel

Syzbot reported the below splat:

WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node
include/linux/gfp.h:221 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Modules linked in:
CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted
6.1.0-rc1-syzkaller-00454-ga70385240892 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 10/11/2022
RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc
ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9
96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
 hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
 madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
 madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
 madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
 do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
 do_madvise mm/madvise.c:1432 [inline]
 __do_sys_madvise mm/madvise.c:1432 [inline]
 __se_sys_madvise mm/madvise.c:1430 [inline]
 __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f6b48a4eef9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
 </TASK>

It is because khugepaged allocates pages with __GFP_THISNODE, but the
preferred node is bogus.  The previous patch fixed the khugepaged
code to avoid allocating page from non-existing node.  But it is still
racy against memory hotremove.  There is no synchronization with the
memory hotplug so it is possible that memory gets offline during a
longer taking scanning.

So this warning still seems not quite helpful because:
  * There is no guarantee the node is online for __GFP_THISNODE context
    for all the callsites.
  * Kernel just fails the allocation regardless the warning, and it looks
    all callsites handle the allocation failure gracefully.

Although while the warning has helped to identify a buggy code, it is not
safe in general and this warning could panic the system with panic-on-warn
configuration which tends to be used surprisingly often.  So replace
VM_WARN_ON to pr_warn().  And the warning will be triggered if
__GFP_NOWARN is set since the allocator would print out warning for such
case if __GFP_NOWARN is not set.

Reported-by: syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
v4: * Renamed nid to this_node and gfp to warn_gfp to make the code
      more readable per Michal.
    * Collected ack from Michal.
v3: * Reverted the old commit from mm-unstable which remove the VM_WARN
      (patch 1/3).
    * Incorporated the suggestion from Michal to use pr_warn.
v2: * Added patch 1/2.
    * Reworded the commit log per Michal.

 include/linux/gfp.h | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ef4aea3b356e..7e144b81c6c4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,6 +210,20 @@ alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct p
 	return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
 }
 
+static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
+{
+	gfp_t warn_gfp = gfp_mask & (__GFP_THISNODE|__GFP_NOWARN);
+
+	if (warn_gfp != (__GFP_THISNODE|__GFP_NOWARN))
+		return;
+
+	if(node_online(this_node))
+		return;
+
+	pr_warn("%pGg allocation from offline node %d\n", &warn_gfp, this_node);
+	dump_stack();
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
@@ -218,7 +232,7 @@ static inline struct page *
 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 {
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
-	VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
+	warn_if_node_offline(nid, gfp_mask);
 
 	return __alloc_pages(gfp_mask, order, nid, NULL);
 }
@@ -227,7 +241,7 @@ static inline
 struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
 {
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
-	VM_WARN_ON((gfp & __GFP_THISNODE) && !node_online(nid));
+	warn_if_node_offline(nid, gfp);
 
 	return __folio_alloc(gfp, order, nid, NULL);
 }
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-11-23 19:35 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-08 18:43 [mm-unstable v3 PATCH 1/3] Revert "mm: don't warn if the node is offlined" Yang Shi
2022-11-08 18:43 ` [mm-unstable v3 PATCH 2/3] mm: khugepaged: allow page allocation fallback to eligible nodes Yang Shi
2022-11-08 19:20   ` Andrew Morton
2022-11-08 19:33     ` Yang Shi
2022-11-08 18:43 ` [mm-unstable v3 PATCH 3/3] mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE Yang Shi
2022-11-08 22:15   ` Michal Hocko
2022-11-09 19:33     ` Yang Shi
2022-11-23 19:30   ` [mm-unstable v4 PATCH] " Yang Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.