All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yang Shi <shy828301@gmail.com>
To: "Zach O'Keefe" <zokeefe@google.com>
Cc: mhocko@suse.com, akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [v2 PATCH 1/2] mm: khugepaged: allow page allocation fallback to eligible nodes
Date: Fri, 4 Nov 2022 13:39:09 -0700	[thread overview]
Message-ID: <CAHbLzkrd5S-6Pm2w93ZX2dSsgb6f5rbJY_ObN1BhMg+pZQiesQ@mail.gmail.com> (raw)
In-Reply-To: <Y2RViRKdDQw2cONa@google.com>

On Thu, Nov 3, 2022 at 4:58 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Nov 03 14:36, Yang Shi wrote:
> > Syzbot reported the below splat:
> >
> > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
> > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > Modules linked in:
> > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
> > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> > FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> >  <TASK>
> >  collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
> >  hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
> >  madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
> >  madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
> >  madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
> >  do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
> >  do_madvise mm/madvise.c:1432 [inline]
> >  __do_sys_madvise mm/madvise.c:1432 [inline]
> >  __se_sys_madvise mm/madvise.c:1430 [inline]
> >  __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
> >  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> >  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > RIP: 0033:0x7f6b48a4eef9
> > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
> > RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
> > RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
> > RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
> > R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
> >  </TASK>
> >
> > The khugepaged code would pick up the node with the most hit as the preferred
> > node, and also tries to do some balance if several nodes have the same
> > hit record.  Basically it does conceptually:
> >     * If the target_node <= last_target_node, then iterate from
> > last_target_node + 1 to MAX_NUMNODES (1024 on default config)
> >     * If the max_value == node_load[nid], then target_node = nid
> >
> > But there is a corner case, paritucularly for MADV_COLLAPSE, that the
> > non-existing node may be returned as preferred node.
> >
> > Assuming the system has 2 nodes, the target_node is 0 and the
> > last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may
> > be 0, then it may return 2 for target_node, but it is actually not
> > existing (offline), so the warn is triggered.
> >
> > The node balance was introduced by commit 9f1b868a13ac ("mm: thp:
> > khugepaged: add policy for finding target node") to satisfy
> > "numactl --interleave=all".  But interleaving is a mere hint rather than
> > something that has hard requirements.
> >
> > So use nodemask to record the nodes which have the same hit record, the
> > hugepage allocation could fallback to those nodes.  And remove
> > __GFP_THISNODE since it does disallow fallback.  And if nodemask is
> > empty (no node is set), it means there is one single node has the most
> > hist record, the nodemask approach actually behaves like __GFP_THISNODE.
> >
> > Reported-by: syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com
> > Suggested-by: Zach O'Keefe <zokeefe@google.com>
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
>
> Reviewed-by: Zach O'Keefe <zokeefe@googel.com>

Thanks.

>
> > ---
> >  mm/khugepaged.c | 32 ++++++++++++++------------------
> >  1 file changed, 14 insertions(+), 18 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index ea0d186bc9d4..572ce7dbf4b0 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -97,8 +97,8 @@ struct collapse_control {
> >       /* Num pages scanned per node */
> >       u32 node_load[MAX_NUMNODES];
> >
> > -     /* Last target selected in hpage_collapse_find_target_node() */
> > -     int last_target_node;
> > +     /* nodemask for allocation fallback */
> > +     nodemask_t alloc_nmask;
> >  };
> >
> >  /**
> > @@ -734,7 +734,6 @@ static void khugepaged_alloc_sleep(void)
> >
> >  struct collapse_control khugepaged_collapse_control = {
> >       .is_khugepaged = true,
> > -     .last_target_node = NUMA_NO_NODE,
> >  };
> >
> >  static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
> > @@ -783,16 +782,11 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
> >                       target_node = nid;
> >               }
> >
> > -     /* do some balance if several nodes have the same hit record */
> > -     if (target_node <= cc->last_target_node)
> > -             for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
> > -                  nid++)
> > -                     if (max_value == cc->node_load[nid]) {
> > -                             target_node = nid;
> > -                             break;
> > -                     }
> > +     for_each_online_node(nid) {
> > +             if (max_value == cc->node_load[nid])
> > +                     node_set(nid, cc->alloc_nmask);
> > +     }
> >
> > -     cc->last_target_node = target_node;
> >       return target_node;
> >  }
> >  #else
> > @@ -802,9 +796,10 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
> >  }
> >  #endif
> >
> > -static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node)
> > +static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node,
> > +                                   nodemask_t *nmask)
> >  {
> > -     *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
> > +     *hpage = __alloc_pages(gfp, HPAGE_PMD_ORDER, node, nmask);
> >       if (unlikely(!*hpage)) {
> >               count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> >               return false;
> > @@ -955,12 +950,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> >  static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> >                             struct collapse_control *cc)
> >  {
> > -     /* Only allocate from the target node */
> >       gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> > -                  GFP_TRANSHUGE) | __GFP_THISNODE;
> > +                  GFP_TRANSHUGE);
> >       int node = hpage_collapse_find_target_node(cc);
> >
> > -     if (!hpage_collapse_alloc_page(hpage, gfp, node))
> > +     if (!hpage_collapse_alloc_page(hpage, gfp, node, &cc->alloc_nmask))
> >               return SCAN_ALLOC_HUGE_PAGE_FAIL;
> >       if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
> >               return SCAN_CGROUP_CHARGE_FAIL;
> > @@ -1144,6 +1138,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >               goto out;
> >
> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> > +     nodes_clear(cc->alloc_nmask);
> >       pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> >       for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >            _pte++, _address += PAGE_SIZE) {
> > @@ -2078,6 +2073,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >       present = 0;
> >       swap = 0;
> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> > +     nodes_clear(cc->alloc_nmask);
> >       rcu_read_lock();
> >       xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) {
> >               if (xas_retry(&xas, page))
> > @@ -2581,7 +2577,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >       if (!cc)
> >               return -ENOMEM;
> >       cc->is_khugepaged = false;
> > -     cc->last_target_node = NUMA_NO_NODE;
> >
> >       mmgrab(mm);
> >       lru_add_drain_all();
> > @@ -2607,6 +2602,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >               }
> >               mmap_assert_locked(mm);
> >               memset(cc->node_load, 0, sizeof(cc->node_load));
> > +             nodes_clear(cc->alloc_nmask);
> >               if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
> >                       struct file *file = get_file(vma->vm_file);
> >                       pgoff_t pgoff = linear_page_index(vma, addr);
> > --
> > 2.26.3
> >
>
> Thanks for the patch, Yang! Looks good. khugepaged selftest is good too.

Thanks for running the test. I ran it too.

  reply	other threads:[~2022-11-04 20:39 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-03 21:36 [v2 PATCH 1/2] mm: khugepaged: allow page allocation fallback to eligible nodes Yang Shi
2022-11-03 21:36 ` [v2 PATCH 2/2] mm: don't warn if the node is offlined Yang Shi
2022-11-04  9:35   ` Michal Hocko
2022-11-04  9:56     ` Michal Hocko
2022-11-04 17:42       ` Yang Shi
2022-11-04 19:51         ` Michal Hocko
2022-11-04 20:52           ` Yang Shi
2022-11-07  7:55             ` Michal Hocko
2022-11-07 18:48               ` Yang Shi
2022-11-08  0:58                 ` Zach O'Keefe
2022-11-03 23:58 ` [v2 PATCH 1/2] mm: khugepaged: allow page allocation fallback to eligible nodes Zach O'Keefe
2022-11-04 20:39   ` Yang Shi [this message]
2022-11-04  8:32 ` Michal Hocko
2022-11-04 17:37   ` Yang Shi
2022-11-04 19:55     ` Michal Hocko
2022-11-04 20:40       ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHbLzkrd5S-6Pm2w93ZX2dSsgb6f5rbJY_ObN1BhMg+pZQiesQ@mail.gmail.com \
    --to=shy828301@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.