linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Suren Baghdasaryan <surenb@google.com>
To: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	syzbot <syzbot+b591856e0f0139f83023@syzkaller.appspotmail.com>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, syzkaller-bugs@googlegroups.com
Subject: Re: [syzbot] [mm?] kernel BUG in vma_replace_policy
Date: Fri, 15 Sep 2023 11:05:14 -0700	[thread overview]
Message-ID: <CAJuCfpE8jnvL23W6fY4_HZf-969aEgvR3-LGRTUC-SFhPFju+w@mail.gmail.com> (raw)
In-Reply-To: <CAJuCfpGO4bRZaDJv5Vuf2wLh3t4hE=5EqDObm_UfcQk4B08PrQ@mail.gmail.com>

On Fri, Sep 15, 2023 at 9:09 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Sep 14, 2023 at 9:26 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > On Thu, 14 Sep 2023, Suren Baghdasaryan wrote:
> > > On Thu, Sep 14, 2023 at 9:24 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > On Thu, Sep 14, 2023 at 08:53:59PM +0000, Suren Baghdasaryan wrote:
> > > > > On Thu, Sep 14, 2023 at 8:00 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > On Thu, Sep 14, 2023 at 7:09 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Thu, Sep 14, 2023 at 06:20:56PM +0000, Suren Baghdasaryan wrote:
> > > > > > > > I think I found the problem and the explanation is much simpler. While
> > > > > > > > walking the page range, queue_folios_pte_range() encounters an
> > > > > > > > unmovable page and queue_folios_pte_range() returns 1. That causes a
> > > > > > > > break from the loop inside walk_page_range() and no more VMAs get
> > > > > > > > locked. After that the loop calling mbind_range() walks over all VMAs,
> > > > > > > > even the ones which were skipped by queue_folios_pte_range() and that
> > > > > > > > causes this BUG assertion.
> > > > > > > >
> > > > > > > > Thinking what's the right way to handle this situation (what's the
> > > > > > > > expected behavior here)...
> > > > > > > > I think the safest way would be to modify walk_page_range() and make
> > > > > > > > it continue calling process_vma_walk_lock() for all VMAs in the range
> > > > > > > > even when __walk_page_range() returns a positive err. Any objection or
> > > > > > > > alternative suggestions?
> > > > > > >
> > > > > > > So we only return 1 here if MPOL_MF_MOVE* & MPOL_MF_STRICT were
> > > > > > > specified.  That means we're going to return an error, no matter what,
> > > > > > > and there's no point in calling mbind_range().  Right?
> > > > > > >
> > > > > > > +++ b/mm/mempolicy.c
> > > > > > > @@ -1334,6 +1334,8 @@ static long do_mbind(unsigned long start, unsigned long len,
> > > > > > >         ret = queue_pages_range(mm, start, end, nmask,
> > > > > > >                           flags | MPOL_MF_INVERT, &pagelist, true);
> > > > > > >
> > > > > > > +       if (ret == 1)
> > > > > > > +               ret = -EIO;
> > > > > > >         if (ret < 0) {
> > > > > > >                 err = ret;
> > > > > > >                 goto up_out;
> > > > > > >
> > > > > > > (I don't really understand this code, so it can't be this simple, can
> > > > > > > it?  Why don't we just return -EIO from queue_folios_pte_range() if
> > > > > > > this is the right answer?)
> > > > > >
> > > > > > Yeah, I'm trying to understand the expected behavior of this function
> > > > > > to make sure we are not missing anything. I tried a simple fix that I
> > > > > > suggested in my previous email and it works but I want to understand a
> > > > > > bit more about this function's logic before posting the fix.
> > > > >
> > > > > So, current functionality is that after queue_pages_range() encounters
> > > > > an unmovable page, terminates the loop and returns 1, mbind_range()
> > > > > will still be called for the whole range
> > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1345),
> > > > > all pages in the pagelist will be migrated
> > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1355)
> > > > > and only after that the -EIO code will be returned
> > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1362).
> > > > > So, if we follow Matthew's suggestion we will be altering the current
> > > > > behavior which I assume is not what we want to do.
> > > >
> > > > Right, I'm intentionally changing the behaviour.  My thinking is
> > > > that mbind(MPOL_MF_MOVE | MPOL_MF_STRICT) is going to fail.  Should
> > > > such a failure actually move the movable pages before reporting that
> > > > it failed?  I don't know.
> > > >
> > > > > The simple fix I was thinking about that would not alter this behavior
> > > > > is smth like this:
> > > >
> > > > I don't like it, but can we run it past syzbot to be sure it solves the
> > > > issue and we're not chasing a ghost here?
> > >
> > > Yes, I just finished running the reproducer on both upstream and
> > > linux-next builds listed in
> > > https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 and the
> > > problem does not happen anymore.
> > > I'm fine with your suggestion too, just wanted to point out it would
> > > introduce change in the behavior. Let me know how you want to proceed.
> >
> > Well done, identifying the mysterious cause of this problem:
> > I'm glad to hear that you've now verified that hypothesis.
> >
> > You're right, it would be a regression to follow Matthew's suggestion.
> >
> > Traditionally, modulo bugs and inconsistencies, the queue_pages_range()
> > phase of do_mbind() has done the best it can, gathering all the pages it
> > can that need migration, even if some were missed; and proceeds to do the
> > mbind_range() phase if there was nothing "seriously" wrong (a gap causing
> > -EFAULT).  Then at the end, if MPOL_MF_STRICT was set, and not all the
> > pages could be migrated (or MOVE was not specified and not all pages
> > were well placed), it returns -EIO rather than 0 to inform the caller
> > that not all could be done.
> >
> > There have been numerous tweaks, but I think most importantly
> > 5.3's d883544515aa ("mm: mempolicy: make the behavior consistent when
> > MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") added those "return 1"s
> > which stop the pagewalk early.  In my opinion, not an improvement - makes
> > it harder to get mbind() to do the best job it can (or is it justified as
> > what you're asking for if you say STRICT?).
> >
> > But whatever, it would be a further regression for mbind() not to have
> > done the mbind_range(), even though it goes on to return -EIO.
> >
> > I had a bad first reaction to your walk_page_range() patch (was expecting
> > to see vma_start_write()s in mbind_range()), but perhaps your patch is
> > exactly what process_mm_walk_lock() does now demand.
> >
> > [Why is Hugh responding on this?  Because I have some long-standing
> > mm/mempolicy.c patches to submit next week, but in reviewing what I
> > could or could not afford to get into at this time, had decided I'd
> > better stay out of queue_pages_range() for now - beyond the trivial
> > preferring an MPOL_MF_WRLOCK flag to your bool lock_vma.]
>
> Thanks for the feedback, Hugh!
> Yeah, this positive err handling is kinda weird. If this behavior (do
> as much as possible even if we fail eventually) is specific to mbind()
> then we could keep walk_page_range() as is and lock the VMAs inside
> the loop that calls mbind_range() with a condition that ret is
> positive. That would be the simplest solution IMHO. But if we expect
> walk_page_range() to always apply requested page_walk_lock policy to
> all VMAs even if some mm_walk_ops returns a positive error somewhere
> in the middle of the walk then my fix would work for that. So, to me
> the important question is how we want walk_page_range() to behave in
> these conditions. I think we should answer that first and document
> that. Then the fix will be easy.

I looked at all the cases where we perform page walk while locking
VMAs and mbind() seems to be the only one that would require
walk_page_range() to lock all VMAs even for a failed walk. So, I
suggest this fix instead and I can also document that if
walk_page_range() fails it might not apply page_walk_lock policy to
the VMAs.

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 42b5567e3773..cbc584e9b6ca 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1342,6 +1342,9 @@ static long do_mbind(unsigned long start,
unsigned long len,
         vma_iter_init(&vmi, mm, start);
         prev = vma_prev(&vmi);
         for_each_vma_range(vmi, vma, end) {
+                /* If queue_pages_range failed then not all VMAs
might be locked */
+                if (ret)
+                        vma_start_write(vma);
                 err = mbind_range(&vmi, vma, &prev, start, end, new);
                 if (err)
                         break;

If this looks good I'll post the patch. Matthew, Hugh, anyone else?

>
>
> >
> > Hugh

  reply	other threads:[~2023-09-15 18:06 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-06  1:03 [syzbot] [mm?] kernel BUG in vma_replace_policy syzbot
     [not found] ` <20230906061902.591996-1-eadavis@sina.com>
2023-09-06 12:06   ` [PATCH] mm: as the same logic with queue_pages_range Matthew Wilcox
2023-09-12  5:20   ` kernel test robot
2023-09-13  9:10     ` [LTP] " Cyril Hrubis
2023-09-08 18:04 ` [syzbot] [mm?] kernel BUG in vma_replace_policy syzbot
2023-09-12  5:30 ` Matthew Wilcox
2023-09-12  6:09   ` syzbot
2023-09-12 14:55   ` Matthew Wilcox
2023-09-12 15:03     ` Suren Baghdasaryan
2023-09-12 16:00       ` Suren Baghdasaryan
2023-09-13 16:05         ` Suren Baghdasaryan
2023-09-13 16:46           ` Suren Baghdasaryan
2023-09-14 18:20             ` Suren Baghdasaryan
2023-09-14 19:09               ` Matthew Wilcox
2023-09-14 20:00                 ` Suren Baghdasaryan
2023-09-14 20:53                   ` Suren Baghdasaryan
2023-09-14 21:24                     ` Matthew Wilcox
2023-09-14 22:21                       ` Suren Baghdasaryan
2023-09-15  4:26                         ` Hugh Dickins
2023-09-15 16:09                           ` Suren Baghdasaryan
2023-09-15 18:05                             ` Suren Baghdasaryan [this message]
2023-09-16  2:43                               ` Hugh Dickins
2023-09-18 21:20                                 ` Suren Baghdasaryan
2023-09-15 18:26                           ` Matthew Wilcox
2023-09-16  2:54                             ` Hugh Dickins
2023-09-16  1:35                           ` Yang Shi
2023-09-16  3:57                             ` Hugh Dickins
2023-09-18 22:34                               ` Yang Shi
2023-09-19  0:34                                 ` Hugh Dickins
     [not found] <20230909034207.5816-1-hdanton@sina.com>
2023-09-09  4:43 ` syzbot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJuCfpE8jnvL23W6fY4_HZf-969aEgvR3-LGRTUC-SFhPFju+w@mail.gmail.com \
    --to=surenb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=shy828301@gmail.com \
    --cc=syzbot+b591856e0f0139f83023@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).