All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@ziepe.ca>
To: Matthew Wilcox <willy@infradead.org>
Cc: Jann Horn <jannh@google.com>, John Hubbard <jhubbard@nvidia.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	kernel list <linux-kernel@vger.kernel.org>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Jan Kara <jack@suse.cz>, stable <stable@vger.kernel.org>
Subject: Re: [PATCH v2] mm/gup: fix try_grab_compound_head() race with split_huge_page()
Date: Fri, 18 Jun 2021 11:58:30 -0300	[thread overview]
Message-ID: <20210618145830.GZ1096940@ziepe.ca> (raw)
In-Reply-To: <YMykiGuZYMqF7DuU@casper.infradead.org>

On Fri, Jun 18, 2021 at 02:50:00PM +0100, Matthew Wilcox wrote:
> On Fri, Jun 18, 2021 at 10:25:56AM -0300, Jason Gunthorpe wrote:
> > On Tue, Jun 15, 2021 at 02:09:38PM +0200, Jann Horn wrote:
> > > On Tue, Jun 15, 2021 at 8:37 AM John Hubbard <jhubbard@nvidia.com> wrote:
> > > > On 6/14/21 6:20 PM, Jann Horn wrote:
> > > > > @@ -55,8 +72,23 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
> > > > >       if (WARN_ON_ONCE(page_ref_count(head) < 0))
> > > > >               return NULL;
> > > > >       if (unlikely(!page_cache_add_speculative(head, refs)))
> > > > >               return NULL;
> > > > > +
> > > > > +     /*
> > > > > +      * At this point we have a stable reference to the head page; but it
> > > > > +      * could be that between the compound_head() lookup and the refcount
> > > > > +      * increment, the compound page was split, in which case we'd end up
> > > > > +      * holding a reference on a page that has nothing to do with the page
> > > > > +      * we were given anymore.
> > > > > +      * So now that the head page is stable, recheck that the pages still
> > > > > +      * belong together.
> > > > > +      */
> > > > > +     if (unlikely(compound_head(page) != head)) {
> > > >
> > > > I was just wondering about what all could happen here. Such as: page gets split,
> > > > reallocated into a different-sized compound page, one that still has page pointing
> > > > to head. I think that's OK, because we don't look at or change other huge page
> > > > fields.
> > > >
> > > > But I thought I'd mention the idea in case anyone else has any clever ideas about
> > > > how this simple check might be insufficient here. It seems fine to me, but I
> > > > routinely lack enough imagination about concurrent operations. :)
> > > 
> > > Hmmm... I think the scariest aspect here is probably the interaction
> > > with concurrent allocation of a compound page on architectures with
> > > store-store reordering (like ARM). *If* the page allocator handled
> > > compound pages with lockless, non-atomic percpu freelists, I think it
> > > might be possible that the zeroing of tail_page->compound_head in
> > > put_page() could be reordered after the page has been freed,
> > > reallocated and set to refcount 1 again?
> > 
> > Oh wow, yes, this all looks sketchy! Doing a RCU access to page->head
> > is a really challenging thing :\
> > 
> > On the simplified store side:
> > 
> >   page->head = my_compound
> >   *ptep = page
> > 
> > There must be some kind of release barrier between those two
> > operations or this is all broken.. That definately deserves a comment.
> 
> set_compound_head() includes a WRITE_ONCE.  Is that enough, or does it
> need an smp_wmb()?

Probably, at least the generic code maps smp_store_release() to
__smp_wmb.

I think Jann was making the argument that there is going to be some
other release operation due to locking between the two above, eg a
lock unlock or something.

> > Ideally we'd use smp_store_release to install the *pte :\
> > 
> > Assuming we cover the release barrier, I would think the algorithm
> > should be broadly:
> > 
> >  struct page *target_page = READ_ONCE(pte)
> >  struct page *target_folio = READ_ONCE(target_page->head)
> 
> compound_head() includes a READ_ONCE already.

Ah, see I obviously haven't memorized that detail :\

> >  page_cache_add_speculative(target_folio, refs)
> 
> That's spelled folio_ref_try_add_rcu() right now.

That seems a much better name

> >  if (target_folio != READ_ONCE(target_page->head) ||
> >      target_page != READ_ONCE(pte))
> >     goto abort
> > 
> > Which is what this patch does but I would like to see the
> > READ_ONCE's.
> 
> ... you want them to be uninlined from compound_head(), et al?

Not really (though see below), I was mostly looking at the pte which
just does pte_val(), no READ_ONCE in there

> > And there possibly should be two try_grab_compound_head()'s since we
> > don't need this overhead on the fully locked path, especially the
> > double atomic on page_ref_add()
> 
> There's only one atomic on page_ref_add(). 

Look at the original patch, it adds this:

+		else
+			page_ref_add(page, refs * (GUP_PIN_COUNTING_BIAS - 1));

Where page is the folio, which is now two atomics to do the same
ref. This is happening because we can't do hpage_pincount_available()
before having initially locked the folio, thus we can no longer
precompute what 'ref' to give to the first folio_ref_try_add_rcu()

> And you need more of this overhead on the fully locked path than you
> realise; the page might be split without holding the mmap_sem, for
> example.

Fully locked here means holding the PTL spinlocks, so we know the pte
cannot change and particularly the refcount of a folio can't go to
zero. We can't change compound_head if the refcount is
elevated.

Keep in mind we also do this in gpu:

 folio_ref_try_add_rcu(READ_ONCE(target_page->head), 1)
 [..]
 folio_put_refs(READ_ONCE(target_page->head), 1)

Which makes me wonder why we have READ_ONCE inside compound_head?

I'm reading the commit message of 1d798ca3f164 ("mm: make
compound_head() robust"), and to me that looks like another special
lockless algorithm that should have the READ_ONCE in it, not the
general code.

Jason

      reply	other threads:[~2021-06-18 14:58 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-15  1:20 [PATCH v2] mm/gup: fix try_grab_compound_head() race with split_huge_page() Jann Horn
2021-06-15  2:00 ` Andrew Morton
2021-06-15  2:36   ` Jann Horn
2021-06-15  2:36     ` Jann Horn
2021-06-15  2:38     ` Jann Horn
2021-06-15  2:38       ` Jann Horn
2021-06-15  6:37 ` John Hubbard
2021-06-15 12:09   ` Jann Horn
2021-06-15 12:09     ` Jann Horn
2021-06-15 23:10     ` Yang Shi
2021-06-15 23:10       ` Yang Shi
2021-06-16 17:27       ` Vlastimil Babka
2021-06-16 18:40         ` Yang Shi
2021-06-16 18:40           ` Yang Shi
2021-06-17 16:09           ` Vlastimil Babka
2021-06-18 13:25     ` Jason Gunthorpe
2021-06-18 13:50       ` Matthew Wilcox
2021-06-18 14:58         ` Jason Gunthorpe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210618145830.GZ1096940@ziepe.ca \
    --to=jgg@ziepe.ca \
    --cc=akpm@linux-foundation.org \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=stable@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.