Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely

From: Jan Kara <jack@suse.cz>
To: John Hubbard <jhubbard@nvidia.com>
Cc: "Jan Kara" <jack@suse.cz>, "Michal Hocko" <mhocko@kernel.org>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	"Yu Zhao" <yuzhao@google.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Arnaldo Carvalho de Melo" <acme@kernel.org>,
	"Alexander Shishkin" <alexander.shishkin@linux.intel.com>,
	"Jiri Olsa" <jolsa@redhat.com>,
	"Namhyung Kim" <namhyung@kernel.org>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Hugh Dickins" <hughd@google.com>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>,
	"David Rientjes" <rientjes@google.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Lance Roy" <ldr709@gmail.com>,
	"Ralph Campbell" <rcampbell@nvidia.com>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Dave Airlie" <airlied@redhat.com>,
	"Thomas Hellstrom" <thellstrom@vmware.com>,
	"Souptick Joarder" <jrdr.linux@gmail.com>,
	"Mel Gorman" <mgorman@suse.de>,
	"Mike Kravetz" <mike.kravetz@oracle.com>,
	"Huang Ying" <ying.huang@intel.com>,
	"Aaron Lu" <ziqian.lzq@antfin.com>,
	"Omar Sandoval" <osandov@fb.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Vineeth Remanan Pillai" <vpillai@digitalocean.com>,
	"Daniel Jordan" <daniel.m.jordan@oracle.com>,
	"Mike Rapoport" <rppt@linux.ibm.com>,
	"Joel Fernandes" <joel@joelfernandes.org>,
	"Mark Rutland" <mark.rutland@arm.com>,
	"Alexander Duyck" <alexander.h.duyck@linux.intel.com>,
	"Pavel Tatashin" <pavel.tatashin@microsoft.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Juergen Gross" <jgross@suse.com>,
	"Anthony Yznaga" <anthony.yznaga@oracle.com>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Darrick J . Wong" <darrick.wong@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	"Paul E. McKenney" <paulmck@kernel.org>
Subject: Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely
Date: Wed, 2 Oct 2019 11:24:47 +0200	[thread overview]
Message-ID: <20191002092447.GC9320@quack2.suse.cz> (raw)
In-Reply-To: <d7371e9b-c68b-1793-d10d-c1e7a504855c@nvidia.com>

On Tue 01-10-19 11:43:30, John Hubbard wrote:
> On 10/1/19 12:10 AM, Jan Kara wrote:
> > On Mon 30-09-19 10:57:08, John Hubbard wrote:
> >> On 9/30/19 2:20 AM, Jan Kara wrote:
> >>> On Fri 27-09-19 12:31:41, John Hubbard wrote:
> >>>> On 9/27/19 5:33 AM, Michal Hocko wrote:
> >>>>> On Thu 26-09-19 20:26:46, John Hubbard wrote:
> >>>>>> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> >> ...
> >> 2. Your code above is after the "pmd = READ_ONCE(*pmdp)" line, so by then,
> >> it's already found the pte based on reading a stale pmd. So checking the
> >> pte seems like it's checking the wrong thing--it's too late, for this case,
> >> right?
> > 
> > Well, if PMD is getting freed, all PTEs in it should be cleared by that
> > time, shouldn't they? So although we read from stale PMD, either we already
> > see cleared PTE or the check pte_val(pte) != pte_val(*ptep) will fail and
> > so we never actually succeed in getting stale PTE entry (at least unless
> > the page table page that used to be PMD can get freed and reused
> > - which is not the case in the example you've shown above).
> 
> Right, that's not what the example shows, but there is nothing here to prevent
> the page table pages from being freed and re-used.
> 
> > So I still don't see a problem. That being said I don't feel being expert
> > in this particular area. I just always thought GUP prevents races like this
> > by the scheme I describe so I'd like to see what I'm missing :).
> 
> I'm very much still "in training" here, so I hope I'm not wasting everyone's 
> time. But I feel confident in stating at least this much:
> 
> There are two distinct lockless synchronization mechanisms here, each
> protecting against a different issue, and it's important not to conflate
> them and think that one protects against the other. I still see a hole in
> (2) below. The mechanisms are:
> 
> 1) Protection against a page (not the page table itself) getting freed
> while get_user_pages*() is trying to take a reference to it. This is avoided
> by the try-get and the re-checking of pte values that you mention above.
> It's an elegant little thing, too. :)
> 
> 2) Protection against page tables getting freed while a get_user_pages_fast()
> call is in progress. This relies on disabling interrupts in gup_fast(), while
> firing interrupts in the freeing path (via tlb flushing, IPIs). And on memory
> barriers (doh--missing!) to avoid leaking memory references outside of the
> irq disabling. 

OK, so you are concerned that the page table walk of pgd->p4d->pud->pmd
will get prefetched by the CPU before interrupts are disabled, somewhere in
the middle of the walk IPI flushing TLBs on the cpu will happen allowing
munmap() on another CPU to proceed and free page tables so the rest of the
walk will happen on freed and possibly reused pages. Do I understand you
right?

Realistically, I don't think this can happen as I'd expect the CPU to throw
away the speculation state on interrupt. But that's just my expectation and
I agree that I don't see anything in Documentation/memory-barriers.txt that
would prevent what you are concerned about. Let's ask Paul :)

Paul, we are discussing here a possible races between
mm/gup.c:__get_user_pages_fast() and mm/mmap.c:unmap_region(). The first
has a code like:

local_irq_save(flags);
load pgd from current->mm
load p4d from pgd
load pud from p4d
load pmd from pud
...
local_irq_restore(flags);

while the second has code like:

unmap_region()
  walk pgd
    walk p4d
      walk pud
        walk pmd
          clear ptes
          flush tlb
  free page tables

Now the serialization between these two relies on the fact that flushing
TLBs from unmap_region() requires IPI to be served on each CPU so in naive
understanding unmap_region() shouldn't be able to get to 'free unused page
tables' part until __get_user_pages_fast() enables interrupts again. Now as
John points out we don't see anything in Documentation/memory-barriers.txt
that would actually guarantee this. So do we really need something like
smp_rmb() after disabling interrupts in __get_user_pages_fast() or is the
race John is concerned about impossible? Thanks!

> This one has a problem, because Documentation/memory-barriers.txt points out:
> 
> INTERRUPT DISABLING FUNCTIONS
> -----------------------------
> 
> Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
> (RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
> barriers are required in such a situation, they must be provided from some
> other means.
> 
> 
> ...and so I'm suggesting that we need something approximately like this:
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 23a9f9c9d377..1678d50a2d8b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2415,7 +2415,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>         if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
>             gup_fast_permitted(start, end)) {
>                 local_irq_disable();
> +               smp_mb();
>                 gup_pgd_range(addr, end, gup_flags, pages, &nr);
> +               smp_mb();
>                 local_irq_enable();
>                 ret = nr;

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR