Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely

From: John Hubbard <jhubbard@nvidia.com>
To: Jan Kara <jack@suse.cz>
Cc: "Michal Hocko" <mhocko@kernel.org>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	"Yu Zhao" <yuzhao@google.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Arnaldo Carvalho de Melo" <acme@kernel.org>,
	"Alexander Shishkin" <alexander.shishkin@linux.intel.com>,
	"Jiri Olsa" <jolsa@redhat.com>,
	"Namhyung Kim" <namhyung@kernel.org>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Hugh Dickins" <hughd@google.com>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>,
	"David Rientjes" <rientjes@google.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Lance Roy" <ldr709@gmail.com>,
	"Ralph Campbell" <rcampbell@nvidia.com>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Dave Airlie" <airlied@redhat.com>,
	"Thomas Hellstrom" <thellstrom@vmware.com>,
	"Souptick Joarder" <jrdr.linux@gmail.com>,
	"Mel Gorman" <mgorman@suse.de>,
	"Mike Kravetz" <mike.kravetz@oracle.com>,
	"Huang Ying" <ying.huang@intel.com>,
	"Aaron Lu" <ziqian.lzq@antfin.com>,
	"Omar Sandoval" <osandov@fb.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Vineeth Remanan Pillai" <vpillai@digitalocean.com>,
	"Daniel Jordan" <daniel.m.jordan@oracle.com>,
	"Mike Rapoport" <rppt@linux.ibm.com>,
	"Joel Fernandes" <joel@joelfernandes.org>,
	"Mark Rutland" <mark.rutland@arm.com>,
	"Alexander Duyck" <alexander.h.duyck@linux.intel.com>,
	"Pavel Tatashin" <pavel.tatashin@microsoft.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Juergen Gross" <jgross@suse.com>,
	"Anthony Yznaga" <anthony.yznaga@oracle.com>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Darrick J . Wong" <darrick.wong@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely
Date: Tue, 1 Oct 2019 11:43:30 -0700	[thread overview]
Message-ID: <d7371e9b-c68b-1793-d10d-c1e7a504855c@nvidia.com> (raw)
In-Reply-To: <20191001071008.GA25062@quack2.suse.cz>

On 10/1/19 12:10 AM, Jan Kara wrote:
> On Mon 30-09-19 10:57:08, John Hubbard wrote:
>> On 9/30/19 2:20 AM, Jan Kara wrote:
>>> On Fri 27-09-19 12:31:41, John Hubbard wrote:
>>>> On 9/27/19 5:33 AM, Michal Hocko wrote:
>>>>> On Thu 26-09-19 20:26:46, John Hubbard wrote:
>>>>>> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
>> ...
>> 2. Your code above is after the "pmd = READ_ONCE(*pmdp)" line, so by then,
>> it's already found the pte based on reading a stale pmd. So checking the
>> pte seems like it's checking the wrong thing--it's too late, for this case,
>> right?
> 
> Well, if PMD is getting freed, all PTEs in it should be cleared by that
> time, shouldn't they? So although we read from stale PMD, either we already
> see cleared PTE or the check pte_val(pte) != pte_val(*ptep) will fail and
> so we never actually succeed in getting stale PTE entry (at least unless
> the page table page that used to be PMD can get freed and reused
> - which is not the case in the example you've shown above).
>

Right, that's not what the example shows, but there is nothing here to prevent
the page table pages from being freed and re-used.

 
> So I still don't see a problem. That being said I don't feel being expert
> in this particular area. I just always thought GUP prevents races like this
> by the scheme I describe so I'd like to see what I'm missing :).
> 

I'm very much still "in training" here, so I hope I'm not wasting everyone's 
time. But I feel confident in stating at least this much:

There are two distinct lockless synchronization mechanisms here, each protecting
against a different issue, and it's important not to conflate them and think that
one protects against the other. I still see a hole in (2) below. The mechanisms 
are:

1) Protection against a page (not the page table itself) getting freed
while get_user_pages*() is trying to take a reference to it. This is avoided
by the try-get and the re-checking of pte values that you mention above.
It's an elegant little thing, too. :)

2) Protection against page tables getting freed while a get_user_pages_fast()
call is in progress. This relies on disabling interrupts in gup_fast(), while
firing interrupts in the freeing path (via tlb flushing, IPIs). And on memory
barriers (doh--missing!) to avoid leaking memory references outside of the
irq disabling. 

This one has a problem, because Documentation/memory-barriers.txt points out:

INTERRUPT DISABLING FUNCTIONS
-----------------------------

Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
barriers are required in such a situation, they must be provided from some
other means.


...and so I'm suggesting that we need something approximately like this:

diff --git a/mm/gup.c b/mm/gup.c
index 23a9f9c9d377..1678d50a2d8b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2415,7 +2415,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
        if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
            gup_fast_permitted(start, end)) {
                local_irq_disable();
+               smp_mb();
                gup_pgd_range(addr, end, gup_flags, pages, &nr);
+               smp_mb();
                local_irq_enable();
                ret = nr;


thanks,
-- 
John Hubbard
NVIDIA