From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BCC23C352AA for ; Wed, 2 Oct 2019 09:25:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 812B821783 for ; Wed, 2 Oct 2019 09:25:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 812B821783 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2A4A76B0005; Wed, 2 Oct 2019 05:25:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 255506B0006; Wed, 2 Oct 2019 05:25:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0F7246B0007; Wed, 2 Oct 2019 05:25:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0174.hostedemail.com [216.40.44.174]) by kanga.kvack.org (Postfix) with ESMTP id DC16F6B0005 for ; Wed, 2 Oct 2019 05:25:23 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 576CF181AC9AE for ; Wed, 2 Oct 2019 09:25:23 +0000 (UTC) X-FDA: 75998311326.29.play04_2ef8ab0a23514 X-HE-Tag: play04_2ef8ab0a23514 X-Filterd-Recvd-Size: 8621 Received: from mx1.suse.de (mx2.suse.de [195.135.220.15]) by imf27.hostedemail.com (Postfix) with ESMTP for ; Wed, 2 Oct 2019 09:25:22 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id C608CB038; Wed, 2 Oct 2019 09:25:19 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id CC4401E420F; Wed, 2 Oct 2019 11:24:47 +0200 (CEST) Date: Wed, 2 Oct 2019 11:24:47 +0200 From: Jan Kara To: John Hubbard Cc: Jan Kara , Michal Hocko , "Kirill A. Shutemov" , Yu Zhao , Peter Zijlstra , Andrew Morton , "Kirill A . Shutemov" , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Vlastimil Babka , Hugh Dickins , =?iso-8859-1?B?Suly9G1l?= Glisse , Andrea Arcangeli , "Aneesh Kumar K . V" , David Rientjes , Matthew Wilcox , Lance Roy , Ralph Campbell , Jason Gunthorpe , Dave Airlie , Thomas Hellstrom , Souptick Joarder , Mel Gorman , Mike Kravetz , Huang Ying , Aaron Lu , Omar Sandoval , Thomas Gleixner , Vineeth Remanan Pillai , Daniel Jordan , Mike Rapoport , Joel Fernandes , Mark Rutland , Alexander Duyck , Pavel Tatashin , David Hildenbrand , Juergen Gross , Anthony Yznaga , Johannes Weiner , "Darrick J . Wong" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, "Paul E. McKenney" Subject: Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely Message-ID: <20191002092447.GC9320@quack2.suse.cz> References: <20190925082530.GD4536@hirez.programming.kicks-ass.net> <20190925222654.GA180125@google.com> <20190926102036.od2wamdx2s7uznvq@box> <9465df76-0229-1b44-5646-5cced1bc1718@nvidia.com> <20190927123056.GE26848@dhcp22.suse.cz> <20190930092003.GA22118@quack2.suse.cz> <6bba357a-1706-7cdb-8a11-359157a21ae8@nvidia.com> <20191001071008.GA25062@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue 01-10-19 11:43:30, John Hubbard wrote: > On 10/1/19 12:10 AM, Jan Kara wrote: > > On Mon 30-09-19 10:57:08, John Hubbard wrote: > >> On 9/30/19 2:20 AM, Jan Kara wrote: > >>> On Fri 27-09-19 12:31:41, John Hubbard wrote: > >>>> On 9/27/19 5:33 AM, Michal Hocko wrote: > >>>>> On Thu 26-09-19 20:26:46, John Hubbard wrote: > >>>>>> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote: > >> ... > >> 2. Your code above is after the "pmd = READ_ONCE(*pmdp)" line, so by then, > >> it's already found the pte based on reading a stale pmd. So checking the > >> pte seems like it's checking the wrong thing--it's too late, for this case, > >> right? > > > > Well, if PMD is getting freed, all PTEs in it should be cleared by that > > time, shouldn't they? So although we read from stale PMD, either we already > > see cleared PTE or the check pte_val(pte) != pte_val(*ptep) will fail and > > so we never actually succeed in getting stale PTE entry (at least unless > > the page table page that used to be PMD can get freed and reused > > - which is not the case in the example you've shown above). > > Right, that's not what the example shows, but there is nothing here to prevent > the page table pages from being freed and re-used. > > > So I still don't see a problem. That being said I don't feel being expert > > in this particular area. I just always thought GUP prevents races like this > > by the scheme I describe so I'd like to see what I'm missing :). > > I'm very much still "in training" here, so I hope I'm not wasting everyone's > time. But I feel confident in stating at least this much: > > There are two distinct lockless synchronization mechanisms here, each > protecting against a different issue, and it's important not to conflate > them and think that one protects against the other. I still see a hole in > (2) below. The mechanisms are: > > 1) Protection against a page (not the page table itself) getting freed > while get_user_pages*() is trying to take a reference to it. This is avoided > by the try-get and the re-checking of pte values that you mention above. > It's an elegant little thing, too. :) > > 2) Protection against page tables getting freed while a get_user_pages_fast() > call is in progress. This relies on disabling interrupts in gup_fast(), while > firing interrupts in the freeing path (via tlb flushing, IPIs). And on memory > barriers (doh--missing!) to avoid leaking memory references outside of the > irq disabling. OK, so you are concerned that the page table walk of pgd->p4d->pud->pmd will get prefetched by the CPU before interrupts are disabled, somewhere in the middle of the walk IPI flushing TLBs on the cpu will happen allowing munmap() on another CPU to proceed and free page tables so the rest of the walk will happen on freed and possibly reused pages. Do I understand you right? Realistically, I don't think this can happen as I'd expect the CPU to throw away the speculation state on interrupt. But that's just my expectation and I agree that I don't see anything in Documentation/memory-barriers.txt that would prevent what you are concerned about. Let's ask Paul :) Paul, we are discussing here a possible races between mm/gup.c:__get_user_pages_fast() and mm/mmap.c:unmap_region(). The first has a code like: local_irq_save(flags); load pgd from current->mm load p4d from pgd load pud from p4d load pmd from pud ... local_irq_restore(flags); while the second has code like: unmap_region() walk pgd walk p4d walk pud walk pmd clear ptes flush tlb free page tables Now the serialization between these two relies on the fact that flushing TLBs from unmap_region() requires IPI to be served on each CPU so in naive understanding unmap_region() shouldn't be able to get to 'free unused page tables' part until __get_user_pages_fast() enables interrupts again. Now as John points out we don't see anything in Documentation/memory-barriers.txt that would actually guarantee this. So do we really need something like smp_rmb() after disabling interrupts in __get_user_pages_fast() or is the race John is concerned about impossible? Thanks! > This one has a problem, because Documentation/memory-barriers.txt points out: > > INTERRUPT DISABLING FUNCTIONS > ----------------------------- > > Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts > (RELEASE equivalent) will act as compiler barriers only. So if memory or I/O > barriers are required in such a situation, they must be provided from some > other means. > > > ...and so I'm suggesting that we need something approximately like this: > > diff --git a/mm/gup.c b/mm/gup.c > index 23a9f9c9d377..1678d50a2d8b 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2415,7 +2415,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, > if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && > gup_fast_permitted(start, end)) { > local_irq_disable(); > + smp_mb(); > gup_pgd_range(addr, end, gup_flags, pages, &nr); > + smp_mb(); > local_irq_enable(); > ret = nr; Honza -- Jan Kara SUSE Labs, CR