From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 17975C10F14 for ; Wed, 2 Oct 2019 17:35:31 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CE02321D82 for ; Wed, 2 Oct 2019 17:35:30 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=nvidia.com header.i=@nvidia.com header.b="itHhErlk" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CE02321D82 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 4DFEE6B0003; Wed, 2 Oct 2019 13:35:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4901D6B0006; Wed, 2 Oct 2019 13:35:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 37EA56B0007; Wed, 2 Oct 2019 13:35:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0192.hostedemail.com [216.40.44.192]) by kanga.kvack.org (Postfix) with ESMTP id 1103C6B0003 for ; Wed, 2 Oct 2019 13:35:30 -0400 (EDT) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id A5926813F for ; Wed, 2 Oct 2019 17:35:29 +0000 (UTC) X-FDA: 75999546378.06.feet32_69c03a942745e X-HE-Tag: feet32_69c03a942745e X-Filterd-Recvd-Size: 7777 Received: from hqemgate14.nvidia.com (hqemgate14.nvidia.com [216.228.121.143]) by imf01.hostedemail.com (Postfix) with ESMTP for ; Wed, 2 Oct 2019 17:35:26 +0000 (UTC) Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqemgate14.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Wed, 02 Oct 2019 10:35:28 -0700 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Wed, 02 Oct 2019 10:35:25 -0700 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Wed, 02 Oct 2019 10:35:25 -0700 Received: from DRHQMAIL107.nvidia.com (10.27.9.16) by HQMAIL105.nvidia.com (172.20.187.12) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Wed, 2 Oct 2019 17:35:24 +0000 Received: from [10.2.161.121] (10.124.1.5) by DRHQMAIL107.nvidia.com (10.27.9.16) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Wed, 2 Oct 2019 17:35:23 +0000 Subject: Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely To: Jan Kara CC: Michal Hocko , "Kirill A. Shutemov" , Yu Zhao , Peter Zijlstra , Andrew Morton , "Kirill A . Shutemov" , Ingo Molnar , Arnaldo Carvalho de Melo , "Alexander Shishkin" , Jiri Olsa , Namhyung Kim , Vlastimil Babka , "Hugh Dickins" , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Andrea Arcangeli , "Aneesh Kumar K . V" , David Rientjes , Matthew Wilcox , Lance Roy , "Ralph Campbell" , Jason Gunthorpe , Dave Airlie , Thomas Hellstrom , "Souptick Joarder" , Mel Gorman , Mike Kravetz , Huang Ying , Aaron Lu , Omar Sandoval , Thomas Gleixner , Vineeth Remanan Pillai , Daniel Jordan , Mike Rapoport , Joel Fernandes , Mark Rutland , Alexander Duyck , Pavel Tatashin , David Hildenbrand , Juergen Gross , Anthony Yznaga , Johannes Weiner , "Darrick J . Wong" , , , "Paul E. McKenney" References: <20190925082530.GD4536@hirez.programming.kicks-ass.net> <20190925222654.GA180125@google.com> <20190926102036.od2wamdx2s7uznvq@box> <9465df76-0229-1b44-5646-5cced1bc1718@nvidia.com> <20190927123056.GE26848@dhcp22.suse.cz> <20190930092003.GA22118@quack2.suse.cz> <6bba357a-1706-7cdb-8a11-359157a21ae8@nvidia.com> <20191001071008.GA25062@quack2.suse.cz> <20191002092447.GC9320@quack2.suse.cz> From: John Hubbard X-Nvconfidentiality: public Message-ID: Date: Wed, 2 Oct 2019 10:33:03 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <20191002092447.GC9320@quack2.suse.cz> X-Originating-IP: [10.124.1.5] X-ClientProxiedBy: HQMAIL101.nvidia.com (172.20.187.10) To DRHQMAIL107.nvidia.com (10.27.9.16) Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1570037728; bh=5Tv4jpMpwJopEfkfsj2RdTxUxsDQnmv1eMusPwTLQns=; h=X-PGP-Universal:Subject:To:CC:References:From:X-Nvconfidentiality: Message-ID:Date:User-Agent:MIME-Version:In-Reply-To: X-Originating-IP:X-ClientProxiedBy:Content-Type:Content-Language: Content-Transfer-Encoding; b=itHhErlk2osAucYFGwqaB5W08xNZ/mnZrE4heJlERQAV6IXC6AJVmf43urHo9EOxZ ACYsF65AlG2RjzZSj/54yrKi4nsPkgSRgE02IIMZd5vk9es/vaO8We1/SU/IcM7toC n2RHUuK/+/jFn6M8jyVdWBDkMgtpzH2bZsMhNntvhF8CFUhdoIrErkJfkutDArkpSM HBo+f6duoBHptveeXSZn9x9ljARLgt9Tgd+9PENKPcFvfJwibw6pCNMk15QsePbPce N8e65Xji+GedB/Vm0eolXbQ2jklf3cRtCFzgZYxlc8E94XVBKZGf0A2emhM1A2wTKG axJGIxXX5i2kw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 10/2/19 2:24 AM, Jan Kara wrote: > > OK, so you are concerned that the page table walk of pgd->p4d->pud->pmd > will get prefetched by the CPU before interrupts are disabled, somewhere in > the middle of the walk IPI flushing TLBs on the cpu will happen allowing > munmap() on another CPU to proceed and free page tables so the rest of the > walk will happen on freed and possibly reused pages. Do I understand you > right? > Yes, that's it. thanks, -- John Hubbard NVIDIA > Realistically, I don't think this can happen as I'd expect the CPU to throw > away the speculation state on interrupt. But that's just my expectation and > I agree that I don't see anything in Documentation/memory-barriers.txt that > would prevent what you are concerned about. Let's ask Paul :) > > Paul, we are discussing here a possible races between > mm/gup.c:__get_user_pages_fast() and mm/mmap.c:unmap_region(). The first > has a code like: > > local_irq_save(flags); > load pgd from current->mm > load p4d from pgd > load pud from p4d > load pmd from pud > ... > local_irq_restore(flags); > > while the second has code like: > > unmap_region() > walk pgd > walk p4d > walk pud > walk pmd > clear ptes > flush tlb > free page tables > > Now the serialization between these two relies on the fact that flushing > TLBs from unmap_region() requires IPI to be served on each CPU so in naive > understanding unmap_region() shouldn't be able to get to 'free unused page > tables' part until __get_user_pages_fast() enables interrupts again. Now as > John points out we don't see anything in Documentation/memory-barriers.txt > that would actually guarantee this. So do we really need something like > smp_rmb() after disabling interrupts in __get_user_pages_fast() or is the > race John is concerned about impossible? Thanks! > >> This one has a problem, because Documentation/memory-barriers.txt points out: >> >> INTERRUPT DISABLING FUNCTIONS >> ----------------------------- >> >> Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts >> (RELEASE equivalent) will act as compiler barriers only. So if memory or I/O >> barriers are required in such a situation, they must be provided from some >> other means. >> >> >> ...and so I'm suggesting that we need something approximately like this: >> >> diff --git a/mm/gup.c b/mm/gup.c >> index 23a9f9c9d377..1678d50a2d8b 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -2415,7 +2415,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, >> if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && >> gup_fast_permitted(start, end)) { >> local_irq_disable(); >> + smp_mb(); >> gup_pgd_range(addr, end, gup_flags, pages, &nr); >> + smp_mb(); >> local_irq_enable(); >> ret = nr; > > Honza >