From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3BA3FC388F3 for ; Tue, 1 Oct 2019 07:09:55 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id F13BC2070B for ; Tue, 1 Oct 2019 07:09:54 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F13BC2070B Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8C92D6B0005; Tue, 1 Oct 2019 03:09:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 879A26B0006; Tue, 1 Oct 2019 03:09:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7699A6B0007; Tue, 1 Oct 2019 03:09:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0075.hostedemail.com [216.40.44.75]) by kanga.kvack.org (Postfix) with ESMTP id 543856B0005 for ; Tue, 1 Oct 2019 03:09:54 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id DF417824CA36 for ; Tue, 1 Oct 2019 07:09:53 +0000 (UTC) X-FDA: 75994341066.24.owner85_736e770620908 X-HE-Tag: owner85_736e770620908 X-Filterd-Recvd-Size: 7125 Received: from mx1.suse.de (mx2.suse.de [195.135.220.15]) by imf02.hostedemail.com (Postfix) with ESMTP for ; Tue, 1 Oct 2019 07:09:53 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 44FBEAFCD; Tue, 1 Oct 2019 07:09:50 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id BEA8B1E37A2; Tue, 1 Oct 2019 09:10:08 +0200 (CEST) Date: Tue, 1 Oct 2019 09:10:08 +0200 From: Jan Kara To: John Hubbard Cc: Jan Kara , Michal Hocko , "Kirill A. Shutemov" , Yu Zhao , Peter Zijlstra , Andrew Morton , "Kirill A . Shutemov" , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Vlastimil Babka , Hugh Dickins , =?iso-8859-1?B?Suly9G1l?= Glisse , Andrea Arcangeli , "Aneesh Kumar K . V" , David Rientjes , Matthew Wilcox , Lance Roy , Ralph Campbell , Jason Gunthorpe , Dave Airlie , Thomas Hellstrom , Souptick Joarder , Mel Gorman , Mike Kravetz , Huang Ying , Aaron Lu , Omar Sandoval , Thomas Gleixner , Vineeth Remanan Pillai , Daniel Jordan , Mike Rapoport , Joel Fernandes , Mark Rutland , Alexander Duyck , Pavel Tatashin , David Hildenbrand , Juergen Gross , Anthony Yznaga , Johannes Weiner , "Darrick J . Wong" , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely Message-ID: <20191001071008.GA25062@quack2.suse.cz> References: <20190924232459.214097-1-yuzhao@google.com> <20190924232459.214097-3-yuzhao@google.com> <20190925082530.GD4536@hirez.programming.kicks-ass.net> <20190925222654.GA180125@google.com> <20190926102036.od2wamdx2s7uznvq@box> <9465df76-0229-1b44-5646-5cced1bc1718@nvidia.com> <20190927123056.GE26848@dhcp22.suse.cz> <20190930092003.GA22118@quack2.suse.cz> <6bba357a-1706-7cdb-8a11-359157a21ae8@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6bba357a-1706-7cdb-8a11-359157a21ae8@nvidia.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon 30-09-19 10:57:08, John Hubbard wrote: > On 9/30/19 2:20 AM, Jan Kara wrote: > > On Fri 27-09-19 12:31:41, John Hubbard wrote: > > > On 9/27/19 5:33 AM, Michal Hocko wrote: > > > > On Thu 26-09-19 20:26:46, John Hubbard wrote: > > > > > On 9/26/19 3:20 AM, Kirill A. Shutemov wrote: > ... > > > CPU 0 CPU 1 > > > -------- --------------- > > > get_user_pages_fast() > > > > > > do_unmap() > > > unmap_region() > > > free_pgtables() > > > /* > > > * speculative reads, re-ordered > > > * by the CPU at run time, not > > > * compile time. The function calls > > > * are not really in this order, but > > > * the corresponding reads could be. > > > */ > > > gup_pgd_range() > > > gup_p4d_range() > > > gup_pud_range() > > > gup_pmd_range() > > > pmd = READ_ONCE(*pmdp) > > > /* This is stale */ > > > > > > tlb_finish_mmu() > > > tlb_flush_mmu() > > > tlb_flush_mmu_tlbonly() > > > tlb_flush() > > > flush_tlb_mm > > > flush_tlb_mm_range > > > flush_tlb_others > > > native_flush_tlb_others > > > smp_call_function_many: IPIs > > > ...blocks until CPU1 reenables > > > interrupts > > > > > > local_irq_disable() > > > ...continue the page walk based > > > on stale/freed data... > > > > Yes, but then we have: > > > > head = try_get_compound_head(page, 1); > > > > which has an atomic operation providing barrier in it and then we have: > > > > if (unlikely(pte_val(pte) != pte_val(*ptep))) { > > put_page(head); > > goto pte_unmap; > > } > > > > So we reload PTE again and check the value didn't change. Which should > > prevent the race you explain above. Or do I miss anything? > > > Well, a couple of questions: > > 1. Is there *really* a memory barrier in try_get_compound_head()? Because > I only see a READ_ONCE compile barrier, which would mean that run time > reordering is still possible. try_get_compound_head() has page_cache_add_speculative() which is atomic_add_unless() which is guaranteed to provide ordering. > 2. Your code above is after the "pmd = READ_ONCE(*pmdp)" line, so by then, > it's already found the pte based on reading a stale pmd. So checking the > pte seems like it's checking the wrong thing--it's too late, for this case, > right? Well, if PMD is getting freed, all PTEs in it should be cleared by that time, shouldn't they? So although we read from stale PMD, either we already see cleared PTE or the check pte_val(pte) != pte_val(*ptep) will fail and so we never actually succeed in getting stale PTE entry (at least unless the page table page that used to be PMD can get freed and reused - which is not the case in the example you've shown above). So I still don't see a problem. That being said I don't feel being expert in this particular area. I just always thought GUP prevents races like this by the scheme I describe so I'd like to see what I'm missing :). Honza -- Jan Kara SUSE Labs, CR