From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A,SIGNED_OFF_BY,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A05BC43461 for ; Tue, 15 Sep 2020 17:50:47 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AF0D4206B6 for ; Tue, 15 Sep 2020 17:50:46 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=nvidia.com header.i=@nvidia.com header.b="W/i1+jdP" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727876AbgIORuD (ORCPT ); Tue, 15 Sep 2020 13:50:03 -0400 Received: from hqnvemgate24.nvidia.com ([216.228.121.143]:19308 "EHLO hqnvemgate24.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727758AbgIORbk (ORCPT ); Tue, 15 Sep 2020 13:31:40 -0400 Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate24.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Tue, 15 Sep 2020 10:29:13 -0700 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Tue, 15 Sep 2020 10:31:34 -0700 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Tue, 15 Sep 2020 10:31:34 -0700 Received: from [10.2.52.22] (10.124.1.5) by HQMAIL107.nvidia.com (172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Tue, 15 Sep 2020 17:31:28 +0000 Subject: Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding To: Vasily Gorbik , Jason Gunthorpe CC: Linus Torvalds , Gerald Schaefer , Alexander Gordeev , Peter Zijlstra , Dave Hansen , LKML , linux-mm , linux-arch , Andrew Morton , Russell King , "Mike Rapoport" , Catalin Marinas , "Will Deacon" , Michael Ellerman , "Benjamin Herrenschmidt" , Paul Mackerras , Jeff Dike , Richard Weinberger , "Dave Hansen" , Andy Lutomirski , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Arnd Bergmann , "Andrey Ryabinin" , linux-x86 , linux-arm , linux-power , linux-sparc , linux-um , linux-s390 , Heiko Carstens , "Christian Borntraeger" , Claudio Imbrenda References: <20200911200511.GC1221970@ziepe.ca> From: John Hubbard Message-ID: <9daa9203-d164-ec78-8a8d-30b8b22cb1da@nvidia.com> Date: Tue, 15 Sep 2020 10:31:28 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: X-Originating-IP: [10.124.1.5] X-ClientProxiedBy: HQMAIL105.nvidia.com (172.20.187.12) To HQMAIL107.nvidia.com (172.20.187.13) Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1600190953; bh=0LuXBF+QuyfRd7vsh816EU5TQVILtq5WYJSSBYonuEo=; h=X-PGP-Universal:Subject:To:CC:References:From:Message-ID:Date: User-Agent:MIME-Version:In-Reply-To:X-Originating-IP: X-ClientProxiedBy:Content-Type:Content-Language: Content-Transfer-Encoding; b=W/i1+jdP/N55Xza2MEm/gNz+nYX0bLMaq7K4jpYOydN5L+jzX14UrU6QLlOAou+x4 5llBNT9lKJ5kA0wXGCkfISrtAVC77+n+sNZvHsqoSgJnBEAxkiaKKljqwFvVixVgt5 kS73XbEMXl9BfBEOqZvSjLoxf/8kpnSdyvYceULQC6Z6HqzD61F8GqcBS1pTkJoByO +4jmgHZSu4rRE5U2vRhPciij3I+cXnK0og81egRmf4sgJVhePYSfOwVDDhzMufBwCA crQDAaOYbstfztYG1D3h6XvP7ewjVrlMuhoWQa7v0odeBQEwLVADbGDE2bQJEFOJAC ttS6TCVAHFe/Q== Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 9/11/20 1:36 PM, Vasily Gorbik wrote: > Currently to make sure that every page table entry is read just once > gup_fast walks perform READ_ONCE and pass pXd value down to the next > gup_pXd_range function by value e.g.: > > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > ... > pudp = pud_offset(&p4d, addr); > > This function passes a reference on that local value copy to pXd_offset, > and might get the very same pointer in return. This happens when the > level is folded (on most arches), and that pointer should not be iterated. > > On s390 due to the fact that each task might have different 5,4 or > 3-level address translation and hence different levels folded the logic > is more complex and non-iteratable pointer to a local copy leads to > severe problems. > > Here is an example of what happens with gup_fast on s390, for a task > with 3-levels paging, crossing a 2 GB pud boundary: > > // addr = 0x1007ffff000, end = 0x10080001000 > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > // pud_offset returns &p4d itself (a pointer to a value on stack) > pudp = pud_offset(&p4d, addr); > do { > // on second iteratation reading "random" stack value > pud_t pud = READ_ONCE(*pudp); > > // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 > next = pud_addr_end(addr, end); > ... > } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack > > return 1; > } > > This happens since s390 moved to common gup code with > commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") > and commit 1a42010cdc26 ("s390/mm: convert to the generic > get_user_pages_fast code"). s390 tried to mimic static level folding by > changing pXd_offset primitives to always calculate top level page table > offset in pgd_offset and just return the value passed when pXd_offset > has to act as folded. > > What is crucial for gup_fast and what has been overlooked is > that PxD_SIZE/MASK and thus pXd_addr_end should also change > correspondingly. And the latter is not possible with dynamic folding. > > To fix the issue in addition to pXd values pass original > pXdp pointers down to gup_pXd_range functions. And introduce > pXd_offset_lockless helpers, which take an additional pXd > entry value parameter. This has already been discussed in > https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 > > Cc: # 5.2+ > Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") > Reviewed-by: Gerald Schaefer > Reviewed-by: Alexander Gordeev > Signed-off-by: Vasily Gorbik > --- Looks cleaner than I'd dared hope for. :) Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA > v2: added brackets &pgd -> &(pgd) > > arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++---------- > include/linux/pgtable.h | 10 ++++++++ > mm/gup.c | 18 +++++++------- > 3 files changed, 49 insertions(+), 21 deletions(-) > > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h > index 7eb01a5459cd..b55561cc8786 100644 > --- a/arch/s390/include/asm/pgtable.h > +++ b/arch/s390/include/asm/pgtable.h > @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address) > > #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address) > > -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) > +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address) > { > - if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) > - return (p4d_t *) pgd_deref(*pgd) + p4d_index(address); > - return (p4d_t *) pgd; > + if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) > + return (p4d_t *) pgd_deref(pgd) + p4d_index(address); > + return (p4d_t *) pgdp; > } > +#define p4d_offset_lockless p4d_offset_lockless > > -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) > +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address) > { > - if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) > - return (pud_t *) p4d_deref(*p4d) + pud_index(address); > - return (pud_t *) p4d; > + return p4d_offset_lockless(pgdp, *pgdp, address); > +} > + > +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address) > +{ > + if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) > + return (pud_t *) p4d_deref(p4d) + pud_index(address); > + return (pud_t *) p4dp; > +} > +#define pud_offset_lockless pud_offset_lockless > + > +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address) > +{ > + return pud_offset_lockless(p4dp, *p4dp, address); > } > #define pud_offset pud_offset > > -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) > +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address) > +{ > + if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) > + return (pmd_t *) pud_deref(pud) + pmd_index(address); > + return (pmd_t *) pudp; > +} > +#define pmd_offset_lockless pmd_offset_lockless > + > +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address) > { > - if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) > - return (pmd_t *) pud_deref(*pud) + pmd_index(address); > - return (pmd_t *) pud; > + return pmd_offset_lockless(pudp, *pudp, address); > } > #define pmd_offset pmd_offset > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index e8cbc2e795d5..90654cb63e9e 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask; > #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED) > #endif > > +#ifndef p4d_offset_lockless > +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address) > +#endif > +#ifndef pud_offset_lockless > +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address) > +#endif > +#ifndef pmd_offset_lockless > +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address) > +#endif > + > /* > * p?d_leaf() - true if this entry is a final mapping to a physical address. > * This differs from p?d_huge() by the fact that they are always available (if > diff --git a/mm/gup.c b/mm/gup.c > index e5739a1974d5..578bf5bd8bf8 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, > return 1; > } > > -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pmd_t *pmdp; > > - pmdp = pmd_offset(&pud, addr); > + pmdp = pmd_offset_lockless(pudp, pud, addr); > do { > pmd_t pmd = READ_ONCE(*pmdp); > > @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > return 1; > } > > -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > - pudp = pud_offset(&p4d, addr); > + pudp = pud_offset_lockless(p4dp, p4d, addr); > do { > pud_t pud = READ_ONCE(*pudp); > > @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(pud_val(pud)), addr, > PUD_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pmd_range(pud, addr, next, flags, pages, nr)) > + } else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr)) > return 0; > } while (pudp++, addr = next, addr != end); > > return 1; > } > > -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > p4d_t *p4dp; > > - p4dp = p4d_offset(&pgd, addr); > + p4dp = p4d_offset_lockless(pgdp, pgd, addr); > do { > p4d_t p4d = READ_ONCE(*p4dp); > > @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr, > P4D_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pud_range(p4d, addr, next, flags, pages, nr)) > + } else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) > return 0; > } while (p4dp++, addr = next, addr != end); > > @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, > PGDIR_SHIFT, next, flags, pages, nr)) > return; > - } else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr)) > + } else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) > return; > } while (pgdp++, addr = next, addr != end); > } > From mboxrd@z Thu Jan 1 00:00:00 1970 From: John Hubbard Date: Tue, 15 Sep 2020 17:31:28 +0000 Subject: Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding Message-Id: <9daa9203-d164-ec78-8a8d-30b8b22cb1da@nvidia.com> List-Id: References: <20200911200511.GC1221970@ziepe.ca> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Vasily Gorbik , Jason Gunthorpe Cc: Linus Torvalds , Gerald Schaefer , Alexander Gordeev , Peter Zijlstra , Dave Hansen , LKML , linux-mm , linux-arch , Andrew Morton , Russell King , Mike Rapoport , Catalin Marinas , Will Deacon , Michael Ellerman , Benjamin Herrenschmidt , Paul Mackerras , Jeff Dike , Richard Weinberger , Dave Hansen , Andy Lutomirski , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Arnd Bergmann , Andrey Ryabinin , linux-x86 , linux-arm , linux-power , linux-sparc , linux-um , linux-s390 , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda On 9/11/20 1:36 PM, Vasily Gorbik wrote: > Currently to make sure that every page table entry is read just once > gup_fast walks perform READ_ONCE and pass pXd value down to the next > gup_pXd_range function by value e.g.: > > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > ... > pudp = pud_offset(&p4d, addr); > > This function passes a reference on that local value copy to pXd_offset, > and might get the very same pointer in return. This happens when the > level is folded (on most arches), and that pointer should not be iterated. > > On s390 due to the fact that each task might have different 5,4 or > 3-level address translation and hence different levels folded the logic > is more complex and non-iteratable pointer to a local copy leads to > severe problems. > > Here is an example of what happens with gup_fast on s390, for a task > with 3-levels paging, crossing a 2 GB pud boundary: > > // addr = 0x1007ffff000, end = 0x10080001000 > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > // pud_offset returns &p4d itself (a pointer to a value on stack) > pudp = pud_offset(&p4d, addr); > do { > // on second iteratation reading "random" stack value > pud_t pud = READ_ONCE(*pudp); > > // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 > next = pud_addr_end(addr, end); > ... > } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack > > return 1; > } > > This happens since s390 moved to common gup code with > commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") > and commit 1a42010cdc26 ("s390/mm: convert to the generic > get_user_pages_fast code"). s390 tried to mimic static level folding by > changing pXd_offset primitives to always calculate top level page table > offset in pgd_offset and just return the value passed when pXd_offset > has to act as folded. > > What is crucial for gup_fast and what has been overlooked is > that PxD_SIZE/MASK and thus pXd_addr_end should also change > correspondingly. And the latter is not possible with dynamic folding. > > To fix the issue in addition to pXd values pass original > pXdp pointers down to gup_pXd_range functions. And introduce > pXd_offset_lockless helpers, which take an additional pXd > entry value parameter. This has already been discussed in > https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 > > Cc: # 5.2+ > Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") > Reviewed-by: Gerald Schaefer > Reviewed-by: Alexander Gordeev > Signed-off-by: Vasily Gorbik > --- Looks cleaner than I'd dared hope for. :) Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA > v2: added brackets &pgd -> &(pgd) > > arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++---------- > include/linux/pgtable.h | 10 ++++++++ > mm/gup.c | 18 +++++++------- > 3 files changed, 49 insertions(+), 21 deletions(-) > > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h > index 7eb01a5459cd..b55561cc8786 100644 > --- a/arch/s390/include/asm/pgtable.h > +++ b/arch/s390/include/asm/pgtable.h > @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address) > > #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address) > > -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) > +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address) > { > - if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) > - return (p4d_t *) pgd_deref(*pgd) + p4d_index(address); > - return (p4d_t *) pgd; > + if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) > + return (p4d_t *) pgd_deref(pgd) + p4d_index(address); > + return (p4d_t *) pgdp; > } > +#define p4d_offset_lockless p4d_offset_lockless > > -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) > +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address) > { > - if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) > - return (pud_t *) p4d_deref(*p4d) + pud_index(address); > - return (pud_t *) p4d; > + return p4d_offset_lockless(pgdp, *pgdp, address); > +} > + > +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address) > +{ > + if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) > + return (pud_t *) p4d_deref(p4d) + pud_index(address); > + return (pud_t *) p4dp; > +} > +#define pud_offset_lockless pud_offset_lockless > + > +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address) > +{ > + return pud_offset_lockless(p4dp, *p4dp, address); > } > #define pud_offset pud_offset > > -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) > +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address) > +{ > + if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) > + return (pmd_t *) pud_deref(pud) + pmd_index(address); > + return (pmd_t *) pudp; > +} > +#define pmd_offset_lockless pmd_offset_lockless > + > +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address) > { > - if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) > - return (pmd_t *) pud_deref(*pud) + pmd_index(address); > - return (pmd_t *) pud; > + return pmd_offset_lockless(pudp, *pudp, address); > } > #define pmd_offset pmd_offset > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index e8cbc2e795d5..90654cb63e9e 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask; > #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED) > #endif > > +#ifndef p4d_offset_lockless > +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address) > +#endif > +#ifndef pud_offset_lockless > +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address) > +#endif > +#ifndef pmd_offset_lockless > +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address) > +#endif > + > /* > * p?d_leaf() - true if this entry is a final mapping to a physical address. > * This differs from p?d_huge() by the fact that they are always available (if > diff --git a/mm/gup.c b/mm/gup.c > index e5739a1974d5..578bf5bd8bf8 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, > return 1; > } > > -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pmd_t *pmdp; > > - pmdp = pmd_offset(&pud, addr); > + pmdp = pmd_offset_lockless(pudp, pud, addr); > do { > pmd_t pmd = READ_ONCE(*pmdp); > > @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > return 1; > } > > -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > - pudp = pud_offset(&p4d, addr); > + pudp = pud_offset_lockless(p4dp, p4d, addr); > do { > pud_t pud = READ_ONCE(*pudp); > > @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(pud_val(pud)), addr, > PUD_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pmd_range(pud, addr, next, flags, pages, nr)) > + } else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr)) > return 0; > } while (pudp++, addr = next, addr != end); > > return 1; > } > > -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > p4d_t *p4dp; > > - p4dp = p4d_offset(&pgd, addr); > + p4dp = p4d_offset_lockless(pgdp, pgd, addr); > do { > p4d_t p4d = READ_ONCE(*p4dp); > > @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr, > P4D_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pud_range(p4d, addr, next, flags, pages, nr)) > + } else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) > return 0; > } while (p4dp++, addr = next, addr != end); > > @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, > PGDIR_SHIFT, next, flags, pages, nr)) > return; > - } else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr)) > + } else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) > return; > } while (pgdp++, addr = next, addr != end); > } > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, NICE_REPLY_A,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E5EAEC43461 for ; Tue, 15 Sep 2020 17:38:34 +0000 (UTC) Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id F41CD20678 for ; Tue, 15 Sep 2020 17:38:33 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=nvidia.com header.i=@nvidia.com header.b="W/i1+jdP" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F41CD20678 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Received: from bilbo.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 4BrVng6WMpzDqW3 for ; Wed, 16 Sep 2020 03:38:31 +1000 (AEST) Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=nvidia.com (client-ip=216.228.121.143; helo=hqnvemgate24.nvidia.com; envelope-from=jhubbard@nvidia.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=pass (p=none dis=none) header.from=nvidia.com Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=nvidia.com header.i=@nvidia.com header.a=rsa-sha256 header.s=n1 header.b=W/i1+jdP; dkim-atps=neutral X-Greylist: delayed 305 seconds by postgrey-1.36 at bilbo; Wed, 16 Sep 2020 03:36:45 AEST Received: from hqnvemgate24.nvidia.com (hqnvemgate24.nvidia.com [216.228.121.143]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4BrVld5HfdzDqLT for ; Wed, 16 Sep 2020 03:36:45 +1000 (AEST) Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate24.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Tue, 15 Sep 2020 10:29:13 -0700 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Tue, 15 Sep 2020 10:31:34 -0700 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Tue, 15 Sep 2020 10:31:34 -0700 Received: from [10.2.52.22] (10.124.1.5) by HQMAIL107.nvidia.com (172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Tue, 15 Sep 2020 17:31:28 +0000 Subject: Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding To: Vasily Gorbik , Jason Gunthorpe References: <20200911200511.GC1221970@ziepe.ca> From: John Hubbard Message-ID: <9daa9203-d164-ec78-8a8d-30b8b22cb1da@nvidia.com> Date: Tue, 15 Sep 2020 10:31:28 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: X-Originating-IP: [10.124.1.5] X-ClientProxiedBy: HQMAIL105.nvidia.com (172.20.187.12) To HQMAIL107.nvidia.com (172.20.187.13) Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1600190953; bh=0LuXBF+QuyfRd7vsh816EU5TQVILtq5WYJSSBYonuEo=; h=X-PGP-Universal:Subject:To:CC:References:From:Message-ID:Date: User-Agent:MIME-Version:In-Reply-To:X-Originating-IP: X-ClientProxiedBy:Content-Type:Content-Language: Content-Transfer-Encoding; b=W/i1+jdP/N55Xza2MEm/gNz+nYX0bLMaq7K4jpYOydN5L+jzX14UrU6QLlOAou+x4 5llBNT9lKJ5kA0wXGCkfISrtAVC77+n+sNZvHsqoSgJnBEAxkiaKKljqwFvVixVgt5 kS73XbEMXl9BfBEOqZvSjLoxf/8kpnSdyvYceULQC6Z6HqzD61F8GqcBS1pTkJoByO +4jmgHZSu4rRE5U2vRhPciij3I+cXnK0og81egRmf4sgJVhePYSfOwVDDhzMufBwCA crQDAaOYbstfztYG1D3h6XvP7ewjVrlMuhoWQa7v0odeBQEwLVADbGDE2bQJEFOJAC ttS6TCVAHFe/Q== X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Peter Zijlstra , Dave Hansen , linux-mm , Paul Mackerras , linux-sparc , Alexander Gordeev , Claudio Imbrenda , Will Deacon , linux-arch , linux-s390 , Richard Weinberger , linux-x86 , Russell King , Christian Borntraeger , Ingo Molnar , Catalin Marinas , Andrey Ryabinin , Gerald Schaefer , Heiko Carstens , Arnd Bergmann , Jeff Dike , linux-um , Borislav Petkov , Andy Lutomirski , Thomas Gleixner , linux-arm , Dave Hansen , linux-power , LKML , Andrew Morton , Linus Torvalds , Mike Rapoport Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" On 9/11/20 1:36 PM, Vasily Gorbik wrote: > Currently to make sure that every page table entry is read just once > gup_fast walks perform READ_ONCE and pass pXd value down to the next > gup_pXd_range function by value e.g.: > > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > ... > pudp = pud_offset(&p4d, addr); > > This function passes a reference on that local value copy to pXd_offset, > and might get the very same pointer in return. This happens when the > level is folded (on most arches), and that pointer should not be iterated. > > On s390 due to the fact that each task might have different 5,4 or > 3-level address translation and hence different levels folded the logic > is more complex and non-iteratable pointer to a local copy leads to > severe problems. > > Here is an example of what happens with gup_fast on s390, for a task > with 3-levels paging, crossing a 2 GB pud boundary: > > // addr = 0x1007ffff000, end = 0x10080001000 > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > // pud_offset returns &p4d itself (a pointer to a value on stack) > pudp = pud_offset(&p4d, addr); > do { > // on second iteratation reading "random" stack value > pud_t pud = READ_ONCE(*pudp); > > // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 > next = pud_addr_end(addr, end); > ... > } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack > > return 1; > } > > This happens since s390 moved to common gup code with > commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") > and commit 1a42010cdc26 ("s390/mm: convert to the generic > get_user_pages_fast code"). s390 tried to mimic static level folding by > changing pXd_offset primitives to always calculate top level page table > offset in pgd_offset and just return the value passed when pXd_offset > has to act as folded. > > What is crucial for gup_fast and what has been overlooked is > that PxD_SIZE/MASK and thus pXd_addr_end should also change > correspondingly. And the latter is not possible with dynamic folding. > > To fix the issue in addition to pXd values pass original > pXdp pointers down to gup_pXd_range functions. And introduce > pXd_offset_lockless helpers, which take an additional pXd > entry value parameter. This has already been discussed in > https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 > > Cc: # 5.2+ > Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") > Reviewed-by: Gerald Schaefer > Reviewed-by: Alexander Gordeev > Signed-off-by: Vasily Gorbik > --- Looks cleaner than I'd dared hope for. :) Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA > v2: added brackets &pgd -> &(pgd) > > arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++---------- > include/linux/pgtable.h | 10 ++++++++ > mm/gup.c | 18 +++++++------- > 3 files changed, 49 insertions(+), 21 deletions(-) > > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h > index 7eb01a5459cd..b55561cc8786 100644 > --- a/arch/s390/include/asm/pgtable.h > +++ b/arch/s390/include/asm/pgtable.h > @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address) > > #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address) > > -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) > +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address) > { > - if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) > - return (p4d_t *) pgd_deref(*pgd) + p4d_index(address); > - return (p4d_t *) pgd; > + if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) > + return (p4d_t *) pgd_deref(pgd) + p4d_index(address); > + return (p4d_t *) pgdp; > } > +#define p4d_offset_lockless p4d_offset_lockless > > -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) > +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address) > { > - if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) > - return (pud_t *) p4d_deref(*p4d) + pud_index(address); > - return (pud_t *) p4d; > + return p4d_offset_lockless(pgdp, *pgdp, address); > +} > + > +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address) > +{ > + if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) > + return (pud_t *) p4d_deref(p4d) + pud_index(address); > + return (pud_t *) p4dp; > +} > +#define pud_offset_lockless pud_offset_lockless > + > +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address) > +{ > + return pud_offset_lockless(p4dp, *p4dp, address); > } > #define pud_offset pud_offset > > -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) > +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address) > +{ > + if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) > + return (pmd_t *) pud_deref(pud) + pmd_index(address); > + return (pmd_t *) pudp; > +} > +#define pmd_offset_lockless pmd_offset_lockless > + > +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address) > { > - if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) > - return (pmd_t *) pud_deref(*pud) + pmd_index(address); > - return (pmd_t *) pud; > + return pmd_offset_lockless(pudp, *pudp, address); > } > #define pmd_offset pmd_offset > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index e8cbc2e795d5..90654cb63e9e 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask; > #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED) > #endif > > +#ifndef p4d_offset_lockless > +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address) > +#endif > +#ifndef pud_offset_lockless > +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address) > +#endif > +#ifndef pmd_offset_lockless > +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address) > +#endif > + > /* > * p?d_leaf() - true if this entry is a final mapping to a physical address. > * This differs from p?d_huge() by the fact that they are always available (if > diff --git a/mm/gup.c b/mm/gup.c > index e5739a1974d5..578bf5bd8bf8 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, > return 1; > } > > -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pmd_t *pmdp; > > - pmdp = pmd_offset(&pud, addr); > + pmdp = pmd_offset_lockless(pudp, pud, addr); > do { > pmd_t pmd = READ_ONCE(*pmdp); > > @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > return 1; > } > > -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > - pudp = pud_offset(&p4d, addr); > + pudp = pud_offset_lockless(p4dp, p4d, addr); > do { > pud_t pud = READ_ONCE(*pudp); > > @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(pud_val(pud)), addr, > PUD_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pmd_range(pud, addr, next, flags, pages, nr)) > + } else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr)) > return 0; > } while (pudp++, addr = next, addr != end); > > return 1; > } > > -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > p4d_t *p4dp; > > - p4dp = p4d_offset(&pgd, addr); > + p4dp = p4d_offset_lockless(pgdp, pgd, addr); > do { > p4d_t p4d = READ_ONCE(*p4dp); > > @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr, > P4D_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pud_range(p4d, addr, next, flags, pages, nr)) > + } else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) > return 0; > } while (p4dp++, addr = next, addr != end); > > @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, > PGDIR_SHIFT, next, flags, pages, nr)) > return; > - } else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr)) > + } else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) > return; > } while (pgdp++, addr = next, addr != end); > } > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CCF79C43461 for ; Tue, 15 Sep 2020 17:33:21 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 60B30208E4 for ; Tue, 15 Sep 2020 17:33:21 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="PKCb92YO"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=nvidia.com header.i=@nvidia.com header.b="W/i1+jdP" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 60B30208E4 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Type: Content-Transfer-Encoding:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:Date:Message-ID:From: References:To:Subject:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=3Xv0l+EA/Rh6cYy7kc+auqVsi9IZpUf4+0zDIAd6y2k=; b=PKCb92YOvUPzj2VLFPCmoQPUq pTBkBpDjuIQxbdu/KyXEfzxBLf+BrF8LVkbYPEJEMklBaVcO4kRZouI8tcJhTckwWssOnm8UMUJNQ wEtfnufXZ7ZVG2pP5Ms8QQoOuRro6P9C5K6NAebw3AeHrVqZNCiQGrdivmYbCA4HgKp1bs+Llp7vF CNN3874vQ6Lak8oy52qSbL8Vas53hMXYlaRuyItO7ju+Krf3GkqcvERpYmZUjPmM4Ofq4eTmP/f0G C3Bwv04l3JKYy+RF9gH0SHg6/ojr8EMAoihBBBnaqH4mYBrs5xUTJJIcbd2nbcQno9YLQIhsipyoX tqMHNOw3Q==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1kIEnu-0002BE-5B; Tue, 15 Sep 2020 17:31:42 +0000 Received: from hqnvemgate24.nvidia.com ([216.228.121.143]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1kIEnq-00029i-A8; Tue, 15 Sep 2020 17:31:39 +0000 Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate24.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Tue, 15 Sep 2020 10:29:13 -0700 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Tue, 15 Sep 2020 10:31:34 -0700 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Tue, 15 Sep 2020 10:31:34 -0700 Received: from [10.2.52.22] (10.124.1.5) by HQMAIL107.nvidia.com (172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Tue, 15 Sep 2020 17:31:28 +0000 Subject: Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding To: Vasily Gorbik , Jason Gunthorpe References: <20200911200511.GC1221970@ziepe.ca> From: John Hubbard Message-ID: <9daa9203-d164-ec78-8a8d-30b8b22cb1da@nvidia.com> Date: Tue, 15 Sep 2020 10:31:28 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: X-Originating-IP: [10.124.1.5] X-ClientProxiedBy: HQMAIL105.nvidia.com (172.20.187.12) To HQMAIL107.nvidia.com (172.20.187.13) Content-Language: en-US DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1600190953; bh=0LuXBF+QuyfRd7vsh816EU5TQVILtq5WYJSSBYonuEo=; h=X-PGP-Universal:Subject:To:CC:References:From:Message-ID:Date: User-Agent:MIME-Version:In-Reply-To:X-Originating-IP: X-ClientProxiedBy:Content-Type:Content-Language: Content-Transfer-Encoding; b=W/i1+jdP/N55Xza2MEm/gNz+nYX0bLMaq7K4jpYOydN5L+jzX14UrU6QLlOAou+x4 5llBNT9lKJ5kA0wXGCkfISrtAVC77+n+sNZvHsqoSgJnBEAxkiaKKljqwFvVixVgt5 kS73XbEMXl9BfBEOqZvSjLoxf/8kpnSdyvYceULQC6Z6HqzD61F8GqcBS1pTkJoByO +4jmgHZSu4rRE5U2vRhPciij3I+cXnK0og81egRmf4sgJVhePYSfOwVDDhzMufBwCA crQDAaOYbstfztYG1D3h6XvP7ewjVrlMuhoWQa7v0odeBQEwLVADbGDE2bQJEFOJAC ttS6TCVAHFe/Q== X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20200915_133138_484680_CD30641E X-CRM114-Status: GOOD ( 35.91 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Peter Zijlstra , Benjamin Herrenschmidt , Dave Hansen , linux-mm , Paul Mackerras , linux-sparc , Alexander Gordeev , Claudio Imbrenda , Will Deacon , linux-arch , linux-s390 , Richard Weinberger , linux-x86 , Russell King , Christian Borntraeger , Ingo Molnar , Catalin Marinas , Andrey Ryabinin , Gerald Schaefer , Heiko Carstens , Arnd Bergmann , Jeff Dike , linux-um , Borislav Petkov , Andy Lutomirski , Thomas Gleixner , linux-arm , Dave Hansen , linux-power , LKML , Michael Ellerman , Andrew Morton , Linus Torvalds , Mike Rapoport Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On 9/11/20 1:36 PM, Vasily Gorbik wrote: > Currently to make sure that every page table entry is read just once > gup_fast walks perform READ_ONCE and pass pXd value down to the next > gup_pXd_range function by value e.g.: > > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > ... > pudp = pud_offset(&p4d, addr); > > This function passes a reference on that local value copy to pXd_offset, > and might get the very same pointer in return. This happens when the > level is folded (on most arches), and that pointer should not be iterated. > > On s390 due to the fact that each task might have different 5,4 or > 3-level address translation and hence different levels folded the logic > is more complex and non-iteratable pointer to a local copy leads to > severe problems. > > Here is an example of what happens with gup_fast on s390, for a task > with 3-levels paging, crossing a 2 GB pud boundary: > > // addr = 0x1007ffff000, end = 0x10080001000 > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > // pud_offset returns &p4d itself (a pointer to a value on stack) > pudp = pud_offset(&p4d, addr); > do { > // on second iteratation reading "random" stack value > pud_t pud = READ_ONCE(*pudp); > > // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 > next = pud_addr_end(addr, end); > ... > } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack > > return 1; > } > > This happens since s390 moved to common gup code with > commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") > and commit 1a42010cdc26 ("s390/mm: convert to the generic > get_user_pages_fast code"). s390 tried to mimic static level folding by > changing pXd_offset primitives to always calculate top level page table > offset in pgd_offset and just return the value passed when pXd_offset > has to act as folded. > > What is crucial for gup_fast and what has been overlooked is > that PxD_SIZE/MASK and thus pXd_addr_end should also change > correspondingly. And the latter is not possible with dynamic folding. > > To fix the issue in addition to pXd values pass original > pXdp pointers down to gup_pXd_range functions. And introduce > pXd_offset_lockless helpers, which take an additional pXd > entry value parameter. This has already been discussed in > https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 > > Cc: # 5.2+ > Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") > Reviewed-by: Gerald Schaefer > Reviewed-by: Alexander Gordeev > Signed-off-by: Vasily Gorbik > --- Looks cleaner than I'd dared hope for. :) Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA > v2: added brackets &pgd -> &(pgd) > > arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++---------- > include/linux/pgtable.h | 10 ++++++++ > mm/gup.c | 18 +++++++------- > 3 files changed, 49 insertions(+), 21 deletions(-) > > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h > index 7eb01a5459cd..b55561cc8786 100644 > --- a/arch/s390/include/asm/pgtable.h > +++ b/arch/s390/include/asm/pgtable.h > @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address) > > #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address) > > -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) > +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address) > { > - if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) > - return (p4d_t *) pgd_deref(*pgd) + p4d_index(address); > - return (p4d_t *) pgd; > + if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) > + return (p4d_t *) pgd_deref(pgd) + p4d_index(address); > + return (p4d_t *) pgdp; > } > +#define p4d_offset_lockless p4d_offset_lockless > > -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) > +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address) > { > - if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) > - return (pud_t *) p4d_deref(*p4d) + pud_index(address); > - return (pud_t *) p4d; > + return p4d_offset_lockless(pgdp, *pgdp, address); > +} > + > +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address) > +{ > + if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) > + return (pud_t *) p4d_deref(p4d) + pud_index(address); > + return (pud_t *) p4dp; > +} > +#define pud_offset_lockless pud_offset_lockless > + > +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address) > +{ > + return pud_offset_lockless(p4dp, *p4dp, address); > } > #define pud_offset pud_offset > > -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) > +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address) > +{ > + if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) > + return (pmd_t *) pud_deref(pud) + pmd_index(address); > + return (pmd_t *) pudp; > +} > +#define pmd_offset_lockless pmd_offset_lockless > + > +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address) > { > - if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) > - return (pmd_t *) pud_deref(*pud) + pmd_index(address); > - return (pmd_t *) pud; > + return pmd_offset_lockless(pudp, *pudp, address); > } > #define pmd_offset pmd_offset > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index e8cbc2e795d5..90654cb63e9e 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask; > #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED) > #endif > > +#ifndef p4d_offset_lockless > +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address) > +#endif > +#ifndef pud_offset_lockless > +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address) > +#endif > +#ifndef pmd_offset_lockless > +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address) > +#endif > + > /* > * p?d_leaf() - true if this entry is a final mapping to a physical address. > * This differs from p?d_huge() by the fact that they are always available (if > diff --git a/mm/gup.c b/mm/gup.c > index e5739a1974d5..578bf5bd8bf8 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, > return 1; > } > > -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pmd_t *pmdp; > > - pmdp = pmd_offset(&pud, addr); > + pmdp = pmd_offset_lockless(pudp, pud, addr); > do { > pmd_t pmd = READ_ONCE(*pmdp); > > @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > return 1; > } > > -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > - pudp = pud_offset(&p4d, addr); > + pudp = pud_offset_lockless(p4dp, p4d, addr); > do { > pud_t pud = READ_ONCE(*pudp); > > @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(pud_val(pud)), addr, > PUD_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pmd_range(pud, addr, next, flags, pages, nr)) > + } else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr)) > return 0; > } while (pudp++, addr = next, addr != end); > > return 1; > } > > -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > p4d_t *p4dp; > > - p4dp = p4d_offset(&pgd, addr); > + p4dp = p4d_offset_lockless(pgdp, pgd, addr); > do { > p4d_t p4d = READ_ONCE(*p4dp); > > @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr, > P4D_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pud_range(p4d, addr, next, flags, pages, nr)) > + } else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) > return 0; > } while (p4dp++, addr = next, addr != end); > > @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, > PGDIR_SHIFT, next, flags, pages, nr)) > return; > - } else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr)) > + } else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) > return; > } while (pgdp++, addr = next, addr != end); > } > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding References: <20200911200511.GC1221970@ziepe.ca> From: John Hubbard Message-ID: <9daa9203-d164-ec78-8a8d-30b8b22cb1da@nvidia.com> Date: Tue, 15 Sep 2020 10:31:28 -0700 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Sender: "linux-um" Errors-To: linux-um-bounces+geert=linux-m68k.org@lists.infradead.org To: Vasily Gorbik , Jason Gunthorpe Cc: Peter Zijlstra , Benjamin Herrenschmidt , Dave Hansen , linux-mm , Paul Mackerras , linux-sparc , Alexander Gordeev , Claudio Imbrenda , Will Deacon , linux-arch , linux-s390 , Richard Weinberger , linux-x86 , Russell King , Christian Borntraeger , Ingo Molnar , Catalin Marinas , Andrey Ryabinin , Gerald Schaefer , Heiko Carstens , Arnd Bergmann , Jeff Dike , linux-um , Borislav Petkov , Andy Lutomirski , Thomas Gleixner , linux-arm , Dave Hansen , linux-power , LKML , Michael Ellerman , Andrew Morton , Linus Torvalds , Mike Rapoport On 9/11/20 1:36 PM, Vasily Gorbik wrote: > Currently to make sure that every page table entry is read just once > gup_fast walks perform READ_ONCE and pass pXd value down to the next > gup_pXd_range function by value e.g.: > > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > ... > pudp = pud_offset(&p4d, addr); > > This function passes a reference on that local value copy to pXd_offset, > and might get the very same pointer in return. This happens when the > level is folded (on most arches), and that pointer should not be iterated. > > On s390 due to the fact that each task might have different 5,4 or > 3-level address translation and hence different levels folded the logic > is more complex and non-iteratable pointer to a local copy leads to > severe problems. > > Here is an example of what happens with gup_fast on s390, for a task > with 3-levels paging, crossing a 2 GB pud boundary: > > // addr = 0x1007ffff000, end = 0x10080001000 > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > // pud_offset returns &p4d itself (a pointer to a value on stack) > pudp = pud_offset(&p4d, addr); > do { > // on second iteratation reading "random" stack value > pud_t pud = READ_ONCE(*pudp); > > // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 > next = pud_addr_end(addr, end); > ... > } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack > > return 1; > } > > This happens since s390 moved to common gup code with > commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") > and commit 1a42010cdc26 ("s390/mm: convert to the generic > get_user_pages_fast code"). s390 tried to mimic static level folding by > changing pXd_offset primitives to always calculate top level page table > offset in pgd_offset and just return the value passed when pXd_offset > has to act as folded. > > What is crucial for gup_fast and what has been overlooked is > that PxD_SIZE/MASK and thus pXd_addr_end should also change > correspondingly. And the latter is not possible with dynamic folding. > > To fix the issue in addition to pXd values pass original > pXdp pointers down to gup_pXd_range functions. And introduce > pXd_offset_lockless helpers, which take an additional pXd > entry value parameter. This has already been discussed in > https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 > > Cc: # 5.2+ > Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") > Reviewed-by: Gerald Schaefer > Reviewed-by: Alexander Gordeev > Signed-off-by: Vasily Gorbik > --- Looks cleaner than I'd dared hope for. :) Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA > v2: added brackets &pgd -> &(pgd) > > arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++---------- > include/linux/pgtable.h | 10 ++++++++ > mm/gup.c | 18 +++++++------- > 3 files changed, 49 insertions(+), 21 deletions(-) > > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h > index 7eb01a5459cd..b55561cc8786 100644 > --- a/arch/s390/include/asm/pgtable.h > +++ b/arch/s390/include/asm/pgtable.h > @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address) > > #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address) > > -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) > +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address) > { > - if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) > - return (p4d_t *) pgd_deref(*pgd) + p4d_index(address); > - return (p4d_t *) pgd; > + if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) > + return (p4d_t *) pgd_deref(pgd) + p4d_index(address); > + return (p4d_t *) pgdp; > } > +#define p4d_offset_lockless p4d_offset_lockless > > -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) > +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address) > { > - if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) > - return (pud_t *) p4d_deref(*p4d) + pud_index(address); > - return (pud_t *) p4d; > + return p4d_offset_lockless(pgdp, *pgdp, address); > +} > + > +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address) > +{ > + if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) > + return (pud_t *) p4d_deref(p4d) + pud_index(address); > + return (pud_t *) p4dp; > +} > +#define pud_offset_lockless pud_offset_lockless > + > +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address) > +{ > + return pud_offset_lockless(p4dp, *p4dp, address); > } > #define pud_offset pud_offset > > -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) > +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address) > +{ > + if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) > + return (pmd_t *) pud_deref(pud) + pmd_index(address); > + return (pmd_t *) pudp; > +} > +#define pmd_offset_lockless pmd_offset_lockless > + > +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address) > { > - if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) > - return (pmd_t *) pud_deref(*pud) + pmd_index(address); > - return (pmd_t *) pud; > + return pmd_offset_lockless(pudp, *pudp, address); > } > #define pmd_offset pmd_offset > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index e8cbc2e795d5..90654cb63e9e 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask; > #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED) > #endif > > +#ifndef p4d_offset_lockless > +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address) > +#endif > +#ifndef pud_offset_lockless > +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address) > +#endif > +#ifndef pmd_offset_lockless > +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address) > +#endif > + > /* > * p?d_leaf() - true if this entry is a final mapping to a physical address. > * This differs from p?d_huge() by the fact that they are always available (if > diff --git a/mm/gup.c b/mm/gup.c > index e5739a1974d5..578bf5bd8bf8 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, > return 1; > } > > -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pmd_t *pmdp; > > - pmdp = pmd_offset(&pud, addr); > + pmdp = pmd_offset_lockless(pudp, pud, addr); > do { > pmd_t pmd = READ_ONCE(*pmdp); > > @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > return 1; > } > > -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > - pudp = pud_offset(&p4d, addr); > + pudp = pud_offset_lockless(p4dp, p4d, addr); > do { > pud_t pud = READ_ONCE(*pudp); > > @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(pud_val(pud)), addr, > PUD_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pmd_range(pud, addr, next, flags, pages, nr)) > + } else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr)) > return 0; > } while (pudp++, addr = next, addr != end); > > return 1; > } > > -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > p4d_t *p4dp; > > - p4dp = p4d_offset(&pgd, addr); > + p4dp = p4d_offset_lockless(pgdp, pgd, addr); > do { > p4d_t p4d = READ_ONCE(*p4dp); > > @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr, > P4D_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pud_range(p4d, addr, next, flags, pages, nr)) > + } else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) > return 0; > } while (p4dp++, addr = next, addr != end); > > @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, > PGDIR_SHIFT, next, flags, pages, nr)) > return; > - } else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr)) > + } else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) > return; > } while (pgdp++, addr = next, addr != end); > } > _______________________________________________ linux-um mailing list linux-um@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-um