From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.9 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AF611C43461 for ; Tue, 8 Sep 2020 05:08:31 +0000 (UTC) Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 95A9921532 for ; Tue, 8 Sep 2020 05:08:30 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 95A9921532 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=csgroup.eu Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Received: from bilbo.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 4BltTS3v6szDqQk for ; Tue, 8 Sep 2020 15:08:28 +1000 (AEST) Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=csgroup.eu (client-ip=93.17.236.30; helo=pegase1.c-s.fr; envelope-from=christophe.leroy@csgroup.eu; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=csgroup.eu Received: from pegase1.c-s.fr (pegase1.c-s.fr [93.17.236.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4BltRT5MjwzDqHw for ; Tue, 8 Sep 2020 15:06:43 +1000 (AEST) Received: from localhost (mailhub1-int [192.168.12.234]) by localhost (Postfix) with ESMTP id 4BltRK1g9Yz9tyWb; Tue, 8 Sep 2020 07:06:37 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at c-s.fr Received: from pegase1.c-s.fr ([192.168.12.234]) by localhost (pegase1.c-s.fr [192.168.12.234]) (amavisd-new, port 10024) with ESMTP id NDyhNxhMFPxL; Tue, 8 Sep 2020 07:06:37 +0200 (CEST) Received: from messagerie.si.c-s.fr (messagerie.si.c-s.fr [192.168.25.192]) by pegase1.c-s.fr (Postfix) with ESMTP id 4BltRG5mjlz9tyWZ; Tue, 8 Sep 2020 07:06:34 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by messagerie.si.c-s.fr (Postfix) with ESMTP id 7661B8B78B; Tue, 8 Sep 2020 07:06:35 +0200 (CEST) X-Virus-Scanned: amavisd-new at c-s.fr Received: from messagerie.si.c-s.fr ([127.0.0.1]) by localhost (messagerie.si.c-s.fr [127.0.0.1]) (amavisd-new, port 10023) with ESMTP id HWiz5tBjyEqc; Tue, 8 Sep 2020 07:06:35 +0200 (CEST) Received: from [192.168.4.90] (unknown [192.168.4.90]) by messagerie.si.c-s.fr (Postfix) with ESMTP id 3409E8B768; Tue, 8 Sep 2020 07:06:31 +0200 (CEST) Subject: Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding To: Gerald Schaefer , Jason Gunthorpe , John Hubbard References: <20200907180058.64880-1-gerald.schaefer@linux.ibm.com> <20200907180058.64880-2-gerald.schaefer@linux.ibm.com> From: Christophe Leroy Message-ID: <82fbe8f9-f199-5fc2-4168-eb43ad0b0346@csgroup.eu> Date: Tue, 8 Sep 2020 07:06:23 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: <20200907180058.64880-2-gerald.schaefer@linux.ibm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: fr Content-Transfer-Encoding: 8bit X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Peter Zijlstra , Catalin Marinas , Dave Hansen , linux-mm , Paul Mackerras , linux-sparc , Alexander Gordeev , Claudio Imbrenda , Will Deacon , linux-arch , linux-s390 , Vasily Gorbik , Richard Weinberger , linux-x86 , Russell King , Christian Borntraeger , Ingo Molnar , Andrey Ryabinin , Jeff Dike , Arnd Bergmann , Heiko Carstens , linux-um , Borislav Petkov , Andy Lutomirski , Thomas Gleixner , linux-arm , Linus Torvalds , LKML , Andrew Morton , linux-power , Mike Rapoport Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" Le 07/09/2020 à 20:00, Gerald Schaefer a écrit : > From: Alexander Gordeev > > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast > code") introduced a subtle but severe bug on s390 with gup_fast, due to > dynamic page table folding. > > The question "What would it require for the generic code to work for s390" > has already been discussed here > https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 > and ended with a promising approach here > https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 > which in the end unfortunately didn't quite work completely. > > We tried to mimic static level folding by changing pgd_offset to always > calculate top level page table offset, and do nothing in folded pXd_offset. > What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do > not reflect this dynamic behaviour, and still act like static 5-level > page tables. > [...] > > Fix this by introducing new pXd_addr_end_folded helpers, which take an > additional pXd entry value parameter, that can be used on s390 > to determine the correct page table level and return corresponding > end / boundary. With that, the pointer iteration will always > happen in gup_pgd_range for s390. No change for other architectures > introduced. Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment. Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ? Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded() > > Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") > Cc: # 5.2+ > Reviewed-by: Gerald Schaefer > Signed-off-by: Alexander Gordeev > Signed-off-by: Gerald Schaefer > --- > arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++ > include/linux/pgtable.h | 16 +++++++++++++ > mm/gup.c | 8 +++---- > 3 files changed, 62 insertions(+), 4 deletions(-) > > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h > index 7eb01a5459cd..027206e4959d 100644 > --- a/arch/s390/include/asm/pgtable.h > +++ b/arch/s390/include/asm/pgtable.h > @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm) > } > #define mm_pmd_folded(mm) mm_pmd_folded(mm) > > +/* > + * With dynamic page table levels on s390, the static pXd_addr_end() functions > + * will not return corresponding dynamic boundaries. This is no problem as long > + * as only pXd pointers are passed down during page table walk, because > + * pXd_offset() will simply return the given pointer for folded levels, and the > + * pointer iteration over a range simply happens at the correct page table > + * level. > + * It is however a problem with gup_fast, or other places walking the page > + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead > + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to > + * a stack variable, which cannot be used for pointer iteration at the correct > + * level. Instead, the iteration then has to happen by going up to pgd level > + * again. To allow this, provide pXd_addr_end_folded() functions with an > + * additional pXd value parameter, which can be used on s390 to determine the > + * folding level and return the corresponding boundary. > + */ > +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end) What does 'rste' stands for ? Isn't this line a bit long ? > +{ > + unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2; > + unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11); > + unsigned long boundary = (addr + size) & ~(size - 1); > + > + /* > + * FIXME The below check is for internal testing only, to be removed > + */ > + VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2)); > + > + return (boundary - 1) < (end - 1) ? boundary : end; > +} > + > +#define pgd_addr_end_folded pgd_addr_end_folded > +static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end) > +{ > + return rste_addr_end_folded(pgd_val(pgd), addr, end); > +} > + > +#define p4d_addr_end_folded p4d_addr_end_folded > +static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end) > +{ > + return rste_addr_end_folded(p4d_val(p4d), addr, end); > +} > + > static inline int mm_has_pgste(struct mm_struct *mm) > { > #ifdef CONFIG_PGSTE > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index e8cbc2e795d5..981c4c2a31fe 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm, > }) > #endif > > +#ifndef pgd_addr_end_folded > +#define pgd_addr_end_folded(pgd, addr, end) pgd_addr_end(addr, end) > +#endif > + > +#ifndef p4d_addr_end_folded > +#define p4d_addr_end_folded(p4d, addr, end) p4d_addr_end(addr, end) > +#endif > + > +#ifndef pud_addr_end_folded > +#define pud_addr_end_folded(pud, addr, end) pud_addr_end(addr, end) > +#endif > + > +#ifndef pmd_addr_end_folded > +#define pmd_addr_end_folded(pmd, addr, end) pmd_addr_end(addr, end) > +#endif > + > /* > * When walking page tables, we usually want to skip any p?d_none entries; > * and any p?d_bad entries - reporting the error before resetting to none. > diff --git a/mm/gup.c b/mm/gup.c > index bd883a112724..ba4aace5d0f4 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > do { > pmd_t pmd = READ_ONCE(*pmdp); > > - next = pmd_addr_end(addr, end); > + next = pmd_addr_end_folded(pmd, addr, end); > if (!pmd_present(pmd)) > return 0; > > @@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > do { > pud_t pud = READ_ONCE(*pudp); > > - next = pud_addr_end(addr, end); > + next = pud_addr_end_folded(pud, addr, end); > if (unlikely(!pud_present(pud))) > return 0; > if (unlikely(pud_huge(pud))) { > @@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > do { > p4d_t p4d = READ_ONCE(*p4dp); > > - next = p4d_addr_end(addr, end); > + next = p4d_addr_end_folded(p4d, addr, end); > if (p4d_none(p4d)) > return 0; > BUILD_BUG_ON(p4d_huge(p4d)); > @@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, > do { > pgd_t pgd = READ_ONCE(*pgdp); > > - next = pgd_addr_end(addr, end); > + next = pgd_addr_end_folded(pgd, addr, end); > if (pgd_none(pgd)) > return; > if (unlikely(pgd_huge(pgd))) { > Christophe