* [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding @ 2020-09-07 18:00 Gerald Schaefer 2020-09-07 18:00 ` [RFC PATCH v2 1/3] " Gerald Schaefer ` (4 more replies) 0 siblings, 5 replies; 62+ messages in thread From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw) To: Jason Gunthorpe, John Hubbard Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda This is v2 of an RFC previously discussed here: https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/ Patch 1 is a fix for a regression in gup_fast on s390, after our conversion to common gup_fast code. It will introduce special helper functions pXd_addr_end_folded(), which have to be used in places where pagetable walk is done w/o lock and with READ_ONCE, so currently only in gup_fast. Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end() themselves by adding an extra pXd value parameter. That was suggested by Jason during v1 discussion, because he is already thinking of some other places where he might want to switch to the READ_ONCE logic for pagetable walks. In general, that would be the cleanest / safest solution, but there is some impact on other architectures and common code, hence the new and greatly enlarged recipient list. Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline functions instead of #defines, so that we get some type checking for the new pXd value parameter. Not sure about Fixes/stable tags for the generic solution. Only patch 1 fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might still be nice to have in stable, to ease future backports, but I guess "nice to have" does not really qualify for stable backports. Changes in v2: - Pick option 2 from v1 discussion (pXd_addr_end_folded helpers) - Add patch 2 + 3 for more generic approach Alexander Gordeev (3): mm/gup: fix gup_fast with dynamic page table folding mm: make pXd_addr_end() functions page-table entry aware mm: make generic pXd_addr_end() macros inline functions arch/arm/include/asm/pgtable-2level.h | 2 +- arch/arm/mm/idmap.c | 6 ++-- arch/arm/mm/mmu.c | 8 ++--- arch/arm64/kernel/hibernate.c | 16 +++++---- arch/arm64/kvm/mmu.c | 16 ++++----- arch/arm64/mm/kasan_init.c | 8 ++--- arch/arm64/mm/mmu.c | 25 +++++++------- arch/powerpc/mm/book3s64/radix_pgtable.c | 7 ++-- arch/powerpc/mm/hugetlbpage.c | 6 ++-- arch/s390/include/asm/pgtable.h | 42 ++++++++++++++++++++++++ arch/s390/mm/page-states.c | 8 ++--- arch/s390/mm/pageattr.c | 8 ++--- arch/s390/mm/vmem.c | 8 ++--- arch/sparc/mm/hugetlbpage.c | 6 ++-- arch/um/kernel/tlb.c | 8 ++--- arch/x86/mm/init_64.c | 15 ++++----- arch/x86/mm/kasan_init_64.c | 16 ++++----- include/asm-generic/pgtable-nop4d.h | 2 +- include/asm-generic/pgtable-nopmd.h | 2 +- include/asm-generic/pgtable-nopud.h | 2 +- include/linux/pgtable.h | 38 ++++++++++++--------- mm/gup.c | 8 ++--- mm/ioremap.c | 8 ++--- mm/kasan/init.c | 17 +++++----- mm/madvise.c | 4 +-- mm/memory.c | 40 +++++++++++----------- mm/mlock.c | 18 +++++++--- mm/mprotect.c | 8 ++--- mm/pagewalk.c | 8 ++--- mm/swapfile.c | 8 ++--- mm/vmalloc.c | 16 ++++----- 31 files changed, 219 insertions(+), 165 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 62+ messages in thread
* [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-07 18:00 [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding Gerald Schaefer @ 2020-09-07 18:00 ` Gerald Schaefer 2020-09-08 5:06 ` Christophe Leroy 2020-09-08 14:30 ` Dave Hansen 2020-09-07 18:00 ` [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware Gerald Schaefer ` (3 subsequent siblings) 4 siblings, 2 replies; 62+ messages in thread From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw) To: Jason Gunthorpe, John Hubbard Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda From: Alexander Gordeev <agordeev@linux.ibm.com> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") introduced a subtle but severe bug on s390 with gup_fast, due to dynamic page table folding. The question "What would it require for the generic code to work for s390" has already been discussed here https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 and ended with a promising approach here https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 which in the end unfortunately didn't quite work completely. We tried to mimic static level folding by changing pgd_offset to always calculate top level page table offset, and do nothing in folded pXd_offset. What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do not reflect this dynamic behaviour, and still act like static 5-level page tables. Here is an example of what happens with gup_fast on s390, for a task with 3-levels paging, crossing a 2 GB pud boundary: // addr = 0x1007ffff000, end = 0x10080001000 static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; // pud_offset returns &p4d itself (a pointer to a value on stack) pudp = pud_offset(&p4d, addr); do { // on second iteratation reading "random" stack value pud_t pud = READ_ONCE(*pudp); // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 next = pud_addr_end(addr, end); ... } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack return 1; } pud_addr_end = 0x10080000000 is correct, but the previous pgd/p4d_addr_end should also have returned that limit, instead of the 5-level static pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end" parameter for gup_pud_range would also have been 0x10080000000, and we would not iterate further in gup_pud_range, but rather go back and (correctly) do it in gup_pgd_range. So, for the second iteration in gup_pud_range, we will increase pudp, which pointed to a stack value and not the real pud table. This new pudp will then point to whatever lies behind the p4d stack value. In general, this happens to be the previously read pgd, but it probably could also be something different, depending on compiler decisions. Most unfortunately, if it happens to be the pgd value, which is the same as the p4d / pud due to folding, it is a valid and present entry. So after the increment, we would still point to the same pud entry. The addr however has been increased in the second iteration, so that we now have different pmd/pte_index values, which will result in very wrong behaviour for the remaining gup_pmd/pte_range calls. We will effectively operate on an address minus 2 GB, due to missing pudp increase. In the "good case", if nothing is mapped there, we will fall back to the slow gup path. But if something is mapped there, and valid for gup_fast, we will end up (silently) getting references on the wrong pages and also add the wrong pages to the **pages result array. This can cause data corruption. Fix this by introducing new pXd_addr_end_folded helpers, which take an additional pXd entry value parameter, that can be used on s390 to determine the correct page table level and return corresponding end / boundary. With that, the pointer iteration will always happen in gup_pgd_range for s390. No change for other architectures introduced. Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") Cc: <stable@vger.kernel.org> # 5.2+ Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> --- arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++ include/linux/pgtable.h | 16 +++++++++++++ mm/gup.c | 8 +++---- 3 files changed, 62 insertions(+), 4 deletions(-) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 7eb01a5459cd..027206e4959d 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm) } #define mm_pmd_folded(mm) mm_pmd_folded(mm) +/* + * With dynamic page table levels on s390, the static pXd_addr_end() functions + * will not return corresponding dynamic boundaries. This is no problem as long + * as only pXd pointers are passed down during page table walk, because + * pXd_offset() will simply return the given pointer for folded levels, and the + * pointer iteration over a range simply happens at the correct page table + * level. + * It is however a problem with gup_fast, or other places walking the page + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to + * a stack variable, which cannot be used for pointer iteration at the correct + * level. Instead, the iteration then has to happen by going up to pgd level + * again. To allow this, provide pXd_addr_end_folded() functions with an + * additional pXd value parameter, which can be used on s390 to determine the + * folding level and return the corresponding boundary. + */ +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end) +{ + unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2; + unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11); + unsigned long boundary = (addr + size) & ~(size - 1); + + /* + * FIXME The below check is for internal testing only, to be removed + */ + VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2)); + + return (boundary - 1) < (end - 1) ? boundary : end; +} + +#define pgd_addr_end_folded pgd_addr_end_folded +static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end) +{ + return rste_addr_end_folded(pgd_val(pgd), addr, end); +} + +#define p4d_addr_end_folded p4d_addr_end_folded +static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end) +{ + return rste_addr_end_folded(p4d_val(p4d), addr, end); +} + static inline int mm_has_pgste(struct mm_struct *mm) { #ifdef CONFIG_PGSTE diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e8cbc2e795d5..981c4c2a31fe 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm, }) #endif +#ifndef pgd_addr_end_folded +#define pgd_addr_end_folded(pgd, addr, end) pgd_addr_end(addr, end) +#endif + +#ifndef p4d_addr_end_folded +#define p4d_addr_end_folded(p4d, addr, end) p4d_addr_end(addr, end) +#endif + +#ifndef pud_addr_end_folded +#define pud_addr_end_folded(pud, addr, end) pud_addr_end(addr, end) +#endif + +#ifndef pmd_addr_end_folded +#define pmd_addr_end_folded(pmd, addr, end) pmd_addr_end(addr, end) +#endif + /* * When walking page tables, we usually want to skip any p?d_none entries; * and any p?d_bad entries - reporting the error before resetting to none. diff --git a/mm/gup.c b/mm/gup.c index bd883a112724..ba4aace5d0f4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, do { pmd_t pmd = READ_ONCE(*pmdp); - next = pmd_addr_end(addr, end); + next = pmd_addr_end_folded(pmd, addr, end); if (!pmd_present(pmd)) return 0; @@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, do { pud_t pud = READ_ONCE(*pudp); - next = pud_addr_end(addr, end); + next = pud_addr_end_folded(pud, addr, end); if (unlikely(!pud_present(pud))) return 0; if (unlikely(pud_huge(pud))) { @@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, do { p4d_t p4d = READ_ONCE(*p4dp); - next = p4d_addr_end(addr, end); + next = p4d_addr_end_folded(p4d, addr, end); if (p4d_none(p4d)) return 0; BUILD_BUG_ON(p4d_huge(p4d)); @@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, do { pgd_t pgd = READ_ONCE(*pgdp); - next = pgd_addr_end(addr, end); + next = pgd_addr_end_folded(pgd, addr, end); if (pgd_none(pgd)) return; if (unlikely(pgd_huge(pgd))) { -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-07 18:00 ` [RFC PATCH v2 1/3] " Gerald Schaefer @ 2020-09-08 5:06 ` Christophe Leroy 2020-09-08 12:09 ` Christian Borntraeger 2020-09-08 14:30 ` Dave Hansen 1 sibling, 1 reply; 62+ messages in thread From: Christophe Leroy @ 2020-09-08 5:06 UTC (permalink / raw) To: Gerald Schaefer, Jason Gunthorpe, John Hubbard Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86, Russell King, Christian Borntraeger, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport Le 07/09/2020 à 20:00, Gerald Schaefer a écrit : > From: Alexander Gordeev <agordeev@linux.ibm.com> > > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast > code") introduced a subtle but severe bug on s390 with gup_fast, due to > dynamic page table folding. > > The question "What would it require for the generic code to work for s390" > has already been discussed here > https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 > and ended with a promising approach here > https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 > which in the end unfortunately didn't quite work completely. > > We tried to mimic static level folding by changing pgd_offset to always > calculate top level page table offset, and do nothing in folded pXd_offset. > What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do > not reflect this dynamic behaviour, and still act like static 5-level > page tables. > [...] > > Fix this by introducing new pXd_addr_end_folded helpers, which take an > additional pXd entry value parameter, that can be used on s390 > to determine the correct page table level and return corresponding > end / boundary. With that, the pointer iteration will always > happen in gup_pgd_range for s390. No change for other architectures > introduced. Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment. Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ? Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded() > > Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") > Cc: <stable@vger.kernel.org> # 5.2+ > Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> > Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> > Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> > --- > arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++ > include/linux/pgtable.h | 16 +++++++++++++ > mm/gup.c | 8 +++---- > 3 files changed, 62 insertions(+), 4 deletions(-) > > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h > index 7eb01a5459cd..027206e4959d 100644 > --- a/arch/s390/include/asm/pgtable.h > +++ b/arch/s390/include/asm/pgtable.h > @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm) > } > #define mm_pmd_folded(mm) mm_pmd_folded(mm) > > +/* > + * With dynamic page table levels on s390, the static pXd_addr_end() functions > + * will not return corresponding dynamic boundaries. This is no problem as long > + * as only pXd pointers are passed down during page table walk, because > + * pXd_offset() will simply return the given pointer for folded levels, and the > + * pointer iteration over a range simply happens at the correct page table > + * level. > + * It is however a problem with gup_fast, or other places walking the page > + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead > + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to > + * a stack variable, which cannot be used for pointer iteration at the correct > + * level. Instead, the iteration then has to happen by going up to pgd level > + * again. To allow this, provide pXd_addr_end_folded() functions with an > + * additional pXd value parameter, which can be used on s390 to determine the > + * folding level and return the corresponding boundary. > + */ > +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end) What does 'rste' stands for ? Isn't this line a bit long ? > +{ > + unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2; > + unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11); > + unsigned long boundary = (addr + size) & ~(size - 1); > + > + /* > + * FIXME The below check is for internal testing only, to be removed > + */ > + VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2)); > + > + return (boundary - 1) < (end - 1) ? boundary : end; > +} > + > +#define pgd_addr_end_folded pgd_addr_end_folded > +static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end) > +{ > + return rste_addr_end_folded(pgd_val(pgd), addr, end); > +} > + > +#define p4d_addr_end_folded p4d_addr_end_folded > +static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end) > +{ > + return rste_addr_end_folded(p4d_val(p4d), addr, end); > +} > + > static inline int mm_has_pgste(struct mm_struct *mm) > { > #ifdef CONFIG_PGSTE > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index e8cbc2e795d5..981c4c2a31fe 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm, > }) > #endif > > +#ifndef pgd_addr_end_folded > +#define pgd_addr_end_folded(pgd, addr, end) pgd_addr_end(addr, end) > +#endif > + > +#ifndef p4d_addr_end_folded > +#define p4d_addr_end_folded(p4d, addr, end) p4d_addr_end(addr, end) > +#endif > + > +#ifndef pud_addr_end_folded > +#define pud_addr_end_folded(pud, addr, end) pud_addr_end(addr, end) > +#endif > + > +#ifndef pmd_addr_end_folded > +#define pmd_addr_end_folded(pmd, addr, end) pmd_addr_end(addr, end) > +#endif > + > /* > * When walking page tables, we usually want to skip any p?d_none entries; > * and any p?d_bad entries - reporting the error before resetting to none. > diff --git a/mm/gup.c b/mm/gup.c > index bd883a112724..ba4aace5d0f4 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > do { > pmd_t pmd = READ_ONCE(*pmdp); > > - next = pmd_addr_end(addr, end); > + next = pmd_addr_end_folded(pmd, addr, end); > if (!pmd_present(pmd)) > return 0; > > @@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > do { > pud_t pud = READ_ONCE(*pudp); > > - next = pud_addr_end(addr, end); > + next = pud_addr_end_folded(pud, addr, end); > if (unlikely(!pud_present(pud))) > return 0; > if (unlikely(pud_huge(pud))) { > @@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > do { > p4d_t p4d = READ_ONCE(*p4dp); > > - next = p4d_addr_end(addr, end); > + next = p4d_addr_end_folded(p4d, addr, end); > if (p4d_none(p4d)) > return 0; > BUILD_BUG_ON(p4d_huge(p4d)); > @@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, > do { > pgd_t pgd = READ_ONCE(*pgdp); > > - next = pgd_addr_end(addr, end); > + next = pgd_addr_end_folded(pgd, addr, end); > if (pgd_none(pgd)) > return; > if (unlikely(pgd_huge(pgd))) { > Christophe ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-08 5:06 ` Christophe Leroy @ 2020-09-08 12:09 ` Christian Borntraeger 2020-09-08 12:40 ` Christophe Leroy 0 siblings, 1 reply; 62+ messages in thread From: Christian Borntraeger @ 2020-09-08 12:09 UTC (permalink / raw) To: Christophe Leroy, Gerald Schaefer, Jason Gunthorpe, John Hubbard Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86, Russell King, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport On 08.09.20 07:06, Christophe Leroy wrote: > > > Le 07/09/2020 à 20:00, Gerald Schaefer a écrit : >> From: Alexander Gordeev <agordeev@linux.ibm.com> >> >> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast >> code") introduced a subtle but severe bug on s390 with gup_fast, due to >> dynamic page table folding. >> >> The question "What would it require for the generic code to work for s390" >> has already been discussed here >> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 >> and ended with a promising approach here >> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 >> which in the end unfortunately didn't quite work completely. >> >> We tried to mimic static level folding by changing pgd_offset to always >> calculate top level page table offset, and do nothing in folded pXd_offset. >> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do >> not reflect this dynamic behaviour, and still act like static 5-level >> page tables. >> > > [...] > >> >> Fix this by introducing new pXd_addr_end_folded helpers, which take an >> additional pXd entry value parameter, that can be used on s390 >> to determine the correct page table level and return corresponding >> end / boundary. With that, the pointer iteration will always >> happen in gup_pgd_range for s390. No change for other architectures >> introduced. > > Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment. > Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ? > > Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded() given that this fixes a data corruption issue, wouldnt it be the best to go forward with this patch ASAP and then handle the other patches on top with all the time that we need? > > >> >> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") >> Cc: <stable@vger.kernel.org> # 5.2+ >> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> >> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> >> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> >> --- >> arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++ >> include/linux/pgtable.h | 16 +++++++++++++ >> mm/gup.c | 8 +++---- >> 3 files changed, 62 insertions(+), 4 deletions(-) >> >> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h >> index 7eb01a5459cd..027206e4959d 100644 >> --- a/arch/s390/include/asm/pgtable.h >> +++ b/arch/s390/include/asm/pgtable.h >> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm) >> } >> #define mm_pmd_folded(mm) mm_pmd_folded(mm) >> +/* >> + * With dynamic page table levels on s390, the static pXd_addr_end() functions >> + * will not return corresponding dynamic boundaries. This is no problem as long >> + * as only pXd pointers are passed down during page table walk, because >> + * pXd_offset() will simply return the given pointer for folded levels, and the >> + * pointer iteration over a range simply happens at the correct page table >> + * level. >> + * It is however a problem with gup_fast, or other places walking the page >> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead >> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to >> + * a stack variable, which cannot be used for pointer iteration at the correct >> + * level. Instead, the iteration then has to happen by going up to pgd level >> + * again. To allow this, provide pXd_addr_end_folded() functions with an >> + * additional pXd value parameter, which can be used on s390 to determine the >> + * folding level and return the corresponding boundary. >> + */ >> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end) > > What does 'rste' stands for ? > > Isn't this line a bit long ? this is region/segment table entry according to the architecture. On our platform we do have the pagetables with a different format that next levels (segment table -> 1MB granularity, region 3rd table -> 2 GB granularity, region 2nd table -> 4TB granularity, region 1st table -> 8 PB granularity. ST,R3,R2,R1 have the same format and are thus often called crste (combined region and segment table entry). ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-08 12:09 ` Christian Borntraeger @ 2020-09-08 12:40 ` Christophe Leroy 2020-09-08 13:38 ` Gerald Schaefer 0 siblings, 1 reply; 62+ messages in thread From: Christophe Leroy @ 2020-09-08 12:40 UTC (permalink / raw) To: Christian Borntraeger, Gerald Schaefer, Jason Gunthorpe, John Hubbard Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86, Russell King, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport Le 08/09/2020 à 14:09, Christian Borntraeger a écrit : > > > On 08.09.20 07:06, Christophe Leroy wrote: >> >> >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit : >>> From: Alexander Gordeev <agordeev@linux.ibm.com> >>> >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to >>> dynamic page table folding. >>> >>> The question "What would it require for the generic code to work for s390" >>> has already been discussed here >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 >>> and ended with a promising approach here >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 >>> which in the end unfortunately didn't quite work completely. >>> >>> We tried to mimic static level folding by changing pgd_offset to always >>> calculate top level page table offset, and do nothing in folded pXd_offset. >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do >>> not reflect this dynamic behaviour, and still act like static 5-level >>> page tables. >>> >> >> [...] >> >>> >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an >>> additional pXd entry value parameter, that can be used on s390 >>> to determine the correct page table level and return corresponding >>> end / boundary. With that, the pointer iteration will always >>> happen in gup_pgd_range for s390. No change for other architectures >>> introduced. >> >> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment. >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ? >> >> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded() > > given that this fixes a data corruption issue, wouldnt it be the best to go forward > with this patch ASAP and then handle the other patches on top with all the time that > we need? I have no strong opinion on this, but I feel rather tricky to have to change generic part of GUP to use a new fonction then revert that change in the following patch, just because you want the first patch in stable and not the second one. Regardless, I was wondering, why do we need a reference to the pXd at all when calling pXd_addr_end() ? Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the passed addr ? Christophe ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-08 12:40 ` Christophe Leroy @ 2020-09-08 13:38 ` Gerald Schaefer 0 siblings, 0 replies; 62+ messages in thread From: Gerald Schaefer @ 2020-09-08 13:38 UTC (permalink / raw) To: Christophe Leroy Cc: Christian Borntraeger, Jason Gunthorpe, John Hubbard, Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86, Russell King, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport On Tue, 8 Sep 2020 14:40:10 +0200 Christophe Leroy <christophe.leroy@csgroup.eu> wrote: > > > Le 08/09/2020 à 14:09, Christian Borntraeger a écrit : > > > > > > On 08.09.20 07:06, Christophe Leroy wrote: > >> > >> > >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit : > >>> From: Alexander Gordeev <agordeev@linux.ibm.com> > >>> > >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast > >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to > >>> dynamic page table folding. > >>> > >>> The question "What would it require for the generic code to work for s390" > >>> has already been discussed here > >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 > >>> and ended with a promising approach here > >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 > >>> which in the end unfortunately didn't quite work completely. > >>> > >>> We tried to mimic static level folding by changing pgd_offset to always > >>> calculate top level page table offset, and do nothing in folded pXd_offset. > >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do > >>> not reflect this dynamic behaviour, and still act like static 5-level > >>> page tables. > >>> > >> > >> [...] > >> > >>> > >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an > >>> additional pXd entry value parameter, that can be used on s390 > >>> to determine the correct page table level and return corresponding > >>> end / boundary. With that, the pointer iteration will always > >>> happen in gup_pgd_range for s390. No change for other architectures > >>> introduced. > >> > >> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment. > >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ? > >> > >> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded() > > > > given that this fixes a data corruption issue, wouldnt it be the best to go forward > > with this patch ASAP and then handle the other patches on top with all the time that > > we need? > > I have no strong opinion on this, but I feel rather tricky to have to > change generic part of GUP to use a new fonction then revert that change > in the following patch, just because you want the first patch in stable > and not the second one. > > Regardless, I was wondering, why do we need a reference to the pXd at > all when calling pXd_addr_end() ? > > Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the > passed addr ? Apart from performance impact when re-doing that what has already been done by the caller, I think we would also break the READ_ONCE semantics. After all, the pXd_offset() would also require some pXd pointer input, which we don't have. So we would need to start over again from mm->pgd. Also, it seems to be more in line with other primitives that take a pXd value or pointer. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-07 18:00 ` [RFC PATCH v2 1/3] " Gerald Schaefer 2020-09-08 5:06 ` Christophe Leroy @ 2020-09-08 14:30 ` Dave Hansen 2020-09-08 17:59 ` Gerald Schaefer 2020-09-09 12:29 ` Gerald Schaefer 1 sibling, 2 replies; 62+ messages in thread From: Dave Hansen @ 2020-09-08 14:30 UTC (permalink / raw) To: Gerald Schaefer, Jason Gunthorpe, John Hubbard Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On 9/7/20 11:00 AM, Gerald Schaefer wrote: > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast > code") introduced a subtle but severe bug on s390 with gup_fast, due to > dynamic page table folding. Would it be fair to say that the "fake" page table entries s390 allocates on the stack are what's causing the trouble here? That might be a nice thing to open up with here. "Dynamic page table folding" really means nothing to me. > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > do { > pmd_t pmd = READ_ONCE(*pmdp); > > - next = pmd_addr_end(addr, end); > + next = pmd_addr_end_folded(pmd, addr, end); > if (!pmd_present(pmd)) > return 0; It looks like you fix this up later, but this would be a problem if left this way. There's no documentation for whether I use pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-08 14:30 ` Dave Hansen @ 2020-09-08 17:59 ` Gerald Schaefer 2020-09-09 12:29 ` Gerald Schaefer 1 sibling, 0 replies; 62+ messages in thread From: Gerald Schaefer @ 2020-09-08 17:59 UTC (permalink / raw) To: Dave Hansen Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Tue, 8 Sep 2020 07:30:50 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 9/7/20 11:00 AM, Gerald Schaefer wrote: > > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast > > code") introduced a subtle but severe bug on s390 with gup_fast, due to > > dynamic page table folding. > > Would it be fair to say that the "fake" page table entries s390 > allocates on the stack are what's causing the trouble here? That might > be a nice thing to open up with here. "Dynamic page table folding" > really means nothing to me. We do not really allocate anything on the stack, it is the generic logic from gup_fast that passes over pXd values (read once before), and using pointers to such (stack) variables instead of real pXd pointers. That, combined with the fact that we just return the passed in pointer in pXd_offset() for folded levels. That works similar on x86 IIUC, but with static folding, and thus also proper pXd_addr_end() results because of statically (and correspondingly) defined Pxd_INDEX/SHIFT. We always have static 5-level PxD_INDEX/SHIFT, and that cannot really be made dynamic, so we just make pXd_addr_end() dynamic instead, and that requires the pXd value to determine the correct pagetable level. Still makes my head spin when trying to explain, sorry. It is a very special s390 oddity, or lets call it "feature", because I don't think any other architecture has "dynamic pagetable folding" capability, depending on process requirements, for whatever it is worth... > > > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > > do { > > pmd_t pmd = READ_ONCE(*pmdp); > > > > - next = pmd_addr_end(addr, end); > > + next = pmd_addr_end_folded(pmd, addr, end); > > if (!pmd_present(pmd)) > > return 0; > > It looks like you fix this up later, but this would be a problem if left > this way. There's no documentation for whether I use > pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker. Yes, that is very unfortunate. We did have some lengthy comment in include/linux/pgtable.h where the pXd_addr_end(_folded) were defined. But that was moved to arch/s390/include/asm/pgtable.h in this version, probably because we already had the generalization in mind, where we would not need such explanation in common header any more. So, it might help better understand the issue that we have with dynamic page table folding and READ_ONCE-style pagetable walkers when looking at that comment. Thanks for pointing out, that comment should definitely go into include/linux/pgtable.h again. At least if we would still go for that "s390 fix first, generalization second" approach, but it seems we have other / better options now. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-08 14:30 ` Dave Hansen 2020-09-08 17:59 ` Gerald Schaefer @ 2020-09-09 12:29 ` Gerald Schaefer 2020-09-09 16:18 ` Dave Hansen 1 sibling, 1 reply; 62+ messages in thread From: Gerald Schaefer @ 2020-09-09 12:29 UTC (permalink / raw) To: Dave Hansen Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Tue, 8 Sep 2020 07:30:50 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 9/7/20 11:00 AM, Gerald Schaefer wrote: > > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast > > code") introduced a subtle but severe bug on s390 with gup_fast, due to > > dynamic page table folding. > > Would it be fair to say that the "fake" page table entries s390 > allocates on the stack are what's causing the trouble here? That might > be a nice thing to open up with here. "Dynamic page table folding" > really means nothing to me. Sorry, I guess my previous reply does not really explain "what the heck is dynamic page table folding?". On s390, we can have different number of page table levels for different processes / mms. We always start with 3 levels, and update dynamically on process demand to 4 or 5 levels, hence the dynamic folding. Still, the PxD_SIZE/SHIFT is defined statically, so that e.g. pXd_addr_end() will not reflect this dynamic behavior. For the various pagetable walkers using pXd_addr_end() (w/o READ_ONCE logic) this is no problem. With static folding, iteration over the folded levels will always happen at pgd level (top-level folding). For s390, we stay at the respective level and iterate there (dynamic middle-level folding), only return to pgd level if there really were 5 levels. This only works well as long there are real pagetable pointers involved, that can also be used for iteration. For gup_fast, or any other future pagetable walkers using the READ_ONCE logic w/o lock, that is not true. There are pointers involved to local pXd values on the stack, because of the READ_ONCE logic, and our middle-level iteration will suddenly iterate over such stack pointers instead of pagetable pointers. This will be addressed by making the pXd_addr_end() dynamic, for which we need to see the pXd value in order to determine its level / type. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-09 12:29 ` Gerald Schaefer @ 2020-09-09 16:18 ` Dave Hansen 2020-09-09 17:25 ` Gerald Schaefer 0 siblings, 1 reply; 62+ messages in thread From: Dave Hansen @ 2020-09-09 16:18 UTC (permalink / raw) To: Gerald Schaefer Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On 9/9/20 5:29 AM, Gerald Schaefer wrote: > This only works well as long there are real pagetable pointers involved, > that can also be used for iteration. For gup_fast, or any other future > pagetable walkers using the READ_ONCE logic w/o lock, that is not true. > There are pointers involved to local pXd values on the stack, because of > the READ_ONCE logic, and our middle-level iteration will suddenly iterate > over such stack pointers instead of pagetable pointers. By "There are pointers involved to local pXd values on the stack", did you mean "locate" instead of "local"? That sentence confused me. Which code is it, exactly that allocates these troublesome on-stack pXd values, btw? > This will be addressed by making the pXd_addr_end() dynamic, for which > we need to see the pXd value in order to determine its level / type. Thanks for the explanation! ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-09 16:18 ` Dave Hansen @ 2020-09-09 17:25 ` Gerald Schaefer 2020-09-09 18:03 ` Jason Gunthorpe 0 siblings, 1 reply; 62+ messages in thread From: Gerald Schaefer @ 2020-09-09 17:25 UTC (permalink / raw) To: Dave Hansen Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Wed, 9 Sep 2020 09:18:46 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 9/9/20 5:29 AM, Gerald Schaefer wrote: > > This only works well as long there are real pagetable pointers involved, > > that can also be used for iteration. For gup_fast, or any other future > > pagetable walkers using the READ_ONCE logic w/o lock, that is not true. > > There are pointers involved to local pXd values on the stack, because of > > the READ_ONCE logic, and our middle-level iteration will suddenly iterate > > over such stack pointers instead of pagetable pointers. > > By "There are pointers involved to local pXd values on the stack", did > you mean "locate" instead of "local"? That sentence confused me. > > Which code is it, exactly that allocates these troublesome on-stack pXd > values, btw? It is the gup_pXd_range() call sequence in mm/gup.c. It starts in gup_pgd_range() with "pgdp = pgd_offset(current->mm, addr)" and then the "pgd_t pgd = READ_ONCE(*pgdp)" which creates the first local stack variable "pgd". The next-level call to gup_p4d_range() gets this "pgd" value as input, but not the original pgdp pointer where it was read from. This is already the essential difference to other pagetable walkers like e.g. walk_pXd_range() in mm/pagewalk.c, where the original pointer is passed through. With READ_ONCE, that pointer must not be further de-referenced, so instead the value is passed over. In gup_p4d_range() we then have "p4dp = p4d_offset(&pgd, addr)", with &pgd being a pointer to the passed over pgd value, so that's the first pXd pointer that does not point directly to the pXd in the page table, but a local stack variable. With folded p4d, p4d_offset(&pgd, addr) will simply return the passed-in &pgd pointer, so we now also have p4dp point to that. That continues with "p4d_t p4d = READ_ONCE(*p4dp)", and that second stack variable passed to gup_huge_pud() and so on. Due to inlining, all those variables will not really be passed anywhere, but simply sit on the stack. So far, IIUC, that would also happen on x86 (or everywhere else actually) for folded levels, i.e. some pXd_offset() calls would simply return the passed in (stack) value pointer. This works as designed, and it will not lead to the "iteration over stack pointer" for anybody but s390, because the pXd_addr_end() boundaries usually take care that you always return to pgd level for iteration, and that is the only level with a real pagetable pointer. For s390, we stay at the first non-folded level and do the iteration there, which is fine for other pagetable walkers using the original pointers, but not for the READ_ONCE-style gup_fast. I actually had to draw myself a picture to get some hold of this, or rather a walk-through with a certain pud-crossing range in a folded 3-level scenario. Not sure if I would have understood my explanation above w/o that, but I hope you can make some sense out of it. Or draw yourself a picture :-) ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-09 17:25 ` Gerald Schaefer @ 2020-09-09 18:03 ` Jason Gunthorpe 2020-09-10 9:39 ` Alexander Gordeev 2020-09-10 13:11 ` Gerald Schaefer 0 siblings, 2 replies; 62+ messages in thread From: Jason Gunthorpe @ 2020-09-09 18:03 UTC (permalink / raw) To: Gerald Schaefer Cc: Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote: > I actually had to draw myself a picture to get some hold of > this, or rather a walk-through with a certain pud-crossing > range in a folded 3-level scenario. Not sure if I would have > understood my explanation above w/o that, but I hope you can > make some sense out of it. Or draw yourself a picture :-) What I don't understand is how does anything work with S390 today? If the fix is only to change pxx_addr_end() then than generic code like mm/pagewalk.c will iterate over a *different list* of page table entries. It's choice of entries to look at is entirely driven by pxx_addr_end(). Which suggest to me that mm/pagewalk.c also doesn't work properly today on S390 and this issue is not really about stack variables? Fundamentally if pXX_offset() and pXX_addr_end() must be consistent together, if pXX_offset() is folded then pXX_addr_end() must cause a single iteration of that level. Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-09 18:03 ` Jason Gunthorpe @ 2020-09-10 9:39 ` Alexander Gordeev 2020-09-10 13:02 ` Jason Gunthorpe 2020-09-10 17:35 ` Linus Torvalds 2020-09-10 13:11 ` Gerald Schaefer 1 sibling, 2 replies; 62+ messages in thread From: Alexander Gordeev @ 2020-09-10 9:39 UTC (permalink / raw) To: Jason Gunthorpe Cc: Gerald Schaefer, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Wed, Sep 09, 2020 at 03:03:24PM -0300, Jason Gunthorpe wrote: > On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote: > > I actually had to draw myself a picture to get some hold of > > this, or rather a walk-through with a certain pud-crossing > > range in a folded 3-level scenario. Not sure if I would have > > understood my explanation above w/o that, but I hope you can > > make some sense out of it. Or draw yourself a picture :-) > > What I don't understand is how does anything work with S390 today? > > If the fix is only to change pxx_addr_end() then than generic code > like mm/pagewalk.c will iterate over a *different list* of page table > entries. > > It's choice of entries to look at is entirely driven by pxx_addr_end(). > > Which suggest to me that mm/pagewalk.c also doesn't work properly > today on S390 and this issue is not really about stack variables? > > Fundamentally if pXX_offset() and pXX_addr_end() must be consistent > together, if pXX_offset() is folded then pXX_addr_end() must cause a > single iteration of that level. Your observation is correct. Another way to describe the problem is existing pXd_addr_end helpers could be applied to mismatching levels on s390 (e.g p4d_addr_end applied to pud or pgd_addr_end applied to p4d). As you noticed, all *_pXd_range iterators could be called with address ranges that exceed single pXd table. However, when it happens with pointers to real page tables (passed to *_pXd_range iterators) we still operate on valid tables, which just (lucky for us) happened to be folded. Thus we still reference correct table entries. It is only gup_fast case that exposes the issue. It hits because pointers to stack copies are passed to gup_pXd_range iterators, not pointers to real page tables itself. As Gerald mentioned, it is very difficult to explain in a clear way. Hopefully, one could make sense ot of it. > Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 9:39 ` Alexander Gordeev @ 2020-09-10 13:02 ` Jason Gunthorpe 2020-09-10 13:28 ` Gerald Schaefer 2020-09-10 17:57 ` Gerald Schaefer 2020-09-10 17:35 ` Linus Torvalds 1 sibling, 2 replies; 62+ messages in thread From: Jason Gunthorpe @ 2020-09-10 13:02 UTC (permalink / raw) To: Alexander Gordeev Cc: Gerald Schaefer, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote: > As Gerald mentioned, it is very difficult to explain in a clear way. > Hopefully, one could make sense ot of it. I would say the page table API requires this invariant: pud = pud_offset(p4d, addr); do { WARN_ON(pud != pud_offset(p4d, addr); next = pud_addr_end(addr, end); } while (pud++, addr = next, addr != end); ie pud++ is supposed to be a shortcut for pud_offset(p4d, next) While S390 does not follow this. Fixing addr_end brings it into alignment by preventing pud++ from happening. The only currently known side effect is that gup_fast crashes, but it sure is an unexpected thing. This suggests another fix, which is to say that pud++ is undefined and pud_offset() must always be called, but I think that would cause worse codegen on all other archs. Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 13:02 ` Jason Gunthorpe @ 2020-09-10 13:28 ` Gerald Schaefer 2020-09-10 15:10 ` Jason Gunthorpe 2020-09-10 17:57 ` Gerald Schaefer 1 sibling, 1 reply; 62+ messages in thread From: Gerald Schaefer @ 2020-09-10 13:28 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, 10 Sep 2020 10:02:33 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote: > > > As Gerald mentioned, it is very difficult to explain in a clear way. > > Hopefully, one could make sense ot of it. > > I would say the page table API requires this invariant: > > pud = pud_offset(p4d, addr); > do { > WARN_ON(pud != pud_offset(p4d, addr); > next = pud_addr_end(addr, end); > } while (pud++, addr = next, addr != end); > > ie pud++ is supposed to be a shortcut for > pud_offset(p4d, next) > > While S390 does not follow this. Fixing addr_end brings it into > alignment by preventing pud++ from happening. > > The only currently known side effect is that gup_fast crashes, but it > sure is an unexpected thing. It only is unexpected in a "top-level folding" world, see my other reply. Consider it an optimization, which was possible because of how our dynamic folding works, and e.g. because we can determine the correct pagetable level from a pXd value in pXd_offset. > This suggests another fix, which is to say that pud++ is undefined and > pud_offset() must always be called, but I think that would cause worse > codegen on all other archs. There really is nothing to fix for s390 outside of gup_fast, or other potential future READ_ONCE pagetable walkers. We do take the side-effect of the generic change on all other pagetable walkers for s390, but it really is rather a slight degradation than a fix. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 13:28 ` Gerald Schaefer @ 2020-09-10 15:10 ` Jason Gunthorpe 2020-09-10 17:07 ` Gerald Schaefer 0 siblings, 1 reply; 62+ messages in thread From: Jason Gunthorpe @ 2020-09-10 15:10 UTC (permalink / raw) To: Gerald Schaefer, Anshuman Khandual Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote: > On Thu, 10 Sep 2020 10:02:33 -0300 > Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote: > > > > > As Gerald mentioned, it is very difficult to explain in a clear way. > > > Hopefully, one could make sense ot of it. > > > > I would say the page table API requires this invariant: > > > > pud = pud_offset(p4d, addr); > > do { > > WARN_ON(pud != pud_offset(p4d, addr); > > next = pud_addr_end(addr, end); > > } while (pud++, addr = next, addr != end); > > > > ie pud++ is supposed to be a shortcut for > > pud_offset(p4d, next) > > > > While S390 does not follow this. Fixing addr_end brings it into > > alignment by preventing pud++ from happening. > > > > The only currently known side effect is that gup_fast crashes, but it > > sure is an unexpected thing. > > It only is unexpected in a "top-level folding" world, see my other reply. > Consider it an optimization, which was possible because of how our dynamic > folding works, and e.g. because we can determine the correct pagetable > level from a pXd value in pXd_offset. No, I disagree. The page walker API the arch presents has to have well defined semantics. For instance, there is an effort to define tests and invarients for the page table accesses to bring this understanding and uniformity: mm/debug_vm_pgtable.c If we fix S390 using the pX_addr_end() change then the above should be updated with an invariant to check it. I've added Anshuman for some thoughts.. For better or worse, that invariant does exclude arches from using other folding techniques. The other solution would be to address the other side of != and adjust the pud++ eg replcae pud++ with something like: pud = pud_next_entry(p4d, pud, next) Such that: pud_next_entry(p4d, pud, next) === pud_offset(p4d, next) In which case the invarient changes to 'callers can never do pointer arithmetic on the result of pXX_offset()' which is a bit harder to enforce. Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 15:10 ` Jason Gunthorpe @ 2020-09-10 17:07 ` Gerald Schaefer 2020-09-10 17:19 ` Jason Gunthorpe 0 siblings, 1 reply; 62+ messages in thread From: Gerald Schaefer @ 2020-09-10 17:07 UTC (permalink / raw) To: Jason Gunthorpe Cc: Anshuman Khandual, Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, 10 Sep 2020 12:10:26 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote: > > On Thu, 10 Sep 2020 10:02:33 -0300 > > Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote: > > > > > > > As Gerald mentioned, it is very difficult to explain in a clear way. > > > > Hopefully, one could make sense ot of it. > > > > > > I would say the page table API requires this invariant: > > > > > > pud = pud_offset(p4d, addr); > > > do { > > > WARN_ON(pud != pud_offset(p4d, addr); > > > next = pud_addr_end(addr, end); > > > } while (pud++, addr = next, addr != end); > > > > > > ie pud++ is supposed to be a shortcut for > > > pud_offset(p4d, next) > > > > > > While S390 does not follow this. Fixing addr_end brings it into > > > alignment by preventing pud++ from happening. > > > > > > The only currently known side effect is that gup_fast crashes, but it > > > sure is an unexpected thing. > > > > It only is unexpected in a "top-level folding" world, see my other reply. > > Consider it an optimization, which was possible because of how our dynamic > > folding works, and e.g. because we can determine the correct pagetable > > level from a pXd value in pXd_offset. > > No, I disagree. The page walker API the arch presents has to have well > defined semantics. For instance, there is an effort to define tests > and invarients for the page table accesses to bring this understanding > and uniformity: > > mm/debug_vm_pgtable.c > > If we fix S390 using the pX_addr_end() change then the above should be > updated with an invariant to check it. I've added Anshuman for some > thoughts.. We are very aware of those tests, and actually a big supporter of the idea. Also part of the supported architectures already, and it has already helped us find / fix some s390 oddities. However, we did not see any issues wrt to our pagetable walking, neither with the current version, nor with the new generic approach. We do currently see other issues, Anshuman will know what I mean :-) > For better or worse, that invariant does exclude arches from using > other folding techniques. > > The other solution would be to address the other side of != and adjust > the pud++ > > eg replcae pud++ with something like: > pud = pud_next_entry(p4d, pud, next) > > Such that: > pud_next_entry(p4d, pud, next) === pud_offset(p4d, next) > > In which case the invarient changes to 'callers can never do pointer > arithmetic on the result of pXX_offset()' which is a bit harder to > enforce. I might have lost track a bit. Are we still talking about possible functional impacts of either our current pagetable walking with s390 (apart from gup_fast), or the proposed generic change (for s390, or others?)? Or is this rather some (other) generic issue / idea that you have, in order to put "some more structure / enforcement" to generic pagetable walkers? ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 17:07 ` Gerald Schaefer @ 2020-09-10 17:19 ` Jason Gunthorpe 0 siblings, 0 replies; 62+ messages in thread From: Jason Gunthorpe @ 2020-09-10 17:19 UTC (permalink / raw) To: Gerald Schaefer Cc: Anshuman Khandual, Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, Sep 10, 2020 at 07:07:57PM +0200, Gerald Schaefer wrote: > I might have lost track a bit. Are we still talking about possible > functional impacts of either our current pagetable walking with s390 > (apart from gup_fast), or the proposed generic change (for s390, or > others?)? I'm looking for an more understandable explanation what is wrong with the S390 implementation. If the page operations require the invariant I described then it is quite easy to explain the problem and understand the solution. Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 13:02 ` Jason Gunthorpe 2020-09-10 13:28 ` Gerald Schaefer @ 2020-09-10 17:57 ` Gerald Schaefer 2020-09-10 23:21 ` Jason Gunthorpe 1 sibling, 1 reply; 62+ messages in thread From: Gerald Schaefer @ 2020-09-10 17:57 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, 10 Sep 2020 10:02:33 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote: > > > As Gerald mentioned, it is very difficult to explain in a clear way. > > Hopefully, one could make sense ot of it. > > I would say the page table API requires this invariant: > > pud = pud_offset(p4d, addr); > do { > WARN_ON(pud != pud_offset(p4d, addr); > next = pud_addr_end(addr, end); > } while (pud++, addr = next, addr != end); > > ie pud++ is supposed to be a shortcut for > pud_offset(p4d, next) > Hmm, IIUC, all architectures with static folding will simply return the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level pagetables. There is no difference for s390. For gup_fast, that p4d pointer is not really a pointer to a value in a pagetable, but to some local copy of such a value, and not just for s390. So, pud = p4d = pointer to copy, and increasing that pud pointer cannot be the same as pud_offset(p4d, next). I do see your point however, at last I think :-) My problem is that I do not see where we would have an s390-specific issue here. Maybe my understanding of how it works for others with static folding is wrong. That would explain my difficulties in getting your point... > While S390 does not follow this. Fixing addr_end brings it into > alignment by preventing pud++ from happening. Exactly, only that nobody seems to follow it, IIUC. Fixing it up with pXd_addr_end was my impression of what we need to do, in order to have it work the same way as for others. > The only currently known side effect is that gup_fast crashes, but it > sure is an unexpected thing. Well, from my understanding it feels more unexpected that something that is supposed to be a pointer to an entry in a page table, really is just a pointer to some copy somewhere. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 17:57 ` Gerald Schaefer @ 2020-09-10 23:21 ` Jason Gunthorpe 0 siblings, 0 replies; 62+ messages in thread From: Jason Gunthorpe @ 2020-09-10 23:21 UTC (permalink / raw) To: Gerald Schaefer Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, Sep 10, 2020 at 07:57:49PM +0200, Gerald Schaefer wrote: > On Thu, 10 Sep 2020 10:02:33 -0300 > Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote: > > > > > As Gerald mentioned, it is very difficult to explain in a clear way. > > > Hopefully, one could make sense ot of it. > > > > I would say the page table API requires this invariant: > > > > pud = pud_offset(p4d, addr); > > do { > > WARN_ON(pud != pud_offset(p4d, addr); > > next = pud_addr_end(addr, end); > > } while (pud++, addr = next, addr != end); > > > > ie pud++ is supposed to be a shortcut for > > pud_offset(p4d, next) > > > > Hmm, IIUC, all architectures with static folding will simply return > the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level > pagetables. It is probably moot now, but since other arch's don't crash they also return pud_addr_end() == end so the loop only does one iteration. ie pud == pud_offset(p4d, addr) for all iterations as the pud++ never happens. Which is what this addr_end patch does for s390.. Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 9:39 ` Alexander Gordeev 2020-09-10 13:02 ` Jason Gunthorpe @ 2020-09-10 17:35 ` Linus Torvalds 2020-09-10 18:13 ` Jason Gunthorpe 1 sibling, 1 reply; 62+ messages in thread From: Linus Torvalds @ 2020-09-10 17:35 UTC (permalink / raw) To: Alexander Gordeev Cc: Jason Gunthorpe, Gerald Schaefer, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev <agordeev@linux.ibm.com> wrote: > > It is only gup_fast case that exposes the issue. It hits because > pointers to stack copies are passed to gup_pXd_range iterators, not > pointers to real page tables itself. Can we possibly change fast-gup to not do the stack copies? I'd actually rather do something like that, than the "addr_end" thing. As you say, none of the other page table walking code does what the GUP code does, and I don't think it's required. The GUP code is kind of strange, I'm not quite sure why. Some of it unusually came from the powerpc code that handled their special odd hugepage model, and that may be why it's so different. How painful would it be to just pass the pmd (etc) _pointers_ around, rather than do the odd "take the address of local copies"? Linus ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 17:35 ` Linus Torvalds @ 2020-09-10 18:13 ` Jason Gunthorpe 2020-09-10 18:33 ` Linus Torvalds 2020-09-10 21:22 ` [RFC PATCH v2 1/3] " John Hubbard 0 siblings, 2 replies; 62+ messages in thread From: Jason Gunthorpe @ 2020-09-10 18:13 UTC (permalink / raw) To: Linus Torvalds Cc: Alexander Gordeev, Gerald Schaefer, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote: > On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev > <agordeev@linux.ibm.com> wrote: > > > > It is only gup_fast case that exposes the issue. It hits because > > pointers to stack copies are passed to gup_pXd_range iterators, not > > pointers to real page tables itself. > > Can we possibly change fast-gup to not do the stack copies? > > I'd actually rather do something like that, than the "addr_end" thing. > As you say, none of the other page table walking code does what the > GUP code does, and I don't think it's required. As I understand it, the requirement is because fast-gup walks without the page table spinlock, or mmap_sem held so it must READ_ONCE the *pXX. It then checks that it is a valid page table pointer, then calls pXX_offset(). The arch implementation of pXX_offset() derefs again the passed pXX pointer. So it defeats the READ_ONCE and the 2nd load could observe something that is no longer a page table pointer and crash. Passing it the address of the stack value is a way to force pXX_offset() to use the READ_ONCE result which has already been tested to be a page table pointer. Other page walking code that holds the mmap_sem tends to use pmd_trans_unstable() which solves this problem by injecting a barrier. The load hidden in pte_offset() after a pmd_trans_unstable() can't be re-ordered and will only see a page table entry under the mmap_sem. However, I think that logic would have been much clearer following the GUP model of READ_ONCE vs extra reads and a hidden barrier. At least it took me a long time to work it out :( I also think there are real bugs here where places are reading *pXX multiple times without locking the page table. One was found recently in the wild in the huge tlb code IIRC. The mm/pagewalk.c has these missing READ_ONCE bugs too. So.. To change away from the stack option I think we'd have to pass the READ_ONCE value to pXX_offset() as an extra argument instead of it derefing the pointer internally. Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 18:13 ` Jason Gunthorpe @ 2020-09-10 18:33 ` Linus Torvalds 2020-09-10 19:10 ` Gerald Schaefer 2020-09-10 21:22 ` [RFC PATCH v2 1/3] " John Hubbard 1 sibling, 1 reply; 62+ messages in thread From: Linus Torvalds @ 2020-09-10 18:33 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alexander Gordeev, Gerald Schaefer, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > So.. To change away from the stack option I think we'd have to pass > the READ_ONCE value to pXX_offset() as an extra argument instead of it > derefing the pointer internally. Yeah, but I think that would actually be the better model than passing an address to a random stack location. It's also effectively what we do in some other places, eg the whole logic with "orig" in the regular pte fault handling is basically doing unlocked loads of the pte, various decisions on that, and then doing a final "is this still the same pte" after it has gotten the page table lock. (And yes, those other pte fault handling cases are different, since they _do_ hold the mmap lock, so they know the page *tables* are stable, and it's only the last level that then gets re-checked against the pte once the pte itself has also been stabilized with the page table lock). So I think it would actually be a better conceptual match to make the page table walking interface be "here, this is the value I read once carefully, and this is the address, now give me the next address". The folded case would then just return the address it was given, and the non-folded case would return the inner page table based on the value. I dunno. I don't actually feel all that strongly about this, so whatever works, I guess. Linus ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 18:33 ` Linus Torvalds @ 2020-09-10 19:10 ` Gerald Schaefer 2020-09-10 19:32 ` Linus Torvalds 0 siblings, 1 reply; 62+ messages in thread From: Gerald Schaefer @ 2020-09-10 19:10 UTC (permalink / raw) To: Linus Torvalds Cc: Jason Gunthorpe, Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, 10 Sep 2020 11:33:17 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > So.. To change away from the stack option I think we'd have to pass > > the READ_ONCE value to pXX_offset() as an extra argument instead of it > > derefing the pointer internally. > > Yeah, but I think that would actually be the better model than passing > an address to a random stack location. > > It's also effectively what we do in some other places, eg the whole > logic with "orig" in the regular pte fault handling is basically doing > unlocked loads of the pte, various decisions on that, and then doing a > final "is this still the same pte" after it has gotten the page table > lock. That sounds a lot like the pXd_offset_orig() from Martins first approach in this thread: https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/ It is also the "Patch 1" option from the start of this thread: https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/ I guess I chose wrongly there, should have had more trust in Martins approach, and not try so hard to do it like others... So, maybe we can start over again, from that patch option. It would of course also initially introduce some gup-specific helpers, like with the other approach. It seemed harder to generalize when I thought about it back then, but I guess it should not be a lot harder than the _addr_end stuff. Or, maybe this time, just not to risk Christian getting a heart attack, we could go for the gup-specific helper first, so that we would at least have a fix for the possible s390 data corruption. Jason, would you agree that we send a new RFC, this time with pXd_offset_orig() approach, and have that accepted as short-term fix? Or would you rather also wait for some proper generic change? Have lost that option from my radar, so cannot really judge how much more effort it would be. I'm on vacation next week anyway, but Alexander or Vasily (who did the option 1 patch) could look into this further. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 19:10 ` Gerald Schaefer @ 2020-09-10 19:32 ` Linus Torvalds 2020-09-10 21:59 ` Jason Gunthorpe 0 siblings, 1 reply; 62+ messages in thread From: Linus Torvalds @ 2020-09-10 19:32 UTC (permalink / raw) To: Gerald Schaefer Cc: Jason Gunthorpe, Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, Sep 10, 2020 at 12:11 PM Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > > That sounds a lot like the pXd_offset_orig() from Martins first approach > in this thread: > https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/ I have to admit to finding that name horrible, but aside from that, yes. I don't think "pXd_offset_orig()" makes any sense as a name. Yes, "orig" may make sense as the variable name (as in "this was the original value we read"), but a function name should describe what it *does*, not what the arguments are. Plus "original" doesn't make sense to me anyway, since we're not modifying it. To me, "original" means that there's a final version too, which this interface in no way implies. It's just "this is the value we already read". ("orig" does make some sense in that fault path - because by definition we *are* going to modify the page table entry, that's the whole point of the fault - we need to do something to not keep faulting. But here, we're not at all necessarily modifying the page table contents, we're just following them and readign the values once) Of course, I don't know what a better name would be to describe what is actually going on, I'm just explaining why I hate that naming. *Maybe* something like just "pXd_offset_value()" together with a comment explaining that it's given the upper pXd pointer _and_ the value behind it, and it needs to return the next level offset? I dunno. "value" doesn't really seem horribly descriptive either, but at least it doesn't feel actively misleading to me. Yeah, I get hung up on naming sometimes. I don't tend to care much about private local variables ("i" is a perfectly fine variable name), but these kinds of somewhat subtle cross-architecture definitions I feel matter. Linus ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 19:32 ` Linus Torvalds @ 2020-09-10 21:59 ` Jason Gunthorpe 2020-09-11 7:09 ` peterz 0 siblings, 1 reply; 62+ messages in thread From: Jason Gunthorpe @ 2020-09-10 21:59 UTC (permalink / raw) To: Linus Torvalds Cc: Gerald Schaefer, Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, Sep 10, 2020 at 12:32:05PM -0700, Linus Torvalds wrote: > Yeah, I get hung up on naming sometimes. I don't tend to care much > about private local variables ("i" is a perfectly fine variable name), > but these kinds of somewhat subtle cross-architecture definitions I > feel matter. One of the first replys to this patch was to ask "when would I use _orig vs normal", so you are not alone. The name should convey it.. So, I suggest pXX_offset_unlocked() Since it is safe to call without the page table lock, while pXX_offset() requires the page table lock to be held as the internal *pXX is a data race otherwise. Patch 1 might be OK for a stable backport, but to get to a clear pXX_offset_unlocked() all the arches would want to be changed to implement that API and the generic code would provide the wrapper: #define pXX_offset(pXXp, address) pXX_offset_unlocked(pXXp, *(pXXp), address) Arches would not have a *pXX inside their code. Then we can talk about auditing call sites of pXX_offset and think about using the _unlocked version in places where the page table lock is not held. For instance mm/pagewalk.c should be changed. So should huge_pte_offset() and probably other places. These places might already be exsting data-race bugs. It is code-as-documentation indicating an unlocked page table walk. Now it is not just a S390 story but a change that makes the data concurrency much clearer, so I think I prefer this version to the addr_end one too. Regards, Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 21:59 ` Jason Gunthorpe @ 2020-09-11 7:09 ` peterz 2020-09-11 11:19 ` Jason Gunthorpe 2020-09-11 19:03 ` [PATCH] " Vasily Gorbik 0 siblings, 2 replies; 62+ messages in thread From: peterz @ 2020-09-11 7:09 UTC (permalink / raw) To: Jason Gunthorpe Cc: Linus Torvalds, Gerald Schaefer, Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote: > So, I suggest pXX_offset_unlocked() Urgh, no. Elsewhere in gup _unlocked() means it will take the lock itself (get_user_pages_unlocked()) -- although often it seems to mean the lock is already held (git grep _unlocked and marvel). What we want is _lockless(). ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-11 7:09 ` peterz @ 2020-09-11 11:19 ` Jason Gunthorpe 2020-09-11 19:03 ` [PATCH] " Vasily Gorbik 1 sibling, 0 replies; 62+ messages in thread From: Jason Gunthorpe @ 2020-09-11 11:19 UTC (permalink / raw) To: peterz Cc: Linus Torvalds, Gerald Schaefer, Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Fri, Sep 11, 2020 at 09:09:39AM +0200, peterz@infradead.org wrote: > On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote: > > So, I suggest pXX_offset_unlocked() > > Urgh, no. Elsewhere in gup _unlocked() means it will take the lock > itself (get_user_pages_unlocked()) -- although often it seems to mean > the lock is already held (git grep _unlocked and marvel). > > What we want is _lockless(). This is clear to me! Thanks, Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH] mm/gup: fix gup_fast with dynamic page table folding 2020-09-11 7:09 ` peterz 2020-09-11 11:19 ` Jason Gunthorpe @ 2020-09-11 19:03 ` Vasily Gorbik 2020-09-11 19:09 ` Linus Torvalds 2020-09-11 19:40 ` Jason Gunthorpe 1 sibling, 2 replies; 62+ messages in thread From: Vasily Gorbik @ 2020-09-11 19:03 UTC (permalink / raw) To: Jason Gunthorpe, John Hubbard, Linus Torvalds Cc: Gerald Schaefer, Alexander Gordeev, Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda Currently to make sure that every page table entry is read just once gup_fast walks perform READ_ONCE and pass pXd value down to the next gup_pXd_range function by value e.g.: static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) ... pudp = pud_offset(&p4d, addr); This function passes a reference on that local value copy to pXd_offset, and might get the very same pointer in return. This happens when the level is folded (on most arches), and that pointer should not be iterated. On s390 due to the fact that each task might have different 5,4 or 3-level address translation and hence different levels folded the logic is more complex and non-iteratable pointer to a local copy leads to severe problems. Here is an example of what happens with gup_fast on s390, for a task with 3-levels paging, crossing a 2 GB pud boundary: // addr = 0x1007ffff000, end = 0x10080001000 static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; // pud_offset returns &p4d itself (a pointer to a value on stack) pudp = pud_offset(&p4d, addr); do { // on second iteratation reading "random" stack value pud_t pud = READ_ONCE(*pudp); // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 next = pud_addr_end(addr, end); ... } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack return 1; } This happens since s390 moved to common gup code with commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") and commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code"). s390 tried to mimic static level folding by changing pXd_offset primitives to always calculate top level page table offset in pgd_offset and just return the value passed when pXd_offset has to act as folded. What is crucial for gup_fast and what has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly. And the latter is not possible with dynamic folding. To fix the issue in addition to pXd values pass original pXdp pointers down to gup_pXd_range functions. And introduce pXd_offset_lockless helpers, which take an additional pXd entry value parameter. This has already been discussed in https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 Cc: <stable@vger.kernel.org> # 5.2+ Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> --- arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++---------- include/linux/pgtable.h | 10 ++++++++ mm/gup.c | 18 +++++++------- 3 files changed, 49 insertions(+), 21 deletions(-) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 7eb01a5459cd..b55561cc8786 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address) #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address) -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address) { - if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) - return (p4d_t *) pgd_deref(*pgd) + p4d_index(address); - return (p4d_t *) pgd; + if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) + return (p4d_t *) pgd_deref(pgd) + p4d_index(address); + return (p4d_t *) pgdp; } +#define p4d_offset_lockless p4d_offset_lockless -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address) { - if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) - return (pud_t *) p4d_deref(*p4d) + pud_index(address); - return (pud_t *) p4d; + return p4d_offset_lockless(pgdp, *pgdp, address); +} + +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address) +{ + if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) + return (pud_t *) p4d_deref(p4d) + pud_index(address); + return (pud_t *) p4dp; +} +#define pud_offset_lockless pud_offset_lockless + +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address) +{ + return pud_offset_lockless(p4dp, *p4dp, address); } #define pud_offset pud_offset -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address) +{ + if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) + return (pmd_t *) pud_deref(pud) + pmd_index(address); + return (pmd_t *) pudp; +} +#define pmd_offset_lockless pmd_offset_lockless + +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address) { - if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) - return (pmd_t *) pud_deref(*pud) + pmd_index(address); - return (pmd_t *) pud; + return pmd_offset_lockless(pudp, *pudp, address); } #define pmd_offset pmd_offset diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e8cbc2e795d5..e899d3506671 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask; #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED) #endif +#ifndef p4d_offset_lockless +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&pgd, address) +#endif +#ifndef pud_offset_lockless +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&p4d, address) +#endif +#ifndef pmd_offset_lockless +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&pud, address) +#endif + /* * p?d_leaf() - true if this entry is a final mapping to a physical address. * This differs from p?d_huge() by the fact that they are always available (if diff --git a/mm/gup.c b/mm/gup.c index e5739a1974d5..578bf5bd8bf8 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, return 1; } -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pmd_t *pmdp; - pmdp = pmd_offset(&pud, addr); + pmdp = pmd_offset_lockless(pudp, pud, addr); do { pmd_t pmd = READ_ONCE(*pmdp); @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, return 1; } -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; - pudp = pud_offset(&p4d, addr); + pudp = pud_offset_lockless(p4dp, p4d, addr); do { pud_t pud = READ_ONCE(*pudp); @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, if (!gup_huge_pd(__hugepd(pud_val(pud)), addr, PUD_SHIFT, next, flags, pages, nr)) return 0; - } else if (!gup_pmd_range(pud, addr, next, flags, pages, nr)) + } else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr)) return 0; } while (pudp++, addr = next, addr != end); return 1; } -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; p4d_t *p4dp; - p4dp = p4d_offset(&pgd, addr); + p4dp = p4d_offset_lockless(pgdp, pgd, addr); do { p4d_t p4d = READ_ONCE(*p4dp); @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr, P4D_SHIFT, next, flags, pages, nr)) return 0; - } else if (!gup_pud_range(p4d, addr, next, flags, pages, nr)) + } else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) return 0; } while (p4dp++, addr = next, addr != end); @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, PGDIR_SHIFT, next, flags, pages, nr)) return; - } else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr)) + } else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) return; } while (pgdp++, addr = next, addr != end); } -- ⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿ ⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿ ⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿ ⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿ ⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿ ⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿ ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding 2020-09-11 19:03 ` [PATCH] " Vasily Gorbik @ 2020-09-11 19:09 ` Linus Torvalds 2020-09-11 19:40 ` Jason Gunthorpe 1 sibling, 0 replies; 62+ messages in thread From: Linus Torvalds @ 2020-09-11 19:09 UTC (permalink / raw) To: Vasily Gorbik Cc: Jason Gunthorpe, John Hubbard, Gerald Schaefer, Alexander Gordeev, Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Fri, Sep 11, 2020 at 12:04 PM Vasily Gorbik <gor@linux.ibm.com> wrote: > > Currently to make sure that every page table entry is read just once > gup_fast walks perform READ_ONCE and pass pXd value down to the next > gup_pXd_range function by value e.g.: [ ... ] Ack, this looks sane to me. I was going to ask how horrible it would be to convert all the other users, but a quick grep convinced me that yeah, it's only GUP that is this special, and we don't want to make this interface be the real one for everything else too.. Linus ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding 2020-09-11 19:03 ` [PATCH] " Vasily Gorbik 2020-09-11 19:09 ` Linus Torvalds @ 2020-09-11 19:40 ` Jason Gunthorpe 2020-09-11 20:05 ` Jason Gunthorpe 1 sibling, 1 reply; 62+ messages in thread From: Jason Gunthorpe @ 2020-09-11 19:40 UTC (permalink / raw) To: Vasily Gorbik Cc: John Hubbard, Linus Torvalds, Gerald Schaefer, Alexander Gordeev, Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Fri, Sep 11, 2020 at 09:03:06PM +0200, Vasily Gorbik wrote: > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index e8cbc2e795d5..e899d3506671 100644 > +++ b/include/linux/pgtable.h > @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask; > #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED) > #endif > > +#ifndef p4d_offset_lockless > +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&pgd, address) > +#endif > +#ifndef pud_offset_lockless > +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&p4d, address) > +#endif > +#ifndef pmd_offset_lockless > +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&pud, address) Needs brackets: &(pgd) These would probably be better as static inlines though, as only s390 compiles will type check pudp like this. Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding 2020-09-11 19:40 ` Jason Gunthorpe @ 2020-09-11 20:05 ` Jason Gunthorpe 2020-09-11 20:36 ` [PATCH v2] " Vasily Gorbik 0 siblings, 1 reply; 62+ messages in thread From: Jason Gunthorpe @ 2020-09-11 20:05 UTC (permalink / raw) To: Vasily Gorbik Cc: John Hubbard, Linus Torvalds, Gerald Schaefer, Alexander Gordeev, Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Fri, Sep 11, 2020 at 04:40:00PM -0300, Jason Gunthorpe wrote: > These would probably be better as static inlines though, as only s390 > compiles will type check pudp like this. Never mind, it must be a macro - still need brackets though Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding 2020-09-11 20:05 ` Jason Gunthorpe @ 2020-09-11 20:36 ` Vasily Gorbik 2020-09-15 17:09 ` Vasily Gorbik ` (3 more replies) 0 siblings, 4 replies; 62+ messages in thread From: Vasily Gorbik @ 2020-09-11 20:36 UTC (permalink / raw) To: Jason Gunthorpe, John Hubbard Cc: Linus Torvalds, Gerald Schaefer, Alexander Gordeev, Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda Currently to make sure that every page table entry is read just once gup_fast walks perform READ_ONCE and pass pXd value down to the next gup_pXd_range function by value e.g.: static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) ... pudp = pud_offset(&p4d, addr); This function passes a reference on that local value copy to pXd_offset, and might get the very same pointer in return. This happens when the level is folded (on most arches), and that pointer should not be iterated. On s390 due to the fact that each task might have different 5,4 or 3-level address translation and hence different levels folded the logic is more complex and non-iteratable pointer to a local copy leads to severe problems. Here is an example of what happens with gup_fast on s390, for a task with 3-levels paging, crossing a 2 GB pud boundary: // addr = 0x1007ffff000, end = 0x10080001000 static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; // pud_offset returns &p4d itself (a pointer to a value on stack) pudp = pud_offset(&p4d, addr); do { // on second iteratation reading "random" stack value pud_t pud = READ_ONCE(*pudp); // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 next = pud_addr_end(addr, end); ... } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack return 1; } This happens since s390 moved to common gup code with commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") and commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code"). s390 tried to mimic static level folding by changing pXd_offset primitives to always calculate top level page table offset in pgd_offset and just return the value passed when pXd_offset has to act as folded. What is crucial for gup_fast and what has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly. And the latter is not possible with dynamic folding. To fix the issue in addition to pXd values pass original pXdp pointers down to gup_pXd_range functions. And introduce pXd_offset_lockless helpers, which take an additional pXd entry value parameter. This has already been discussed in https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 Cc: <stable@vger.kernel.org> # 5.2+ Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> --- v2: added brackets &pgd -> &(pgd) arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++---------- include/linux/pgtable.h | 10 ++++++++ mm/gup.c | 18 +++++++------- 3 files changed, 49 insertions(+), 21 deletions(-) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 7eb01a5459cd..b55561cc8786 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address) #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address) -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address) { - if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) - return (p4d_t *) pgd_deref(*pgd) + p4d_index(address); - return (p4d_t *) pgd; + if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) + return (p4d_t *) pgd_deref(pgd) + p4d_index(address); + return (p4d_t *) pgdp; } +#define p4d_offset_lockless p4d_offset_lockless -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address) { - if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) - return (pud_t *) p4d_deref(*p4d) + pud_index(address); - return (pud_t *) p4d; + return p4d_offset_lockless(pgdp, *pgdp, address); +} + +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address) +{ + if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) + return (pud_t *) p4d_deref(p4d) + pud_index(address); + return (pud_t *) p4dp; +} +#define pud_offset_lockless pud_offset_lockless + +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address) +{ + return pud_offset_lockless(p4dp, *p4dp, address); } #define pud_offset pud_offset -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address) +{ + if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) + return (pmd_t *) pud_deref(pud) + pmd_index(address); + return (pmd_t *) pudp; +} +#define pmd_offset_lockless pmd_offset_lockless + +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address) { - if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) - return (pmd_t *) pud_deref(*pud) + pmd_index(address); - return (pmd_t *) pud; + return pmd_offset_lockless(pudp, *pudp, address); } #define pmd_offset pmd_offset diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e8cbc2e795d5..90654cb63e9e 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask; #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED) #endif +#ifndef p4d_offset_lockless +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address) +#endif +#ifndef pud_offset_lockless +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address) +#endif +#ifndef pmd_offset_lockless +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address) +#endif + /* * p?d_leaf() - true if this entry is a final mapping to a physical address. * This differs from p?d_huge() by the fact that they are always available (if diff --git a/mm/gup.c b/mm/gup.c index e5739a1974d5..578bf5bd8bf8 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, return 1; } -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pmd_t *pmdp; - pmdp = pmd_offset(&pud, addr); + pmdp = pmd_offset_lockless(pudp, pud, addr); do { pmd_t pmd = READ_ONCE(*pmdp); @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, return 1; } -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; - pudp = pud_offset(&p4d, addr); + pudp = pud_offset_lockless(p4dp, p4d, addr); do { pud_t pud = READ_ONCE(*pudp); @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, if (!gup_huge_pd(__hugepd(pud_val(pud)), addr, PUD_SHIFT, next, flags, pages, nr)) return 0; - } else if (!gup_pmd_range(pud, addr, next, flags, pages, nr)) + } else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr)) return 0; } while (pudp++, addr = next, addr != end); return 1; } -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; p4d_t *p4dp; - p4dp = p4d_offset(&pgd, addr); + p4dp = p4d_offset_lockless(pgdp, pgd, addr); do { p4d_t p4d = READ_ONCE(*p4dp); @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr, P4D_SHIFT, next, flags, pages, nr)) return 0; - } else if (!gup_pud_range(p4d, addr, next, flags, pages, nr)) + } else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) return 0; } while (p4dp++, addr = next, addr != end); @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, PGDIR_SHIFT, next, flags, pages, nr)) return; - } else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr)) + } else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) return; } while (pgdp++, addr = next, addr != end); } -- ⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿ ⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿ ⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿ ⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿ ⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿ ⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿ ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding 2020-09-11 20:36 ` [PATCH v2] " Vasily Gorbik @ 2020-09-15 17:09 ` Vasily Gorbik 2020-09-15 17:14 ` Jason Gunthorpe ` (2 subsequent siblings) 3 siblings, 0 replies; 62+ messages in thread From: Vasily Gorbik @ 2020-09-15 17:09 UTC (permalink / raw) To: Andrew Morton, Jason Gunthorpe, John Hubbard Cc: Gerald Schaefer, Alexander Gordeev, Linus Torvalds, Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote: > Currently to make sure that every page table entry is read just once > gup_fast walks perform READ_ONCE and pass pXd value down to the next > gup_pXd_range function by value e.g.: ...snip... > --- > v2: added brackets &pgd -> &(pgd) > > arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++---------- > include/linux/pgtable.h | 10 ++++++++ > mm/gup.c | 18 +++++++------- > 3 files changed, 49 insertions(+), 21 deletions(-) Andrew, any chance you would pick this up? There is an Ack from Linus. And I haven't seen any objections from Jason or John. This seems to be as safe for other architectures as possible. @Jason and John Any acks/nacks? Thank you, Vasily ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding 2020-09-11 20:36 ` [PATCH v2] " Vasily Gorbik 2020-09-15 17:09 ` Vasily Gorbik @ 2020-09-15 17:14 ` Jason Gunthorpe 2020-09-15 17:18 ` Mike Rapoport 2020-09-15 17:31 ` John Hubbard 3 siblings, 0 replies; 62+ messages in thread From: Jason Gunthorpe @ 2020-09-15 17:14 UTC (permalink / raw) To: Vasily Gorbik Cc: John Hubbard, Linus Torvalds, Gerald Schaefer, Alexander Gordeev, Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote: > Currently to make sure that every page table entry is read just once > gup_fast walks perform READ_ONCE and pass pXd value down to the next > gup_pXd_range function by value e.g.: > > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > ... > pudp = pud_offset(&p4d, addr); > > This function passes a reference on that local value copy to pXd_offset, > and might get the very same pointer in return. This happens when the > level is folded (on most arches), and that pointer should not be iterated. > > On s390 due to the fact that each task might have different 5,4 or > 3-level address translation and hence different levels folded the logic > is more complex and non-iteratable pointer to a local copy leads to > severe problems. > > Here is an example of what happens with gup_fast on s390, for a task > with 3-levels paging, crossing a 2 GB pud boundary: > > // addr = 0x1007ffff000, end = 0x10080001000 > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > // pud_offset returns &p4d itself (a pointer to a value on stack) > pudp = pud_offset(&p4d, addr); > do { > // on second iteratation reading "random" stack value > pud_t pud = READ_ONCE(*pudp); > > // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 > next = pud_addr_end(addr, end); > ... > } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack > > return 1; > } > > This happens since s390 moved to common gup code with > commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") > and commit 1a42010cdc26 ("s390/mm: convert to the generic > get_user_pages_fast code"). s390 tried to mimic static level folding by > changing pXd_offset primitives to always calculate top level page table > offset in pgd_offset and just return the value passed when pXd_offset > has to act as folded. > > What is crucial for gup_fast and what has been overlooked is > that PxD_SIZE/MASK and thus pXd_addr_end should also change > correspondingly. And the latter is not possible with dynamic folding. > > To fix the issue in addition to pXd values pass original > pXdp pointers down to gup_pXd_range functions. And introduce > pXd_offset_lockless helpers, which take an additional pXd > entry value parameter. This has already been discussed in > https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 > > Cc: <stable@vger.kernel.org> # 5.2+ > Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") > Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> > Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> > Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> > --- > v2: added brackets &pgd -> &(pgd) Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Regards, Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding 2020-09-11 20:36 ` [PATCH v2] " Vasily Gorbik 2020-09-15 17:09 ` Vasily Gorbik 2020-09-15 17:14 ` Jason Gunthorpe @ 2020-09-15 17:18 ` Mike Rapoport 2020-09-15 17:31 ` John Hubbard 3 siblings, 0 replies; 62+ messages in thread From: Mike Rapoport @ 2020-09-15 17:18 UTC (permalink / raw) To: Vasily Gorbik Cc: Jason Gunthorpe, John Hubbard, Linus Torvalds, Gerald Schaefer, Alexander Gordeev, Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote: > Currently to make sure that every page table entry is read just once > gup_fast walks perform READ_ONCE and pass pXd value down to the next > gup_pXd_range function by value e.g.: > > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > ... > pudp = pud_offset(&p4d, addr); > > This function passes a reference on that local value copy to pXd_offset, > and might get the very same pointer in return. This happens when the > level is folded (on most arches), and that pointer should not be iterated. > > On s390 due to the fact that each task might have different 5,4 or > 3-level address translation and hence different levels folded the logic > is more complex and non-iteratable pointer to a local copy leads to > severe problems. > > Here is an example of what happens with gup_fast on s390, for a task > with 3-levels paging, crossing a 2 GB pud boundary: > > // addr = 0x1007ffff000, end = 0x10080001000 > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > // pud_offset returns &p4d itself (a pointer to a value on stack) > pudp = pud_offset(&p4d, addr); > do { > // on second iteratation reading "random" stack value > pud_t pud = READ_ONCE(*pudp); > > // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 > next = pud_addr_end(addr, end); > ... > } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack > > return 1; > } > > This happens since s390 moved to common gup code with > commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") > and commit 1a42010cdc26 ("s390/mm: convert to the generic > get_user_pages_fast code"). s390 tried to mimic static level folding by > changing pXd_offset primitives to always calculate top level page table > offset in pgd_offset and just return the value passed when pXd_offset > has to act as folded. > > What is crucial for gup_fast and what has been overlooked is > that PxD_SIZE/MASK and thus pXd_addr_end should also change > correspondingly. And the latter is not possible with dynamic folding. > > To fix the issue in addition to pXd values pass original > pXdp pointers down to gup_pXd_range functions. And introduce > pXd_offset_lockless helpers, which take an additional pXd > entry value parameter. This has already been discussed in > https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 > > Cc: <stable@vger.kernel.org> # 5.2+ > Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") > Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> > Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> > Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> > --- > v2: added brackets &pgd -> &(pgd) > > arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++---------- > include/linux/pgtable.h | 10 ++++++++ > mm/gup.c | 18 +++++++------- > 3 files changed, 49 insertions(+), 21 deletions(-) > > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h > index 7eb01a5459cd..b55561cc8786 100644 > --- a/arch/s390/include/asm/pgtable.h > +++ b/arch/s390/include/asm/pgtable.h > @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address) > > #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address) > > -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) > +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address) > { > - if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) > - return (p4d_t *) pgd_deref(*pgd) + p4d_index(address); > - return (p4d_t *) pgd; > + if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) > + return (p4d_t *) pgd_deref(pgd) + p4d_index(address); > + return (p4d_t *) pgdp; > } > +#define p4d_offset_lockless p4d_offset_lockless > > -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) > +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address) > { > - if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) > - return (pud_t *) p4d_deref(*p4d) + pud_index(address); > - return (pud_t *) p4d; > + return p4d_offset_lockless(pgdp, *pgdp, address); > +} > + > +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address) > +{ > + if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) > + return (pud_t *) p4d_deref(p4d) + pud_index(address); > + return (pud_t *) p4dp; > +} > +#define pud_offset_lockless pud_offset_lockless > + > +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address) > +{ > + return pud_offset_lockless(p4dp, *p4dp, address); > } > #define pud_offset pud_offset > > -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) > +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address) > +{ > + if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) > + return (pmd_t *) pud_deref(pud) + pmd_index(address); > + return (pmd_t *) pudp; > +} > +#define pmd_offset_lockless pmd_offset_lockless > + > +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address) > { > - if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) > - return (pmd_t *) pud_deref(*pud) + pmd_index(address); > - return (pmd_t *) pud; > + return pmd_offset_lockless(pudp, *pudp, address); > } > #define pmd_offset pmd_offset > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index e8cbc2e795d5..90654cb63e9e 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask; > #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED) > #endif > > +#ifndef p4d_offset_lockless > +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address) > +#endif > +#ifndef pud_offset_lockless > +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address) > +#endif > +#ifndef pmd_offset_lockless > +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address) > +#endif > + > /* > * p?d_leaf() - true if this entry is a final mapping to a physical address. > * This differs from p?d_huge() by the fact that they are always available (if > diff --git a/mm/gup.c b/mm/gup.c > index e5739a1974d5..578bf5bd8bf8 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, > return 1; > } > > -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pmd_t *pmdp; > > - pmdp = pmd_offset(&pud, addr); > + pmdp = pmd_offset_lockless(pudp, pud, addr); > do { > pmd_t pmd = READ_ONCE(*pmdp); > > @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > return 1; > } > > -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > - pudp = pud_offset(&p4d, addr); > + pudp = pud_offset_lockless(p4dp, p4d, addr); > do { > pud_t pud = READ_ONCE(*pudp); > > @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(pud_val(pud)), addr, > PUD_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pmd_range(pud, addr, next, flags, pages, nr)) > + } else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr)) > return 0; > } while (pudp++, addr = next, addr != end); > > return 1; > } > > -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > p4d_t *p4dp; > > - p4dp = p4d_offset(&pgd, addr); > + p4dp = p4d_offset_lockless(pgdp, pgd, addr); > do { > p4d_t p4d = READ_ONCE(*p4dp); > > @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr, > P4D_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pud_range(p4d, addr, next, flags, pages, nr)) > + } else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) > return 0; > } while (p4dp++, addr = next, addr != end); > > @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, > PGDIR_SHIFT, next, flags, pages, nr)) > return; > - } else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr)) > + } else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) > return; > } while (pgdp++, addr = next, addr != end); > } > -- > ⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿ > ⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿ > ⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿ > ⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿ > ⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿ > ⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿ -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding 2020-09-11 20:36 ` [PATCH v2] " Vasily Gorbik ` (2 preceding siblings ...) 2020-09-15 17:18 ` Mike Rapoport @ 2020-09-15 17:31 ` John Hubbard 3 siblings, 0 replies; 62+ messages in thread From: John Hubbard @ 2020-09-15 17:31 UTC (permalink / raw) To: Vasily Gorbik, Jason Gunthorpe Cc: Linus Torvalds, Gerald Schaefer, Alexander Gordeev, Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On 9/11/20 1:36 PM, Vasily Gorbik wrote: > Currently to make sure that every page table entry is read just once > gup_fast walks perform READ_ONCE and pass pXd value down to the next > gup_pXd_range function by value e.g.: > > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > ... > pudp = pud_offset(&p4d, addr); > > This function passes a reference on that local value copy to pXd_offset, > and might get the very same pointer in return. This happens when the > level is folded (on most arches), and that pointer should not be iterated. > > On s390 due to the fact that each task might have different 5,4 or > 3-level address translation and hence different levels folded the logic > is more complex and non-iteratable pointer to a local copy leads to > severe problems. > > Here is an example of what happens with gup_fast on s390, for a task > with 3-levels paging, crossing a 2 GB pud boundary: > > // addr = 0x1007ffff000, end = 0x10080001000 > static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > // pud_offset returns &p4d itself (a pointer to a value on stack) > pudp = pud_offset(&p4d, addr); > do { > // on second iteratation reading "random" stack value > pud_t pud = READ_ONCE(*pudp); > > // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 > next = pud_addr_end(addr, end); > ... > } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack > > return 1; > } > > This happens since s390 moved to common gup code with > commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") > and commit 1a42010cdc26 ("s390/mm: convert to the generic > get_user_pages_fast code"). s390 tried to mimic static level folding by > changing pXd_offset primitives to always calculate top level page table > offset in pgd_offset and just return the value passed when pXd_offset > has to act as folded. > > What is crucial for gup_fast and what has been overlooked is > that PxD_SIZE/MASK and thus pXd_addr_end should also change > correspondingly. And the latter is not possible with dynamic folding. > > To fix the issue in addition to pXd values pass original > pXdp pointers down to gup_pXd_range functions. And introduce > pXd_offset_lockless helpers, which take an additional pXd > entry value parameter. This has already been discussed in > https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 > > Cc: <stable@vger.kernel.org> # 5.2+ > Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") > Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> > Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> > Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> > --- Looks cleaner than I'd dared hope for. :) Reviewed-by: John Hubbard <jhubbard@nvidia.com> thanks, -- John Hubbard NVIDIA > v2: added brackets &pgd -> &(pgd) > > arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++---------- > include/linux/pgtable.h | 10 ++++++++ > mm/gup.c | 18 +++++++------- > 3 files changed, 49 insertions(+), 21 deletions(-) > > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h > index 7eb01a5459cd..b55561cc8786 100644 > --- a/arch/s390/include/asm/pgtable.h > +++ b/arch/s390/include/asm/pgtable.h > @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address) > > #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address) > > -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) > +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address) > { > - if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) > - return (p4d_t *) pgd_deref(*pgd) + p4d_index(address); > - return (p4d_t *) pgd; > + if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) > + return (p4d_t *) pgd_deref(pgd) + p4d_index(address); > + return (p4d_t *) pgdp; > } > +#define p4d_offset_lockless p4d_offset_lockless > > -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) > +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address) > { > - if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) > - return (pud_t *) p4d_deref(*p4d) + pud_index(address); > - return (pud_t *) p4d; > + return p4d_offset_lockless(pgdp, *pgdp, address); > +} > + > +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address) > +{ > + if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) > + return (pud_t *) p4d_deref(p4d) + pud_index(address); > + return (pud_t *) p4dp; > +} > +#define pud_offset_lockless pud_offset_lockless > + > +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address) > +{ > + return pud_offset_lockless(p4dp, *p4dp, address); > } > #define pud_offset pud_offset > > -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) > +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address) > +{ > + if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) > + return (pmd_t *) pud_deref(pud) + pmd_index(address); > + return (pmd_t *) pudp; > +} > +#define pmd_offset_lockless pmd_offset_lockless > + > +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address) > { > - if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) > - return (pmd_t *) pud_deref(*pud) + pmd_index(address); > - return (pmd_t *) pud; > + return pmd_offset_lockless(pudp, *pudp, address); > } > #define pmd_offset pmd_offset > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index e8cbc2e795d5..90654cb63e9e 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask; > #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED) > #endif > > +#ifndef p4d_offset_lockless > +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address) > +#endif > +#ifndef pud_offset_lockless > +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address) > +#endif > +#ifndef pmd_offset_lockless > +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address) > +#endif > + > /* > * p?d_leaf() - true if this entry is a final mapping to a physical address. > * This differs from p?d_huge() by the fact that they are always available (if > diff --git a/mm/gup.c b/mm/gup.c > index e5739a1974d5..578bf5bd8bf8 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, > return 1; > } > > -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pmd_t *pmdp; > > - pmdp = pmd_offset(&pud, addr); > + pmdp = pmd_offset_lockless(pudp, pud, addr); > do { > pmd_t pmd = READ_ONCE(*pmdp); > > @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, > return 1; > } > > -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > pud_t *pudp; > > - pudp = pud_offset(&p4d, addr); > + pudp = pud_offset_lockless(p4dp, p4d, addr); > do { > pud_t pud = READ_ONCE(*pudp); > > @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(pud_val(pud)), addr, > PUD_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pmd_range(pud, addr, next, flags, pages, nr)) > + } else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr)) > return 0; > } while (pudp++, addr = next, addr != end); > > return 1; > } > > -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end, > unsigned int flags, struct page **pages, int *nr) > { > unsigned long next; > p4d_t *p4dp; > > - p4dp = p4d_offset(&pgd, addr); > + p4dp = p4d_offset_lockless(pgdp, pgd, addr); > do { > p4d_t p4d = READ_ONCE(*p4dp); > > @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr, > P4D_SHIFT, next, flags, pages, nr)) > return 0; > - } else if (!gup_pud_range(p4d, addr, next, flags, pages, nr)) > + } else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) > return 0; > } while (p4dp++, addr = next, addr != end); > > @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, > if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, > PGDIR_SHIFT, next, flags, pages, nr)) > return; > - } else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr)) > + } else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) > return; > } while (pgdp++, addr = next, addr != end); > } > ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 18:13 ` Jason Gunthorpe 2020-09-10 18:33 ` Linus Torvalds @ 2020-09-10 21:22 ` John Hubbard 2020-09-10 22:11 ` Jason Gunthorpe 1 sibling, 1 reply; 62+ messages in thread From: John Hubbard @ 2020-09-10 21:22 UTC (permalink / raw) To: Jason Gunthorpe, Linus Torvalds Cc: Alexander Gordeev, Gerald Schaefer, Dave Hansen, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On 9/10/20 11:13 AM, Jason Gunthorpe wrote: > On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote: >> On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev >> <agordeev@linux.ibm.com> wrote: >>> >>> It is only gup_fast case that exposes the issue. It hits because >>> pointers to stack copies are passed to gup_pXd_range iterators, not >>> pointers to real page tables itself. >> >> Can we possibly change fast-gup to not do the stack copies? >> >> I'd actually rather do something like that, than the "addr_end" thing. > >> As you say, none of the other page table walking code does what the >> GUP code does, and I don't think it's required. > > As I understand it, the requirement is because fast-gup walks without > the page table spinlock, or mmap_sem held so it must READ_ONCE the > *pXX. > > It then checks that it is a valid page table pointer, then calls > pXX_offset(). > > The arch implementation of pXX_offset() derefs again the passed pXX > pointer. So it defeats the READ_ONCE and the 2nd load could observe > something that is no longer a page table pointer and crash. Just to be clear, though, that makes it sound a little wilder and reckless than it really is, right? Because actually, the page tables cannot be freed while gup_fast is walking them, due to either IPI blocking during the walk, or the moral equivalent (MMU_GATHER_RCU_TABLE_FREE) for non-IPI architectures. So the pages tables can *change* underneath gup_fast, and for example pages can be unmapped. But they remain valid page tables, it's just that their contents are unstable. Even if pXd_none()==true. Or am I way off here, and it really is possible (aside from the current s390 situation) to observe something that "is no longer a page table"? thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 21:22 ` [RFC PATCH v2 1/3] " John Hubbard @ 2020-09-10 22:11 ` Jason Gunthorpe 2020-09-10 22:17 ` John Hubbard 2020-09-11 12:19 ` Alexander Gordeev 0 siblings, 2 replies; 62+ messages in thread From: Jason Gunthorpe @ 2020-09-10 22:11 UTC (permalink / raw) To: John Hubbard Cc: Linus Torvalds, Alexander Gordeev, Gerald Schaefer, Dave Hansen, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote: > Or am I way off here, and it really is possible (aside from the current > s390 situation) to observe something that "is no longer a page table"? Yes, that is the issue. Remember there is no locking for GUP fast. While a page table cannot be freed there is nothing preventing the page table entry from being concurrently modified. Without the stack variable it looks like this: pud_t pud = READ_ONCE(*pudp); if (!pud_present(pud)) return pmd_offset(pudp, address); And pmd_offset() expands to return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address); Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value of *pud can change, eg to !pud_present. Then pud_page_vaddr(*pud) will crash. It is not use after free, it is using data that has not been validated. Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 22:11 ` Jason Gunthorpe @ 2020-09-10 22:17 ` John Hubbard 2020-09-11 12:19 ` Alexander Gordeev 1 sibling, 0 replies; 62+ messages in thread From: John Hubbard @ 2020-09-10 22:17 UTC (permalink / raw) To: Jason Gunthorpe Cc: Linus Torvalds, Alexander Gordeev, Gerald Schaefer, Dave Hansen, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On 9/10/20 3:11 PM, Jason Gunthorpe wrote: > On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote: > >> Or am I way off here, and it really is possible (aside from the current >> s390 situation) to observe something that "is no longer a page table"? > > Yes, that is the issue. Remember there is no locking for GUP > fast. While a page table cannot be freed there is nothing preventing > the page table entry from being concurrently modified. > OK, then we are saying the same thing after all, good. > Without the stack variable it looks like this: > > pud_t pud = READ_ONCE(*pudp); > if (!pud_present(pud)) > return > pmd_offset(pudp, address); > > And pmd_offset() expands to > > return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address); > > Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value > of *pud can change, eg to !pud_present. > > Then pud_page_vaddr(*pud) will crash. It is not use after free, it > is using data that has not been validated. > Right, that matches what I had in mind, too: you can still have a problem even though you're in the same page table. I just wanted to confirm that there's not some odd way to launch out into completely non-page-table memory. thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-10 22:11 ` Jason Gunthorpe 2020-09-10 22:17 ` John Hubbard @ 2020-09-11 12:19 ` Alexander Gordeev 2020-09-11 16:45 ` Linus Torvalds 1 sibling, 1 reply; 62+ messages in thread From: Alexander Gordeev @ 2020-09-11 12:19 UTC (permalink / raw) To: Jason Gunthorpe Cc: John Hubbard, Linus Torvalds, Gerald Schaefer, Dave Hansen, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Thu, Sep 10, 2020 at 07:11:16PM -0300, Jason Gunthorpe wrote: > On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote: > > > Or am I way off here, and it really is possible (aside from the current > > s390 situation) to observe something that "is no longer a page table"? > > Yes, that is the issue. Remember there is no locking for GUP > fast. While a page table cannot be freed there is nothing preventing > the page table entry from being concurrently modified. > > Without the stack variable it looks like this: > > pud_t pud = READ_ONCE(*pudp); > if (!pud_present(pud)) > return > pmd_offset(pudp, address); > > And pmd_offset() expands to > > return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address); > > Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value > of *pud can change, eg to !pud_present. > > Then pud_page_vaddr(*pud) will crash. It is not use after free, it > is using data that has not been validated. One thing I ask myself and it is probably a good moment to wonder. What if the entry is still pud_present, but got remapped after READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere? > Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-11 12:19 ` Alexander Gordeev @ 2020-09-11 16:45 ` Linus Torvalds 0 siblings, 0 replies; 62+ messages in thread From: Linus Torvalds @ 2020-09-11 16:45 UTC (permalink / raw) To: Alexander Gordeev Cc: Jason Gunthorpe, John Hubbard, Gerald Schaefer, Dave Hansen, LKML, linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Fri, Sep 11, 2020 at 5:20 AM Alexander Gordeev <agordeev@linux.ibm.com> wrote: > > What if the entry is still pud_present, but got remapped after > READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere? That can't happen. The GUP walk doesn't hold any locks, but it *is* done with interrupts disabled, and anybody who is modifying the page tables needs to do the TLB flush, and/or RCU-free them. The interrupt disable means that on architectures where the TLB flush involves an IPI, it will be delayed until afterwards, but it also acts as a big RCU read lock hammer. So the page tables can get modified under us, but the old pages won't be released and re-used. Linus ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-09 18:03 ` Jason Gunthorpe 2020-09-10 9:39 ` Alexander Gordeev @ 2020-09-10 13:11 ` Gerald Schaefer 1 sibling, 0 replies; 62+ messages in thread From: Gerald Schaefer @ 2020-09-10 13:11 UTC (permalink / raw) To: Jason Gunthorpe Cc: Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Wed, 9 Sep 2020 15:03:24 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote: > > I actually had to draw myself a picture to get some hold of > > this, or rather a walk-through with a certain pud-crossing > > range in a folded 3-level scenario. Not sure if I would have > > understood my explanation above w/o that, but I hope you can > > make some sense out of it. Or draw yourself a picture :-) > > What I don't understand is how does anything work with S390 today? That is totally comprehensible :-) > If the fix is only to change pxx_addr_end() then than generic code > like mm/pagewalk.c will iterate over a *different list* of page table > entries. > > It's choice of entries to look at is entirely driven by pxx_addr_end(). > > Which suggest to me that mm/pagewalk.c also doesn't work properly > today on S390 and this issue is not really about stack variables? I guess you are confused by the fact that the generic change will indeed change the logic for _all_ pagetable walkers on s390, not just for the gup_fast case. But that doesn't mean that they were doing it wrong before, we simply can do it both ways. However, we probably should make that (in theory useless) change more explicit. Let's compare before and after for mm/pagewalk.c on s390, with 3-level pagetables, range crossing 2 GB pud boundary. * Before (with pXd_addr_end always using static 5-level PxD_SIZE): walk_pgd_range() -> pgd_addr_end() will use static 2^53 PGDIR_SIZE, range is not cropped, no iterations needed, passed over to next level walk_p4d_range() -> p4d_addr_end() will use static 2^42 P4D_SIZE, range still not cropped walk_pud_range() -> pud_addr_end() now we're cropping, with 2^31 PUD_SIZE, need two iterations for range crossing pud boundary, doing that right here on a pudp which is actually the previously passed-through pgdp/p4dp (pointing to correct pagetable entry) * After (with dynamic pXd_addr_end using "correct" PxD_SIZE boundaries, should be similar to other archs static "top-level folding"): walk_pgd_range() -> pgd_addr_end() will now determine "correct" boundary based on pgd value, i.e. 2^31 PUD_SIZE, do cropping now, iteration will now happen here walk_p4d/pud_range() -> operate on cropped range, will not iterate, instead return to pgd level, which will then use the same pointer for iteration as in the "Before" case, but not on the same level. IMHO, our "Before" logic is more efficient, and also feels more natural. After all, it is not really necessary to return to pgd level, and it will surely cost some extra instructions. We are willing to take that cost for the sake of doing it in a more generic way, hoping that will reduce future issues. E.g. you already mentioned that you have plans for using the READ_ONCE logic also in other places, and that would be such a "future issue". > Fundamentally if pXX_offset() and pXX_addr_end() must be consistent > together, if pXX_offset() is folded then pXX_addr_end() must cause a > single iteration of that level. well, that sounds correct in theory, but I guess it depends on "how you fold it". E.g. what does "if pXX_offset() is folded" mean? Take pgd_offset() for the 3-level case above. From our previous "middle-level folding/iteration" perspective, I would say that pgd/p4d are folded into pud, so if you say "if pgd_offset() is folded then pgd_addr_end() must cause a single iteration of that level", we were doing it all correctly, i.e only having single iteration on pgd/p4d level. You could even say that all others are doing / using it wrong :-) Now take pgd_offset() from the "top-level folding/iteration". Here you would say that p4d/pud are folded into pgd, which again does not sound like the natural / most efficient way to me, but IIUC this has to be how it works for all other archs with (static) pagetable folding. Now you'd say "if pud/p4d_offset() is folded then pud/p4d_addr_end() must cause a single iteration of that level", and that would sound correct. At least until you look more closely, because e.g. p4d_addr_end() in include/asm-generic/pgtable-nop4d.h is simply this: #define p4d_addr_end(addr, end) (end) How can that cause a single iteration? It clearly won't, it only works because the previous pgd_addr_end already cropped the range so that there will be only single iterations for p4d/pud. The more I think of it, the more it sounds like s390 "middle-level folding/iteration" was doing it "the right way", and everybody else was wrong, or at least not in an optimally efficient way :-) Might also be that only we could do this because we can determine the pagetable level from a pagetable entry value. Anyway, if you are not yet confused enough, I recommend looking at the other option we had in mind, for fixing the gup_fast issue. See "Patch 1" from here: https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/ That would actually have kept that "middle-level iteration" also for gup_fast, by additionally passing through the pXd pointers. However, it also needed a gup-specific version of pXd_offset(), in order to keep the READ_ONCE semantics. For s390, that would have actually been the best solution, but a generic version of that might not have been so easy. And doing it like everybody else can not be so bad, at least I really hope so. Of course, at some point in time, we might come up with some fancy fundamental change that would "do it the right middle-level way for everybody". At least I think I overheard Vasily and Alexander discussing some wild ideas, but that is certainly beyond this scope here... ^ permalink raw reply [flat|nested] 62+ messages in thread
* [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware 2020-09-07 18:00 [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding Gerald Schaefer 2020-09-07 18:00 ` [RFC PATCH v2 1/3] " Gerald Schaefer @ 2020-09-07 18:00 ` Gerald Schaefer 2020-09-08 5:14 ` Christophe Leroy ` (2 more replies) 2020-09-07 18:00 ` [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions Gerald Schaefer ` (2 subsequent siblings) 4 siblings, 3 replies; 62+ messages in thread From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw) To: Jason Gunthorpe, John Hubbard Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda From: Alexander Gordeev <agordeev@linux.ibm.com> Unlike all other page-table abstractions pXd_addr_end() do not take into account a particular table entry in which context the functions are called. On architectures with dynamic page-tables folding that might lead to lack of necessary information that is difficult to obtain other than from the table entry itself. That already led to a subtle memory corruption issue on s390. By letting pXd_addr_end() functions know about the page-table entry we allow archs not only make extra checks, but also optimizations. As result of this change the pXd_addr_end_folded() functions used in gup_fast traversal code become unnecessary and get replaced with universal pXd_addr_end() variants. The arch-specific updates not only add dereferencing of page-table entry pointers, but also small changes to the code flow to make those dereferences possible, at least for x86 and powerpc. Also for arm64, but in way that should not have any impact. So, even though the dereferenced page-table entries are not used on archs other than s390, and are optimized out by the compiler, there is a small change in kernel size and this is what bloat-o-meter reports: x86: add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10) Function old new delta vmemmap_populate 587 592 +5 munlock_vma_pages_range 556 561 +5 Total: Before=15534694, After=15534704, chg +0.00% powerpc: add/remove: 0/0 grow/shrink: 1/0 up/down: 4/0 (4) Function old new delta .remove_pagetable 1648 1652 +4 Total: Before=21478240, After=21478244, chg +0.00% arm64: add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0) Function old new delta Total: Before=20240851, After=20240851, chg +0.00% sparc: add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0) Function old new delta Total: Before=4907262, After=4907262, chg +0.00% Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> --- arch/arm/include/asm/pgtable-2level.h | 2 +- arch/arm/mm/idmap.c | 6 ++-- arch/arm/mm/mmu.c | 8 ++--- arch/arm64/kernel/hibernate.c | 16 ++++++---- arch/arm64/kvm/mmu.c | 16 +++++----- arch/arm64/mm/kasan_init.c | 8 ++--- arch/arm64/mm/mmu.c | 25 +++++++-------- arch/powerpc/mm/book3s64/radix_pgtable.c | 7 ++--- arch/powerpc/mm/hugetlbpage.c | 6 ++-- arch/s390/include/asm/pgtable.h | 8 ++--- arch/s390/mm/page-states.c | 8 ++--- arch/s390/mm/pageattr.c | 8 ++--- arch/s390/mm/vmem.c | 8 ++--- arch/sparc/mm/hugetlbpage.c | 6 ++-- arch/um/kernel/tlb.c | 8 ++--- arch/x86/mm/init_64.c | 15 ++++----- arch/x86/mm/kasan_init_64.c | 16 +++++----- include/asm-generic/pgtable-nop4d.h | 2 +- include/asm-generic/pgtable-nopmd.h | 2 +- include/asm-generic/pgtable-nopud.h | 2 +- include/linux/pgtable.h | 26 ++++----------- mm/gup.c | 8 ++--- mm/ioremap.c | 8 ++--- mm/kasan/init.c | 17 +++++----- mm/madvise.c | 4 +-- mm/memory.c | 40 ++++++++++++------------ mm/mlock.c | 18 ++++++++--- mm/mprotect.c | 8 ++--- mm/pagewalk.c | 8 ++--- mm/swapfile.c | 8 ++--- mm/vmalloc.c | 16 +++++----- 31 files changed, 165 insertions(+), 173 deletions(-) diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h index 3502c2f746ca..5e6416b339f4 100644 --- a/arch/arm/include/asm/pgtable-2level.h +++ b/arch/arm/include/asm/pgtable-2level.h @@ -209,7 +209,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr) } while (0) /* we don't need complex calculations here as the pmd is folded into the pgd */ -#define pmd_addr_end(addr,end) (end) +#define pmd_addr_end(pmd,addr,end) (end) #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext) diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c index 448e57c6f653..5437f943ca8b 100644 --- a/arch/arm/mm/idmap.c +++ b/arch/arm/mm/idmap.c @@ -46,7 +46,7 @@ static void idmap_add_pmd(pud_t *pud, unsigned long addr, unsigned long end, pmd = pmd_offset(pud, addr); do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); *pmd = __pmd((addr & PMD_MASK) | prot); flush_pmd_entry(pmd); } while (pmd++, addr = next, addr != end); @@ -73,7 +73,7 @@ static void idmap_add_pud(pgd_t *pgd, unsigned long addr, unsigned long end, unsigned long next; do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); idmap_add_pmd(pud, addr, next, prot); } while (pud++, addr = next, addr != end); } @@ -95,7 +95,7 @@ static void identity_mapping_add(pgd_t *pgd, const char *text_start, pgd += pgd_index(addr); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); idmap_add_pud(pgd, addr, next, prot); } while (pgd++, addr = next, addr != end); } diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c index 698cc740c6b8..4013746e4c75 100644 --- a/arch/arm/mm/mmu.c +++ b/arch/arm/mm/mmu.c @@ -777,7 +777,7 @@ static void __init alloc_init_pmd(pud_t *pud, unsigned long addr, * With LPAE, we must loop over to map * all the pmds for the given range. */ - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); /* * Try a section mapping - addr, next and phys must all be @@ -805,7 +805,7 @@ static void __init alloc_init_pud(p4d_t *p4d, unsigned long addr, unsigned long next; do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); alloc_init_pmd(pud, addr, next, phys, type, alloc, ng); phys += next - addr; } while (pud++, addr = next, addr != end); @@ -820,7 +820,7 @@ static void __init alloc_init_p4d(pgd_t *pgd, unsigned long addr, unsigned long next; do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); alloc_init_pud(p4d, addr, next, phys, type, alloc, ng); phys += next - addr; } while (p4d++, addr = next, addr != end); @@ -923,7 +923,7 @@ static void __init __create_mapping(struct mm_struct *mm, struct map_desc *md, pgd = pgd_offset(mm, addr); end = addr + length; do { - unsigned long next = pgd_addr_end(addr, end); + unsigned long next = pgd_addr_end(*pgd, addr, end); alloc_init_p4d(pgd, addr, next, phys, type, alloc, ng); diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c index 68e14152d6e9..7be8c9cdc5c8 100644 --- a/arch/arm64/kernel/hibernate.c +++ b/arch/arm64/kernel/hibernate.c @@ -412,7 +412,7 @@ static int copy_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long start, do { pmd_t pmd = READ_ONCE(*src_pmdp); - next = pmd_addr_end(addr, end); + next = pmd_addr_end(pmd, addr, end); if (pmd_none(pmd)) continue; if (pmd_table(pmd)) { @@ -447,7 +447,7 @@ static int copy_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp, unsigned long start, do { pud_t pud = READ_ONCE(*src_pudp); - next = pud_addr_end(addr, end); + next = pud_addr_end(pud, addr, end); if (pud_none(pud)) continue; if (pud_table(pud)) { @@ -473,8 +473,10 @@ static int copy_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long start, dst_p4dp = p4d_offset(dst_pgdp, start); src_p4dp = p4d_offset(src_pgdp, start); do { - next = p4d_addr_end(addr, end); - if (p4d_none(READ_ONCE(*src_p4dp))) + p4d_t p4d = READ_ONCE(*src_p4dp); + + next = p4d_addr_end(p4d, addr, end); + if (p4d_none(p4d)) continue; if (copy_pud(dst_p4dp, src_p4dp, addr, next)) return -ENOMEM; @@ -492,8 +494,10 @@ static int copy_page_tables(pgd_t *dst_pgdp, unsigned long start, dst_pgdp = pgd_offset_pgd(dst_pgdp, start); do { - next = pgd_addr_end(addr, end); - if (pgd_none(READ_ONCE(*src_pgdp))) + pgd_t pgd = READ_ONCE(*src_pgdp); + + next = pgd_addr_end(pgd, addr, end); + if (pgd_none(pgd)) continue; if (copy_p4d(dst_pgdp, src_pgdp, addr, next)) return -ENOMEM; diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index ba00bcc0c884..8f470f93a8e9 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -547,7 +547,7 @@ static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end) start_pmd = pmd = pmd_offset(pud, addr); do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); /* Hyp doesn't use huge pmds */ if (!pmd_none(*pmd)) unmap_hyp_ptes(pmd, addr, next); @@ -564,7 +564,7 @@ static void unmap_hyp_puds(p4d_t *p4d, phys_addr_t addr, phys_addr_t end) start_pud = pud = pud_offset(p4d, addr); do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); /* Hyp doesn't use huge puds */ if (!pud_none(*pud)) unmap_hyp_pmds(pud, addr, next); @@ -581,7 +581,7 @@ static void unmap_hyp_p4ds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end) start_p4d = p4d = p4d_offset(pgd, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); /* Hyp doesn't use huge p4ds */ if (!p4d_none(*p4d)) unmap_hyp_puds(p4d, addr, next); @@ -609,7 +609,7 @@ static void __unmap_hyp_range(pgd_t *pgdp, unsigned long ptrs_per_pgd, */ pgd = pgdp + kvm_pgd_index(addr, ptrs_per_pgd); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); if (!pgd_none(*pgd)) unmap_hyp_p4ds(pgd, addr, next); } while (pgd++, addr = next, addr != end); @@ -712,7 +712,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start, get_page(virt_to_page(pmd)); } - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); create_hyp_pte_mappings(pmd, addr, next, pfn, prot); pfn += (next - addr) >> PAGE_SHIFT; @@ -744,7 +744,7 @@ static int create_hyp_pud_mappings(p4d_t *p4d, unsigned long start, get_page(virt_to_page(pud)); } - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); ret = create_hyp_pmd_mappings(pud, addr, next, pfn, prot); if (ret) return ret; @@ -777,7 +777,7 @@ static int create_hyp_p4d_mappings(pgd_t *pgd, unsigned long start, get_page(virt_to_page(p4d)); } - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); ret = create_hyp_pud_mappings(p4d, addr, next, pfn, prot); if (ret) return ret; @@ -813,7 +813,7 @@ static int __create_hyp_mappings(pgd_t *pgdp, unsigned long ptrs_per_pgd, get_page(virt_to_page(pgd)); } - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); err = create_hyp_p4d_mappings(pgd, addr, next, pfn, prot); if (err) goto out; diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c index b24e43d20667..8d1c811fd59e 100644 --- a/arch/arm64/mm/kasan_init.c +++ b/arch/arm64/mm/kasan_init.c @@ -120,7 +120,7 @@ static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr, pmd_t *pmdp = kasan_pmd_offset(pudp, addr, node, early); do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmdp, addr, end); kasan_pte_populate(pmdp, addr, next, node, early); } while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp))); } @@ -132,7 +132,7 @@ static void __init kasan_pud_populate(p4d_t *p4dp, unsigned long addr, pud_t *pudp = kasan_pud_offset(p4dp, addr, node, early); do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pudp, addr, end); kasan_pmd_populate(pudp, addr, next, node, early); } while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp))); } @@ -144,7 +144,7 @@ static void __init kasan_p4d_populate(pgd_t *pgdp, unsigned long addr, p4d_t *p4dp = p4d_offset(pgdp, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4dp, addr, end); kasan_pud_populate(p4dp, addr, next, node, early); } while (p4dp++, addr = next, addr != end); } @@ -157,7 +157,7 @@ static void __init kasan_pgd_populate(unsigned long addr, unsigned long end, pgdp = pgd_offset_k(addr); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgdp, addr, end); kasan_p4d_populate(pgdp, addr, next, node, early); } while (pgdp++, addr = next, addr != end); } diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 64211436629d..d679cf024bc8 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -209,7 +209,7 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end, do { pmd_t old_pmd = READ_ONCE(*pmdp); - next = pmd_addr_end(addr, end); + next = pmd_addr_end(old_pmd, addr, end); /* try section mapping first */ if (((addr | next | phys) & ~SECTION_MASK) == 0 && @@ -307,7 +307,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end, do { pud_t old_pud = READ_ONCE(*pudp); - next = pud_addr_end(addr, end); + next = pud_addr_end(old_pud, addr, end); /* * For 4K granule only, attempt to put down a 1GB block @@ -356,7 +356,7 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys, end = PAGE_ALIGN(virt + size); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgdp, addr, end); alloc_init_pud(pgdp, addr, next, phys, prot, pgtable_alloc, flags); phys += next - addr; @@ -820,9 +820,9 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr, pmd_t *pmdp, pmd; do { - next = pmd_addr_end(addr, end); pmdp = pmd_offset(pudp, addr); pmd = READ_ONCE(*pmdp); + next = pmd_addr_end(pmd, addr, end); if (pmd_none(pmd)) continue; @@ -853,9 +853,9 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr, pud_t *pudp, pud; do { - next = pud_addr_end(addr, end); pudp = pud_offset(p4dp, addr); pud = READ_ONCE(*pudp); + next = pud_addr_end(pud, addr, end); if (pud_none(pud)) continue; @@ -886,9 +886,9 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr, p4d_t *p4dp, p4d; do { - next = p4d_addr_end(addr, end); p4dp = p4d_offset(pgdp, addr); p4d = READ_ONCE(*p4dp); + next = p4d_addr_end(p4d, addr, end); if (p4d_none(p4d)) continue; @@ -912,9 +912,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end, WARN_ON(!free_mapped && altmap); do { - next = pgd_addr_end(addr, end); pgdp = pgd_offset_k(addr); pgd = READ_ONCE(*pgdp); + next = pgd_addr_end(pgd, addr, end); if (pgd_none(pgd)) continue; @@ -968,9 +968,9 @@ static void free_empty_pmd_table(pud_t *pudp, unsigned long addr, unsigned long i, next, start = addr; do { - next = pmd_addr_end(addr, end); pmdp = pmd_offset(pudp, addr); pmd = READ_ONCE(*pmdp); + next = pmd_addr_end(pmd, addr, end); if (pmd_none(pmd)) continue; @@ -1008,9 +1008,9 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr, unsigned long i, next, start = addr; do { - next = pud_addr_end(addr, end); pudp = pud_offset(p4dp, addr); pud = READ_ONCE(*pudp); + next = pud_addr_end(pud, addr, end); if (pud_none(pud)) continue; @@ -1048,9 +1048,9 @@ static void free_empty_p4d_table(pgd_t *pgdp, unsigned long addr, p4d_t *p4dp, p4d; do { - next = p4d_addr_end(addr, end); p4dp = p4d_offset(pgdp, addr); p4d = READ_ONCE(*p4dp); + next = p4d_addr_end(p4d, addr, end); if (p4d_none(p4d)) continue; @@ -1066,9 +1066,9 @@ static void free_empty_tables(unsigned long addr, unsigned long end, pgd_t *pgdp, pgd; do { - next = pgd_addr_end(addr, end); pgdp = pgd_offset_k(addr); pgd = READ_ONCE(*pgdp); + next = pgd_addr_end(pgd, addr, end); if (pgd_none(pgd)) continue; @@ -1097,8 +1097,6 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, pmd_t *pmdp; do { - next = pmd_addr_end(addr, end); - pgdp = vmemmap_pgd_populate(addr, node); if (!pgdp) return -ENOMEM; @@ -1112,6 +1110,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, return -ENOMEM; pmdp = pmd_offset(pudp, addr); + next = pmd_addr_end(*pmdp, addr, end); if (pmd_none(READ_ONCE(*pmdp))) { void *p = NULL; diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index cc72666e891a..816e218df285 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -817,7 +817,7 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr, pmd = pmd_start + pmd_index(addr); for (; addr < end; addr = next, pmd++) { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (!pmd_present(*pmd)) continue; @@ -847,7 +847,7 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr, pud = pud_start + pud_index(addr); for (; addr < end; addr = next, pud++) { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (!pud_present(*pud)) continue; @@ -878,10 +878,9 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end) spin_lock(&init_mm.page_table_lock); for (addr = start; addr < end; addr = next) { - next = pgd_addr_end(addr, end); - pgd = pgd_offset_k(addr); p4d = p4d_offset(pgd, addr); + next = pgd_addr_end(*pgd, addr, end); if (!p4d_present(*p4d)) continue; diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 26292544630f..f0606d6774a4 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -352,7 +352,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud, unsigned long more; pmd = pmd_offset(pud, addr); - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (!is_hugepd(__hugepd(pmd_val(*pmd)))) { if (pmd_none_or_clear_bad(pmd)) continue; @@ -409,7 +409,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d, start = addr; do { pud = pud_offset(p4d, addr); - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (!is_hugepd(__hugepd(pud_val(*pud)))) { if (pud_none_or_clear_bad(pud)) continue; @@ -478,9 +478,9 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, */ do { - next = pgd_addr_end(addr, end); pgd = pgd_offset(tlb->mm, addr); p4d = p4d_offset(pgd, addr); + next = pgd_addr_end(*pgd, addr, end); if (!is_hugepd(__hugepd(pgd_val(*pgd)))) { if (p4d_none_or_clear_bad(p4d)) continue; diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 027206e4959d..6fb17ac413be 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -542,14 +542,14 @@ static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned lo return (boundary - 1) < (end - 1) ? boundary : end; } -#define pgd_addr_end_folded pgd_addr_end_folded -static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end) +#define pgd_addr_end pgd_addr_end +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end) { return rste_addr_end_folded(pgd_val(pgd), addr, end); } -#define p4d_addr_end_folded p4d_addr_end_folded -static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end) +#define p4d_addr_end p4d_addr_end +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end) { return rste_addr_end_folded(p4d_val(p4d), addr, end); } diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c index 567c69f3069e..4aba634b4b26 100644 --- a/arch/s390/mm/page-states.c +++ b/arch/s390/mm/page-states.c @@ -109,7 +109,7 @@ static void mark_kernel_pmd(pud_t *pud, unsigned long addr, unsigned long end) pmd = pmd_offset(pud, addr); do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (pmd_none(*pmd) || pmd_large(*pmd)) continue; page = virt_to_page(pmd_val(*pmd)); @@ -126,7 +126,7 @@ static void mark_kernel_pud(p4d_t *p4d, unsigned long addr, unsigned long end) pud = pud_offset(p4d, addr); do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (pud_none(*pud) || pud_large(*pud)) continue; if (!pud_folded(*pud)) { @@ -147,7 +147,7 @@ static void mark_kernel_p4d(pgd_t *pgd, unsigned long addr, unsigned long end) p4d = p4d_offset(pgd, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (p4d_none(*p4d)) continue; if (!p4d_folded(*p4d)) { @@ -169,7 +169,7 @@ static void mark_kernel_pgd(void) addr = 0; pgd = pgd_offset_k(addr); do { - next = pgd_addr_end(addr, MODULES_END); + next = pgd_addr_end(*pgd, addr, MODULES_END); if (pgd_none(*pgd)) continue; if (!pgd_folded(*pgd)) { diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c index c5c52ec2b46f..b827d758a17a 100644 --- a/arch/s390/mm/pageattr.c +++ b/arch/s390/mm/pageattr.c @@ -162,7 +162,7 @@ static int walk_pmd_level(pud_t *pudp, unsigned long addr, unsigned long end, do { if (pmd_none(*pmdp)) return -EINVAL; - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmdp, addr, end); if (pmd_large(*pmdp)) { if (addr & ~PMD_MASK || addr + PMD_SIZE > next) { rc = split_pmd_page(pmdp, addr); @@ -239,7 +239,7 @@ static int walk_pud_level(p4d_t *p4d, unsigned long addr, unsigned long end, do { if (pud_none(*pudp)) return -EINVAL; - next = pud_addr_end(addr, end); + next = pud_addr_end(*pudp, addr, end); if (pud_large(*pudp)) { if (addr & ~PUD_MASK || addr + PUD_SIZE > next) { rc = split_pud_page(pudp, addr); @@ -269,7 +269,7 @@ static int walk_p4d_level(pgd_t *pgd, unsigned long addr, unsigned long end, do { if (p4d_none(*p4dp)) return -EINVAL; - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4dp, addr, end); rc = walk_pud_level(p4dp, addr, next, flags); p4dp++; addr = next; @@ -296,7 +296,7 @@ static int change_page_attr(unsigned long addr, unsigned long end, do { if (pgd_none(*pgdp)) break; - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgdp, addr, end); rc = walk_p4d_level(pgdp, addr, next, flags); if (rc) break; diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c index b239f2ba93b0..672bc89f13e7 100644 --- a/arch/s390/mm/vmem.c +++ b/arch/s390/mm/vmem.c @@ -219,7 +219,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr, pmd = pmd_offset(pud, addr); for (; addr < end; addr = next, pmd++) { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (!add) { if (pmd_none(*pmd)) continue; @@ -320,7 +320,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end, prot &= ~_REGION_ENTRY_NOEXEC; pud = pud_offset(p4d, addr); for (; addr < end; addr = next, pud++) { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (!add) { if (pud_none(*pud)) continue; @@ -394,7 +394,7 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end, p4d = p4d_offset(pgd, addr); for (; addr < end; addr = next, p4d++) { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (!add) { if (p4d_none(*p4d)) continue; @@ -449,8 +449,8 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add, if (WARN_ON_ONCE(!PAGE_ALIGNED(start | end))) return -EINVAL; for (addr = start; addr < end; addr = next) { - next = pgd_addr_end(addr, end); pgd = pgd_offset_k(addr); + next = pgd_addr_end(*pgd, addr, end); if (!add) { if (pgd_none(*pgd)) diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index ec423b5f17dd..341c2ff8d31a 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -428,7 +428,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud, start = addr; pmd = pmd_offset(pud, addr); do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (pmd_none(*pmd)) continue; if (is_hugetlb_pmd(*pmd)) @@ -465,7 +465,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d, start = addr; pud = pud_offset(p4d, addr); do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (pud_none_or_clear_bad(pud)) continue; if (is_hugetlb_pud(*pud)) @@ -519,7 +519,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, pgd = pgd_offset(tlb->mm, addr); p4d = p4d_offset(pgd, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (p4d_none_or_clear_bad(p4d)) continue; hugetlb_free_pud_range(tlb, p4d, addr, next, floor, ceiling); diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c index 61776790cd67..7b4fe31c8df2 100644 --- a/arch/um/kernel/tlb.c +++ b/arch/um/kernel/tlb.c @@ -264,7 +264,7 @@ static inline int update_pmd_range(pud_t *pud, unsigned long addr, pmd = pmd_offset(pud, addr); do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (!pmd_present(*pmd)) { if (hvc->force || pmd_newpage(*pmd)) { ret = add_munmap(addr, next - addr, hvc); @@ -286,7 +286,7 @@ static inline int update_pud_range(p4d_t *p4d, unsigned long addr, pud = pud_offset(p4d, addr); do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (!pud_present(*pud)) { if (hvc->force || pud_newpage(*pud)) { ret = add_munmap(addr, next - addr, hvc); @@ -308,7 +308,7 @@ static inline int update_p4d_range(pgd_t *pgd, unsigned long addr, p4d = p4d_offset(pgd, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (!p4d_present(*p4d)) { if (hvc->force || p4d_newpage(*p4d)) { ret = add_munmap(addr, next - addr, hvc); @@ -331,7 +331,7 @@ void fix_range_common(struct mm_struct *mm, unsigned long start_addr, hvc = INIT_HVC(mm, force, userspace); pgd = pgd_offset(mm, addr); do { - next = pgd_addr_end(addr, end_addr); + next = pgd_addr_end(*pgd, addr, end_addr); if (!pgd_present(*pgd)) { if (force || pgd_newpage(*pgd)) { ret = add_munmap(addr, next - addr, &hvc); diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index a4ac13cc3fdc..e2cb9316a104 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1043,7 +1043,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end, pmd = pmd_start + pmd_index(addr); for (; addr < end; addr = next, pmd++) { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (!pmd_present(*pmd)) continue; @@ -1099,7 +1099,7 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end, pud = pud_start + pud_index(addr); for (; addr < end; addr = next, pud++) { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (!pud_present(*pud)) continue; @@ -1153,7 +1153,7 @@ remove_p4d_table(p4d_t *p4d_start, unsigned long addr, unsigned long end, p4d = p4d_start + p4d_index(addr); for (; addr < end; addr = next, p4d++) { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (!p4d_present(*p4d)) continue; @@ -1186,9 +1186,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct, p4d_t *p4d; for (addr = start; addr < end; addr = next) { - next = pgd_addr_end(addr, end); - pgd = pgd_offset_k(addr); + next = pgd_addr_end(*pgd, addr, end); if (!pgd_present(*pgd)) continue; @@ -1500,8 +1499,6 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start, pmd_t *pmd; for (addr = start; addr < end; addr = next) { - next = pmd_addr_end(addr, end); - pgd = vmemmap_pgd_populate(addr, node); if (!pgd) return -ENOMEM; @@ -1515,6 +1512,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start, return -ENOMEM; pmd = pmd_offset(pud, addr); + next = pmd_addr_end(*pmd, addr, end); if (pmd_none(*pmd)) { void *p; @@ -1623,9 +1621,8 @@ void register_page_bootmem_memmap(unsigned long section_nr, get_page_bootmem(section_nr, pte_page(*pte), SECTION_INFO); } else { - next = pmd_addr_end(addr, end); - pmd = pmd_offset(pud, addr); + next = pmd_addr_end(*pmd, addr, end); if (pmd_none(*pmd)) continue; diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index 1a50434c8a4d..2c105b5154ba 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -96,7 +96,7 @@ static void __init kasan_populate_pud(pud_t *pud, unsigned long addr, pmd = pmd_offset(pud, addr); do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (!pmd_large(*pmd)) kasan_populate_pmd(pmd, addr, next, nid); } while (pmd++, addr = next, addr != end); @@ -116,7 +116,7 @@ static void __init kasan_populate_p4d(p4d_t *p4d, unsigned long addr, pud = pud_offset(p4d, addr); do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (!pud_large(*pud)) kasan_populate_pud(pud, addr, next, nid); } while (pud++, addr = next, addr != end); @@ -136,7 +136,7 @@ static void __init kasan_populate_pgd(pgd_t *pgd, unsigned long addr, p4d = p4d_offset(pgd, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); kasan_populate_p4d(p4d, addr, next, nid); } while (p4d++, addr = next, addr != end); } @@ -151,7 +151,7 @@ static void __init kasan_populate_shadow(unsigned long addr, unsigned long end, end = round_up(end, PAGE_SIZE); pgd = pgd_offset_k(addr); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); kasan_populate_pgd(pgd, addr, next, nid); } while (pgd++, addr = next, addr != end); } @@ -219,7 +219,7 @@ static void __init kasan_early_p4d_populate(pgd_t *pgd, p4d = early_p4d_offset(pgd, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (!p4d_none(*p4d)) continue; @@ -239,7 +239,7 @@ static void __init kasan_map_early_shadow(pgd_t *pgd) pgd += pgd_index(addr); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); kasan_early_p4d_populate(pgd, addr, next); } while (pgd++, addr = next, addr != end); } @@ -254,7 +254,7 @@ static void __init kasan_shallow_populate_p4ds(pgd_t *pgd, p4d = p4d_offset(pgd, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (p4d_none(*p4d)) { p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true); @@ -272,7 +272,7 @@ static void __init kasan_shallow_populate_pgds(void *start, void *end) addr = (unsigned long)start; pgd = pgd_offset_k(addr); do { - next = pgd_addr_end(addr, (unsigned long)end); + next = pgd_addr_end(*pgd, addr, (unsigned long)end); if (pgd_none(*pgd)) { p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true); diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h index ce2cbb3c380f..156b42e51424 100644 --- a/include/asm-generic/pgtable-nop4d.h +++ b/include/asm-generic/pgtable-nop4d.h @@ -53,7 +53,7 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) #define p4d_free_tlb(tlb, x, a) do { } while (0) #undef p4d_addr_end -#define p4d_addr_end(addr, end) (end) +#define p4d_addr_end(p4d, addr, end) (end) #endif /* __ASSEMBLY__ */ #endif /* _PGTABLE_NOP4D_H */ diff --git a/include/asm-generic/pgtable-nopmd.h b/include/asm-generic/pgtable-nopmd.h index 3e13acd019ae..e988384de1c7 100644 --- a/include/asm-generic/pgtable-nopmd.h +++ b/include/asm-generic/pgtable-nopmd.h @@ -64,7 +64,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) #define pmd_free_tlb(tlb, x, a) do { } while (0) #undef pmd_addr_end -#define pmd_addr_end(addr, end) (end) +#define pmd_addr_end(pmd, addr, end) (end) #endif /* __ASSEMBLY__ */ diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h index a9d751fbda9e..57a28bade9f9 100644 --- a/include/asm-generic/pgtable-nopud.h +++ b/include/asm-generic/pgtable-nopud.h @@ -60,7 +60,7 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) #define pud_free_tlb(tlb, x, a) do { } while (0) #undef pud_addr_end -#define pud_addr_end(addr, end) (end) +#define pud_addr_end(pud, addr, end) (end) #endif /* __ASSEMBLY__ */ #endif /* _PGTABLE_NOPUD_H */ diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 981c4c2a31fe..67ebc22cf83d 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -655,48 +655,34 @@ static inline int arch_unmap_one(struct mm_struct *mm, * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout. */ -#define pgd_addr_end(addr, end) \ +#ifndef pgd_addr_end +#define pgd_addr_end(pgd, addr, end) \ ({ unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK; \ (__boundary - 1 < (end) - 1)? __boundary: (end); \ }) +#endif #ifndef p4d_addr_end -#define p4d_addr_end(addr, end) \ +#define p4d_addr_end(p4d, addr, end) \ ({ unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK; \ (__boundary - 1 < (end) - 1)? __boundary: (end); \ }) #endif #ifndef pud_addr_end -#define pud_addr_end(addr, end) \ +#define pud_addr_end(pud, addr, end) \ ({ unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK; \ (__boundary - 1 < (end) - 1)? __boundary: (end); \ }) #endif #ifndef pmd_addr_end -#define pmd_addr_end(addr, end) \ +#define pmd_addr_end(pmd, addr, end) \ ({ unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK; \ (__boundary - 1 < (end) - 1)? __boundary: (end); \ }) #endif -#ifndef pgd_addr_end_folded -#define pgd_addr_end_folded(pgd, addr, end) pgd_addr_end(addr, end) -#endif - -#ifndef p4d_addr_end_folded -#define p4d_addr_end_folded(p4d, addr, end) p4d_addr_end(addr, end) -#endif - -#ifndef pud_addr_end_folded -#define pud_addr_end_folded(pud, addr, end) pud_addr_end(addr, end) -#endif - -#ifndef pmd_addr_end_folded -#define pmd_addr_end_folded(pmd, addr, end) pmd_addr_end(addr, end) -#endif - /* * When walking page tables, we usually want to skip any p?d_none entries; * and any p?d_bad entries - reporting the error before resetting to none. diff --git a/mm/gup.c b/mm/gup.c index ba4aace5d0f4..7826876ae7e0 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, do { pmd_t pmd = READ_ONCE(*pmdp); - next = pmd_addr_end_folded(pmd, addr, end); + next = pmd_addr_end(pmd, addr, end); if (!pmd_present(pmd)) return 0; @@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, do { pud_t pud = READ_ONCE(*pudp); - next = pud_addr_end_folded(pud, addr, end); + next = pud_addr_end(pud, addr, end); if (unlikely(!pud_present(pud))) return 0; if (unlikely(pud_huge(pud))) { @@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, do { p4d_t p4d = READ_ONCE(*p4dp); - next = p4d_addr_end_folded(p4d, addr, end); + next = p4d_addr_end(p4d, addr, end); if (p4d_none(p4d)) return 0; BUILD_BUG_ON(p4d_huge(p4d)); @@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, do { pgd_t pgd = READ_ONCE(*pgdp); - next = pgd_addr_end_folded(pgd, addr, end); + next = pgd_addr_end(pgd, addr, end); if (pgd_none(pgd)) return; if (unlikely(pgd_huge(pgd))) { diff --git a/mm/ioremap.c b/mm/ioremap.c index 5fa1ab41d152..400fa119c09d 100644 --- a/mm/ioremap.c +++ b/mm/ioremap.c @@ -114,7 +114,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr, if (!pmd) return -ENOMEM; do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) { *mask |= PGTBL_PMD_MODIFIED; @@ -160,7 +160,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr, if (!pud) return -ENOMEM; do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) { *mask |= PGTBL_PUD_MODIFIED; @@ -206,7 +206,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr, if (!p4d) return -ENOMEM; do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) { *mask |= PGTBL_P4D_MODIFIED; @@ -234,7 +234,7 @@ int ioremap_page_range(unsigned long addr, start = addr; pgd = pgd_offset_k(addr); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot, &mask); if (err) diff --git a/mm/kasan/init.c b/mm/kasan/init.c index fe6be0be1f76..829627a92763 100644 --- a/mm/kasan/init.c +++ b/mm/kasan/init.c @@ -117,7 +117,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr, unsigned long next; do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (IS_ALIGNED(addr, PMD_SIZE) && end - addr >= PMD_SIZE) { pmd_populate_kernel(&init_mm, pmd, @@ -150,7 +150,7 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr, unsigned long next; do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (IS_ALIGNED(addr, PUD_SIZE) && end - addr >= PUD_SIZE) { pmd_t *pmd; @@ -187,7 +187,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr, unsigned long next; do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (IS_ALIGNED(addr, P4D_SIZE) && end - addr >= P4D_SIZE) { pud_t *pud; pmd_t *pmd; @@ -236,7 +236,7 @@ int __ref kasan_populate_early_shadow(const void *shadow_start, unsigned long next; do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); if (IS_ALIGNED(addr, PGDIR_SIZE) && end - addr >= PGDIR_SIZE) { p4d_t *p4d; @@ -370,7 +370,7 @@ static void kasan_remove_pmd_table(pmd_t *pmd, unsigned long addr, for (; addr < end; addr = next, pmd++) { pte_t *pte; - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (!pmd_present(*pmd)) continue; @@ -395,7 +395,7 @@ static void kasan_remove_pud_table(pud_t *pud, unsigned long addr, for (; addr < end; addr = next, pud++) { pmd_t *pmd, *pmd_base; - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (!pud_present(*pud)) continue; @@ -421,7 +421,7 @@ static void kasan_remove_p4d_table(p4d_t *p4d, unsigned long addr, for (; addr < end; addr = next, p4d++) { pud_t *pud; - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (!p4d_present(*p4d)) continue; @@ -454,9 +454,8 @@ void kasan_remove_zero_shadow(void *start, unsigned long size) for (; addr < end; addr = next) { p4d_t *p4d; - next = pgd_addr_end(addr, end); - pgd = pgd_offset_k(addr); + next = pgd_addr_end(*pgd, addr, end); if (!pgd_present(*pgd)) continue; diff --git a/mm/madvise.c b/mm/madvise.c index e32e7efbba0f..acfb3441d97e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -326,7 +326,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (pmd_trans_huge(*pmd)) { pmd_t orig_pmd; - unsigned long next = pmd_addr_end(addr, end); + unsigned long next = pmd_addr_end(*pmd, addr, end); tlb_change_page_size(tlb, HPAGE_PMD_SIZE); ptl = pmd_trans_huge_lock(pmd, vma); @@ -587,7 +587,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, int nr_swap = 0; unsigned long next; - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (pmd_trans_huge(*pmd)) if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next)) goto next; diff --git a/mm/memory.c b/mm/memory.c index fb5463153351..f95424946b0d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -233,7 +233,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, start = addr; pmd = pmd_offset(pud, addr); do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (pmd_none_or_clear_bad(pmd)) continue; free_pte_range(tlb, pmd, addr); @@ -267,7 +267,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d, start = addr; pud = pud_offset(p4d, addr); do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (pud_none_or_clear_bad(pud)) continue; free_pmd_range(tlb, pud, addr, next, floor, ceiling); @@ -301,7 +301,7 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd, start = addr; p4d = p4d_offset(pgd, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (p4d_none_or_clear_bad(p4d)) continue; free_pud_range(tlb, p4d, addr, next, floor, ceiling); @@ -381,7 +381,7 @@ void free_pgd_range(struct mmu_gather *tlb, tlb_change_page_size(tlb, PAGE_SIZE); pgd = pgd_offset(tlb->mm, addr); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); if (pgd_none_or_clear_bad(pgd)) continue; free_p4d_range(tlb, pgd, addr, next, floor, ceiling); @@ -887,7 +887,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src return -ENOMEM; src_pmd = pmd_offset(src_pud, addr); do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*src_pmd, addr, end); if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) { int err; @@ -921,7 +921,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src return -ENOMEM; src_pud = pud_offset(src_p4d, addr); do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*src_pud, addr, end); if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) { int err; @@ -955,7 +955,7 @@ static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src return -ENOMEM; src_p4d = p4d_offset(src_pgd, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*src_p4d, addr, end); if (p4d_none_or_clear_bad(src_p4d)) continue; if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d, @@ -1017,7 +1017,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*src_pgd, addr, end); if (pgd_none_or_clear_bad(src_pgd)) continue; if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd, @@ -1177,7 +1177,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, pmd = pmd_offset(pud, addr); do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { if (next - addr != HPAGE_PMD_SIZE) __split_huge_pmd(vma, pmd, addr, false, NULL); @@ -1212,7 +1212,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb, pud = pud_offset(p4d, addr); do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (pud_trans_huge(*pud) || pud_devmap(*pud)) { if (next - addr != HPAGE_PUD_SIZE) { mmap_assert_locked(tlb->mm); @@ -1241,7 +1241,7 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb, p4d = p4d_offset(pgd, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (p4d_none_or_clear_bad(p4d)) continue; next = zap_pud_range(tlb, vma, p4d, addr, next, details); @@ -1262,7 +1262,7 @@ void unmap_page_range(struct mmu_gather *tlb, tlb_start_vma(tlb, vma); pgd = pgd_offset(vma->vm_mm, addr); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); if (pgd_none_or_clear_bad(pgd)) continue; next = zap_p4d_range(tlb, vma, pgd, addr, next, details); @@ -2030,7 +2030,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, return -ENOMEM; VM_BUG_ON(pmd_trans_huge(*pmd)); do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); err = remap_pte_range(mm, pmd, addr, next, pfn + (addr >> PAGE_SHIFT), prot); if (err) @@ -2052,7 +2052,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d, if (!pud) return -ENOMEM; do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); err = remap_pmd_range(mm, pud, addr, next, pfn + (addr >> PAGE_SHIFT), prot); if (err) @@ -2074,7 +2074,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd, if (!p4d) return -ENOMEM; do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); err = remap_pud_range(mm, p4d, addr, next, pfn + (addr >> PAGE_SHIFT), prot); if (err) @@ -2143,7 +2143,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); err = remap_p4d_range(mm, pgd, addr, next, pfn + (addr >> PAGE_SHIFT), prot); if (err) @@ -2266,7 +2266,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud, pmd = pmd_offset(pud, addr); } do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (create || !pmd_none_or_clear_bad(pmd)) { err = apply_to_pte_range(mm, pmd, addr, next, fn, data, create, mask); @@ -2294,7 +2294,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d, pud = pud_offset(p4d, addr); } do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (create || !pud_none_or_clear_bad(pud)) { err = apply_to_pmd_range(mm, pud, addr, next, fn, data, create, mask); @@ -2322,7 +2322,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd, p4d = p4d_offset(pgd, addr); } do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (create || !p4d_none_or_clear_bad(p4d)) { err = apply_to_pud_range(mm, p4d, addr, next, fn, data, create, mask); @@ -2348,7 +2348,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr, pgd = pgd_offset(mm, addr); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); if (!create && pgd_none_or_clear_bad(pgd)) continue; err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, &mask); diff --git a/mm/mlock.c b/mm/mlock.c index 93ca2bf30b4f..5898e8fe2288 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -374,8 +374,12 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, struct vm_area_struct *vma, struct zone *zone, unsigned long start, unsigned long end) { - pte_t *pte; spinlock_t *ptl; + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; /* * Initialize pte walk starting at the already pinned page where we @@ -384,10 +388,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, */ pte = get_locked_pte(vma->vm_mm, start, &ptl); /* Make sure we do not cross the page table boundary */ - end = pgd_addr_end(start, end); - end = p4d_addr_end(start, end); - end = pud_addr_end(start, end); - end = pmd_addr_end(start, end); + pgd = pgd_offset(vma->vm_mm, start); + end = pgd_addr_end(*pgd, start, end); + p4d = p4d_offset(pgd, start); + end = p4d_addr_end(*p4d, start, end); + pud = pud_offset(p4d, start); + end = pud_addr_end(*pud, start, end); + pmd = pmd_offset(pud, start); + end = pmd_addr_end(*pmd, start, end); /* The page next to the pinned page is the first we will try to get */ start += PAGE_SIZE; diff --git a/mm/mprotect.c b/mm/mprotect.c index ce8b8a5eacbb..278f2dbd1f20 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -225,7 +225,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, do { unsigned long this_pages; - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); /* * Automatic NUMA balancing walks the tables with mmap_lock @@ -291,7 +291,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma, pud = pud_offset(p4d, addr); do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (pud_none_or_clear_bad(pud)) continue; pages += change_pmd_range(vma, pud, addr, next, newprot, @@ -311,7 +311,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma, p4d = p4d_offset(pgd, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (p4d_none_or_clear_bad(p4d)) continue; pages += change_pud_range(vma, p4d, addr, next, newprot, @@ -336,7 +336,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, flush_cache_range(vma, addr, end); inc_tlb_flush_pending(mm); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); if (pgd_none_or_clear_bad(pgd)) continue; pages += change_p4d_range(vma, pgd, addr, next, newprot, diff --git a/mm/pagewalk.c b/mm/pagewalk.c index e81640d9f177..a5b9f61b5d45 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -70,7 +70,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, pmd = pmd_offset(pud, addr); do { again: - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) { if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); @@ -128,7 +128,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, pud = pud_offset(p4d, addr); do { again: - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) { if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); @@ -176,7 +176,7 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, p4d = p4d_offset(pgd, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (p4d_none_or_clear_bad(p4d)) { if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); @@ -211,7 +211,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end, else pgd = pgd_offset(walk->mm, addr); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); if (pgd_none_or_clear_bad(pgd)) { if (ops->pte_hole) err = ops->pte_hole(addr, next, 0, walk); diff --git a/mm/swapfile.c b/mm/swapfile.c index 20012c0c0252..b1dd815aee6b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2018,7 +2018,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, pmd = pmd_offset(pud, addr); do { cond_resched(); - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (pmd_none_or_trans_huge_or_clear_bad(pmd)) continue; ret = unuse_pte_range(vma, pmd, addr, next, type, @@ -2040,7 +2040,7 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d, pud = pud_offset(p4d, addr); do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (pud_none_or_clear_bad(pud)) continue; ret = unuse_pmd_range(vma, pud, addr, next, type, @@ -2062,7 +2062,7 @@ static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd, p4d = p4d_offset(pgd, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (p4d_none_or_clear_bad(p4d)) continue; ret = unuse_pud_range(vma, p4d, addr, next, type, @@ -2085,7 +2085,7 @@ static int unuse_vma(struct vm_area_struct *vma, unsigned int type, pgd = pgd_offset(vma->vm_mm, addr); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); if (pgd_none_or_clear_bad(pgd)) continue; ret = unuse_p4d_range(vma, pgd, addr, next, type, diff --git a/mm/vmalloc.c b/mm/vmalloc.c index be4724b916b3..09ff0d5ecbc1 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -93,7 +93,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, pmd = pmd_offset(pud, addr); do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); cleared = pmd_clear_huge(pmd); if (cleared || pmd_bad(*pmd)) @@ -118,7 +118,7 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, pud = pud_offset(p4d, addr); do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); cleared = pud_clear_huge(pud); if (cleared || pud_bad(*pud)) @@ -141,7 +141,7 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, p4d = p4d_offset(pgd, addr); do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); cleared = p4d_clear_huge(p4d); if (cleared || p4d_bad(*p4d)) @@ -179,7 +179,7 @@ void unmap_kernel_range_noflush(unsigned long start, unsigned long size) BUG_ON(addr >= end); pgd = pgd_offset_k(addr); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); if (pgd_bad(*pgd)) mask |= PGTBL_PGD_MODIFIED; if (pgd_none_or_clear_bad(pgd)) @@ -230,7 +230,7 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr, if (!pmd) return -ENOMEM; do { - next = pmd_addr_end(addr, end); + next = pmd_addr_end(*pmd, addr, end); if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask)) return -ENOMEM; } while (pmd++, addr = next, addr != end); @@ -248,7 +248,7 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr, if (!pud) return -ENOMEM; do { - next = pud_addr_end(addr, end); + next = pud_addr_end(*pud, addr, end); if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask)) return -ENOMEM; } while (pud++, addr = next, addr != end); @@ -266,7 +266,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, if (!p4d) return -ENOMEM; do { - next = p4d_addr_end(addr, end); + next = p4d_addr_end(*p4d, addr, end); if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask)) return -ENOMEM; } while (p4d++, addr = next, addr != end); @@ -305,7 +305,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size, BUG_ON(addr >= end); pgd = pgd_offset_k(addr); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(*pgd, addr, end); if (pgd_bad(*pgd)) mask |= PGTBL_PGD_MODIFIED; err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask); -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware 2020-09-07 18:00 ` [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware Gerald Schaefer @ 2020-09-08 5:14 ` Christophe Leroy 2020-09-08 7:46 ` Alexander Gordeev 2020-09-08 14:25 ` Alexander Gordeev 2020-09-08 13:26 ` Jason Gunthorpe 2020-09-08 14:33 ` Dave Hansen 2 siblings, 2 replies; 62+ messages in thread From: Christophe Leroy @ 2020-09-08 5:14 UTC (permalink / raw) To: Gerald Schaefer, Jason Gunthorpe, John Hubbard Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86, Russell King, Christian Borntraeger, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport Le 07/09/2020 à 20:00, Gerald Schaefer a écrit : > From: Alexander Gordeev <agordeev@linux.ibm.com> > > Unlike all other page-table abstractions pXd_addr_end() do not take > into account a particular table entry in which context the functions > are called. On architectures with dynamic page-tables folding that > might lead to lack of necessary information that is difficult to > obtain other than from the table entry itself. That already led to > a subtle memory corruption issue on s390. > > By letting pXd_addr_end() functions know about the page-table entry > we allow archs not only make extra checks, but also optimizations. > > As result of this change the pXd_addr_end_folded() functions used > in gup_fast traversal code become unnecessary and get replaced with > universal pXd_addr_end() variants. > > The arch-specific updates not only add dereferencing of page-table > entry pointers, but also small changes to the code flow to make those > dereferences possible, at least for x86 and powerpc. Also for arm64, > but in way that should not have any impact. > [...] > > Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> > Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> > --- > arch/arm/include/asm/pgtable-2level.h | 2 +- > arch/arm/mm/idmap.c | 6 ++-- > arch/arm/mm/mmu.c | 8 ++--- > arch/arm64/kernel/hibernate.c | 16 ++++++---- > arch/arm64/kvm/mmu.c | 16 +++++----- > arch/arm64/mm/kasan_init.c | 8 ++--- > arch/arm64/mm/mmu.c | 25 +++++++-------- > arch/powerpc/mm/book3s64/radix_pgtable.c | 7 ++--- > arch/powerpc/mm/hugetlbpage.c | 6 ++-- You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems. > arch/s390/include/asm/pgtable.h | 8 ++--- > arch/s390/mm/page-states.c | 8 ++--- > arch/s390/mm/pageattr.c | 8 ++--- > arch/s390/mm/vmem.c | 8 ++--- > arch/sparc/mm/hugetlbpage.c | 6 ++-- > arch/um/kernel/tlb.c | 8 ++--- > arch/x86/mm/init_64.c | 15 ++++----- > arch/x86/mm/kasan_init_64.c | 16 +++++----- > include/asm-generic/pgtable-nop4d.h | 2 +- > include/asm-generic/pgtable-nopmd.h | 2 +- > include/asm-generic/pgtable-nopud.h | 2 +- > include/linux/pgtable.h | 26 ++++----------- > mm/gup.c | 8 ++--- > mm/ioremap.c | 8 ++--- > mm/kasan/init.c | 17 +++++----- > mm/madvise.c | 4 +-- > mm/memory.c | 40 ++++++++++++------------ > mm/mlock.c | 18 ++++++++--- > mm/mprotect.c | 8 ++--- > mm/pagewalk.c | 8 ++--- > mm/swapfile.c | 8 ++--- > mm/vmalloc.c | 16 +++++----- > 31 files changed, 165 insertions(+), 173 deletions(-) Christophe ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware 2020-09-08 5:14 ` Christophe Leroy @ 2020-09-08 7:46 ` Alexander Gordeev 2020-09-08 8:16 ` Christophe Leroy 2020-09-08 14:25 ` Alexander Gordeev 1 sibling, 1 reply; 62+ messages in thread From: Alexander Gordeev @ 2020-09-08 7:46 UTC (permalink / raw) To: Christophe Leroy, Michael Ellerman Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86, Russell King, Christian Borntraeger, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote: > You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems. Yes, and also two more sources :/ arch/powerpc/mm/kasan/8xx.c arch/powerpc/mm/kasan/kasan_init_32.c But these two are not quite obvious wrt pgd_addr_end() used while traversing pmds. Could you please clarify a bit? diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c index 2784224..89c5053 100644 --- a/arch/powerpc/mm/kasan/8xx.c +++ b/arch/powerpc/mm/kasan/8xx.c @@ -15,8 +15,8 @@ for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) { pte_basic_t *new; - k_next = pgd_addr_end(k_cur, k_end); - k_next = pgd_addr_end(k_next, k_end); + k_next = pmd_addr_end(k_cur, k_end); + k_next = pmd_addr_end(k_next, k_end); if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte) continue; diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index fb29404..3f7d6dc6 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_ for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) { pte_t *new; - k_next = pgd_addr_end(k_cur, k_end); + k_next = pmd_addr_end(k_cur, k_end); if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte) continue; @@ -196,7 +196,7 @@ void __init kasan_early_init(void) kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL); do { - next = pgd_addr_end(addr, end); + next = pmd_addr_end(addr, end); pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte); } while (pmd++, addr = next, addr != end); > Christophe ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware 2020-09-08 7:46 ` Alexander Gordeev @ 2020-09-08 8:16 ` Christophe Leroy 2020-09-08 14:15 ` Alexander Gordeev 0 siblings, 1 reply; 62+ messages in thread From: Christophe Leroy @ 2020-09-08 8:16 UTC (permalink / raw) To: Alexander Gordeev, Michael Ellerman Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86, Russell King, Christian Borntraeger, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport Le 08/09/2020 à 09:46, Alexander Gordeev a écrit : > On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote: >> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems. > > Yes, and also two more sources :/ > arch/powerpc/mm/kasan/8xx.c > arch/powerpc/mm/kasan/kasan_init_32.c > > But these two are not quite obvious wrt pgd_addr_end() used > while traversing pmds. Could you please clarify a bit? > > > diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c > index 2784224..89c5053 100644 > --- a/arch/powerpc/mm/kasan/8xx.c > +++ b/arch/powerpc/mm/kasan/8xx.c > @@ -15,8 +15,8 @@ > for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) { > pte_basic_t *new; > > - k_next = pgd_addr_end(k_cur, k_end); > - k_next = pgd_addr_end(k_next, k_end); > + k_next = pmd_addr_end(k_cur, k_end); > + k_next = pmd_addr_end(k_next, k_end); No, I don't think so. On powerpc32 we have only two levels, so pgd and pmd are more or less the same. But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h is a no-op, so I don't think it will work. It is likely that this function should iterate on pgd, then you get pmd = pmd_offset(pud_offset(p4d_offset(pgd))); > if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte) > continue; > > diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c > index fb29404..3f7d6dc6 100644 > --- a/arch/powerpc/mm/kasan/kasan_init_32.c > +++ b/arch/powerpc/mm/kasan/kasan_init_32.c > @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_ > for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) { > pte_t *new; > > - k_next = pgd_addr_end(k_cur, k_end); > + k_next = pmd_addr_end(k_cur, k_end); Same here I get, iterate on pgd then get pmd = pmd_offset(pud_offset(p4d_offset(pgd))); > if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte) > continue; > > @@ -196,7 +196,7 @@ void __init kasan_early_init(void) > kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL); > > do { > - next = pgd_addr_end(addr, end); > + next = pmd_addr_end(addr, end); > pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte); > } while (pmd++, addr = next, addr != end); > > Christophe ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware 2020-09-08 8:16 ` Christophe Leroy @ 2020-09-08 14:15 ` Alexander Gordeev 2020-09-09 8:38 ` Christophe Leroy 0 siblings, 1 reply; 62+ messages in thread From: Alexander Gordeev @ 2020-09-08 14:15 UTC (permalink / raw) To: Christophe Leroy Cc: Michael Ellerman, Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86, Russell King, Christian Borntraeger, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote: > >Yes, and also two more sources :/ > > arch/powerpc/mm/kasan/8xx.c > > arch/powerpc/mm/kasan/kasan_init_32.c > > > >But these two are not quite obvious wrt pgd_addr_end() used > >while traversing pmds. Could you please clarify a bit? > > > > > >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c > >index 2784224..89c5053 100644 > >--- a/arch/powerpc/mm/kasan/8xx.c > >+++ b/arch/powerpc/mm/kasan/8xx.c > >@@ -15,8 +15,8 @@ > > for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) { > > pte_basic_t *new; > >- k_next = pgd_addr_end(k_cur, k_end); > >- k_next = pgd_addr_end(k_next, k_end); > >+ k_next = pmd_addr_end(k_cur, k_end); > >+ k_next = pmd_addr_end(k_next, k_end); > > No, I don't think so. > On powerpc32 we have only two levels, so pgd and pmd are more or > less the same. > But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h > is a no-op, so I don't think it will work. > > It is likely that this function should iterate on pgd, then you get > pmd = pmd_offset(pud_offset(p4d_offset(pgd))); It looks like the code iterates over single pmd table while using pgd_addr_end() only to skip all the middle levels and bail out from the loop. I would be wary for switching from pmds to pgds, since we are trying to minimize impact (especially functional) and the rework does not seem that obvious. Assuming pmd and pgd are the same would actually such approach work for now? diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c index 2784224..94466cc 100644 --- a/arch/powerpc/mm/kasan/8xx.c +++ b/arch/powerpc/mm/kasan/8xx.c @@ -15,8 +15,8 @@ for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) { pte_basic_t *new; - k_next = pgd_addr_end(k_cur, k_end); - k_next = pgd_addr_end(k_next, k_end); + k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end); + k_next = pgd_addr_end(__pgd(pmd_val(*(pmd + 1))), k_next, k_end); if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte) continue; diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index fb29404..c0bcd64 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_ for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) { pte_t *new; - k_next = pgd_addr_end(k_cur, k_end); + k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end); if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte) continue; @@ -196,7 +196,7 @@ void __init kasan_early_init(void) kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(__pgd(pmd_val(*pmd)), addr, end); pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte); } while (pmd++, addr = next, addr != end); Alternatively we could pass invalid pgd to keep the code structure intact, but that of course is less nice. Thanks! > Christophe ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware 2020-09-08 14:15 ` Alexander Gordeev @ 2020-09-09 8:38 ` Christophe Leroy 0 siblings, 0 replies; 62+ messages in thread From: Christophe Leroy @ 2020-09-09 8:38 UTC (permalink / raw) To: Alexander Gordeev Cc: Michael Ellerman, Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86, Russell King, Christian Borntraeger, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport On Tue, 2020-09-08 at 16:15 +0200, Alexander Gordeev wrote: > On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote: > > >Yes, and also two more sources :/ > > > arch/powerpc/mm/kasan/8xx.c > > > arch/powerpc/mm/kasan/kasan_init_32.c > > > > > >But these two are not quite obvious wrt pgd_addr_end() used > > >while traversing pmds. Could you please clarify a bit? > > > > > > > > >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c > > >index 2784224..89c5053 100644 > > >--- a/arch/powerpc/mm/kasan/8xx.c > > >+++ b/arch/powerpc/mm/kasan/8xx.c > > >@@ -15,8 +15,8 @@ > > > for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) { > > > pte_basic_t *new; > > >- k_next = pgd_addr_end(k_cur, k_end); > > >- k_next = pgd_addr_end(k_next, k_end); > > >+ k_next = pmd_addr_end(k_cur, k_end); > > >+ k_next = pmd_addr_end(k_next, k_end); > > > > No, I don't think so. > > On powerpc32 we have only two levels, so pgd and pmd are more or > > less the same. > > But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h > > is a no-op, so I don't think it will work. > > > > It is likely that this function should iterate on pgd, then you get > > pmd = pmd_offset(pud_offset(p4d_offset(pgd))); > > It looks like the code iterates over single pmd table while using > pgd_addr_end() only to skip all the middle levels and bail out > from the loop. > > I would be wary for switching from pmds to pgds, since we are > trying to minimize impact (especially functional) and the > rework does not seem that obvious. > I've just tested the following change, it works and should fix the oddity: diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c index 2784224054f8..8e53ddf57b84 100644 --- a/arch/powerpc/mm/kasan/8xx.c +++ b/arch/powerpc/mm/kasan/8xx.c @@ -9,11 +9,12 @@ static int __init kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void *block) { - pmd_t *pmd = pmd_off_k(k_start); + pgd_t *pgd = pgd_offset_k(k_start); unsigned long k_cur, k_next; - for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) { + for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd += 2, block += SZ_8M) { pte_basic_t *new; + pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur), k_cur); k_next = pgd_addr_end(k_cur, k_end); k_next = pgd_addr_end(k_next, k_end); diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index fb294046e00e..e5f524fa71a7 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -30,13 +30,12 @@ static void __init kasan_populate_pte(pte_t *ptep, pgprot_t prot) int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_end) { - pmd_t *pmd; + pgd_t *pgd = pgd_offset_k(k_start); unsigned long k_cur, k_next; - pmd = pmd_off_k(k_start); - - for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) { + for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd++) { pte_t *new; + pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur), k_cur); k_next = pgd_addr_end(k_cur, k_end); if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte) @@ -189,16 +188,18 @@ void __init kasan_early_init(void) unsigned long addr = KASAN_SHADOW_START; unsigned long end = KASAN_SHADOW_END; unsigned long next; - pmd_t *pmd = pmd_off_k(addr); + pgd_t *pgd = pgd_offset_k(addr); BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK); kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL); do { + pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, addr), addr), addr); + next = pgd_addr_end(addr, end); pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte); - } while (pmd++, addr = next, addr != end); + } while (pgd++, addr = next, addr != end); if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE)) kasan_early_hash_table(); --- Christophe ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware 2020-09-08 5:14 ` Christophe Leroy 2020-09-08 7:46 ` Alexander Gordeev @ 2020-09-08 14:25 ` Alexander Gordeev 1 sibling, 0 replies; 62+ messages in thread From: Alexander Gordeev @ 2020-09-08 14:25 UTC (permalink / raw) To: Christophe Leroy, Michael Ellerman Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86, Russell King, Christian Borntraeger, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote: [...] > You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems. If this one would be okay? diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c index 60c6ea16..3690d22 100644 --- a/arch/powerpc/mm/book3s64/subpage_prot.c +++ b/arch/powerpc/mm/book3s64/subpage_prot.c @@ -88,6 +88,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr, static void subpage_prot_clear(unsigned long addr, unsigned long len) { struct mm_struct *mm = current->mm; + pmd_t *pmd = pmd_off(mm, addr); struct subpage_prot_table *spt; u32 **spm, *spp; unsigned long i; @@ -103,8 +104,8 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len) limit = addr + len; if (limit > spt->maxaddr) limit = spt->maxaddr; - for (; addr < limit; addr = next) { - next = pmd_addr_end(addr, limit); + for (; addr < limit; addr = next, pmd++) { + next = pmd_addr_end(*pmd, addr, limit); if (addr < 0x100000000UL) { spm = spt->low_prot; } else { @@ -191,6 +192,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr, unsigned long, len, u32 __user *, map) { struct mm_struct *mm = current->mm; + pmd_t *pmd = pmd_off(mm, addr); struct subpage_prot_table *spt; u32 **spm, *spp; unsigned long i; @@ -236,8 +238,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr, } subpage_mark_vma_nohuge(mm, addr, len); - for (limit = addr + len; addr < limit; addr = next) { - next = pmd_addr_end(addr, limit); + for (limit = addr + len; addr < limit; addr = next, pmd++) { + next = pmd_addr_end(*pmd, addr, limit); err = -ENOMEM; if (addr < 0x100000000UL) { spm = spt->low_prot; Thanks! > Christophe ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware 2020-09-07 18:00 ` [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware Gerald Schaefer 2020-09-08 5:14 ` Christophe Leroy @ 2020-09-08 13:26 ` Jason Gunthorpe 2020-09-08 14:33 ` Dave Hansen 2 siblings, 0 replies; 62+ messages in thread From: Jason Gunthorpe @ 2020-09-08 13:26 UTC (permalink / raw) To: Gerald Schaefer Cc: John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Mon, Sep 07, 2020 at 08:00:57PM +0200, Gerald Schaefer wrote: > From: Alexander Gordeev <agordeev@linux.ibm.com> > > Unlike all other page-table abstractions pXd_addr_end() do not take > into account a particular table entry in which context the functions > are called. On architectures with dynamic page-tables folding that > might lead to lack of necessary information that is difficult to > obtain other than from the table entry itself. That already led to > a subtle memory corruption issue on s390. > > By letting pXd_addr_end() functions know about the page-table entry > we allow archs not only make extra checks, but also optimizations. > > As result of this change the pXd_addr_end_folded() functions used > in gup_fast traversal code become unnecessary and get replaced with > universal pXd_addr_end() variants. > > The arch-specific updates not only add dereferencing of page-table > entry pointers, but also small changes to the code flow to make those > dereferences possible, at least for x86 and powerpc. Also for arm64, > but in way that should not have any impact. > > So, even though the dereferenced page-table entries are not used on > archs other than s390, and are optimized out by the compiler, there > is a small change in kernel size and this is what bloat-o-meter reports: This looks pretty clean and straightfoward, only __munlock_pagevec_fill() had any real increased complexity. Thanks, Jason ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware 2020-09-07 18:00 ` [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware Gerald Schaefer 2020-09-08 5:14 ` Christophe Leroy 2020-09-08 13:26 ` Jason Gunthorpe @ 2020-09-08 14:33 ` Dave Hansen 2 siblings, 0 replies; 62+ messages in thread From: Dave Hansen @ 2020-09-08 14:33 UTC (permalink / raw) To: Gerald Schaefer, Jason Gunthorpe, John Hubbard Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On 9/7/20 11:00 AM, Gerald Schaefer wrote: > x86: > add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10) > Function old new delta > vmemmap_populate 587 592 +5 > munlock_vma_pages_range 556 561 +5 > Total: Before=15534694, After=15534704, chg +0.00% ... > arch/x86/mm/init_64.c | 15 ++++----- > arch/x86/mm/kasan_init_64.c | 16 +++++----- I didn't do a super thorough review on this, but it generally looks OK and the benefits of sharing more code between arches certainly outweigh a few bytes of binary growth. For the x86 bits at least, feel free to add my ack. ^ permalink raw reply [flat|nested] 62+ messages in thread
* [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions 2020-09-07 18:00 [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding Gerald Schaefer 2020-09-07 18:00 ` [RFC PATCH v2 1/3] " Gerald Schaefer 2020-09-07 18:00 ` [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware Gerald Schaefer @ 2020-09-07 18:00 ` Gerald Schaefer 2020-09-07 20:15 ` Mike Rapoport 2020-09-08 5:19 ` Christophe Leroy 2020-09-07 20:12 ` [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding Mike Rapoport 2020-09-08 4:42 ` Christophe Leroy 4 siblings, 2 replies; 62+ messages in thread From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw) To: Jason Gunthorpe, John Hubbard Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda From: Alexander Gordeev <agordeev@linux.ibm.com> Since pXd_addr_end() macros take pXd page-table entry as a parameter it makes sense to check the entry type on compile. Even though most archs do not make use of page-table entries in pXd_addr_end() calls, checking the type in traversal code paths could help to avoid subtle bugs. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> --- include/linux/pgtable.h | 36 ++++++++++++++++++++---------------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 67ebc22cf83d..d9e7d16c2263 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm, */ #ifndef pgd_addr_end -#define pgd_addr_end(pgd, addr, end) \ -({ unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK; \ - (__boundary - 1 < (end) - 1)? __boundary: (end); \ -}) +#define pgd_addr_end pgd_addr_end +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end) +{ unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK; + return (__boundary - 1 < end - 1) ? __boundary : end; +} #endif #ifndef p4d_addr_end -#define p4d_addr_end(p4d, addr, end) \ -({ unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK; \ - (__boundary - 1 < (end) - 1)? __boundary: (end); \ -}) +#define p4d_addr_end p4d_addr_end +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end) +{ unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK; + return (__boundary - 1 < end - 1) ? __boundary : end; +} #endif #ifndef pud_addr_end -#define pud_addr_end(pud, addr, end) \ -({ unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK; \ - (__boundary - 1 < (end) - 1)? __boundary: (end); \ -}) +#define pud_addr_end pud_addr_end +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end) +{ unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK; + return (__boundary - 1 < end - 1) ? __boundary : end; +} #endif #ifndef pmd_addr_end -#define pmd_addr_end(pmd, addr, end) \ -({ unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK; \ - (__boundary - 1 < (end) - 1)? __boundary: (end); \ -}) +#define pmd_addr_end pmd_addr_end +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end) +{ unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK; + return (__boundary - 1 < end - 1) ? __boundary : end; +} #endif /* -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions 2020-09-07 18:00 ` [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions Gerald Schaefer @ 2020-09-07 20:15 ` Mike Rapoport 2020-09-08 5:19 ` Christophe Leroy 1 sibling, 0 replies; 62+ messages in thread From: Mike Rapoport @ 2020-09-07 20:15 UTC (permalink / raw) To: Gerald Schaefer Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda Hi, Some style comments below. On Mon, Sep 07, 2020 at 08:00:58PM +0200, Gerald Schaefer wrote: > From: Alexander Gordeev <agordeev@linux.ibm.com> > > Since pXd_addr_end() macros take pXd page-table entry as a > parameter it makes sense to check the entry type on compile. > Even though most archs do not make use of page-table entries > in pXd_addr_end() calls, checking the type in traversal code > paths could help to avoid subtle bugs. > > Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> > Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> > --- > include/linux/pgtable.h | 36 ++++++++++++++++++++---------------- > 1 file changed, 20 insertions(+), 16 deletions(-) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index 67ebc22cf83d..d9e7d16c2263 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm, > */ > > #ifndef pgd_addr_end > -#define pgd_addr_end(pgd, addr, end) \ > -({ unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK; \ > - (__boundary - 1 < (end) - 1)? __boundary: (end); \ > -}) > +#define pgd_addr_end pgd_addr_end > +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end) > +{ unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK; The code should be on a separate line from the curly brace. Besides, since this is not a macro anymore, I think it would be nicer to use 'boundary' without underscores. This applies to the changes below as well. > + return (__boundary - 1 < end - 1) ? __boundary : end; > +} > #endif > > #ifndef p4d_addr_end > -#define p4d_addr_end(p4d, addr, end) \ > -({ unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK; \ > - (__boundary - 1 < (end) - 1)? __boundary: (end); \ > -}) > +#define p4d_addr_end p4d_addr_end > +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end) > +{ unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK; > + return (__boundary - 1 < end - 1) ? __boundary : end; > +} > #endif > > #ifndef pud_addr_end > -#define pud_addr_end(pud, addr, end) \ > -({ unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK; \ > - (__boundary - 1 < (end) - 1)? __boundary: (end); \ > -}) > +#define pud_addr_end pud_addr_end > +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end) > +{ unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK; > + return (__boundary - 1 < end - 1) ? __boundary : end; > +} > #endif > > #ifndef pmd_addr_end > -#define pmd_addr_end(pmd, addr, end) \ > -({ unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK; \ > - (__boundary - 1 < (end) - 1)? __boundary: (end); \ > -}) > +#define pmd_addr_end pmd_addr_end > +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end) > +{ unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK; > + return (__boundary - 1 < end - 1) ? __boundary : end; > +} > #endif > > /* > -- > 2.17.1 > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions 2020-09-07 18:00 ` [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions Gerald Schaefer 2020-09-07 20:15 ` Mike Rapoport @ 2020-09-08 5:19 ` Christophe Leroy 2020-09-08 15:48 ` Alexander Gordeev 1 sibling, 1 reply; 62+ messages in thread From: Christophe Leroy @ 2020-09-08 5:19 UTC (permalink / raw) To: Gerald Schaefer, Jason Gunthorpe, John Hubbard Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86, Russell King, Christian Borntraeger, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport Le 07/09/2020 à 20:00, Gerald Schaefer a écrit : > From: Alexander Gordeev <agordeev@linux.ibm.com> > > Since pXd_addr_end() macros take pXd page-table entry as a > parameter it makes sense to check the entry type on compile. > Even though most archs do not make use of page-table entries > in pXd_addr_end() calls, checking the type in traversal code > paths could help to avoid subtle bugs. > > Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> > Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> > --- > include/linux/pgtable.h | 36 ++++++++++++++++++++---------------- > 1 file changed, 20 insertions(+), 16 deletions(-) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index 67ebc22cf83d..d9e7d16c2263 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm, > */ > > #ifndef pgd_addr_end > -#define pgd_addr_end(pgd, addr, end) \ > -({ unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK; \ > - (__boundary - 1 < (end) - 1)? __boundary: (end); \ > -}) > +#define pgd_addr_end pgd_addr_end I think that #define is pointless, usually there is no such #define for the default case. > +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end) > +{ unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK; > + return (__boundary - 1 < end - 1) ? __boundary : end; > +} Please use the standard layout, ie entry { and exit } alone on their line, and space between local vars declaration and the rest. Also remove the leading __ in front of var names as it's not needed once it is not macros anymore. f_name() { some_local_var; do_something(); } > #endif > > #ifndef p4d_addr_end > -#define p4d_addr_end(p4d, addr, end) \ > -({ unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK; \ > - (__boundary - 1 < (end) - 1)? __boundary: (end); \ > -}) > +#define p4d_addr_end p4d_addr_end > +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end) > +{ unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK; > + return (__boundary - 1 < end - 1) ? __boundary : end; > +} > #endif > > #ifndef pud_addr_end > -#define pud_addr_end(pud, addr, end) \ > -({ unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK; \ > - (__boundary - 1 < (end) - 1)? __boundary: (end); \ > -}) > +#define pud_addr_end pud_addr_end > +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end) > +{ unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK; > + return (__boundary - 1 < end - 1) ? __boundary : end; > +} > #endif > > #ifndef pmd_addr_end > -#define pmd_addr_end(pmd, addr, end) \ > -({ unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK; \ > - (__boundary - 1 < (end) - 1)? __boundary: (end); \ > -}) > +#define pmd_addr_end pmd_addr_end > +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end) > +{ unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK; > + return (__boundary - 1 < end - 1) ? __boundary : end; > +} > #endif > > /* > ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions 2020-09-08 5:19 ` Christophe Leroy @ 2020-09-08 15:48 ` Alexander Gordeev 2020-09-08 17:20 ` Christophe Leroy 0 siblings, 1 reply; 62+ messages in thread From: Alexander Gordeev @ 2020-09-08 15:48 UTC (permalink / raw) To: Christophe Leroy Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86, Russell King, Christian Borntraeger, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote: [...] > >diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > >index 67ebc22cf83d..d9e7d16c2263 100644 > >--- a/include/linux/pgtable.h > >+++ b/include/linux/pgtable.h > >@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm, > > */ > > #ifndef pgd_addr_end > >-#define pgd_addr_end(pgd, addr, end) \ > >-({ unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK; \ > >- (__boundary - 1 < (end) - 1)? __boundary: (end); \ > >-}) > >+#define pgd_addr_end pgd_addr_end > > I think that #define is pointless, usually there is no such #define > for the default case. Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h): #define pgd_addr_end pgd_addr_end static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end) { return rste_addr_end_folded(pgd_val(pgd), addr, end); } > >+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end) > >+{ unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK; > >+ return (__boundary - 1 < end - 1) ? __boundary : end; > >+} ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions 2020-09-08 15:48 ` Alexander Gordeev @ 2020-09-08 17:20 ` Christophe Leroy 0 siblings, 0 replies; 62+ messages in thread From: Christophe Leroy @ 2020-09-08 17:20 UTC (permalink / raw) To: Alexander Gordeev Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86, Russell King, Christian Borntraeger, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport Le 08/09/2020 à 17:48, Alexander Gordeev a écrit : > On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote: > > [...] > >>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>> index 67ebc22cf83d..d9e7d16c2263 100644 >>> --- a/include/linux/pgtable.h >>> +++ b/include/linux/pgtable.h >>> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm, >>> */ >>> #ifndef pgd_addr_end >>> -#define pgd_addr_end(pgd, addr, end) \ >>> -({ unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK; \ >>> - (__boundary - 1 < (end) - 1)? __boundary: (end); \ >>> -}) >>> +#define pgd_addr_end pgd_addr_end >> >> I think that #define is pointless, usually there is no such #define >> for the default case. > > Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h): > > #define pgd_addr_end pgd_addr_end > static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end) > { > return rste_addr_end_folded(pgd_val(pgd), addr, end); > } Yes, there in s390 the #define is needed to hit the #ifndef pgd_addr_end that's in include/linux/pgtable.h But in include/linux/pgtable.h, there is no need of an #define pgd_addr_end pgd_addr_end I think > >>> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end) >>> +{ unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK; >>> + return (__boundary - 1 < end - 1) ? __boundary : end; >>> +} Christophe ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-07 18:00 [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding Gerald Schaefer ` (2 preceding siblings ...) 2020-09-07 18:00 ` [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions Gerald Schaefer @ 2020-09-07 20:12 ` Mike Rapoport 2020-09-08 5:22 ` Christophe Leroy 2020-09-08 4:42 ` Christophe Leroy 4 siblings, 1 reply; 62+ messages in thread From: Mike Rapoport @ 2020-09-07 20:12 UTC (permalink / raw) To: Gerald Schaefer Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds, Russell King, Catalin Marinas, Will Deacon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc, linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik, Heiko Carstens, Christian Borntraeger, Claudio Imbrenda On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote: > This is v2 of an RFC previously discussed here: > https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/ > > Patch 1 is a fix for a regression in gup_fast on s390, after our conversion > to common gup_fast code. It will introduce special helper functions > pXd_addr_end_folded(), which have to be used in places where pagetable walk > is done w/o lock and with READ_ONCE, so currently only in gup_fast. > > Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end() > themselves by adding an extra pXd value parameter. That was suggested by > Jason during v1 discussion, because he is already thinking of some other > places where he might want to switch to the READ_ONCE logic for pagetable > walks. In general, that would be the cleanest / safest solution, but there > is some impact on other architectures and common code, hence the new and > greatly enlarged recipient list. > > Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline > functions instead of #defines, so that we get some type checking for the > new pXd value parameter. > > Not sure about Fixes/stable tags for the generic solution. Only patch 1 > fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might > still be nice to have in stable, to ease future backports, but I guess > "nice to have" does not really qualify for stable backports. I also think that adding pXd parameter to pXd_addr_end() is a cleaner way and with this patch 1 is not really required. I would even merge patches 2 and 3 into a single patch and use only it as the fix. [ /me apologises to stable@ team :-) ] > Changes in v2: > - Pick option 2 from v1 discussion (pXd_addr_end_folded helpers) > - Add patch 2 + 3 for more generic approach > > Alexander Gordeev (3): > mm/gup: fix gup_fast with dynamic page table folding > mm: make pXd_addr_end() functions page-table entry aware > mm: make generic pXd_addr_end() macros inline functions > > arch/arm/include/asm/pgtable-2level.h | 2 +- > arch/arm/mm/idmap.c | 6 ++-- > arch/arm/mm/mmu.c | 8 ++--- > arch/arm64/kernel/hibernate.c | 16 +++++---- > arch/arm64/kvm/mmu.c | 16 ++++----- > arch/arm64/mm/kasan_init.c | 8 ++--- > arch/arm64/mm/mmu.c | 25 +++++++------- > arch/powerpc/mm/book3s64/radix_pgtable.c | 7 ++-- > arch/powerpc/mm/hugetlbpage.c | 6 ++-- > arch/s390/include/asm/pgtable.h | 42 ++++++++++++++++++++++++ > arch/s390/mm/page-states.c | 8 ++--- > arch/s390/mm/pageattr.c | 8 ++--- > arch/s390/mm/vmem.c | 8 ++--- > arch/sparc/mm/hugetlbpage.c | 6 ++-- > arch/um/kernel/tlb.c | 8 ++--- > arch/x86/mm/init_64.c | 15 ++++----- > arch/x86/mm/kasan_init_64.c | 16 ++++----- > include/asm-generic/pgtable-nop4d.h | 2 +- > include/asm-generic/pgtable-nopmd.h | 2 +- > include/asm-generic/pgtable-nopud.h | 2 +- > include/linux/pgtable.h | 38 ++++++++++++--------- > mm/gup.c | 8 ++--- > mm/ioremap.c | 8 ++--- > mm/kasan/init.c | 17 +++++----- > mm/madvise.c | 4 +-- > mm/memory.c | 40 +++++++++++----------- > mm/mlock.c | 18 +++++++--- > mm/mprotect.c | 8 ++--- > mm/pagewalk.c | 8 ++--- > mm/swapfile.c | 8 ++--- > mm/vmalloc.c | 16 ++++----- > 31 files changed, 219 insertions(+), 165 deletions(-) > > -- > 2.17.1 > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-07 20:12 ` [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding Mike Rapoport @ 2020-09-08 5:22 ` Christophe Leroy 2020-09-08 17:36 ` Gerald Schaefer 0 siblings, 1 reply; 62+ messages in thread From: Christophe Leroy @ 2020-09-08 5:22 UTC (permalink / raw) To: Mike Rapoport, Gerald Schaefer Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger, Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds Le 07/09/2020 à 22:12, Mike Rapoport a écrit : > On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote: >> This is v2 of an RFC previously discussed here: >> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/ >> >> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion >> to common gup_fast code. It will introduce special helper functions >> pXd_addr_end_folded(), which have to be used in places where pagetable walk >> is done w/o lock and with READ_ONCE, so currently only in gup_fast. >> >> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end() >> themselves by adding an extra pXd value parameter. That was suggested by >> Jason during v1 discussion, because he is already thinking of some other >> places where he might want to switch to the READ_ONCE logic for pagetable >> walks. In general, that would be the cleanest / safest solution, but there >> is some impact on other architectures and common code, hence the new and >> greatly enlarged recipient list. >> >> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline >> functions instead of #defines, so that we get some type checking for the >> new pXd value parameter. >> >> Not sure about Fixes/stable tags for the generic solution. Only patch 1 >> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might >> still be nice to have in stable, to ease future backports, but I guess >> "nice to have" does not really qualify for stable backports. > > I also think that adding pXd parameter to pXd_addr_end() is a cleaner > way and with this patch 1 is not really required. I would even merge > patches 2 and 3 into a single patch and use only it as the fix. Why not merging patches 2 and 3, but I would keep patch 1 separate but after the generic changes, so that we first do the generic changes, then we do the specific S390 use of it. Christophe ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-08 5:22 ` Christophe Leroy @ 2020-09-08 17:36 ` Gerald Schaefer 2020-09-09 16:12 ` Gerald Schaefer 0 siblings, 1 reply; 62+ messages in thread From: Gerald Schaefer @ 2020-09-08 17:36 UTC (permalink / raw) To: Christophe Leroy Cc: Mike Rapoport, Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger, Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds On Tue, 8 Sep 2020 07:22:39 +0200 Christophe Leroy <christophe.leroy@csgroup.eu> wrote: > > > Le 07/09/2020 à 22:12, Mike Rapoport a écrit : > > On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote: > >> This is v2 of an RFC previously discussed here: > >> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/ > >> > >> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion > >> to common gup_fast code. It will introduce special helper functions > >> pXd_addr_end_folded(), which have to be used in places where pagetable walk > >> is done w/o lock and with READ_ONCE, so currently only in gup_fast. > >> > >> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end() > >> themselves by adding an extra pXd value parameter. That was suggested by > >> Jason during v1 discussion, because he is already thinking of some other > >> places where he might want to switch to the READ_ONCE logic for pagetable > >> walks. In general, that would be the cleanest / safest solution, but there > >> is some impact on other architectures and common code, hence the new and > >> greatly enlarged recipient list. > >> > >> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline > >> functions instead of #defines, so that we get some type checking for the > >> new pXd value parameter. > >> > >> Not sure about Fixes/stable tags for the generic solution. Only patch 1 > >> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might > >> still be nice to have in stable, to ease future backports, but I guess > >> "nice to have" does not really qualify for stable backports. > > > > I also think that adding pXd parameter to pXd_addr_end() is a cleaner > > way and with this patch 1 is not really required. I would even merge > > patches 2 and 3 into a single patch and use only it as the fix. > > Why not merging patches 2 and 3, but I would keep patch 1 separate but > after the generic changes, so that we first do the generic changes, then > we do the specific S390 use of it. Yes, we thought about that approach too. It would at least allow to get all into stable, more or less nicely, as prerequisite for the s390 fix. Two concerns kept us from going that way. For once, it might not be the nicest way to get it all in stable, and we would not want to risk further objections due to the imminent and rather scary data corruption issue that we want to fix asap. For the same reason, we thought that the generalization part might need more time and agreement from various people, so that we could at least get the first patch as short-term solution. It seems now that the generalization is very well accepted so far, apart from some apparent issues on arm. Also, merging 2 + 3 and putting them first seems to be acceptable, so we could do that for v3, if there are no objections. Of course, we first need to address the few remaining issues for arm(32?), which do look quite confusing to me so far. BTW, sorry for the compile error with patch 3, I guess we did the cross-compile only for 1 + 2 applied, to see the bloat-o-meter changes. But I guess patch 3 already proved its usefulness by that :-) ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-08 17:36 ` Gerald Schaefer @ 2020-09-09 16:12 ` Gerald Schaefer 0 siblings, 0 replies; 62+ messages in thread From: Gerald Schaefer @ 2020-09-09 16:12 UTC (permalink / raw) To: Christophe Leroy Cc: Mike Rapoport, Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger, Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds On Tue, 8 Sep 2020 19:36:50 +0200 Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: [..] > > It seems now that the generalization is very well accepted so far, > apart from some apparent issues on arm. Also, merging 2 + 3 and > putting them first seems to be acceptable, so we could do that for > v3, if there are no objections. > > Of course, we first need to address the few remaining issues for > arm(32?), which do look quite confusing to me so far. BTW, sorry for > the compile error with patch 3, I guess we did the cross-compile only > for 1 + 2 applied, to see the bloat-o-meter changes. But I guess > patch 3 already proved its usefulness by that :-) Umm, replace "arm" with "power", sorry. No issues on arm so far, but also no ack I think. Thanks to Christophe for the power change, and to Mike for volunteering for some cross compilation and cross-arch testing. Will send v3 with merged and re-ordered patches after some more testing. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding 2020-09-07 18:00 [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding Gerald Schaefer ` (3 preceding siblings ...) 2020-09-07 20:12 ` [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding Mike Rapoport @ 2020-09-08 4:42 ` Christophe Leroy 4 siblings, 0 replies; 62+ messages in thread From: Christophe Leroy @ 2020-09-08 4:42 UTC (permalink / raw) To: Gerald Schaefer, Jason Gunthorpe, John Hubbard Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon, linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86, Russell King, Christian Borntraeger, Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport Le 07/09/2020 à 20:00, Gerald Schaefer a écrit : > This is v2 of an RFC previously discussed here: > https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/ > > Patch 1 is a fix for a regression in gup_fast on s390, after our conversion > to common gup_fast code. It will introduce special helper functions > pXd_addr_end_folded(), which have to be used in places where pagetable walk > is done w/o lock and with READ_ONCE, so currently only in gup_fast. > > Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end() > themselves by adding an extra pXd value parameter. That was suggested by > Jason during v1 discussion, because he is already thinking of some other > places where he might want to switch to the READ_ONCE logic for pagetable > walks. In general, that would be the cleanest / safest solution, but there > is some impact on other architectures and common code, hence the new and > greatly enlarged recipient list. > > Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline > functions instead of #defines, so that we get some type checking for the > new pXd value parameter. > > Not sure about Fixes/stable tags for the generic solution. Only patch 1 > fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might > still be nice to have in stable, to ease future backports, but I guess > "nice to have" does not really qualify for stable backports. If one day you have to backport a fix that requires patch 2 and/or 3, just mark it "depends-on:" and the patches will go in stable at the relevant time. Christophe ^ permalink raw reply [flat|nested] 62+ messages in thread
end of thread, other threads:[~2020-09-15 17:31 UTC | newest] Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-09-07 18:00 [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding Gerald Schaefer 2020-09-07 18:00 ` [RFC PATCH v2 1/3] " Gerald Schaefer 2020-09-08 5:06 ` Christophe Leroy 2020-09-08 12:09 ` Christian Borntraeger 2020-09-08 12:40 ` Christophe Leroy 2020-09-08 13:38 ` Gerald Schaefer 2020-09-08 14:30 ` Dave Hansen 2020-09-08 17:59 ` Gerald Schaefer 2020-09-09 12:29 ` Gerald Schaefer 2020-09-09 16:18 ` Dave Hansen 2020-09-09 17:25 ` Gerald Schaefer 2020-09-09 18:03 ` Jason Gunthorpe 2020-09-10 9:39 ` Alexander Gordeev 2020-09-10 13:02 ` Jason Gunthorpe 2020-09-10 13:28 ` Gerald Schaefer 2020-09-10 15:10 ` Jason Gunthorpe 2020-09-10 17:07 ` Gerald Schaefer 2020-09-10 17:19 ` Jason Gunthorpe 2020-09-10 17:57 ` Gerald Schaefer 2020-09-10 23:21 ` Jason Gunthorpe 2020-09-10 17:35 ` Linus Torvalds 2020-09-10 18:13 ` Jason Gunthorpe 2020-09-10 18:33 ` Linus Torvalds 2020-09-10 19:10 ` Gerald Schaefer 2020-09-10 19:32 ` Linus Torvalds 2020-09-10 21:59 ` Jason Gunthorpe 2020-09-11 7:09 ` peterz 2020-09-11 11:19 ` Jason Gunthorpe 2020-09-11 19:03 ` [PATCH] " Vasily Gorbik 2020-09-11 19:09 ` Linus Torvalds 2020-09-11 19:40 ` Jason Gunthorpe 2020-09-11 20:05 ` Jason Gunthorpe 2020-09-11 20:36 ` [PATCH v2] " Vasily Gorbik 2020-09-15 17:09 ` Vasily Gorbik 2020-09-15 17:14 ` Jason Gunthorpe 2020-09-15 17:18 ` Mike Rapoport 2020-09-15 17:31 ` John Hubbard 2020-09-10 21:22 ` [RFC PATCH v2 1/3] " John Hubbard 2020-09-10 22:11 ` Jason Gunthorpe 2020-09-10 22:17 ` John Hubbard 2020-09-11 12:19 ` Alexander Gordeev 2020-09-11 16:45 ` Linus Torvalds 2020-09-10 13:11 ` Gerald Schaefer 2020-09-07 18:00 ` [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware Gerald Schaefer 2020-09-08 5:14 ` Christophe Leroy 2020-09-08 7:46 ` Alexander Gordeev 2020-09-08 8:16 ` Christophe Leroy 2020-09-08 14:15 ` Alexander Gordeev 2020-09-09 8:38 ` Christophe Leroy 2020-09-08 14:25 ` Alexander Gordeev 2020-09-08 13:26 ` Jason Gunthorpe 2020-09-08 14:33 ` Dave Hansen 2020-09-07 18:00 ` [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions Gerald Schaefer 2020-09-07 20:15 ` Mike Rapoport 2020-09-08 5:19 ` Christophe Leroy 2020-09-08 15:48 ` Alexander Gordeev 2020-09-08 17:20 ` Christophe Leroy 2020-09-07 20:12 ` [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding Mike Rapoport 2020-09-08 5:22 ` Christophe Leroy 2020-09-08 17:36 ` Gerald Schaefer 2020-09-09 16:12 ` Gerald Schaefer 2020-09-08 4:42 ` Christophe Leroy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).