All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00 ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

This is v2 of an RFC previously discussed here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
to common gup_fast code. It will introduce special helper functions
pXd_addr_end_folded(), which have to be used in places where pagetable walk
is done w/o lock and with READ_ONCE, so currently only in gup_fast.

Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
themselves by adding an extra pXd value parameter. That was suggested by
Jason during v1 discussion, because he is already thinking of some other
places where he might want to switch to the READ_ONCE logic for pagetable
walks. In general, that would be the cleanest / safest solution, but there
is some impact on other architectures and common code, hence the new and
greatly enlarged recipient list.

Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
functions instead of #defines, so that we get some type checking for the
new pXd value parameter.

Not sure about Fixes/stable tags for the generic solution. Only patch 1
fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
still be nice to have in stable, to ease future backports, but I guess
"nice to have" does not really qualify for stable backports.

Changes in v2:
- Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
- Add patch 2 + 3 for more generic approach

Alexander Gordeev (3):
  mm/gup: fix gup_fast with dynamic page table folding
  mm: make pXd_addr_end() functions page-table entry aware
  mm: make generic pXd_addr_end() macros inline functions

 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 +++++----
 arch/arm64/kvm/mmu.c                     | 16 ++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++-------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 ++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 38 ++++++++++++---------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 +++++++++++-----------
 mm/mlock.c                               | 18 +++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 ++++-----
 31 files changed, 219 insertions(+), 165 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00 ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

This is v2 of an RFC previously discussed here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
to common gup_fast code. It will introduce special helper functions
pXd_addr_end_folded(), which have to be used in places where pagetable walk
is done w/o lock and with READ_ONCE, so currently only in gup_fast.

Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
themselves by adding an extra pXd value parameter. That was suggested by
Jason during v1 discussion, because he is already thinking of some other
places where he might want to switch to the READ_ONCE logic for pagetable
walks. In general, that would be the cleanest / safest solution, but there
is some impact on other architectures and common code, hence the new and
greatly enlarged recipient list.

Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
functions instead of #defines, so that we get some type checking for the
new pXd value parameter.

Not sure about Fixes/stable tags for the generic solution. Only patch 1
fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
still be nice to have in stable, to ease future backports, but I guess
"nice to have" does not really qualify for stable backports.

Changes in v2:
- Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
- Add patch 2 + 3 for more generic approach

Alexander Gordeev (3):
  mm/gup: fix gup_fast with dynamic page table folding
  mm: make pXd_addr_end() functions page-table entry aware
  mm: make generic pXd_addr_end() macros inline functions

 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 +++++----
 arch/arm64/kvm/mmu.c                     | 16 ++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++-------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 ++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 38 ++++++++++++---------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 +++++++++++-----------
 mm/mlock.c                               | 18 +++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 ++++-----
 31 files changed, 219 insertions(+), 165 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00 ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

This is v2 of an RFC previously discussed here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
to common gup_fast code. It will introduce special helper functions
pXd_addr_end_folded(), which have to be used in places where pagetable walk
is done w/o lock and with READ_ONCE, so currently only in gup_fast.

Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
themselves by adding an extra pXd value parameter. That was suggested by
Jason during v1 discussion, because he is already thinking of some other
places where he might want to switch to the READ_ONCE logic for pagetable
walks. In general, that would be the cleanest / safest solution, but there
is some impact on other architectures and common code, hence the new and
greatly enlarged recipient list.

Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
functions instead of #defines, so that we get some type checking for the
new pXd value parameter.

Not sure about Fixes/stable tags for the generic solution. Only patch 1
fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
still be nice to have in stable, to ease future backports, but I guess
"nice to have" does not really qualify for stable backports.

Changes in v2:
- Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
- Add patch 2 + 3 for more generic approach

Alexander Gordeev (3):
  mm/gup: fix gup_fast with dynamic page table folding
  mm: make pXd_addr_end() functions page-table entry aware
  mm: make generic pXd_addr_end() macros inline functions

 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 +++++----
 arch/arm64/kvm/mmu.c                     | 16 ++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++-------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 ++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 38 ++++++++++++---------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 +++++++++++-----------
 mm/mlock.c                               | 18 +++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 ++++-----
 31 files changed, 219 insertions(+), 165 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00 ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

This is v2 of an RFC previously discussed here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
to common gup_fast code. It will introduce special helper functions
pXd_addr_end_folded(), which have to be used in places where pagetable walk
is done w/o lock and with READ_ONCE, so currently only in gup_fast.

Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
themselves by adding an extra pXd value parameter. That was suggested by
Jason during v1 discussion, because he is already thinking of some other
places where he might want to switch to the READ_ONCE logic for pagetable
walks. In general, that would be the cleanest / safest solution, but there
is some impact on other architectures and common code, hence the new and
greatly enlarged recipient list.

Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
functions instead of #defines, so that we get some type checking for the
new pXd value parameter.

Not sure about Fixes/stable tags for the generic solution. Only patch 1
fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
still be nice to have in stable, to ease future backports, but I guess
"nice to have" does not really qualify for stable backports.

Changes in v2:
- Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
- Add patch 2 + 3 for more generic approach

Alexander Gordeev (3):
  mm/gup: fix gup_fast with dynamic page table folding
  mm: make pXd_addr_end() functions page-table entry aware
  mm: make generic pXd_addr_end() macros inline functions

 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 +++++----
 arch/arm64/kvm/mmu.c                     | 16 ++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++-------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 ++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 38 ++++++++++++---------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 +++++++++++-----------
 mm/mlock.c                               | 18 +++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 ++++-----
 31 files changed, 219 insertions(+), 165 deletions(-)

-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-07 18:00 ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-07 18:00   ` Gerald Schaefer
  -1 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

From: Alexander Gordeev <agordeev@linux.ibm.com>

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
calculate top level page table offset, and do nothing in folded pXd_offset.
What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
not reflect this dynamic behaviour, and still act like static 5-level
page tables.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

pud_addr_end = 0x10080000000 is correct, but the previous pgd/p4d_addr_end
should also have returned that limit, instead of the 5-level static
pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end"
parameter for gup_pud_range would also have been 0x10080000000, and we
would not iterate further in gup_pud_range, but rather go back and
(correctly) do it in gup_pgd_range.

So, for the second iteration in gup_pud_range, we will increase pudp,
which pointed to a stack value and not the real pud table. This new pudp
will then point to whatever lies behind the p4d stack value. In general,
this happens to be the previously read pgd, but it probably could also
be something different, depending on compiler decisions.

Most unfortunately, if it happens to be the pgd value, which is the
same as the p4d / pud due to folding, it is a valid and present entry.
So after the increment, we would still point to the same pud entry.
The addr however has been increased in the second iteration, so that we
now have different pmd/pte_index values, which will result in very wrong
behaviour for the remaining gup_pmd/pte_range calls. We will effectively
operate on an address minus 2 GB, due to missing pudp increase.

In the "good case", if nothing is mapped there, we will fall back to
the slow gup path. But if something is mapped there, and valid
for gup_fast, we will end up (silently) getting references on the wrong
pages and also add the wrong pages to the **pages result array. This
can cause data corruption.

Fix this by introducing new pXd_addr_end_folded helpers, which take an
additional pXd entry value parameter, that can be used on s390
to determine the correct page table level and return corresponding
end / boundary. With that, the pointer iteration will always
happen in gup_pgd_range for s390. No change for other architectures
introduced.

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Cc: <stable@vger.kernel.org> # 5.2+
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
 include/linux/pgtable.h         | 16 +++++++++++++
 mm/gup.c                        |  8 +++----
 3 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..027206e4959d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
 }
 #define mm_pmd_folded(mm) mm_pmd_folded(mm)
 
+/*
+ * With dynamic page table levels on s390, the static pXd_addr_end() functions
+ * will not return corresponding dynamic boundaries. This is no problem as long
+ * as only pXd pointers are passed down during page table walk, because
+ * pXd_offset() will simply return the given pointer for folded levels, and the
+ * pointer iteration over a range simply happens at the correct page table
+ * level.
+ * It is however a problem with gup_fast, or other places walking the page
+ * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
+ * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
+ * a stack variable, which cannot be used for pointer iteration at the correct
+ * level. Instead, the iteration then has to happen by going up to pgd level
+ * again. To allow this, provide pXd_addr_end_folded() functions with an
+ * additional pXd value parameter, which can be used on s390 to determine the
+ * folding level and return the corresponding boundary.
+ */
+static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
+{
+	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
+	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
+	unsigned long boundary = (addr + size) & ~(size - 1);
+
+	/*
+	 * FIXME The below check is for internal testing only, to be removed
+	 */
+	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
+
+	return (boundary - 1) < (end - 1) ? boundary : end;
+}
+
+#define pgd_addr_end_folded pgd_addr_end_folded
+static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(pgd_val(pgd), addr, end);
+}
+
+#define p4d_addr_end_folded p4d_addr_end_folded
+static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(p4d_val(p4d), addr, end);
+}
+
 static inline int mm_has_pgste(struct mm_struct *mm)
 {
 #ifdef CONFIG_PGSTE
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..981c4c2a31fe 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
 })
 #endif
 
+#ifndef pgd_addr_end_folded
+#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
+#endif
+
+#ifndef p4d_addr_end_folded
+#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
+#endif
+
+#ifndef pud_addr_end_folded
+#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
+#endif
+
+#ifndef pmd_addr_end_folded
+#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
+#endif
+
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index bd883a112724..ba4aace5d0f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end_folded(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end_folded(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end_folded(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end_folded(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

From: Alexander Gordeev <agordeev@linux.ibm.com>

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
calculate top level page table offset, and do nothing in folded pXd_offset.
What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
not reflect this dynamic behaviour, and still act like static 5-level
page tables.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

pud_addr_end = 0x10080000000 is correct, but the previous pgd/p4d_addr_end
should also have returned that limit, instead of the 5-level static
pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end"
parameter for gup_pud_range would also have been 0x10080000000, and we
would not iterate further in gup_pud_range, but rather go back and
(correctly) do it in gup_pgd_range.

So, for the second iteration in gup_pud_range, we will increase pudp,
which pointed to a stack value and not the real pud table. This new pudp
will then point to whatever lies behind the p4d stack value. In general,
this happens to be the previously read pgd, but it probably could also
be something different, depending on compiler decisions.

Most unfortunately, if it happens to be the pgd value, which is the
same as the p4d / pud due to folding, it is a valid and present entry.
So after the increment, we would still point to the same pud entry.
The addr however has been increased in the second iteration, so that we
now have different pmd/pte_index values, which will result in very wrong
behaviour for the remaining gup_pmd/pte_range calls. We will effectively
operate on an address minus 2 GB, due to missing pudp increase.

In the "good case", if nothing is mapped there, we will fall back to
the slow gup path. But if something is mapped there, and valid
for gup_fast, we will end up (silently) getting references on the wrong
pages and also add the wrong pages to the **pages result array. This
can cause data corruption.

Fix this by introducing new pXd_addr_end_folded helpers, which take an
additional pXd entry value parameter, that can be used on s390
to determine the correct page table level and return corresponding
end / boundary. With that, the pointer iteration will always
happen in gup_pgd_range for s390. No change for other architectures
introduced.

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Cc: <stable@vger.kernel.org> # 5.2+
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
 include/linux/pgtable.h         | 16 +++++++++++++
 mm/gup.c                        |  8 +++----
 3 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..027206e4959d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
 }
 #define mm_pmd_folded(mm) mm_pmd_folded(mm)
 
+/*
+ * With dynamic page table levels on s390, the static pXd_addr_end() functions
+ * will not return corresponding dynamic boundaries. This is no problem as long
+ * as only pXd pointers are passed down during page table walk, because
+ * pXd_offset() will simply return the given pointer for folded levels, and the
+ * pointer iteration over a range simply happens at the correct page table
+ * level.
+ * It is however a problem with gup_fast, or other places walking the page
+ * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
+ * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
+ * a stack variable, which cannot be used for pointer iteration at the correct
+ * level. Instead, the iteration then has to happen by going up to pgd level
+ * again. To allow this, provide pXd_addr_end_folded() functions with an
+ * additional pXd value parameter, which can be used on s390 to determine the
+ * folding level and return the corresponding boundary.
+ */
+static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
+{
+	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
+	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
+	unsigned long boundary = (addr + size) & ~(size - 1);
+
+	/*
+	 * FIXME The below check is for internal testing only, to be removed
+	 */
+	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
+
+	return (boundary - 1) < (end - 1) ? boundary : end;
+}
+
+#define pgd_addr_end_folded pgd_addr_end_folded
+static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(pgd_val(pgd), addr, end);
+}
+
+#define p4d_addr_end_folded p4d_addr_end_folded
+static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(p4d_val(p4d), addr, end);
+}
+
 static inline int mm_has_pgste(struct mm_struct *mm)
 {
 #ifdef CONFIG_PGSTE
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..981c4c2a31fe 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
 })
 #endif
 
+#ifndef pgd_addr_end_folded
+#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
+#endif
+
+#ifndef p4d_addr_end_folded
+#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
+#endif
+
+#ifndef pud_addr_end_folded
+#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
+#endif
+
+#ifndef pmd_addr_end_folded
+#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
+#endif
+
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index bd883a112724..ba4aace5d0f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end_folded(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end_folded(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end_folded(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end_folded(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
-- 
2.17.1

^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
calculate top level page table offset, and do nothing in folded pXd_offset.
What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
not reflect this dynamic behaviour, and still act like static 5-level
page tables.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

pud_addr_end = 0x10080000000 is correct, but the previous pgd/p4d_addr_end
should also have returned that limit, instead of the 5-level static
pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end"
parameter for gup_pud_range would also have been 0x10080000000, and we
would not iterate further in gup_pud_range, but rather go back and
(correctly) do it in gup_pgd_range.

So, for the second iteration in gup_pud_range, we will increase pudp,
which pointed to a stack value and not the real pud table. This new pudp
will then point to whatever lies behind the p4d stack value. In general,
this happens to be the previously read pgd, but it probably could also
be something different, depending on compiler decisions.

Most unfortunately, if it happens to be the pgd value, which is the
same as the p4d / pud due to folding, it is a valid and present entry.
So after the increment, we would still point to the same pud entry.
The addr however has been increased in the second iteration, so that we
now have different pmd/pte_index values, which will result in very wrong
behaviour for the remaining gup_pmd/pte_range calls. We will effectively
operate on an address minus 2 GB, due to missing pudp increase.

In the "good case", if nothing is mapped there, we will fall back to
the slow gup path. But if something is mapped there, and valid
for gup_fast, we will end up (silently) getting references on the wrong
pages and also add the wrong pages to the **pages result array. This
can cause data corruption.

Fix this by introducing new pXd_addr_end_folded helpers, which take an
additional pXd entry value parameter, that can be used on s390
to determine the correct page table level and return corresponding
end / boundary. With that, the pointer iteration will always
happen in gup_pgd_range for s390. No change for other architectures
introduced.

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Cc: <stable@vger.kernel.org> # 5.2+
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
 include/linux/pgtable.h         | 16 +++++++++++++
 mm/gup.c                        |  8 +++----
 3 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..027206e4959d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
 }
 #define mm_pmd_folded(mm) mm_pmd_folded(mm)
 
+/*
+ * With dynamic page table levels on s390, the static pXd_addr_end() functions
+ * will not return corresponding dynamic boundaries. This is no problem as long
+ * as only pXd pointers are passed down during page table walk, because
+ * pXd_offset() will simply return the given pointer for folded levels, and the
+ * pointer iteration over a range simply happens at the correct page table
+ * level.
+ * It is however a problem with gup_fast, or other places walking the page
+ * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
+ * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
+ * a stack variable, which cannot be used for pointer iteration at the correct
+ * level. Instead, the iteration then has to happen by going up to pgd level
+ * again. To allow this, provide pXd_addr_end_folded() functions with an
+ * additional pXd value parameter, which can be used on s390 to determine the
+ * folding level and return the corresponding boundary.
+ */
+static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
+{
+	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
+	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
+	unsigned long boundary = (addr + size) & ~(size - 1);
+
+	/*
+	 * FIXME The below check is for internal testing only, to be removed
+	 */
+	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
+
+	return (boundary - 1) < (end - 1) ? boundary : end;
+}
+
+#define pgd_addr_end_folded pgd_addr_end_folded
+static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(pgd_val(pgd), addr, end);
+}
+
+#define p4d_addr_end_folded p4d_addr_end_folded
+static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(p4d_val(p4d), addr, end);
+}
+
 static inline int mm_has_pgste(struct mm_struct *mm)
 {
 #ifdef CONFIG_PGSTE
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..981c4c2a31fe 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
 })
 #endif
 
+#ifndef pgd_addr_end_folded
+#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
+#endif
+
+#ifndef p4d_addr_end_folded
+#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
+#endif
+
+#ifndef pud_addr_end_folded
+#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
+#endif
+
+#ifndef pmd_addr_end_folded
+#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
+#endif
+
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index bd883a112724..ba4aace5d0f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end_folded(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end_folded(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end_folded(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end_folded(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
calculate top level page table offset, and do nothing in folded pXd_offset.
What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
not reflect this dynamic behaviour, and still act like static 5-level
page tables.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

pud_addr_end = 0x10080000000 is correct, but the previous pgd/p4d_addr_end
should also have returned that limit, instead of the 5-level static
pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end"
parameter for gup_pud_range would also have been 0x10080000000, and we
would not iterate further in gup_pud_range, but rather go back and
(correctly) do it in gup_pgd_range.

So, for the second iteration in gup_pud_range, we will increase pudp,
which pointed to a stack value and not the real pud table. This new pudp
will then point to whatever lies behind the p4d stack value. In general,
this happens to be the previously read pgd, but it probably could also
be something different, depending on compiler decisions.

Most unfortunately, if it happens to be the pgd value, which is the
same as the p4d / pud due to folding, it is a valid and present entry.
So after the increment, we would still point to the same pud entry.
The addr however has been increased in the second iteration, so that we
now have different pmd/pte_index values, which will result in very wrong
behaviour for the remaining gup_pmd/pte_range calls. We will effectively
operate on an address minus 2 GB, due to missing pudp increase.

In the "good case", if nothing is mapped there, we will fall back to
the slow gup path. But if something is mapped there, and valid
for gup_fast, we will end up (silently) getting references on the wrong
pages and also add the wrong pages to the **pages result array. This
can cause data corruption.

Fix this by introducing new pXd_addr_end_folded helpers, which take an
additional pXd entry value parameter, that can be used on s390
to determine the correct page table level and return corresponding
end / boundary. With that, the pointer iteration will always
happen in gup_pgd_range for s390. No change for other architectures
introduced.

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Cc: <stable@vger.kernel.org> # 5.2+
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
 include/linux/pgtable.h         | 16 +++++++++++++
 mm/gup.c                        |  8 +++----
 3 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..027206e4959d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
 }
 #define mm_pmd_folded(mm) mm_pmd_folded(mm)
 
+/*
+ * With dynamic page table levels on s390, the static pXd_addr_end() functions
+ * will not return corresponding dynamic boundaries. This is no problem as long
+ * as only pXd pointers are passed down during page table walk, because
+ * pXd_offset() will simply return the given pointer for folded levels, and the
+ * pointer iteration over a range simply happens at the correct page table
+ * level.
+ * It is however a problem with gup_fast, or other places walking the page
+ * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
+ * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
+ * a stack variable, which cannot be used for pointer iteration at the correct
+ * level. Instead, the iteration then has to happen by going up to pgd level
+ * again. To allow this, provide pXd_addr_end_folded() functions with an
+ * additional pXd value parameter, which can be used on s390 to determine the
+ * folding level and return the corresponding boundary.
+ */
+static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
+{
+	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
+	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
+	unsigned long boundary = (addr + size) & ~(size - 1);
+
+	/*
+	 * FIXME The below check is for internal testing only, to be removed
+	 */
+	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
+
+	return (boundary - 1) < (end - 1) ? boundary : end;
+}
+
+#define pgd_addr_end_folded pgd_addr_end_folded
+static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(pgd_val(pgd), addr, end);
+}
+
+#define p4d_addr_end_folded p4d_addr_end_folded
+static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(p4d_val(p4d), addr, end);
+}
+
 static inline int mm_has_pgste(struct mm_struct *mm)
 {
 #ifdef CONFIG_PGSTE
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..981c4c2a31fe 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
 })
 #endif
 
+#ifndef pgd_addr_end_folded
+#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
+#endif
+
+#ifndef p4d_addr_end_folded
+#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
+#endif
+
+#ifndef pud_addr_end_folded
+#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
+#endif
+
+#ifndef pmd_addr_end_folded
+#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
+#endif
+
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index bd883a112724..ba4aace5d0f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end_folded(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end_folded(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end_folded(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end_folded(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-07 18:00 ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-07 18:00   ` Gerald Schaefer
  -1 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

From: Alexander Gordeev <agordeev@linux.ibm.com>

Unlike all other page-table abstractions pXd_addr_end() do not take
into account a particular table entry in which context the functions
are called. On architectures with dynamic page-tables folding that
might lead to lack of necessary information that is difficult to
obtain other than from the table entry itself. That already led to
a subtle memory corruption issue on s390.

By letting pXd_addr_end() functions know about the page-table entry
we allow archs not only make extra checks, but also optimizations.

As result of this change the pXd_addr_end_folded() functions used
in gup_fast traversal code become unnecessary and get replaced with
universal pXd_addr_end() variants.

The arch-specific updates not only add dereferencing of page-table
entry pointers, but also small changes to the code flow to make those
dereferences possible, at least for x86 and powerpc. Also for arm64,
but in way that should not have any impact.

So, even though the dereferenced page-table entries are not used on
archs other than s390, and are optimized out by the compiler, there
is a small change in kernel size and this is what bloat-o-meter reports:

x86:
add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
Function                                     old     new   delta
vmemmap_populate                             587     592      +5
munlock_vma_pages_range                      556     561      +5
Total: Before=15534694, After=15534704, chg +0.00%

powerpc:
add/remove: 0/0 grow/shrink: 1/0 up/down: 4/0 (4)
Function                                     old     new   delta
.remove_pagetable                           1648    1652      +4
Total: Before=21478240, After=21478244, chg +0.00%

arm64:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before=20240851, After=20240851, chg +0.00%

sparc:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before=4907262, After=4907262, chg +0.00%

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 ++++++----
 arch/arm64/kvm/mmu.c                     | 16 +++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++--------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          |  8 ++---
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 +++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 26 ++++-----------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 ++++++++++++------------
 mm/mlock.c                               | 18 ++++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 +++++-----
 31 files changed, 165 insertions(+), 173 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index 3502c2f746ca..5e6416b339f4 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -209,7 +209,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 	} while (0)
 
 /* we don't need complex calculations here as the pmd is folded into the pgd */
-#define pmd_addr_end(addr,end) (end)
+#define pmd_addr_end(pmd,addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
 
diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c
index 448e57c6f653..5437f943ca8b 100644
--- a/arch/arm/mm/idmap.c
+++ b/arch/arm/mm/idmap.c
@@ -46,7 +46,7 @@ static void idmap_add_pmd(pud_t *pud, unsigned long addr, unsigned long end,
 		pmd = pmd_offset(pud, addr);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		*pmd = __pmd((addr & PMD_MASK) | prot);
 		flush_pmd_entry(pmd);
 	} while (pmd++, addr = next, addr != end);
@@ -73,7 +73,7 @@ static void idmap_add_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		idmap_add_pmd(pud, addr, next, prot);
 	} while (pud++, addr = next, addr != end);
 }
@@ -95,7 +95,7 @@ static void identity_mapping_add(pgd_t *pgd, const char *text_start,
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		idmap_add_pud(pgd, addr, next, prot);
 	} while (pgd++, addr = next, addr != end);
 }
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 698cc740c6b8..4013746e4c75 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -777,7 +777,7 @@ static void __init alloc_init_pmd(pud_t *pud, unsigned long addr,
 		 * With LPAE, we must loop over to map
 		 * all the pmds for the given range.
 		 */
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Try a section mapping - addr, next and phys must all be
@@ -805,7 +805,7 @@ static void __init alloc_init_pud(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		alloc_init_pmd(pud, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (pud++, addr = next, addr != end);
@@ -820,7 +820,7 @@ static void __init alloc_init_p4d(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		alloc_init_pud(p4d, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (p4d++, addr = next, addr != end);
@@ -923,7 +923,7 @@ static void __init __create_mapping(struct mm_struct *mm, struct map_desc *md,
 	pgd = pgd_offset(mm, addr);
 	end = addr + length;
 	do {
-		unsigned long next = pgd_addr_end(addr, end);
+		unsigned long next = pgd_addr_end(*pgd, addr, end);
 
 		alloc_init_p4d(pgd, addr, next, phys, type, alloc, ng);
 
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 68e14152d6e9..7be8c9cdc5c8 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -412,7 +412,7 @@ static int copy_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long start,
 	do {
 		pmd_t pmd = READ_ONCE(*src_pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 		if (pmd_table(pmd)) {
@@ -447,7 +447,7 @@ static int copy_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp, unsigned long start,
 	do {
 		pud_t pud = READ_ONCE(*src_pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 		if (pud_table(pud)) {
@@ -473,8 +473,10 @@ static int copy_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long start,
 	dst_p4dp = p4d_offset(dst_pgdp, start);
 	src_p4dp = p4d_offset(src_pgdp, start);
 	do {
-		next = p4d_addr_end(addr, end);
-		if (p4d_none(READ_ONCE(*src_p4dp)))
+		p4d_t p4d = READ_ONCE(*src_p4dp);
+
+		next = p4d_addr_end(p4d, addr, end);
+		if (p4d_none(p4d))
 			continue;
 		if (copy_pud(dst_p4dp, src_p4dp, addr, next))
 			return -ENOMEM;
@@ -492,8 +494,10 @@ static int copy_page_tables(pgd_t *dst_pgdp, unsigned long start,
 
 	dst_pgdp = pgd_offset_pgd(dst_pgdp, start);
 	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(READ_ONCE(*src_pgdp)))
+		pgd_t pgd = READ_ONCE(*src_pgdp);
+
+		next = pgd_addr_end(pgd, addr, end);
+		if (pgd_none(pgd))
 			continue;
 		if (copy_p4d(dst_pgdp, src_pgdp, addr, next))
 			return -ENOMEM;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ba00bcc0c884..8f470f93a8e9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -547,7 +547,7 @@ static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
 
 	start_pmd = pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		/* Hyp doesn't use huge pmds */
 		if (!pmd_none(*pmd))
 			unmap_hyp_ptes(pmd, addr, next);
@@ -564,7 +564,7 @@ static void unmap_hyp_puds(p4d_t *p4d, phys_addr_t addr, phys_addr_t end)
 
 	start_pud = pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		/* Hyp doesn't use huge puds */
 		if (!pud_none(*pud))
 			unmap_hyp_pmds(pud, addr, next);
@@ -581,7 +581,7 @@ static void unmap_hyp_p4ds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
 
 	start_p4d = p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		/* Hyp doesn't use huge p4ds */
 		if (!p4d_none(*p4d))
 			unmap_hyp_puds(p4d, addr, next);
@@ -609,7 +609,7 @@ static void __unmap_hyp_range(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 	 */
 	pgd = pgdp + kvm_pgd_index(addr, ptrs_per_pgd);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_none(*pgd))
 			unmap_hyp_p4ds(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
@@ -712,7 +712,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
 			get_page(virt_to_page(pmd));
 		}
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		create_hyp_pte_mappings(pmd, addr, next, pfn, prot);
 		pfn += (next - addr) >> PAGE_SHIFT;
@@ -744,7 +744,7 @@ static int create_hyp_pud_mappings(p4d_t *p4d, unsigned long start,
 			get_page(virt_to_page(pud));
 		}
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		ret = create_hyp_pmd_mappings(pud, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -777,7 +777,7 @@ static int create_hyp_p4d_mappings(pgd_t *pgd, unsigned long start,
 			get_page(virt_to_page(p4d));
 		}
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		ret = create_hyp_pud_mappings(p4d, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -813,7 +813,7 @@ static int __create_hyp_mappings(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 			get_page(virt_to_page(pgd));
 		}
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = create_hyp_p4d_mappings(pgd, addr, next, pfn, prot);
 		if (err)
 			goto out;
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b24e43d20667..8d1c811fd59e 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -120,7 +120,7 @@ static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp = kasan_pmd_offset(pudp, addr, node, early);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		kasan_pte_populate(pmdp, addr, next, node, early);
 	} while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)));
 }
@@ -132,7 +132,7 @@ static void __init kasan_pud_populate(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp = kasan_pud_offset(p4dp, addr, node, early);
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		kasan_pmd_populate(pudp, addr, next, node, early);
 	} while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)));
 }
@@ -144,7 +144,7 @@ static void __init kasan_p4d_populate(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp = p4d_offset(pgdp, addr);
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		kasan_pud_populate(p4dp, addr, next, node, early);
 	} while (p4dp++, addr = next, addr != end);
 }
@@ -157,7 +157,7 @@ static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
 
 	pgdp = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		kasan_p4d_populate(pgdp, addr, next, node, early);
 	} while (pgdp++, addr = next, addr != end);
 }
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 64211436629d..d679cf024bc8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -209,7 +209,7 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		pmd_t old_pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(old_pmd, addr, end);
 
 		/* try section mapping first */
 		if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
@@ -307,7 +307,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
 	do {
 		pud_t old_pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(old_pud, addr, end);
 
 		/*
 		 * For 4K granule only, attempt to put down a 1GB block
@@ -356,7 +356,7 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 	end = PAGE_ALIGN(virt + size);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		alloc_init_pud(pgdp, addr, next, phys, prot, pgtable_alloc,
 			       flags);
 		phys += next - addr;
@@ -820,9 +820,9 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp, pmd;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -853,9 +853,9 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp, pud;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -886,9 +886,9 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -912,9 +912,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end,
 	WARN_ON(!free_mapped && altmap);
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -968,9 +968,9 @@ static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -1008,9 +1008,9 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -1048,9 +1048,9 @@ static void free_empty_p4d_table(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -1066,9 +1066,9 @@ static void free_empty_tables(unsigned long addr, unsigned long end,
 	pgd_t *pgdp, pgd;
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -1097,8 +1097,6 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	pmd_t *pmdp;
 
 	do {
-		next = pmd_addr_end(addr, end);
-
 		pgdp = vmemmap_pgd_populate(addr, node);
 		if (!pgdp)
 			return -ENOMEM;
@@ -1112,6 +1110,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 			return -ENOMEM;
 
 		pmdp = pmd_offset(pudp, addr);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_none(READ_ONCE(*pmdp))) {
 			void *p = NULL;
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index cc72666e891a..816e218df285 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -817,7 +817,7 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -847,7 +847,7 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -878,10 +878,9 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
 	spin_lock(&init_mm.page_table_lock);
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!p4d_present(*p4d))
 			continue;
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26292544630f..f0606d6774a4 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -352,7 +352,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 		unsigned long more;
 
 		pmd = pmd_offset(pud, addr);
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!is_hugepd(__hugepd(pmd_val(*pmd)))) {
 			if (pmd_none_or_clear_bad(pmd))
 				continue;
@@ -409,7 +409,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	do {
 		pud = pud_offset(p4d, addr);
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!is_hugepd(__hugepd(pud_val(*pud)))) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
@@ -478,9 +478,9 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	 */
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset(tlb->mm, addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!is_hugepd(__hugepd(pgd_val(*pgd)))) {
 			if (p4d_none_or_clear_bad(p4d))
 				continue;
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 027206e4959d..6fb17ac413be 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -542,14 +542,14 @@ static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned lo
 	return (boundary - 1) < (end - 1) ? boundary : end;
 }
 
-#define pgd_addr_end_folded pgd_addr_end_folded
-static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
 }
 
-#define p4d_addr_end_folded p4d_addr_end_folded
-static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(p4d_val(p4d), addr, end);
 }
diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c
index 567c69f3069e..4aba634b4b26 100644
--- a/arch/s390/mm/page-states.c
+++ b/arch/s390/mm/page-states.c
@@ -109,7 +109,7 @@ static void mark_kernel_pmd(pud_t *pud, unsigned long addr, unsigned long end)
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || pmd_large(*pmd))
 			continue;
 		page = virt_to_page(pmd_val(*pmd));
@@ -126,7 +126,7 @@ static void mark_kernel_pud(p4d_t *p4d, unsigned long addr, unsigned long end)
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || pud_large(*pud))
 			continue;
 		if (!pud_folded(*pud)) {
@@ -147,7 +147,7 @@ static void mark_kernel_p4d(pgd_t *pgd, unsigned long addr, unsigned long end)
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none(*p4d))
 			continue;
 		if (!p4d_folded(*p4d)) {
@@ -169,7 +169,7 @@ static void mark_kernel_pgd(void)
 	addr = 0;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, MODULES_END);
+		next = pgd_addr_end(*pgd, addr, MODULES_END);
 		if (pgd_none(*pgd))
 			continue;
 		if (!pgd_folded(*pgd)) {
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index c5c52ec2b46f..b827d758a17a 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -162,7 +162,7 @@ static int walk_pmd_level(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		if (pmd_none(*pmdp))
 			return -EINVAL;
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_large(*pmdp)) {
 			if (addr & ~PMD_MASK || addr + PMD_SIZE > next) {
 				rc = split_pmd_page(pmdp, addr);
@@ -239,7 +239,7 @@ static int walk_pud_level(p4d_t *p4d, unsigned long addr, unsigned long end,
 	do {
 		if (pud_none(*pudp))
 			return -EINVAL;
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		if (pud_large(*pudp)) {
 			if (addr & ~PUD_MASK || addr + PUD_SIZE > next) {
 				rc = split_pud_page(pudp, addr);
@@ -269,7 +269,7 @@ static int walk_p4d_level(pgd_t *pgd, unsigned long addr, unsigned long end,
 	do {
 		if (p4d_none(*p4dp))
 			return -EINVAL;
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		rc = walk_pud_level(p4dp, addr, next, flags);
 		p4dp++;
 		addr = next;
@@ -296,7 +296,7 @@ static int change_page_attr(unsigned long addr, unsigned long end,
 	do {
 		if (pgd_none(*pgdp))
 			break;
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		rc = walk_p4d_level(pgdp, addr, next, flags);
 		if (rc)
 			break;
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index b239f2ba93b0..672bc89f13e7 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -219,7 +219,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!add) {
 			if (pmd_none(*pmd))
 				continue;
@@ -320,7 +320,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 		prot &= ~_REGION_ENTRY_NOEXEC;
 	pud = pud_offset(p4d, addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!add) {
 			if (pud_none(*pud))
 				continue;
@@ -394,7 +394,7 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!add) {
 			if (p4d_none(*p4d))
 				continue;
@@ -449,8 +449,8 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 	if (WARN_ON_ONCE(!PAGE_ALIGNED(start | end)))
 		return -EINVAL;
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (!add) {
 			if (pgd_none(*pgd))
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ec423b5f17dd..341c2ff8d31a 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -428,7 +428,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd))
 			continue;
 		if (is_hugetlb_pmd(*pmd))
@@ -465,7 +465,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		if (is_hugetlb_pud(*pud))
@@ -519,7 +519,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	pgd = pgd_offset(tlb->mm, addr);
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		hugetlb_free_pud_range(tlb, p4d, addr, next, floor, ceiling);
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index 61776790cd67..7b4fe31c8df2 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -264,7 +264,7 @@ static inline int update_pmd_range(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_present(*pmd)) {
 			if (hvc->force || pmd_newpage(*pmd)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -286,7 +286,7 @@ static inline int update_pud_range(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_present(*pud)) {
 			if (hvc->force || pud_newpage(*pud)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -308,7 +308,7 @@ static inline int update_p4d_range(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!p4d_present(*p4d)) {
 			if (hvc->force || p4d_newpage(*p4d)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -331,7 +331,7 @@ void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
 	hvc = INIT_HVC(mm, force, userspace);
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end_addr);
+		next = pgd_addr_end(*pgd, addr, end_addr);
 		if (!pgd_present(*pgd)) {
 			if (force || pgd_newpage(*pgd)) {
 				ret = add_munmap(addr, next - addr, &hvc);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a4ac13cc3fdc..e2cb9316a104 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1043,7 +1043,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -1099,7 +1099,7 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -1153,7 +1153,7 @@ remove_p4d_table(p4d_t *p4d_start, unsigned long addr, unsigned long end,
 
 	p4d = p4d_start + p4d_index(addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -1186,9 +1186,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct,
 	p4d_t *p4d;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
@@ -1500,8 +1499,6 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 	pmd_t *pmd;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pmd_addr_end(addr, end);
-
 		pgd = vmemmap_pgd_populate(addr, node);
 		if (!pgd)
 			return -ENOMEM;
@@ -1515,6 +1512,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 			return -ENOMEM;
 
 		pmd = pmd_offset(pud, addr);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd)) {
 			void *p;
 
@@ -1623,9 +1621,8 @@ void register_page_bootmem_memmap(unsigned long section_nr,
 			get_page_bootmem(section_nr, pte_page(*pte),
 					 SECTION_INFO);
 		} else {
-			next = pmd_addr_end(addr, end);
-
 			pmd = pmd_offset(pud, addr);
+			next = pmd_addr_end(*pmd, addr, end);
 			if (pmd_none(*pmd))
 				continue;
 
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 1a50434c8a4d..2c105b5154ba 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -96,7 +96,7 @@ static void __init kasan_populate_pud(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_large(*pmd))
 			kasan_populate_pmd(pmd, addr, next, nid);
 	} while (pmd++, addr = next, addr != end);
@@ -116,7 +116,7 @@ static void __init kasan_populate_p4d(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_large(*pud))
 			kasan_populate_pud(pud, addr, next, nid);
 	} while (pud++, addr = next, addr != end);
@@ -136,7 +136,7 @@ static void __init kasan_populate_pgd(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		kasan_populate_p4d(p4d, addr, next, nid);
 	} while (p4d++, addr = next, addr != end);
 }
@@ -151,7 +151,7 @@ static void __init kasan_populate_shadow(unsigned long addr, unsigned long end,
 	end = round_up(end, PAGE_SIZE);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_populate_pgd(pgd, addr, next, nid);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -219,7 +219,7 @@ static void __init kasan_early_p4d_populate(pgd_t *pgd,
 
 	p4d = early_p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_none(*p4d))
 			continue;
@@ -239,7 +239,7 @@ static void __init kasan_map_early_shadow(pgd_t *pgd)
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_early_p4d_populate(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -254,7 +254,7 @@ static void __init kasan_shallow_populate_p4ds(pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (p4d_none(*p4d)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
@@ -272,7 +272,7 @@ static void __init kasan_shallow_populate_pgds(void *start, void *end)
 	addr = (unsigned long)start;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, (unsigned long)end);
+		next = pgd_addr_end(*pgd, addr, (unsigned long)end);
 
 		if (pgd_none(*pgd)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
index ce2cbb3c380f..156b42e51424 100644
--- a/include/asm-generic/pgtable-nop4d.h
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -53,7 +53,7 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
 #define p4d_free_tlb(tlb, x, a)			do { } while (0)
 
 #undef  p4d_addr_end
-#define p4d_addr_end(addr, end)			(end)
+#define p4d_addr_end(p4d, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOP4D_H */
diff --git a/include/asm-generic/pgtable-nopmd.h b/include/asm-generic/pgtable-nopmd.h
index 3e13acd019ae..e988384de1c7 100644
--- a/include/asm-generic/pgtable-nopmd.h
+++ b/include/asm-generic/pgtable-nopmd.h
@@ -64,7 +64,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 #define pmd_free_tlb(tlb, x, a)		do { } while (0)
 
 #undef  pmd_addr_end
-#define pmd_addr_end(addr, end)			(end)
+#define pmd_addr_end(pmd, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index a9d751fbda9e..57a28bade9f9 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -60,7 +60,7 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
 #define pud_free_tlb(tlb, x, a)		        do { } while (0)
 
 #undef  pud_addr_end
-#define pud_addr_end(addr, end)			(end)
+#define pud_addr_end(pud, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOPUD_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 981c4c2a31fe..67ebc22cf83d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -655,48 +655,34 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
  */
 
-#define pgd_addr_end(addr, end)						\
+#ifndef pgd_addr_end
+#define pgd_addr_end(pgd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
+#endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(addr, end)						\
+#define p4d_addr_end(p4d, addr, end)					\
 ({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(addr, end)						\
+#define pud_addr_end(pud, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(addr, end)						\
+#define pmd_addr_end(pmd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
-#ifndef pgd_addr_end_folded
-#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
-#endif
-
-#ifndef p4d_addr_end_folded
-#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
-#endif
-
-#ifndef pud_addr_end_folded
-#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
-#endif
-
-#ifndef pmd_addr_end_folded
-#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
-#endif
-
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index ba4aace5d0f4..7826876ae7e0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end_folded(pmd, addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end_folded(pud, addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end_folded(p4d, addr, end);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end_folded(pgd, addr, end);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..400fa119c09d 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -114,7 +114,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PMD_MODIFIED;
@@ -160,7 +160,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PUD_MODIFIED;
@@ -206,7 +206,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_P4D_MODIFIED;
@@ -234,7 +234,7 @@ int ioremap_page_range(unsigned long addr,
 	start = addr;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
 					&mask);
 		if (err)
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index fe6be0be1f76..829627a92763 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -117,7 +117,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (IS_ALIGNED(addr, PMD_SIZE) && end - addr >= PMD_SIZE) {
 			pmd_populate_kernel(&init_mm, pmd,
@@ -150,7 +150,7 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (IS_ALIGNED(addr, PUD_SIZE) && end - addr >= PUD_SIZE) {
 			pmd_t *pmd;
 
@@ -187,7 +187,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (IS_ALIGNED(addr, P4D_SIZE) && end - addr >= P4D_SIZE) {
 			pud_t *pud;
 			pmd_t *pmd;
@@ -236,7 +236,7 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
 	unsigned long next;
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (IS_ALIGNED(addr, PGDIR_SIZE) && end - addr >= PGDIR_SIZE) {
 			p4d_t *p4d;
@@ -370,7 +370,7 @@ static void kasan_remove_pmd_table(pmd_t *pmd, unsigned long addr,
 	for (; addr < end; addr = next, pmd++) {
 		pte_t *pte;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -395,7 +395,7 @@ static void kasan_remove_pud_table(pud_t *pud, unsigned long addr,
 	for (; addr < end; addr = next, pud++) {
 		pmd_t *pmd, *pmd_base;
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -421,7 +421,7 @@ static void kasan_remove_p4d_table(p4d_t *p4d, unsigned long addr,
 	for (; addr < end; addr = next, p4d++) {
 		pud_t *pud;
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -454,9 +454,8 @@ void kasan_remove_zero_shadow(void *start, unsigned long size)
 	for (; addr < end; addr = next) {
 		p4d_t *p4d;
 
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
diff --git a/mm/madvise.c b/mm/madvise.c
index e32e7efbba0f..acfb3441d97e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -326,7 +326,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (pmd_trans_huge(*pmd)) {
 		pmd_t orig_pmd;
-		unsigned long next = pmd_addr_end(addr, end);
+		unsigned long next = pmd_addr_end(*pmd, addr, end);
 
 		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 		ptl = pmd_trans_huge_lock(pmd, vma);
@@ -587,7 +587,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	int nr_swap = 0;
 	unsigned long next;
 
-	next = pmd_addr_end(addr, end);
+	next = pmd_addr_end(*pmd, addr, end);
 	if (pmd_trans_huge(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
diff --git a/mm/memory.c b/mm/memory.c
index fb5463153351..f95424946b0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -233,7 +233,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		free_pte_range(tlb, pmd, addr);
@@ -267,7 +267,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		free_pmd_range(tlb, pud, addr, next, floor, ceiling);
@@ -301,7 +301,7 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 	start = addr;
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		free_pud_range(tlb, p4d, addr, next, floor, ceiling);
@@ -381,7 +381,7 @@ void free_pgd_range(struct mmu_gather *tlb,
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	pgd = pgd_offset(tlb->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
@@ -887,7 +887,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*src_pmd, addr, end);
 		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
 			|| pmd_devmap(*src_pmd)) {
 			int err;
@@ -921,7 +921,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pud = pud_offset(src_p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*src_pud, addr, end);
 		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
 			int err;
 
@@ -955,7 +955,7 @@ static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_p4d = p4d_offset(src_pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*src_p4d, addr, end);
 		if (p4d_none_or_clear_bad(src_p4d))
 			continue;
 		if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
@@ -1017,7 +1017,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*src_pgd, addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
@@ -1177,7 +1177,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
@@ -1212,7 +1212,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				mmap_assert_locked(tlb->mm);
@@ -1241,7 +1241,7 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		next = zap_pud_range(tlb, vma, p4d, addr, next, details);
@@ -1262,7 +1262,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
@@ -2030,7 +2030,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 		return -ENOMEM;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		err = remap_pte_range(mm, pmd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2052,7 +2052,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		err = remap_pmd_range(mm, pud, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2074,7 +2074,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		err = remap_pud_range(mm, p4d, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2143,7 +2143,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = remap_p4d_range(mm, pgd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2266,7 +2266,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 		pmd = pmd_offset(pud, addr);
 	}
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (create || !pmd_none_or_clear_bad(pmd)) {
 			err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
 						 create, mask);
@@ -2294,7 +2294,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 		pud = pud_offset(p4d, addr);
 	}
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (create || !pud_none_or_clear_bad(pud)) {
 			err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
 						 create, mask);
@@ -2322,7 +2322,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 		p4d = p4d_offset(pgd, addr);
 	}
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (create || !p4d_none_or_clear_bad(p4d)) {
 			err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
 						 create, mask);
@@ -2348,7 +2348,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!create && pgd_none_or_clear_bad(pgd))
 			continue;
 		err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, &mask);
diff --git a/mm/mlock.c b/mm/mlock.c
index 93ca2bf30b4f..5898e8fe2288 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -374,8 +374,12 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 			struct vm_area_struct *vma, struct zone *zone,
 			unsigned long start, unsigned long end)
 {
-	pte_t *pte;
 	spinlock_t *ptl;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
 
 	/*
 	 * Initialize pte walk starting at the already pinned page where we
@@ -384,10 +388,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 	 */
 	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
 	/* Make sure we do not cross the page table boundary */
-	end = pgd_addr_end(start, end);
-	end = p4d_addr_end(start, end);
-	end = pud_addr_end(start, end);
-	end = pmd_addr_end(start, end);
+	pgd = pgd_offset(vma->vm_mm, start);
+	end = pgd_addr_end(*pgd, start, end);
+	p4d = p4d_offset(pgd, start);
+	end = p4d_addr_end(*p4d, start, end);
+	pud = pud_offset(p4d, start);
+	end = pud_addr_end(*pud, start, end);
+	pmd = pmd_offset(pud, start);
+	end = pmd_addr_end(*pmd, start, end);
 
 	/* The page next to the pinned page is the first we will try to get */
 	start += PAGE_SIZE;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..278f2dbd1f20 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -225,7 +225,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	do {
 		unsigned long this_pages;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
@@ -291,7 +291,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
@@ -311,7 +311,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(vma, p4d, addr, next, newprot,
@@ -336,7 +336,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	flush_cache_range(vma, addr, end);
 	inc_tlb_flush_pending(mm);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(vma, pgd, addr, next, newprot,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..a5b9f61b5d45 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -70,7 +70,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	pmd = pmd_offset(pud, addr);
 	do {
 again:
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -128,7 +128,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 	pud = pud_offset(p4d, addr);
 	do {
  again:
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -176,7 +176,7 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -211,7 +211,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
 	else
 		pgd = pgd_offset(walk->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, 0, walk);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 20012c0c0252..b1dd815aee6b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2018,7 +2018,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	pmd = pmd_offset(pud, addr);
 	do {
 		cond_resched();
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type,
@@ -2040,7 +2040,7 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		ret = unuse_pmd_range(vma, pud, addr, next, type,
@@ -2062,7 +2062,7 @@ static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		ret = unuse_pud_range(vma, p4d, addr, next, type,
@@ -2085,7 +2085,7 @@ static int unuse_vma(struct vm_area_struct *vma, unsigned int type,
 
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		ret = unuse_p4d_range(vma, pgd, addr, next, type,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index be4724b916b3..09ff0d5ecbc1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -93,7 +93,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		cleared = pmd_clear_huge(pmd);
 		if (cleared || pmd_bad(*pmd))
@@ -118,7 +118,7 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		cleared = pud_clear_huge(pud);
 		if (cleared || pud_bad(*pud))
@@ -141,7 +141,7 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		cleared = p4d_clear_huge(p4d);
 		if (cleared || p4d_bad(*p4d))
@@ -179,7 +179,7 @@ void unmap_kernel_range_noflush(unsigned long start, unsigned long size)
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		if (pgd_none_or_clear_bad(pgd))
@@ -230,7 +230,7 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
@@ -248,7 +248,7 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
@@ -266,7 +266,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
@@ -305,7 +305,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

From: Alexander Gordeev <agordeev@linux.ibm.com>

Unlike all other page-table abstractions pXd_addr_end() do not take
into account a particular table entry in which context the functions
are called. On architectures with dynamic page-tables folding that
might lead to lack of necessary information that is difficult to
obtain other than from the table entry itself. That already led to
a subtle memory corruption issue on s390.

By letting pXd_addr_end() functions know about the page-table entry
we allow archs not only make extra checks, but also optimizations.

As result of this change the pXd_addr_end_folded() functions used
in gup_fast traversal code become unnecessary and get replaced with
universal pXd_addr_end() variants.

The arch-specific updates not only add dereferencing of page-table
entry pointers, but also small changes to the code flow to make those
dereferences possible, at least for x86 and powerpc. Also for arm64,
but in way that should not have any impact.

So, even though the dereferenced page-table entries are not used on
archs other than s390, and are optimized out by the compiler, there
is a small change in kernel size and this is what bloat-o-meter reports:

x86:
add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
Function                                     old     new   delta
vmemmap_populate                             587     592      +5
munlock_vma_pages_range                      556     561      +5
Total: Before\x15534694, After\x15534704, chg +0.00%

powerpc:
add/remove: 0/0 grow/shrink: 1/0 up/down: 4/0 (4)
Function                                     old     new   delta
.remove_pagetable                           1648    1652      +4
Total: Before!478240, After!478244, chg +0.00%

arm64:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before 240851, After 240851, chg +0.00%

sparc:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: BeforeI07262, AfterI07262, chg +0.00%

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 ++++++----
 arch/arm64/kvm/mmu.c                     | 16 +++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++--------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          |  8 ++---
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 +++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 26 ++++-----------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 ++++++++++++------------
 mm/mlock.c                               | 18 ++++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 +++++-----
 31 files changed, 165 insertions(+), 173 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index 3502c2f746ca..5e6416b339f4 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -209,7 +209,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 	} while (0)
 
 /* we don't need complex calculations here as the pmd is folded into the pgd */
-#define pmd_addr_end(addr,end) (end)
+#define pmd_addr_end(pmd,addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
 
diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c
index 448e57c6f653..5437f943ca8b 100644
--- a/arch/arm/mm/idmap.c
+++ b/arch/arm/mm/idmap.c
@@ -46,7 +46,7 @@ static void idmap_add_pmd(pud_t *pud, unsigned long addr, unsigned long end,
 		pmd = pmd_offset(pud, addr);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		*pmd = __pmd((addr & PMD_MASK) | prot);
 		flush_pmd_entry(pmd);
 	} while (pmd++, addr = next, addr != end);
@@ -73,7 +73,7 @@ static void idmap_add_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		idmap_add_pmd(pud, addr, next, prot);
 	} while (pud++, addr = next, addr != end);
 }
@@ -95,7 +95,7 @@ static void identity_mapping_add(pgd_t *pgd, const char *text_start,
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		idmap_add_pud(pgd, addr, next, prot);
 	} while (pgd++, addr = next, addr != end);
 }
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 698cc740c6b8..4013746e4c75 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -777,7 +777,7 @@ static void __init alloc_init_pmd(pud_t *pud, unsigned long addr,
 		 * With LPAE, we must loop over to map
 		 * all the pmds for the given range.
 		 */
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Try a section mapping - addr, next and phys must all be
@@ -805,7 +805,7 @@ static void __init alloc_init_pud(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		alloc_init_pmd(pud, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (pud++, addr = next, addr != end);
@@ -820,7 +820,7 @@ static void __init alloc_init_p4d(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		alloc_init_pud(p4d, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (p4d++, addr = next, addr != end);
@@ -923,7 +923,7 @@ static void __init __create_mapping(struct mm_struct *mm, struct map_desc *md,
 	pgd = pgd_offset(mm, addr);
 	end = addr + length;
 	do {
-		unsigned long next = pgd_addr_end(addr, end);
+		unsigned long next = pgd_addr_end(*pgd, addr, end);
 
 		alloc_init_p4d(pgd, addr, next, phys, type, alloc, ng);
 
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 68e14152d6e9..7be8c9cdc5c8 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -412,7 +412,7 @@ static int copy_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long start,
 	do {
 		pmd_t pmd = READ_ONCE(*src_pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 		if (pmd_table(pmd)) {
@@ -447,7 +447,7 @@ static int copy_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp, unsigned long start,
 	do {
 		pud_t pud = READ_ONCE(*src_pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 		if (pud_table(pud)) {
@@ -473,8 +473,10 @@ static int copy_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long start,
 	dst_p4dp = p4d_offset(dst_pgdp, start);
 	src_p4dp = p4d_offset(src_pgdp, start);
 	do {
-		next = p4d_addr_end(addr, end);
-		if (p4d_none(READ_ONCE(*src_p4dp)))
+		p4d_t p4d = READ_ONCE(*src_p4dp);
+
+		next = p4d_addr_end(p4d, addr, end);
+		if (p4d_none(p4d))
 			continue;
 		if (copy_pud(dst_p4dp, src_p4dp, addr, next))
 			return -ENOMEM;
@@ -492,8 +494,10 @@ static int copy_page_tables(pgd_t *dst_pgdp, unsigned long start,
 
 	dst_pgdp = pgd_offset_pgd(dst_pgdp, start);
 	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(READ_ONCE(*src_pgdp)))
+		pgd_t pgd = READ_ONCE(*src_pgdp);
+
+		next = pgd_addr_end(pgd, addr, end);
+		if (pgd_none(pgd))
 			continue;
 		if (copy_p4d(dst_pgdp, src_pgdp, addr, next))
 			return -ENOMEM;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ba00bcc0c884..8f470f93a8e9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -547,7 +547,7 @@ static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
 
 	start_pmd = pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		/* Hyp doesn't use huge pmds */
 		if (!pmd_none(*pmd))
 			unmap_hyp_ptes(pmd, addr, next);
@@ -564,7 +564,7 @@ static void unmap_hyp_puds(p4d_t *p4d, phys_addr_t addr, phys_addr_t end)
 
 	start_pud = pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		/* Hyp doesn't use huge puds */
 		if (!pud_none(*pud))
 			unmap_hyp_pmds(pud, addr, next);
@@ -581,7 +581,7 @@ static void unmap_hyp_p4ds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
 
 	start_p4d = p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		/* Hyp doesn't use huge p4ds */
 		if (!p4d_none(*p4d))
 			unmap_hyp_puds(p4d, addr, next);
@@ -609,7 +609,7 @@ static void __unmap_hyp_range(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 	 */
 	pgd = pgdp + kvm_pgd_index(addr, ptrs_per_pgd);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_none(*pgd))
 			unmap_hyp_p4ds(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
@@ -712,7 +712,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
 			get_page(virt_to_page(pmd));
 		}
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		create_hyp_pte_mappings(pmd, addr, next, pfn, prot);
 		pfn += (next - addr) >> PAGE_SHIFT;
@@ -744,7 +744,7 @@ static int create_hyp_pud_mappings(p4d_t *p4d, unsigned long start,
 			get_page(virt_to_page(pud));
 		}
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		ret = create_hyp_pmd_mappings(pud, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -777,7 +777,7 @@ static int create_hyp_p4d_mappings(pgd_t *pgd, unsigned long start,
 			get_page(virt_to_page(p4d));
 		}
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		ret = create_hyp_pud_mappings(p4d, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -813,7 +813,7 @@ static int __create_hyp_mappings(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 			get_page(virt_to_page(pgd));
 		}
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = create_hyp_p4d_mappings(pgd, addr, next, pfn, prot);
 		if (err)
 			goto out;
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b24e43d20667..8d1c811fd59e 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -120,7 +120,7 @@ static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp = kasan_pmd_offset(pudp, addr, node, early);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		kasan_pte_populate(pmdp, addr, next, node, early);
 	} while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)));
 }
@@ -132,7 +132,7 @@ static void __init kasan_pud_populate(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp = kasan_pud_offset(p4dp, addr, node, early);
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		kasan_pmd_populate(pudp, addr, next, node, early);
 	} while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)));
 }
@@ -144,7 +144,7 @@ static void __init kasan_p4d_populate(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp = p4d_offset(pgdp, addr);
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		kasan_pud_populate(p4dp, addr, next, node, early);
 	} while (p4dp++, addr = next, addr != end);
 }
@@ -157,7 +157,7 @@ static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
 
 	pgdp = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		kasan_p4d_populate(pgdp, addr, next, node, early);
 	} while (pgdp++, addr = next, addr != end);
 }
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 64211436629d..d679cf024bc8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -209,7 +209,7 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		pmd_t old_pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(old_pmd, addr, end);
 
 		/* try section mapping first */
 		if (((addr | next | phys) & ~SECTION_MASK) = 0 &&
@@ -307,7 +307,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
 	do {
 		pud_t old_pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(old_pud, addr, end);
 
 		/*
 		 * For 4K granule only, attempt to put down a 1GB block
@@ -356,7 +356,7 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 	end = PAGE_ALIGN(virt + size);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		alloc_init_pud(pgdp, addr, next, phys, prot, pgtable_alloc,
 			       flags);
 		phys += next - addr;
@@ -820,9 +820,9 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp, pmd;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -853,9 +853,9 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp, pud;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -886,9 +886,9 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -912,9 +912,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end,
 	WARN_ON(!free_mapped && altmap);
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -968,9 +968,9 @@ static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -1008,9 +1008,9 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -1048,9 +1048,9 @@ static void free_empty_p4d_table(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -1066,9 +1066,9 @@ static void free_empty_tables(unsigned long addr, unsigned long end,
 	pgd_t *pgdp, pgd;
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -1097,8 +1097,6 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	pmd_t *pmdp;
 
 	do {
-		next = pmd_addr_end(addr, end);
-
 		pgdp = vmemmap_pgd_populate(addr, node);
 		if (!pgdp)
 			return -ENOMEM;
@@ -1112,6 +1110,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 			return -ENOMEM;
 
 		pmdp = pmd_offset(pudp, addr);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_none(READ_ONCE(*pmdp))) {
 			void *p = NULL;
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index cc72666e891a..816e218df285 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -817,7 +817,7 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -847,7 +847,7 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -878,10 +878,9 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
 	spin_lock(&init_mm.page_table_lock);
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!p4d_present(*p4d))
 			continue;
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26292544630f..f0606d6774a4 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -352,7 +352,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 		unsigned long more;
 
 		pmd = pmd_offset(pud, addr);
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!is_hugepd(__hugepd(pmd_val(*pmd)))) {
 			if (pmd_none_or_clear_bad(pmd))
 				continue;
@@ -409,7 +409,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	do {
 		pud = pud_offset(p4d, addr);
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!is_hugepd(__hugepd(pud_val(*pud)))) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
@@ -478,9 +478,9 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	 */
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset(tlb->mm, addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!is_hugepd(__hugepd(pgd_val(*pgd)))) {
 			if (p4d_none_or_clear_bad(p4d))
 				continue;
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 027206e4959d..6fb17ac413be 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -542,14 +542,14 @@ static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned lo
 	return (boundary - 1) < (end - 1) ? boundary : end;
 }
 
-#define pgd_addr_end_folded pgd_addr_end_folded
-static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
 }
 
-#define p4d_addr_end_folded p4d_addr_end_folded
-static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(p4d_val(p4d), addr, end);
 }
diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c
index 567c69f3069e..4aba634b4b26 100644
--- a/arch/s390/mm/page-states.c
+++ b/arch/s390/mm/page-states.c
@@ -109,7 +109,7 @@ static void mark_kernel_pmd(pud_t *pud, unsigned long addr, unsigned long end)
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || pmd_large(*pmd))
 			continue;
 		page = virt_to_page(pmd_val(*pmd));
@@ -126,7 +126,7 @@ static void mark_kernel_pud(p4d_t *p4d, unsigned long addr, unsigned long end)
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || pud_large(*pud))
 			continue;
 		if (!pud_folded(*pud)) {
@@ -147,7 +147,7 @@ static void mark_kernel_p4d(pgd_t *pgd, unsigned long addr, unsigned long end)
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none(*p4d))
 			continue;
 		if (!p4d_folded(*p4d)) {
@@ -169,7 +169,7 @@ static void mark_kernel_pgd(void)
 	addr = 0;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, MODULES_END);
+		next = pgd_addr_end(*pgd, addr, MODULES_END);
 		if (pgd_none(*pgd))
 			continue;
 		if (!pgd_folded(*pgd)) {
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index c5c52ec2b46f..b827d758a17a 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -162,7 +162,7 @@ static int walk_pmd_level(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		if (pmd_none(*pmdp))
 			return -EINVAL;
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_large(*pmdp)) {
 			if (addr & ~PMD_MASK || addr + PMD_SIZE > next) {
 				rc = split_pmd_page(pmdp, addr);
@@ -239,7 +239,7 @@ static int walk_pud_level(p4d_t *p4d, unsigned long addr, unsigned long end,
 	do {
 		if (pud_none(*pudp))
 			return -EINVAL;
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		if (pud_large(*pudp)) {
 			if (addr & ~PUD_MASK || addr + PUD_SIZE > next) {
 				rc = split_pud_page(pudp, addr);
@@ -269,7 +269,7 @@ static int walk_p4d_level(pgd_t *pgd, unsigned long addr, unsigned long end,
 	do {
 		if (p4d_none(*p4dp))
 			return -EINVAL;
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		rc = walk_pud_level(p4dp, addr, next, flags);
 		p4dp++;
 		addr = next;
@@ -296,7 +296,7 @@ static int change_page_attr(unsigned long addr, unsigned long end,
 	do {
 		if (pgd_none(*pgdp))
 			break;
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		rc = walk_p4d_level(pgdp, addr, next, flags);
 		if (rc)
 			break;
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index b239f2ba93b0..672bc89f13e7 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -219,7 +219,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!add) {
 			if (pmd_none(*pmd))
 				continue;
@@ -320,7 +320,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 		prot &= ~_REGION_ENTRY_NOEXEC;
 	pud = pud_offset(p4d, addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!add) {
 			if (pud_none(*pud))
 				continue;
@@ -394,7 +394,7 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!add) {
 			if (p4d_none(*p4d))
 				continue;
@@ -449,8 +449,8 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 	if (WARN_ON_ONCE(!PAGE_ALIGNED(start | end)))
 		return -EINVAL;
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (!add) {
 			if (pgd_none(*pgd))
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ec423b5f17dd..341c2ff8d31a 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -428,7 +428,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd))
 			continue;
 		if (is_hugetlb_pmd(*pmd))
@@ -465,7 +465,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		if (is_hugetlb_pud(*pud))
@@ -519,7 +519,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	pgd = pgd_offset(tlb->mm, addr);
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		hugetlb_free_pud_range(tlb, p4d, addr, next, floor, ceiling);
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index 61776790cd67..7b4fe31c8df2 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -264,7 +264,7 @@ static inline int update_pmd_range(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_present(*pmd)) {
 			if (hvc->force || pmd_newpage(*pmd)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -286,7 +286,7 @@ static inline int update_pud_range(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_present(*pud)) {
 			if (hvc->force || pud_newpage(*pud)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -308,7 +308,7 @@ static inline int update_p4d_range(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!p4d_present(*p4d)) {
 			if (hvc->force || p4d_newpage(*p4d)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -331,7 +331,7 @@ void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
 	hvc = INIT_HVC(mm, force, userspace);
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end_addr);
+		next = pgd_addr_end(*pgd, addr, end_addr);
 		if (!pgd_present(*pgd)) {
 			if (force || pgd_newpage(*pgd)) {
 				ret = add_munmap(addr, next - addr, &hvc);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a4ac13cc3fdc..e2cb9316a104 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1043,7 +1043,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -1099,7 +1099,7 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -1153,7 +1153,7 @@ remove_p4d_table(p4d_t *p4d_start, unsigned long addr, unsigned long end,
 
 	p4d = p4d_start + p4d_index(addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -1186,9 +1186,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct,
 	p4d_t *p4d;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
@@ -1500,8 +1499,6 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 	pmd_t *pmd;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pmd_addr_end(addr, end);
-
 		pgd = vmemmap_pgd_populate(addr, node);
 		if (!pgd)
 			return -ENOMEM;
@@ -1515,6 +1512,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 			return -ENOMEM;
 
 		pmd = pmd_offset(pud, addr);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd)) {
 			void *p;
 
@@ -1623,9 +1621,8 @@ void register_page_bootmem_memmap(unsigned long section_nr,
 			get_page_bootmem(section_nr, pte_page(*pte),
 					 SECTION_INFO);
 		} else {
-			next = pmd_addr_end(addr, end);
-
 			pmd = pmd_offset(pud, addr);
+			next = pmd_addr_end(*pmd, addr, end);
 			if (pmd_none(*pmd))
 				continue;
 
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 1a50434c8a4d..2c105b5154ba 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -96,7 +96,7 @@ static void __init kasan_populate_pud(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_large(*pmd))
 			kasan_populate_pmd(pmd, addr, next, nid);
 	} while (pmd++, addr = next, addr != end);
@@ -116,7 +116,7 @@ static void __init kasan_populate_p4d(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_large(*pud))
 			kasan_populate_pud(pud, addr, next, nid);
 	} while (pud++, addr = next, addr != end);
@@ -136,7 +136,7 @@ static void __init kasan_populate_pgd(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		kasan_populate_p4d(p4d, addr, next, nid);
 	} while (p4d++, addr = next, addr != end);
 }
@@ -151,7 +151,7 @@ static void __init kasan_populate_shadow(unsigned long addr, unsigned long end,
 	end = round_up(end, PAGE_SIZE);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_populate_pgd(pgd, addr, next, nid);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -219,7 +219,7 @@ static void __init kasan_early_p4d_populate(pgd_t *pgd,
 
 	p4d = early_p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_none(*p4d))
 			continue;
@@ -239,7 +239,7 @@ static void __init kasan_map_early_shadow(pgd_t *pgd)
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_early_p4d_populate(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -254,7 +254,7 @@ static void __init kasan_shallow_populate_p4ds(pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (p4d_none(*p4d)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
@@ -272,7 +272,7 @@ static void __init kasan_shallow_populate_pgds(void *start, void *end)
 	addr = (unsigned long)start;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, (unsigned long)end);
+		next = pgd_addr_end(*pgd, addr, (unsigned long)end);
 
 		if (pgd_none(*pgd)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
index ce2cbb3c380f..156b42e51424 100644
--- a/include/asm-generic/pgtable-nop4d.h
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -53,7 +53,7 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
 #define p4d_free_tlb(tlb, x, a)			do { } while (0)
 
 #undef  p4d_addr_end
-#define p4d_addr_end(addr, end)			(end)
+#define p4d_addr_end(p4d, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOP4D_H */
diff --git a/include/asm-generic/pgtable-nopmd.h b/include/asm-generic/pgtable-nopmd.h
index 3e13acd019ae..e988384de1c7 100644
--- a/include/asm-generic/pgtable-nopmd.h
+++ b/include/asm-generic/pgtable-nopmd.h
@@ -64,7 +64,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 #define pmd_free_tlb(tlb, x, a)		do { } while (0)
 
 #undef  pmd_addr_end
-#define pmd_addr_end(addr, end)			(end)
+#define pmd_addr_end(pmd, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index a9d751fbda9e..57a28bade9f9 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -60,7 +60,7 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
 #define pud_free_tlb(tlb, x, a)		        do { } while (0)
 
 #undef  pud_addr_end
-#define pud_addr_end(addr, end)			(end)
+#define pud_addr_end(pud, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOPUD_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 981c4c2a31fe..67ebc22cf83d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -655,48 +655,34 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
  */
 
-#define pgd_addr_end(addr, end)						\
+#ifndef pgd_addr_end
+#define pgd_addr_end(pgd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
+#endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(addr, end)						\
+#define p4d_addr_end(p4d, addr, end)					\
 ({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(addr, end)						\
+#define pud_addr_end(pud, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(addr, end)						\
+#define pmd_addr_end(pmd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
-#ifndef pgd_addr_end_folded
-#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
-#endif
-
-#ifndef p4d_addr_end_folded
-#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
-#endif
-
-#ifndef pud_addr_end_folded
-#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
-#endif
-
-#ifndef pmd_addr_end_folded
-#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
-#endif
-
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index ba4aace5d0f4..7826876ae7e0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end_folded(pmd, addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end_folded(pud, addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end_folded(p4d, addr, end);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end_folded(pgd, addr, end);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..400fa119c09d 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -114,7 +114,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PMD_MODIFIED;
@@ -160,7 +160,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PUD_MODIFIED;
@@ -206,7 +206,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_P4D_MODIFIED;
@@ -234,7 +234,7 @@ int ioremap_page_range(unsigned long addr,
 	start = addr;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
 					&mask);
 		if (err)
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index fe6be0be1f76..829627a92763 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -117,7 +117,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (IS_ALIGNED(addr, PMD_SIZE) && end - addr >= PMD_SIZE) {
 			pmd_populate_kernel(&init_mm, pmd,
@@ -150,7 +150,7 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (IS_ALIGNED(addr, PUD_SIZE) && end - addr >= PUD_SIZE) {
 			pmd_t *pmd;
 
@@ -187,7 +187,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (IS_ALIGNED(addr, P4D_SIZE) && end - addr >= P4D_SIZE) {
 			pud_t *pud;
 			pmd_t *pmd;
@@ -236,7 +236,7 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
 	unsigned long next;
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (IS_ALIGNED(addr, PGDIR_SIZE) && end - addr >= PGDIR_SIZE) {
 			p4d_t *p4d;
@@ -370,7 +370,7 @@ static void kasan_remove_pmd_table(pmd_t *pmd, unsigned long addr,
 	for (; addr < end; addr = next, pmd++) {
 		pte_t *pte;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -395,7 +395,7 @@ static void kasan_remove_pud_table(pud_t *pud, unsigned long addr,
 	for (; addr < end; addr = next, pud++) {
 		pmd_t *pmd, *pmd_base;
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -421,7 +421,7 @@ static void kasan_remove_p4d_table(p4d_t *p4d, unsigned long addr,
 	for (; addr < end; addr = next, p4d++) {
 		pud_t *pud;
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -454,9 +454,8 @@ void kasan_remove_zero_shadow(void *start, unsigned long size)
 	for (; addr < end; addr = next) {
 		p4d_t *p4d;
 
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
diff --git a/mm/madvise.c b/mm/madvise.c
index e32e7efbba0f..acfb3441d97e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -326,7 +326,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (pmd_trans_huge(*pmd)) {
 		pmd_t orig_pmd;
-		unsigned long next = pmd_addr_end(addr, end);
+		unsigned long next = pmd_addr_end(*pmd, addr, end);
 
 		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 		ptl = pmd_trans_huge_lock(pmd, vma);
@@ -587,7 +587,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	int nr_swap = 0;
 	unsigned long next;
 
-	next = pmd_addr_end(addr, end);
+	next = pmd_addr_end(*pmd, addr, end);
 	if (pmd_trans_huge(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
diff --git a/mm/memory.c b/mm/memory.c
index fb5463153351..f95424946b0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -233,7 +233,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		free_pte_range(tlb, pmd, addr);
@@ -267,7 +267,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		free_pmd_range(tlb, pud, addr, next, floor, ceiling);
@@ -301,7 +301,7 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 	start = addr;
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		free_pud_range(tlb, p4d, addr, next, floor, ceiling);
@@ -381,7 +381,7 @@ void free_pgd_range(struct mmu_gather *tlb,
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	pgd = pgd_offset(tlb->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
@@ -887,7 +887,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*src_pmd, addr, end);
 		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
 			|| pmd_devmap(*src_pmd)) {
 			int err;
@@ -921,7 +921,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pud = pud_offset(src_p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*src_pud, addr, end);
 		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
 			int err;
 
@@ -955,7 +955,7 @@ static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_p4d = p4d_offset(src_pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*src_p4d, addr, end);
 		if (p4d_none_or_clear_bad(src_p4d))
 			continue;
 		if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
@@ -1017,7 +1017,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*src_pgd, addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
@@ -1177,7 +1177,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
@@ -1212,7 +1212,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				mmap_assert_locked(tlb->mm);
@@ -1241,7 +1241,7 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		next = zap_pud_range(tlb, vma, p4d, addr, next, details);
@@ -1262,7 +1262,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
@@ -2030,7 +2030,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 		return -ENOMEM;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		err = remap_pte_range(mm, pmd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2052,7 +2052,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		err = remap_pmd_range(mm, pud, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2074,7 +2074,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		err = remap_pud_range(mm, p4d, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2143,7 +2143,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = remap_p4d_range(mm, pgd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2266,7 +2266,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 		pmd = pmd_offset(pud, addr);
 	}
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (create || !pmd_none_or_clear_bad(pmd)) {
 			err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
 						 create, mask);
@@ -2294,7 +2294,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 		pud = pud_offset(p4d, addr);
 	}
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (create || !pud_none_or_clear_bad(pud)) {
 			err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
 						 create, mask);
@@ -2322,7 +2322,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 		p4d = p4d_offset(pgd, addr);
 	}
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (create || !p4d_none_or_clear_bad(p4d)) {
 			err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
 						 create, mask);
@@ -2348,7 +2348,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!create && pgd_none_or_clear_bad(pgd))
 			continue;
 		err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, &mask);
diff --git a/mm/mlock.c b/mm/mlock.c
index 93ca2bf30b4f..5898e8fe2288 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -374,8 +374,12 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 			struct vm_area_struct *vma, struct zone *zone,
 			unsigned long start, unsigned long end)
 {
-	pte_t *pte;
 	spinlock_t *ptl;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
 
 	/*
 	 * Initialize pte walk starting at the already pinned page where we
@@ -384,10 +388,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 	 */
 	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
 	/* Make sure we do not cross the page table boundary */
-	end = pgd_addr_end(start, end);
-	end = p4d_addr_end(start, end);
-	end = pud_addr_end(start, end);
-	end = pmd_addr_end(start, end);
+	pgd = pgd_offset(vma->vm_mm, start);
+	end = pgd_addr_end(*pgd, start, end);
+	p4d = p4d_offset(pgd, start);
+	end = p4d_addr_end(*p4d, start, end);
+	pud = pud_offset(p4d, start);
+	end = pud_addr_end(*pud, start, end);
+	pmd = pmd_offset(pud, start);
+	end = pmd_addr_end(*pmd, start, end);
 
 	/* The page next to the pinned page is the first we will try to get */
 	start += PAGE_SIZE;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..278f2dbd1f20 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -225,7 +225,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	do {
 		unsigned long this_pages;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
@@ -291,7 +291,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
@@ -311,7 +311,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(vma, p4d, addr, next, newprot,
@@ -336,7 +336,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	flush_cache_range(vma, addr, end);
 	inc_tlb_flush_pending(mm);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(vma, pgd, addr, next, newprot,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..a5b9f61b5d45 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -70,7 +70,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	pmd = pmd_offset(pud, addr);
 	do {
 again:
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -128,7 +128,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 	pud = pud_offset(p4d, addr);
 	do {
  again:
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -176,7 +176,7 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -211,7 +211,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
 	else
 		pgd = pgd_offset(walk->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, 0, walk);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 20012c0c0252..b1dd815aee6b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2018,7 +2018,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	pmd = pmd_offset(pud, addr);
 	do {
 		cond_resched();
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type,
@@ -2040,7 +2040,7 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		ret = unuse_pmd_range(vma, pud, addr, next, type,
@@ -2062,7 +2062,7 @@ static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		ret = unuse_pud_range(vma, p4d, addr, next, type,
@@ -2085,7 +2085,7 @@ static int unuse_vma(struct vm_area_struct *vma, unsigned int type,
 
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		ret = unuse_p4d_range(vma, pgd, addr, next, type,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index be4724b916b3..09ff0d5ecbc1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -93,7 +93,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		cleared = pmd_clear_huge(pmd);
 		if (cleared || pmd_bad(*pmd))
@@ -118,7 +118,7 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		cleared = pud_clear_huge(pud);
 		if (cleared || pud_bad(*pud))
@@ -141,7 +141,7 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		cleared = p4d_clear_huge(p4d);
 		if (cleared || p4d_bad(*p4d))
@@ -179,7 +179,7 @@ void unmap_kernel_range_noflush(unsigned long start, unsigned long size)
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		if (pgd_none_or_clear_bad(pgd))
@@ -230,7 +230,7 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
@@ -248,7 +248,7 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
@@ -266,7 +266,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
@@ -305,7 +305,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
-- 
2.17.1

^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Unlike all other page-table abstractions pXd_addr_end() do not take
into account a particular table entry in which context the functions
are called. On architectures with dynamic page-tables folding that
might lead to lack of necessary information that is difficult to
obtain other than from the table entry itself. That already led to
a subtle memory corruption issue on s390.

By letting pXd_addr_end() functions know about the page-table entry
we allow archs not only make extra checks, but also optimizations.

As result of this change the pXd_addr_end_folded() functions used
in gup_fast traversal code become unnecessary and get replaced with
universal pXd_addr_end() variants.

The arch-specific updates not only add dereferencing of page-table
entry pointers, but also small changes to the code flow to make those
dereferences possible, at least for x86 and powerpc. Also for arm64,
but in way that should not have any impact.

So, even though the dereferenced page-table entries are not used on
archs other than s390, and are optimized out by the compiler, there
is a small change in kernel size and this is what bloat-o-meter reports:

x86:
add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
Function                                     old     new   delta
vmemmap_populate                             587     592      +5
munlock_vma_pages_range                      556     561      +5
Total: Before=15534694, After=15534704, chg +0.00%

powerpc:
add/remove: 0/0 grow/shrink: 1/0 up/down: 4/0 (4)
Function                                     old     new   delta
.remove_pagetable                           1648    1652      +4
Total: Before=21478240, After=21478244, chg +0.00%

arm64:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before=20240851, After=20240851, chg +0.00%

sparc:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before=4907262, After=4907262, chg +0.00%

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 ++++++----
 arch/arm64/kvm/mmu.c                     | 16 +++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++--------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          |  8 ++---
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 +++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 26 ++++-----------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 ++++++++++++------------
 mm/mlock.c                               | 18 ++++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 +++++-----
 31 files changed, 165 insertions(+), 173 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index 3502c2f746ca..5e6416b339f4 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -209,7 +209,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 	} while (0)
 
 /* we don't need complex calculations here as the pmd is folded into the pgd */
-#define pmd_addr_end(addr,end) (end)
+#define pmd_addr_end(pmd,addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
 
diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c
index 448e57c6f653..5437f943ca8b 100644
--- a/arch/arm/mm/idmap.c
+++ b/arch/arm/mm/idmap.c
@@ -46,7 +46,7 @@ static void idmap_add_pmd(pud_t *pud, unsigned long addr, unsigned long end,
 		pmd = pmd_offset(pud, addr);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		*pmd = __pmd((addr & PMD_MASK) | prot);
 		flush_pmd_entry(pmd);
 	} while (pmd++, addr = next, addr != end);
@@ -73,7 +73,7 @@ static void idmap_add_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		idmap_add_pmd(pud, addr, next, prot);
 	} while (pud++, addr = next, addr != end);
 }
@@ -95,7 +95,7 @@ static void identity_mapping_add(pgd_t *pgd, const char *text_start,
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		idmap_add_pud(pgd, addr, next, prot);
 	} while (pgd++, addr = next, addr != end);
 }
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 698cc740c6b8..4013746e4c75 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -777,7 +777,7 @@ static void __init alloc_init_pmd(pud_t *pud, unsigned long addr,
 		 * With LPAE, we must loop over to map
 		 * all the pmds for the given range.
 		 */
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Try a section mapping - addr, next and phys must all be
@@ -805,7 +805,7 @@ static void __init alloc_init_pud(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		alloc_init_pmd(pud, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (pud++, addr = next, addr != end);
@@ -820,7 +820,7 @@ static void __init alloc_init_p4d(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		alloc_init_pud(p4d, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (p4d++, addr = next, addr != end);
@@ -923,7 +923,7 @@ static void __init __create_mapping(struct mm_struct *mm, struct map_desc *md,
 	pgd = pgd_offset(mm, addr);
 	end = addr + length;
 	do {
-		unsigned long next = pgd_addr_end(addr, end);
+		unsigned long next = pgd_addr_end(*pgd, addr, end);
 
 		alloc_init_p4d(pgd, addr, next, phys, type, alloc, ng);
 
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 68e14152d6e9..7be8c9cdc5c8 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -412,7 +412,7 @@ static int copy_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long start,
 	do {
 		pmd_t pmd = READ_ONCE(*src_pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 		if (pmd_table(pmd)) {
@@ -447,7 +447,7 @@ static int copy_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp, unsigned long start,
 	do {
 		pud_t pud = READ_ONCE(*src_pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 		if (pud_table(pud)) {
@@ -473,8 +473,10 @@ static int copy_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long start,
 	dst_p4dp = p4d_offset(dst_pgdp, start);
 	src_p4dp = p4d_offset(src_pgdp, start);
 	do {
-		next = p4d_addr_end(addr, end);
-		if (p4d_none(READ_ONCE(*src_p4dp)))
+		p4d_t p4d = READ_ONCE(*src_p4dp);
+
+		next = p4d_addr_end(p4d, addr, end);
+		if (p4d_none(p4d))
 			continue;
 		if (copy_pud(dst_p4dp, src_p4dp, addr, next))
 			return -ENOMEM;
@@ -492,8 +494,10 @@ static int copy_page_tables(pgd_t *dst_pgdp, unsigned long start,
 
 	dst_pgdp = pgd_offset_pgd(dst_pgdp, start);
 	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(READ_ONCE(*src_pgdp)))
+		pgd_t pgd = READ_ONCE(*src_pgdp);
+
+		next = pgd_addr_end(pgd, addr, end);
+		if (pgd_none(pgd))
 			continue;
 		if (copy_p4d(dst_pgdp, src_pgdp, addr, next))
 			return -ENOMEM;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ba00bcc0c884..8f470f93a8e9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -547,7 +547,7 @@ static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
 
 	start_pmd = pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		/* Hyp doesn't use huge pmds */
 		if (!pmd_none(*pmd))
 			unmap_hyp_ptes(pmd, addr, next);
@@ -564,7 +564,7 @@ static void unmap_hyp_puds(p4d_t *p4d, phys_addr_t addr, phys_addr_t end)
 
 	start_pud = pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		/* Hyp doesn't use huge puds */
 		if (!pud_none(*pud))
 			unmap_hyp_pmds(pud, addr, next);
@@ -581,7 +581,7 @@ static void unmap_hyp_p4ds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
 
 	start_p4d = p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		/* Hyp doesn't use huge p4ds */
 		if (!p4d_none(*p4d))
 			unmap_hyp_puds(p4d, addr, next);
@@ -609,7 +609,7 @@ static void __unmap_hyp_range(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 	 */
 	pgd = pgdp + kvm_pgd_index(addr, ptrs_per_pgd);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_none(*pgd))
 			unmap_hyp_p4ds(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
@@ -712,7 +712,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
 			get_page(virt_to_page(pmd));
 		}
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		create_hyp_pte_mappings(pmd, addr, next, pfn, prot);
 		pfn += (next - addr) >> PAGE_SHIFT;
@@ -744,7 +744,7 @@ static int create_hyp_pud_mappings(p4d_t *p4d, unsigned long start,
 			get_page(virt_to_page(pud));
 		}
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		ret = create_hyp_pmd_mappings(pud, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -777,7 +777,7 @@ static int create_hyp_p4d_mappings(pgd_t *pgd, unsigned long start,
 			get_page(virt_to_page(p4d));
 		}
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		ret = create_hyp_pud_mappings(p4d, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -813,7 +813,7 @@ static int __create_hyp_mappings(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 			get_page(virt_to_page(pgd));
 		}
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = create_hyp_p4d_mappings(pgd, addr, next, pfn, prot);
 		if (err)
 			goto out;
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b24e43d20667..8d1c811fd59e 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -120,7 +120,7 @@ static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp = kasan_pmd_offset(pudp, addr, node, early);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		kasan_pte_populate(pmdp, addr, next, node, early);
 	} while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)));
 }
@@ -132,7 +132,7 @@ static void __init kasan_pud_populate(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp = kasan_pud_offset(p4dp, addr, node, early);
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		kasan_pmd_populate(pudp, addr, next, node, early);
 	} while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)));
 }
@@ -144,7 +144,7 @@ static void __init kasan_p4d_populate(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp = p4d_offset(pgdp, addr);
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		kasan_pud_populate(p4dp, addr, next, node, early);
 	} while (p4dp++, addr = next, addr != end);
 }
@@ -157,7 +157,7 @@ static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
 
 	pgdp = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		kasan_p4d_populate(pgdp, addr, next, node, early);
 	} while (pgdp++, addr = next, addr != end);
 }
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 64211436629d..d679cf024bc8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -209,7 +209,7 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		pmd_t old_pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(old_pmd, addr, end);
 
 		/* try section mapping first */
 		if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
@@ -307,7 +307,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
 	do {
 		pud_t old_pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(old_pud, addr, end);
 
 		/*
 		 * For 4K granule only, attempt to put down a 1GB block
@@ -356,7 +356,7 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 	end = PAGE_ALIGN(virt + size);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		alloc_init_pud(pgdp, addr, next, phys, prot, pgtable_alloc,
 			       flags);
 		phys += next - addr;
@@ -820,9 +820,9 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp, pmd;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -853,9 +853,9 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp, pud;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -886,9 +886,9 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -912,9 +912,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end,
 	WARN_ON(!free_mapped && altmap);
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -968,9 +968,9 @@ static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -1008,9 +1008,9 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -1048,9 +1048,9 @@ static void free_empty_p4d_table(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -1066,9 +1066,9 @@ static void free_empty_tables(unsigned long addr, unsigned long end,
 	pgd_t *pgdp, pgd;
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -1097,8 +1097,6 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	pmd_t *pmdp;
 
 	do {
-		next = pmd_addr_end(addr, end);
-
 		pgdp = vmemmap_pgd_populate(addr, node);
 		if (!pgdp)
 			return -ENOMEM;
@@ -1112,6 +1110,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 			return -ENOMEM;
 
 		pmdp = pmd_offset(pudp, addr);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_none(READ_ONCE(*pmdp))) {
 			void *p = NULL;
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index cc72666e891a..816e218df285 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -817,7 +817,7 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -847,7 +847,7 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -878,10 +878,9 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
 	spin_lock(&init_mm.page_table_lock);
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!p4d_present(*p4d))
 			continue;
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26292544630f..f0606d6774a4 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -352,7 +352,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 		unsigned long more;
 
 		pmd = pmd_offset(pud, addr);
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!is_hugepd(__hugepd(pmd_val(*pmd)))) {
 			if (pmd_none_or_clear_bad(pmd))
 				continue;
@@ -409,7 +409,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	do {
 		pud = pud_offset(p4d, addr);
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!is_hugepd(__hugepd(pud_val(*pud)))) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
@@ -478,9 +478,9 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	 */
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset(tlb->mm, addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!is_hugepd(__hugepd(pgd_val(*pgd)))) {
 			if (p4d_none_or_clear_bad(p4d))
 				continue;
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 027206e4959d..6fb17ac413be 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -542,14 +542,14 @@ static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned lo
 	return (boundary - 1) < (end - 1) ? boundary : end;
 }
 
-#define pgd_addr_end_folded pgd_addr_end_folded
-static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
 }
 
-#define p4d_addr_end_folded p4d_addr_end_folded
-static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(p4d_val(p4d), addr, end);
 }
diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c
index 567c69f3069e..4aba634b4b26 100644
--- a/arch/s390/mm/page-states.c
+++ b/arch/s390/mm/page-states.c
@@ -109,7 +109,7 @@ static void mark_kernel_pmd(pud_t *pud, unsigned long addr, unsigned long end)
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || pmd_large(*pmd))
 			continue;
 		page = virt_to_page(pmd_val(*pmd));
@@ -126,7 +126,7 @@ static void mark_kernel_pud(p4d_t *p4d, unsigned long addr, unsigned long end)
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || pud_large(*pud))
 			continue;
 		if (!pud_folded(*pud)) {
@@ -147,7 +147,7 @@ static void mark_kernel_p4d(pgd_t *pgd, unsigned long addr, unsigned long end)
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none(*p4d))
 			continue;
 		if (!p4d_folded(*p4d)) {
@@ -169,7 +169,7 @@ static void mark_kernel_pgd(void)
 	addr = 0;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, MODULES_END);
+		next = pgd_addr_end(*pgd, addr, MODULES_END);
 		if (pgd_none(*pgd))
 			continue;
 		if (!pgd_folded(*pgd)) {
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index c5c52ec2b46f..b827d758a17a 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -162,7 +162,7 @@ static int walk_pmd_level(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		if (pmd_none(*pmdp))
 			return -EINVAL;
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_large(*pmdp)) {
 			if (addr & ~PMD_MASK || addr + PMD_SIZE > next) {
 				rc = split_pmd_page(pmdp, addr);
@@ -239,7 +239,7 @@ static int walk_pud_level(p4d_t *p4d, unsigned long addr, unsigned long end,
 	do {
 		if (pud_none(*pudp))
 			return -EINVAL;
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		if (pud_large(*pudp)) {
 			if (addr & ~PUD_MASK || addr + PUD_SIZE > next) {
 				rc = split_pud_page(pudp, addr);
@@ -269,7 +269,7 @@ static int walk_p4d_level(pgd_t *pgd, unsigned long addr, unsigned long end,
 	do {
 		if (p4d_none(*p4dp))
 			return -EINVAL;
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		rc = walk_pud_level(p4dp, addr, next, flags);
 		p4dp++;
 		addr = next;
@@ -296,7 +296,7 @@ static int change_page_attr(unsigned long addr, unsigned long end,
 	do {
 		if (pgd_none(*pgdp))
 			break;
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		rc = walk_p4d_level(pgdp, addr, next, flags);
 		if (rc)
 			break;
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index b239f2ba93b0..672bc89f13e7 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -219,7 +219,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!add) {
 			if (pmd_none(*pmd))
 				continue;
@@ -320,7 +320,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 		prot &= ~_REGION_ENTRY_NOEXEC;
 	pud = pud_offset(p4d, addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!add) {
 			if (pud_none(*pud))
 				continue;
@@ -394,7 +394,7 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!add) {
 			if (p4d_none(*p4d))
 				continue;
@@ -449,8 +449,8 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 	if (WARN_ON_ONCE(!PAGE_ALIGNED(start | end)))
 		return -EINVAL;
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (!add) {
 			if (pgd_none(*pgd))
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ec423b5f17dd..341c2ff8d31a 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -428,7 +428,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd))
 			continue;
 		if (is_hugetlb_pmd(*pmd))
@@ -465,7 +465,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		if (is_hugetlb_pud(*pud))
@@ -519,7 +519,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	pgd = pgd_offset(tlb->mm, addr);
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		hugetlb_free_pud_range(tlb, p4d, addr, next, floor, ceiling);
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index 61776790cd67..7b4fe31c8df2 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -264,7 +264,7 @@ static inline int update_pmd_range(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_present(*pmd)) {
 			if (hvc->force || pmd_newpage(*pmd)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -286,7 +286,7 @@ static inline int update_pud_range(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_present(*pud)) {
 			if (hvc->force || pud_newpage(*pud)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -308,7 +308,7 @@ static inline int update_p4d_range(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!p4d_present(*p4d)) {
 			if (hvc->force || p4d_newpage(*p4d)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -331,7 +331,7 @@ void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
 	hvc = INIT_HVC(mm, force, userspace);
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end_addr);
+		next = pgd_addr_end(*pgd, addr, end_addr);
 		if (!pgd_present(*pgd)) {
 			if (force || pgd_newpage(*pgd)) {
 				ret = add_munmap(addr, next - addr, &hvc);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a4ac13cc3fdc..e2cb9316a104 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1043,7 +1043,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -1099,7 +1099,7 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -1153,7 +1153,7 @@ remove_p4d_table(p4d_t *p4d_start, unsigned long addr, unsigned long end,
 
 	p4d = p4d_start + p4d_index(addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -1186,9 +1186,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct,
 	p4d_t *p4d;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
@@ -1500,8 +1499,6 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 	pmd_t *pmd;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pmd_addr_end(addr, end);
-
 		pgd = vmemmap_pgd_populate(addr, node);
 		if (!pgd)
 			return -ENOMEM;
@@ -1515,6 +1512,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 			return -ENOMEM;
 
 		pmd = pmd_offset(pud, addr);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd)) {
 			void *p;
 
@@ -1623,9 +1621,8 @@ void register_page_bootmem_memmap(unsigned long section_nr,
 			get_page_bootmem(section_nr, pte_page(*pte),
 					 SECTION_INFO);
 		} else {
-			next = pmd_addr_end(addr, end);
-
 			pmd = pmd_offset(pud, addr);
+			next = pmd_addr_end(*pmd, addr, end);
 			if (pmd_none(*pmd))
 				continue;
 
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 1a50434c8a4d..2c105b5154ba 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -96,7 +96,7 @@ static void __init kasan_populate_pud(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_large(*pmd))
 			kasan_populate_pmd(pmd, addr, next, nid);
 	} while (pmd++, addr = next, addr != end);
@@ -116,7 +116,7 @@ static void __init kasan_populate_p4d(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_large(*pud))
 			kasan_populate_pud(pud, addr, next, nid);
 	} while (pud++, addr = next, addr != end);
@@ -136,7 +136,7 @@ static void __init kasan_populate_pgd(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		kasan_populate_p4d(p4d, addr, next, nid);
 	} while (p4d++, addr = next, addr != end);
 }
@@ -151,7 +151,7 @@ static void __init kasan_populate_shadow(unsigned long addr, unsigned long end,
 	end = round_up(end, PAGE_SIZE);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_populate_pgd(pgd, addr, next, nid);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -219,7 +219,7 @@ static void __init kasan_early_p4d_populate(pgd_t *pgd,
 
 	p4d = early_p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_none(*p4d))
 			continue;
@@ -239,7 +239,7 @@ static void __init kasan_map_early_shadow(pgd_t *pgd)
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_early_p4d_populate(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -254,7 +254,7 @@ static void __init kasan_shallow_populate_p4ds(pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (p4d_none(*p4d)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
@@ -272,7 +272,7 @@ static void __init kasan_shallow_populate_pgds(void *start, void *end)
 	addr = (unsigned long)start;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, (unsigned long)end);
+		next = pgd_addr_end(*pgd, addr, (unsigned long)end);
 
 		if (pgd_none(*pgd)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
index ce2cbb3c380f..156b42e51424 100644
--- a/include/asm-generic/pgtable-nop4d.h
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -53,7 +53,7 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
 #define p4d_free_tlb(tlb, x, a)			do { } while (0)
 
 #undef  p4d_addr_end
-#define p4d_addr_end(addr, end)			(end)
+#define p4d_addr_end(p4d, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOP4D_H */
diff --git a/include/asm-generic/pgtable-nopmd.h b/include/asm-generic/pgtable-nopmd.h
index 3e13acd019ae..e988384de1c7 100644
--- a/include/asm-generic/pgtable-nopmd.h
+++ b/include/asm-generic/pgtable-nopmd.h
@@ -64,7 +64,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 #define pmd_free_tlb(tlb, x, a)		do { } while (0)
 
 #undef  pmd_addr_end
-#define pmd_addr_end(addr, end)			(end)
+#define pmd_addr_end(pmd, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index a9d751fbda9e..57a28bade9f9 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -60,7 +60,7 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
 #define pud_free_tlb(tlb, x, a)		        do { } while (0)
 
 #undef  pud_addr_end
-#define pud_addr_end(addr, end)			(end)
+#define pud_addr_end(pud, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOPUD_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 981c4c2a31fe..67ebc22cf83d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -655,48 +655,34 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
  */
 
-#define pgd_addr_end(addr, end)						\
+#ifndef pgd_addr_end
+#define pgd_addr_end(pgd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
+#endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(addr, end)						\
+#define p4d_addr_end(p4d, addr, end)					\
 ({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(addr, end)						\
+#define pud_addr_end(pud, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(addr, end)						\
+#define pmd_addr_end(pmd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
-#ifndef pgd_addr_end_folded
-#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
-#endif
-
-#ifndef p4d_addr_end_folded
-#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
-#endif
-
-#ifndef pud_addr_end_folded
-#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
-#endif
-
-#ifndef pmd_addr_end_folded
-#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
-#endif
-
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index ba4aace5d0f4..7826876ae7e0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end_folded(pmd, addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end_folded(pud, addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end_folded(p4d, addr, end);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end_folded(pgd, addr, end);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..400fa119c09d 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -114,7 +114,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PMD_MODIFIED;
@@ -160,7 +160,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PUD_MODIFIED;
@@ -206,7 +206,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_P4D_MODIFIED;
@@ -234,7 +234,7 @@ int ioremap_page_range(unsigned long addr,
 	start = addr;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
 					&mask);
 		if (err)
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index fe6be0be1f76..829627a92763 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -117,7 +117,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (IS_ALIGNED(addr, PMD_SIZE) && end - addr >= PMD_SIZE) {
 			pmd_populate_kernel(&init_mm, pmd,
@@ -150,7 +150,7 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (IS_ALIGNED(addr, PUD_SIZE) && end - addr >= PUD_SIZE) {
 			pmd_t *pmd;
 
@@ -187,7 +187,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (IS_ALIGNED(addr, P4D_SIZE) && end - addr >= P4D_SIZE) {
 			pud_t *pud;
 			pmd_t *pmd;
@@ -236,7 +236,7 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
 	unsigned long next;
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (IS_ALIGNED(addr, PGDIR_SIZE) && end - addr >= PGDIR_SIZE) {
 			p4d_t *p4d;
@@ -370,7 +370,7 @@ static void kasan_remove_pmd_table(pmd_t *pmd, unsigned long addr,
 	for (; addr < end; addr = next, pmd++) {
 		pte_t *pte;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -395,7 +395,7 @@ static void kasan_remove_pud_table(pud_t *pud, unsigned long addr,
 	for (; addr < end; addr = next, pud++) {
 		pmd_t *pmd, *pmd_base;
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -421,7 +421,7 @@ static void kasan_remove_p4d_table(p4d_t *p4d, unsigned long addr,
 	for (; addr < end; addr = next, p4d++) {
 		pud_t *pud;
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -454,9 +454,8 @@ void kasan_remove_zero_shadow(void *start, unsigned long size)
 	for (; addr < end; addr = next) {
 		p4d_t *p4d;
 
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
diff --git a/mm/madvise.c b/mm/madvise.c
index e32e7efbba0f..acfb3441d97e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -326,7 +326,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (pmd_trans_huge(*pmd)) {
 		pmd_t orig_pmd;
-		unsigned long next = pmd_addr_end(addr, end);
+		unsigned long next = pmd_addr_end(*pmd, addr, end);
 
 		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 		ptl = pmd_trans_huge_lock(pmd, vma);
@@ -587,7 +587,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	int nr_swap = 0;
 	unsigned long next;
 
-	next = pmd_addr_end(addr, end);
+	next = pmd_addr_end(*pmd, addr, end);
 	if (pmd_trans_huge(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
diff --git a/mm/memory.c b/mm/memory.c
index fb5463153351..f95424946b0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -233,7 +233,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		free_pte_range(tlb, pmd, addr);
@@ -267,7 +267,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		free_pmd_range(tlb, pud, addr, next, floor, ceiling);
@@ -301,7 +301,7 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 	start = addr;
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		free_pud_range(tlb, p4d, addr, next, floor, ceiling);
@@ -381,7 +381,7 @@ void free_pgd_range(struct mmu_gather *tlb,
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	pgd = pgd_offset(tlb->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
@@ -887,7 +887,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*src_pmd, addr, end);
 		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
 			|| pmd_devmap(*src_pmd)) {
 			int err;
@@ -921,7 +921,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pud = pud_offset(src_p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*src_pud, addr, end);
 		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
 			int err;
 
@@ -955,7 +955,7 @@ static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_p4d = p4d_offset(src_pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*src_p4d, addr, end);
 		if (p4d_none_or_clear_bad(src_p4d))
 			continue;
 		if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
@@ -1017,7 +1017,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*src_pgd, addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
@@ -1177,7 +1177,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
@@ -1212,7 +1212,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				mmap_assert_locked(tlb->mm);
@@ -1241,7 +1241,7 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		next = zap_pud_range(tlb, vma, p4d, addr, next, details);
@@ -1262,7 +1262,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
@@ -2030,7 +2030,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 		return -ENOMEM;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		err = remap_pte_range(mm, pmd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2052,7 +2052,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		err = remap_pmd_range(mm, pud, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2074,7 +2074,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		err = remap_pud_range(mm, p4d, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2143,7 +2143,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = remap_p4d_range(mm, pgd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2266,7 +2266,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 		pmd = pmd_offset(pud, addr);
 	}
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (create || !pmd_none_or_clear_bad(pmd)) {
 			err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
 						 create, mask);
@@ -2294,7 +2294,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 		pud = pud_offset(p4d, addr);
 	}
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (create || !pud_none_or_clear_bad(pud)) {
 			err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
 						 create, mask);
@@ -2322,7 +2322,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 		p4d = p4d_offset(pgd, addr);
 	}
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (create || !p4d_none_or_clear_bad(p4d)) {
 			err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
 						 create, mask);
@@ -2348,7 +2348,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!create && pgd_none_or_clear_bad(pgd))
 			continue;
 		err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, &mask);
diff --git a/mm/mlock.c b/mm/mlock.c
index 93ca2bf30b4f..5898e8fe2288 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -374,8 +374,12 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 			struct vm_area_struct *vma, struct zone *zone,
 			unsigned long start, unsigned long end)
 {
-	pte_t *pte;
 	spinlock_t *ptl;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
 
 	/*
 	 * Initialize pte walk starting at the already pinned page where we
@@ -384,10 +388,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 	 */
 	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
 	/* Make sure we do not cross the page table boundary */
-	end = pgd_addr_end(start, end);
-	end = p4d_addr_end(start, end);
-	end = pud_addr_end(start, end);
-	end = pmd_addr_end(start, end);
+	pgd = pgd_offset(vma->vm_mm, start);
+	end = pgd_addr_end(*pgd, start, end);
+	p4d = p4d_offset(pgd, start);
+	end = p4d_addr_end(*p4d, start, end);
+	pud = pud_offset(p4d, start);
+	end = pud_addr_end(*pud, start, end);
+	pmd = pmd_offset(pud, start);
+	end = pmd_addr_end(*pmd, start, end);
 
 	/* The page next to the pinned page is the first we will try to get */
 	start += PAGE_SIZE;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..278f2dbd1f20 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -225,7 +225,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	do {
 		unsigned long this_pages;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
@@ -291,7 +291,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
@@ -311,7 +311,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(vma, p4d, addr, next, newprot,
@@ -336,7 +336,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	flush_cache_range(vma, addr, end);
 	inc_tlb_flush_pending(mm);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(vma, pgd, addr, next, newprot,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..a5b9f61b5d45 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -70,7 +70,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	pmd = pmd_offset(pud, addr);
 	do {
 again:
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -128,7 +128,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 	pud = pud_offset(p4d, addr);
 	do {
  again:
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -176,7 +176,7 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -211,7 +211,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
 	else
 		pgd = pgd_offset(walk->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, 0, walk);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 20012c0c0252..b1dd815aee6b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2018,7 +2018,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	pmd = pmd_offset(pud, addr);
 	do {
 		cond_resched();
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type,
@@ -2040,7 +2040,7 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		ret = unuse_pmd_range(vma, pud, addr, next, type,
@@ -2062,7 +2062,7 @@ static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		ret = unuse_pud_range(vma, p4d, addr, next, type,
@@ -2085,7 +2085,7 @@ static int unuse_vma(struct vm_area_struct *vma, unsigned int type,
 
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		ret = unuse_p4d_range(vma, pgd, addr, next, type,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index be4724b916b3..09ff0d5ecbc1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -93,7 +93,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		cleared = pmd_clear_huge(pmd);
 		if (cleared || pmd_bad(*pmd))
@@ -118,7 +118,7 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		cleared = pud_clear_huge(pud);
 		if (cleared || pud_bad(*pud))
@@ -141,7 +141,7 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		cleared = p4d_clear_huge(p4d);
 		if (cleared || p4d_bad(*p4d))
@@ -179,7 +179,7 @@ void unmap_kernel_range_noflush(unsigned long start, unsigned long size)
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		if (pgd_none_or_clear_bad(pgd))
@@ -230,7 +230,7 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
@@ -248,7 +248,7 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
@@ -266,7 +266,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
@@ -305,7 +305,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Unlike all other page-table abstractions pXd_addr_end() do not take
into account a particular table entry in which context the functions
are called. On architectures with dynamic page-tables folding that
might lead to lack of necessary information that is difficult to
obtain other than from the table entry itself. That already led to
a subtle memory corruption issue on s390.

By letting pXd_addr_end() functions know about the page-table entry
we allow archs not only make extra checks, but also optimizations.

As result of this change the pXd_addr_end_folded() functions used
in gup_fast traversal code become unnecessary and get replaced with
universal pXd_addr_end() variants.

The arch-specific updates not only add dereferencing of page-table
entry pointers, but also small changes to the code flow to make those
dereferences possible, at least for x86 and powerpc. Also for arm64,
but in way that should not have any impact.

So, even though the dereferenced page-table entries are not used on
archs other than s390, and are optimized out by the compiler, there
is a small change in kernel size and this is what bloat-o-meter reports:

x86:
add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
Function                                     old     new   delta
vmemmap_populate                             587     592      +5
munlock_vma_pages_range                      556     561      +5
Total: Before=15534694, After=15534704, chg +0.00%

powerpc:
add/remove: 0/0 grow/shrink: 1/0 up/down: 4/0 (4)
Function                                     old     new   delta
.remove_pagetable                           1648    1652      +4
Total: Before=21478240, After=21478244, chg +0.00%

arm64:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before=20240851, After=20240851, chg +0.00%

sparc:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before=4907262, After=4907262, chg +0.00%

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 ++++++----
 arch/arm64/kvm/mmu.c                     | 16 +++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++--------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          |  8 ++---
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 +++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 26 ++++-----------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 ++++++++++++------------
 mm/mlock.c                               | 18 ++++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 +++++-----
 31 files changed, 165 insertions(+), 173 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index 3502c2f746ca..5e6416b339f4 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -209,7 +209,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 	} while (0)
 
 /* we don't need complex calculations here as the pmd is folded into the pgd */
-#define pmd_addr_end(addr,end) (end)
+#define pmd_addr_end(pmd,addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
 
diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c
index 448e57c6f653..5437f943ca8b 100644
--- a/arch/arm/mm/idmap.c
+++ b/arch/arm/mm/idmap.c
@@ -46,7 +46,7 @@ static void idmap_add_pmd(pud_t *pud, unsigned long addr, unsigned long end,
 		pmd = pmd_offset(pud, addr);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		*pmd = __pmd((addr & PMD_MASK) | prot);
 		flush_pmd_entry(pmd);
 	} while (pmd++, addr = next, addr != end);
@@ -73,7 +73,7 @@ static void idmap_add_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		idmap_add_pmd(pud, addr, next, prot);
 	} while (pud++, addr = next, addr != end);
 }
@@ -95,7 +95,7 @@ static void identity_mapping_add(pgd_t *pgd, const char *text_start,
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		idmap_add_pud(pgd, addr, next, prot);
 	} while (pgd++, addr = next, addr != end);
 }
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 698cc740c6b8..4013746e4c75 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -777,7 +777,7 @@ static void __init alloc_init_pmd(pud_t *pud, unsigned long addr,
 		 * With LPAE, we must loop over to map
 		 * all the pmds for the given range.
 		 */
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Try a section mapping - addr, next and phys must all be
@@ -805,7 +805,7 @@ static void __init alloc_init_pud(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		alloc_init_pmd(pud, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (pud++, addr = next, addr != end);
@@ -820,7 +820,7 @@ static void __init alloc_init_p4d(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		alloc_init_pud(p4d, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (p4d++, addr = next, addr != end);
@@ -923,7 +923,7 @@ static void __init __create_mapping(struct mm_struct *mm, struct map_desc *md,
 	pgd = pgd_offset(mm, addr);
 	end = addr + length;
 	do {
-		unsigned long next = pgd_addr_end(addr, end);
+		unsigned long next = pgd_addr_end(*pgd, addr, end);
 
 		alloc_init_p4d(pgd, addr, next, phys, type, alloc, ng);
 
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 68e14152d6e9..7be8c9cdc5c8 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -412,7 +412,7 @@ static int copy_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long start,
 	do {
 		pmd_t pmd = READ_ONCE(*src_pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 		if (pmd_table(pmd)) {
@@ -447,7 +447,7 @@ static int copy_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp, unsigned long start,
 	do {
 		pud_t pud = READ_ONCE(*src_pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 		if (pud_table(pud)) {
@@ -473,8 +473,10 @@ static int copy_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long start,
 	dst_p4dp = p4d_offset(dst_pgdp, start);
 	src_p4dp = p4d_offset(src_pgdp, start);
 	do {
-		next = p4d_addr_end(addr, end);
-		if (p4d_none(READ_ONCE(*src_p4dp)))
+		p4d_t p4d = READ_ONCE(*src_p4dp);
+
+		next = p4d_addr_end(p4d, addr, end);
+		if (p4d_none(p4d))
 			continue;
 		if (copy_pud(dst_p4dp, src_p4dp, addr, next))
 			return -ENOMEM;
@@ -492,8 +494,10 @@ static int copy_page_tables(pgd_t *dst_pgdp, unsigned long start,
 
 	dst_pgdp = pgd_offset_pgd(dst_pgdp, start);
 	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(READ_ONCE(*src_pgdp)))
+		pgd_t pgd = READ_ONCE(*src_pgdp);
+
+		next = pgd_addr_end(pgd, addr, end);
+		if (pgd_none(pgd))
 			continue;
 		if (copy_p4d(dst_pgdp, src_pgdp, addr, next))
 			return -ENOMEM;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ba00bcc0c884..8f470f93a8e9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -547,7 +547,7 @@ static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
 
 	start_pmd = pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		/* Hyp doesn't use huge pmds */
 		if (!pmd_none(*pmd))
 			unmap_hyp_ptes(pmd, addr, next);
@@ -564,7 +564,7 @@ static void unmap_hyp_puds(p4d_t *p4d, phys_addr_t addr, phys_addr_t end)
 
 	start_pud = pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		/* Hyp doesn't use huge puds */
 		if (!pud_none(*pud))
 			unmap_hyp_pmds(pud, addr, next);
@@ -581,7 +581,7 @@ static void unmap_hyp_p4ds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
 
 	start_p4d = p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		/* Hyp doesn't use huge p4ds */
 		if (!p4d_none(*p4d))
 			unmap_hyp_puds(p4d, addr, next);
@@ -609,7 +609,7 @@ static void __unmap_hyp_range(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 	 */
 	pgd = pgdp + kvm_pgd_index(addr, ptrs_per_pgd);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_none(*pgd))
 			unmap_hyp_p4ds(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
@@ -712,7 +712,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
 			get_page(virt_to_page(pmd));
 		}
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		create_hyp_pte_mappings(pmd, addr, next, pfn, prot);
 		pfn += (next - addr) >> PAGE_SHIFT;
@@ -744,7 +744,7 @@ static int create_hyp_pud_mappings(p4d_t *p4d, unsigned long start,
 			get_page(virt_to_page(pud));
 		}
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		ret = create_hyp_pmd_mappings(pud, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -777,7 +777,7 @@ static int create_hyp_p4d_mappings(pgd_t *pgd, unsigned long start,
 			get_page(virt_to_page(p4d));
 		}
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		ret = create_hyp_pud_mappings(p4d, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -813,7 +813,7 @@ static int __create_hyp_mappings(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 			get_page(virt_to_page(pgd));
 		}
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = create_hyp_p4d_mappings(pgd, addr, next, pfn, prot);
 		if (err)
 			goto out;
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b24e43d20667..8d1c811fd59e 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -120,7 +120,7 @@ static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp = kasan_pmd_offset(pudp, addr, node, early);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		kasan_pte_populate(pmdp, addr, next, node, early);
 	} while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)));
 }
@@ -132,7 +132,7 @@ static void __init kasan_pud_populate(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp = kasan_pud_offset(p4dp, addr, node, early);
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		kasan_pmd_populate(pudp, addr, next, node, early);
 	} while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)));
 }
@@ -144,7 +144,7 @@ static void __init kasan_p4d_populate(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp = p4d_offset(pgdp, addr);
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		kasan_pud_populate(p4dp, addr, next, node, early);
 	} while (p4dp++, addr = next, addr != end);
 }
@@ -157,7 +157,7 @@ static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
 
 	pgdp = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		kasan_p4d_populate(pgdp, addr, next, node, early);
 	} while (pgdp++, addr = next, addr != end);
 }
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 64211436629d..d679cf024bc8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -209,7 +209,7 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		pmd_t old_pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(old_pmd, addr, end);
 
 		/* try section mapping first */
 		if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
@@ -307,7 +307,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
 	do {
 		pud_t old_pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(old_pud, addr, end);
 
 		/*
 		 * For 4K granule only, attempt to put down a 1GB block
@@ -356,7 +356,7 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 	end = PAGE_ALIGN(virt + size);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		alloc_init_pud(pgdp, addr, next, phys, prot, pgtable_alloc,
 			       flags);
 		phys += next - addr;
@@ -820,9 +820,9 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp, pmd;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -853,9 +853,9 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp, pud;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -886,9 +886,9 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -912,9 +912,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end,
 	WARN_ON(!free_mapped && altmap);
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -968,9 +968,9 @@ static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -1008,9 +1008,9 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -1048,9 +1048,9 @@ static void free_empty_p4d_table(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -1066,9 +1066,9 @@ static void free_empty_tables(unsigned long addr, unsigned long end,
 	pgd_t *pgdp, pgd;
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -1097,8 +1097,6 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	pmd_t *pmdp;
 
 	do {
-		next = pmd_addr_end(addr, end);
-
 		pgdp = vmemmap_pgd_populate(addr, node);
 		if (!pgdp)
 			return -ENOMEM;
@@ -1112,6 +1110,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 			return -ENOMEM;
 
 		pmdp = pmd_offset(pudp, addr);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_none(READ_ONCE(*pmdp))) {
 			void *p = NULL;
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index cc72666e891a..816e218df285 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -817,7 +817,7 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -847,7 +847,7 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -878,10 +878,9 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
 	spin_lock(&init_mm.page_table_lock);
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!p4d_present(*p4d))
 			continue;
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26292544630f..f0606d6774a4 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -352,7 +352,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 		unsigned long more;
 
 		pmd = pmd_offset(pud, addr);
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!is_hugepd(__hugepd(pmd_val(*pmd)))) {
 			if (pmd_none_or_clear_bad(pmd))
 				continue;
@@ -409,7 +409,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	do {
 		pud = pud_offset(p4d, addr);
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!is_hugepd(__hugepd(pud_val(*pud)))) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
@@ -478,9 +478,9 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	 */
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset(tlb->mm, addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!is_hugepd(__hugepd(pgd_val(*pgd)))) {
 			if (p4d_none_or_clear_bad(p4d))
 				continue;
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 027206e4959d..6fb17ac413be 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -542,14 +542,14 @@ static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned lo
 	return (boundary - 1) < (end - 1) ? boundary : end;
 }
 
-#define pgd_addr_end_folded pgd_addr_end_folded
-static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
 }
 
-#define p4d_addr_end_folded p4d_addr_end_folded
-static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(p4d_val(p4d), addr, end);
 }
diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c
index 567c69f3069e..4aba634b4b26 100644
--- a/arch/s390/mm/page-states.c
+++ b/arch/s390/mm/page-states.c
@@ -109,7 +109,7 @@ static void mark_kernel_pmd(pud_t *pud, unsigned long addr, unsigned long end)
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || pmd_large(*pmd))
 			continue;
 		page = virt_to_page(pmd_val(*pmd));
@@ -126,7 +126,7 @@ static void mark_kernel_pud(p4d_t *p4d, unsigned long addr, unsigned long end)
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || pud_large(*pud))
 			continue;
 		if (!pud_folded(*pud)) {
@@ -147,7 +147,7 @@ static void mark_kernel_p4d(pgd_t *pgd, unsigned long addr, unsigned long end)
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none(*p4d))
 			continue;
 		if (!p4d_folded(*p4d)) {
@@ -169,7 +169,7 @@ static void mark_kernel_pgd(void)
 	addr = 0;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, MODULES_END);
+		next = pgd_addr_end(*pgd, addr, MODULES_END);
 		if (pgd_none(*pgd))
 			continue;
 		if (!pgd_folded(*pgd)) {
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index c5c52ec2b46f..b827d758a17a 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -162,7 +162,7 @@ static int walk_pmd_level(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		if (pmd_none(*pmdp))
 			return -EINVAL;
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_large(*pmdp)) {
 			if (addr & ~PMD_MASK || addr + PMD_SIZE > next) {
 				rc = split_pmd_page(pmdp, addr);
@@ -239,7 +239,7 @@ static int walk_pud_level(p4d_t *p4d, unsigned long addr, unsigned long end,
 	do {
 		if (pud_none(*pudp))
 			return -EINVAL;
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		if (pud_large(*pudp)) {
 			if (addr & ~PUD_MASK || addr + PUD_SIZE > next) {
 				rc = split_pud_page(pudp, addr);
@@ -269,7 +269,7 @@ static int walk_p4d_level(pgd_t *pgd, unsigned long addr, unsigned long end,
 	do {
 		if (p4d_none(*p4dp))
 			return -EINVAL;
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		rc = walk_pud_level(p4dp, addr, next, flags);
 		p4dp++;
 		addr = next;
@@ -296,7 +296,7 @@ static int change_page_attr(unsigned long addr, unsigned long end,
 	do {
 		if (pgd_none(*pgdp))
 			break;
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		rc = walk_p4d_level(pgdp, addr, next, flags);
 		if (rc)
 			break;
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index b239f2ba93b0..672bc89f13e7 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -219,7 +219,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!add) {
 			if (pmd_none(*pmd))
 				continue;
@@ -320,7 +320,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 		prot &= ~_REGION_ENTRY_NOEXEC;
 	pud = pud_offset(p4d, addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!add) {
 			if (pud_none(*pud))
 				continue;
@@ -394,7 +394,7 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!add) {
 			if (p4d_none(*p4d))
 				continue;
@@ -449,8 +449,8 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 	if (WARN_ON_ONCE(!PAGE_ALIGNED(start | end)))
 		return -EINVAL;
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (!add) {
 			if (pgd_none(*pgd))
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ec423b5f17dd..341c2ff8d31a 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -428,7 +428,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd))
 			continue;
 		if (is_hugetlb_pmd(*pmd))
@@ -465,7 +465,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		if (is_hugetlb_pud(*pud))
@@ -519,7 +519,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	pgd = pgd_offset(tlb->mm, addr);
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		hugetlb_free_pud_range(tlb, p4d, addr, next, floor, ceiling);
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index 61776790cd67..7b4fe31c8df2 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -264,7 +264,7 @@ static inline int update_pmd_range(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_present(*pmd)) {
 			if (hvc->force || pmd_newpage(*pmd)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -286,7 +286,7 @@ static inline int update_pud_range(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_present(*pud)) {
 			if (hvc->force || pud_newpage(*pud)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -308,7 +308,7 @@ static inline int update_p4d_range(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!p4d_present(*p4d)) {
 			if (hvc->force || p4d_newpage(*p4d)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -331,7 +331,7 @@ void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
 	hvc = INIT_HVC(mm, force, userspace);
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end_addr);
+		next = pgd_addr_end(*pgd, addr, end_addr);
 		if (!pgd_present(*pgd)) {
 			if (force || pgd_newpage(*pgd)) {
 				ret = add_munmap(addr, next - addr, &hvc);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a4ac13cc3fdc..e2cb9316a104 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1043,7 +1043,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -1099,7 +1099,7 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -1153,7 +1153,7 @@ remove_p4d_table(p4d_t *p4d_start, unsigned long addr, unsigned long end,
 
 	p4d = p4d_start + p4d_index(addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -1186,9 +1186,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct,
 	p4d_t *p4d;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
@@ -1500,8 +1499,6 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 	pmd_t *pmd;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pmd_addr_end(addr, end);
-
 		pgd = vmemmap_pgd_populate(addr, node);
 		if (!pgd)
 			return -ENOMEM;
@@ -1515,6 +1512,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 			return -ENOMEM;
 
 		pmd = pmd_offset(pud, addr);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd)) {
 			void *p;
 
@@ -1623,9 +1621,8 @@ void register_page_bootmem_memmap(unsigned long section_nr,
 			get_page_bootmem(section_nr, pte_page(*pte),
 					 SECTION_INFO);
 		} else {
-			next = pmd_addr_end(addr, end);
-
 			pmd = pmd_offset(pud, addr);
+			next = pmd_addr_end(*pmd, addr, end);
 			if (pmd_none(*pmd))
 				continue;
 
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 1a50434c8a4d..2c105b5154ba 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -96,7 +96,7 @@ static void __init kasan_populate_pud(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_large(*pmd))
 			kasan_populate_pmd(pmd, addr, next, nid);
 	} while (pmd++, addr = next, addr != end);
@@ -116,7 +116,7 @@ static void __init kasan_populate_p4d(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_large(*pud))
 			kasan_populate_pud(pud, addr, next, nid);
 	} while (pud++, addr = next, addr != end);
@@ -136,7 +136,7 @@ static void __init kasan_populate_pgd(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		kasan_populate_p4d(p4d, addr, next, nid);
 	} while (p4d++, addr = next, addr != end);
 }
@@ -151,7 +151,7 @@ static void __init kasan_populate_shadow(unsigned long addr, unsigned long end,
 	end = round_up(end, PAGE_SIZE);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_populate_pgd(pgd, addr, next, nid);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -219,7 +219,7 @@ static void __init kasan_early_p4d_populate(pgd_t *pgd,
 
 	p4d = early_p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_none(*p4d))
 			continue;
@@ -239,7 +239,7 @@ static void __init kasan_map_early_shadow(pgd_t *pgd)
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_early_p4d_populate(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -254,7 +254,7 @@ static void __init kasan_shallow_populate_p4ds(pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (p4d_none(*p4d)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
@@ -272,7 +272,7 @@ static void __init kasan_shallow_populate_pgds(void *start, void *end)
 	addr = (unsigned long)start;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, (unsigned long)end);
+		next = pgd_addr_end(*pgd, addr, (unsigned long)end);
 
 		if (pgd_none(*pgd)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
index ce2cbb3c380f..156b42e51424 100644
--- a/include/asm-generic/pgtable-nop4d.h
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -53,7 +53,7 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
 #define p4d_free_tlb(tlb, x, a)			do { } while (0)
 
 #undef  p4d_addr_end
-#define p4d_addr_end(addr, end)			(end)
+#define p4d_addr_end(p4d, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOP4D_H */
diff --git a/include/asm-generic/pgtable-nopmd.h b/include/asm-generic/pgtable-nopmd.h
index 3e13acd019ae..e988384de1c7 100644
--- a/include/asm-generic/pgtable-nopmd.h
+++ b/include/asm-generic/pgtable-nopmd.h
@@ -64,7 +64,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 #define pmd_free_tlb(tlb, x, a)		do { } while (0)
 
 #undef  pmd_addr_end
-#define pmd_addr_end(addr, end)			(end)
+#define pmd_addr_end(pmd, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index a9d751fbda9e..57a28bade9f9 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -60,7 +60,7 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
 #define pud_free_tlb(tlb, x, a)		        do { } while (0)
 
 #undef  pud_addr_end
-#define pud_addr_end(addr, end)			(end)
+#define pud_addr_end(pud, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOPUD_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 981c4c2a31fe..67ebc22cf83d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -655,48 +655,34 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
  */
 
-#define pgd_addr_end(addr, end)						\
+#ifndef pgd_addr_end
+#define pgd_addr_end(pgd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
+#endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(addr, end)						\
+#define p4d_addr_end(p4d, addr, end)					\
 ({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(addr, end)						\
+#define pud_addr_end(pud, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(addr, end)						\
+#define pmd_addr_end(pmd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
-#ifndef pgd_addr_end_folded
-#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
-#endif
-
-#ifndef p4d_addr_end_folded
-#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
-#endif
-
-#ifndef pud_addr_end_folded
-#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
-#endif
-
-#ifndef pmd_addr_end_folded
-#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
-#endif
-
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index ba4aace5d0f4..7826876ae7e0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end_folded(pmd, addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end_folded(pud, addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end_folded(p4d, addr, end);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end_folded(pgd, addr, end);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..400fa119c09d 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -114,7 +114,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PMD_MODIFIED;
@@ -160,7 +160,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PUD_MODIFIED;
@@ -206,7 +206,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_P4D_MODIFIED;
@@ -234,7 +234,7 @@ int ioremap_page_range(unsigned long addr,
 	start = addr;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
 					&mask);
 		if (err)
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index fe6be0be1f76..829627a92763 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -117,7 +117,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (IS_ALIGNED(addr, PMD_SIZE) && end - addr >= PMD_SIZE) {
 			pmd_populate_kernel(&init_mm, pmd,
@@ -150,7 +150,7 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (IS_ALIGNED(addr, PUD_SIZE) && end - addr >= PUD_SIZE) {
 			pmd_t *pmd;
 
@@ -187,7 +187,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (IS_ALIGNED(addr, P4D_SIZE) && end - addr >= P4D_SIZE) {
 			pud_t *pud;
 			pmd_t *pmd;
@@ -236,7 +236,7 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
 	unsigned long next;
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (IS_ALIGNED(addr, PGDIR_SIZE) && end - addr >= PGDIR_SIZE) {
 			p4d_t *p4d;
@@ -370,7 +370,7 @@ static void kasan_remove_pmd_table(pmd_t *pmd, unsigned long addr,
 	for (; addr < end; addr = next, pmd++) {
 		pte_t *pte;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -395,7 +395,7 @@ static void kasan_remove_pud_table(pud_t *pud, unsigned long addr,
 	for (; addr < end; addr = next, pud++) {
 		pmd_t *pmd, *pmd_base;
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -421,7 +421,7 @@ static void kasan_remove_p4d_table(p4d_t *p4d, unsigned long addr,
 	for (; addr < end; addr = next, p4d++) {
 		pud_t *pud;
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -454,9 +454,8 @@ void kasan_remove_zero_shadow(void *start, unsigned long size)
 	for (; addr < end; addr = next) {
 		p4d_t *p4d;
 
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
diff --git a/mm/madvise.c b/mm/madvise.c
index e32e7efbba0f..acfb3441d97e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -326,7 +326,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (pmd_trans_huge(*pmd)) {
 		pmd_t orig_pmd;
-		unsigned long next = pmd_addr_end(addr, end);
+		unsigned long next = pmd_addr_end(*pmd, addr, end);
 
 		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 		ptl = pmd_trans_huge_lock(pmd, vma);
@@ -587,7 +587,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	int nr_swap = 0;
 	unsigned long next;
 
-	next = pmd_addr_end(addr, end);
+	next = pmd_addr_end(*pmd, addr, end);
 	if (pmd_trans_huge(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
diff --git a/mm/memory.c b/mm/memory.c
index fb5463153351..f95424946b0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -233,7 +233,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		free_pte_range(tlb, pmd, addr);
@@ -267,7 +267,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		free_pmd_range(tlb, pud, addr, next, floor, ceiling);
@@ -301,7 +301,7 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 	start = addr;
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		free_pud_range(tlb, p4d, addr, next, floor, ceiling);
@@ -381,7 +381,7 @@ void free_pgd_range(struct mmu_gather *tlb,
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	pgd = pgd_offset(tlb->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
@@ -887,7 +887,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*src_pmd, addr, end);
 		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
 			|| pmd_devmap(*src_pmd)) {
 			int err;
@@ -921,7 +921,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pud = pud_offset(src_p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*src_pud, addr, end);
 		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
 			int err;
 
@@ -955,7 +955,7 @@ static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_p4d = p4d_offset(src_pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*src_p4d, addr, end);
 		if (p4d_none_or_clear_bad(src_p4d))
 			continue;
 		if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
@@ -1017,7 +1017,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*src_pgd, addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
@@ -1177,7 +1177,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
@@ -1212,7 +1212,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				mmap_assert_locked(tlb->mm);
@@ -1241,7 +1241,7 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		next = zap_pud_range(tlb, vma, p4d, addr, next, details);
@@ -1262,7 +1262,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
@@ -2030,7 +2030,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 		return -ENOMEM;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		err = remap_pte_range(mm, pmd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2052,7 +2052,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		err = remap_pmd_range(mm, pud, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2074,7 +2074,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		err = remap_pud_range(mm, p4d, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2143,7 +2143,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = remap_p4d_range(mm, pgd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2266,7 +2266,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 		pmd = pmd_offset(pud, addr);
 	}
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (create || !pmd_none_or_clear_bad(pmd)) {
 			err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
 						 create, mask);
@@ -2294,7 +2294,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 		pud = pud_offset(p4d, addr);
 	}
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (create || !pud_none_or_clear_bad(pud)) {
 			err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
 						 create, mask);
@@ -2322,7 +2322,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 		p4d = p4d_offset(pgd, addr);
 	}
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (create || !p4d_none_or_clear_bad(p4d)) {
 			err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
 						 create, mask);
@@ -2348,7 +2348,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!create && pgd_none_or_clear_bad(pgd))
 			continue;
 		err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, &mask);
diff --git a/mm/mlock.c b/mm/mlock.c
index 93ca2bf30b4f..5898e8fe2288 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -374,8 +374,12 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 			struct vm_area_struct *vma, struct zone *zone,
 			unsigned long start, unsigned long end)
 {
-	pte_t *pte;
 	spinlock_t *ptl;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
 
 	/*
 	 * Initialize pte walk starting at the already pinned page where we
@@ -384,10 +388,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 	 */
 	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
 	/* Make sure we do not cross the page table boundary */
-	end = pgd_addr_end(start, end);
-	end = p4d_addr_end(start, end);
-	end = pud_addr_end(start, end);
-	end = pmd_addr_end(start, end);
+	pgd = pgd_offset(vma->vm_mm, start);
+	end = pgd_addr_end(*pgd, start, end);
+	p4d = p4d_offset(pgd, start);
+	end = p4d_addr_end(*p4d, start, end);
+	pud = pud_offset(p4d, start);
+	end = pud_addr_end(*pud, start, end);
+	pmd = pmd_offset(pud, start);
+	end = pmd_addr_end(*pmd, start, end);
 
 	/* The page next to the pinned page is the first we will try to get */
 	start += PAGE_SIZE;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..278f2dbd1f20 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -225,7 +225,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	do {
 		unsigned long this_pages;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
@@ -291,7 +291,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
@@ -311,7 +311,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(vma, p4d, addr, next, newprot,
@@ -336,7 +336,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	flush_cache_range(vma, addr, end);
 	inc_tlb_flush_pending(mm);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(vma, pgd, addr, next, newprot,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..a5b9f61b5d45 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -70,7 +70,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	pmd = pmd_offset(pud, addr);
 	do {
 again:
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -128,7 +128,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 	pud = pud_offset(p4d, addr);
 	do {
  again:
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -176,7 +176,7 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -211,7 +211,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
 	else
 		pgd = pgd_offset(walk->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, 0, walk);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 20012c0c0252..b1dd815aee6b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2018,7 +2018,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	pmd = pmd_offset(pud, addr);
 	do {
 		cond_resched();
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type,
@@ -2040,7 +2040,7 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		ret = unuse_pmd_range(vma, pud, addr, next, type,
@@ -2062,7 +2062,7 @@ static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		ret = unuse_pud_range(vma, p4d, addr, next, type,
@@ -2085,7 +2085,7 @@ static int unuse_vma(struct vm_area_struct *vma, unsigned int type,
 
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		ret = unuse_p4d_range(vma, pgd, addr, next, type,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index be4724b916b3..09ff0d5ecbc1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -93,7 +93,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		cleared = pmd_clear_huge(pmd);
 		if (cleared || pmd_bad(*pmd))
@@ -118,7 +118,7 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		cleared = pud_clear_huge(pud);
 		if (cleared || pud_bad(*pud))
@@ -141,7 +141,7 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		cleared = p4d_clear_huge(p4d);
 		if (cleared || p4d_bad(*p4d))
@@ -179,7 +179,7 @@ void unmap_kernel_range_noflush(unsigned long start, unsigned long size)
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		if (pgd_none_or_clear_bad(pgd))
@@ -230,7 +230,7 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
@@ -248,7 +248,7 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
@@ -266,7 +266,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
@@ -305,7 +305,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
  2020-09-07 18:00 ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-07 18:00   ` Gerald Schaefer
  -1 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

From: Alexander Gordeev <agordeev@linux.ibm.com>

Since pXd_addr_end() macros take pXd page-table entry as a
parameter it makes sense to check the entry type on compile.
Even though most archs do not make use of page-table entries
in pXd_addr_end() calls, checking the type in traversal code
paths could help to avoid subtle bugs.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 67ebc22cf83d..d9e7d16c2263 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  */
 
 #ifndef pgd_addr_end
-#define pgd_addr_end(pgd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(p4d, addr, end)					\
-({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(pud, addr, end)					\
-({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pud_addr_end pud_addr_end
+static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(pmd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pmd_addr_end pmd_addr_end
+static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 /*
-- 
2.17.1


^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

From: Alexander Gordeev <agordeev@linux.ibm.com>

Since pXd_addr_end() macros take pXd page-table entry as a
parameter it makes sense to check the entry type on compile.
Even though most archs do not make use of page-table entries
in pXd_addr_end() calls, checking the type in traversal code
paths could help to avoid subtle bugs.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 67ebc22cf83d..d9e7d16c2263 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  */
 
 #ifndef pgd_addr_end
-#define pgd_addr_end(pgd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(p4d, addr, end)					\
-({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(pud, addr, end)					\
-({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pud_addr_end pud_addr_end
+static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(pmd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pmd_addr_end pmd_addr_end
+static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 /*
-- 
2.17.1

^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Since pXd_addr_end() macros take pXd page-table entry as a
parameter it makes sense to check the entry type on compile.
Even though most archs do not make use of page-table entries
in pXd_addr_end() calls, checking the type in traversal code
paths could help to avoid subtle bugs.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 67ebc22cf83d..d9e7d16c2263 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  */
 
 #ifndef pgd_addr_end
-#define pgd_addr_end(pgd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(p4d, addr, end)					\
-({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(pud, addr, end)					\
-({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pud_addr_end pud_addr_end
+static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(pmd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pmd_addr_end pmd_addr_end
+static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 /*
-- 
2.17.1


^ permalink raw reply	[flat|nested] 254+ messages in thread

* [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Since pXd_addr_end() macros take pXd page-table entry as a
parameter it makes sense to check the entry type on compile.
Even though most archs do not make use of page-table entries
in pXd_addr_end() calls, checking the type in traversal code
paths could help to avoid subtle bugs.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 67ebc22cf83d..d9e7d16c2263 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  */
 
 #ifndef pgd_addr_end
-#define pgd_addr_end(pgd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(p4d, addr, end)					\
-({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(pud, addr, end)					\
-({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pud_addr_end pud_addr_end
+static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(pmd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pmd_addr_end pmd_addr_end
+static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 /*
-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-07 18:00 ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-07 20:12   ` Mike Rapoport
  -1 siblings, 0 replies; 254+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:12 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86,
	linux-arm, linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

I also think that adding pXd parameter to pXd_addr_end() is a cleaner
way and with this patch 1 is not really required. I would even merge
patches 2 and 3 into a single patch and use only it as the fix.

[ /me apologises to stable@ team :-) ]

> Changes in v2:
> - Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
> - Add patch 2 + 3 for more generic approach
> 
> Alexander Gordeev (3):
>   mm/gup: fix gup_fast with dynamic page table folding
>   mm: make pXd_addr_end() functions page-table entry aware
>   mm: make generic pXd_addr_end() macros inline functions
> 
>  arch/arm/include/asm/pgtable-2level.h    |  2 +-
>  arch/arm/mm/idmap.c                      |  6 ++--
>  arch/arm/mm/mmu.c                        |  8 ++---
>  arch/arm64/kernel/hibernate.c            | 16 +++++----
>  arch/arm64/kvm/mmu.c                     | 16 ++++-----
>  arch/arm64/mm/kasan_init.c               |  8 ++---
>  arch/arm64/mm/mmu.c                      | 25 +++++++-------
>  arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
>  arch/powerpc/mm/hugetlbpage.c            |  6 ++--
>  arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
>  arch/s390/mm/page-states.c               |  8 ++---
>  arch/s390/mm/pageattr.c                  |  8 ++---
>  arch/s390/mm/vmem.c                      |  8 ++---
>  arch/sparc/mm/hugetlbpage.c              |  6 ++--
>  arch/um/kernel/tlb.c                     |  8 ++---
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 ++++-----
>  include/asm-generic/pgtable-nop4d.h      |  2 +-
>  include/asm-generic/pgtable-nopmd.h      |  2 +-
>  include/asm-generic/pgtable-nopud.h      |  2 +-
>  include/linux/pgtable.h                  | 38 ++++++++++++---------
>  mm/gup.c                                 |  8 ++---
>  mm/ioremap.c                             |  8 ++---
>  mm/kasan/init.c                          | 17 +++++-----
>  mm/madvise.c                             |  4 +--
>  mm/memory.c                              | 40 +++++++++++-----------
>  mm/mlock.c                               | 18 +++++++---
>  mm/mprotect.c                            |  8 ++---
>  mm/pagewalk.c                            |  8 ++---
>  mm/swapfile.c                            |  8 ++---
>  mm/vmalloc.c                             | 16 ++++-----
>  31 files changed, 219 insertions(+), 165 deletions(-)
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 20:12   ` Mike Rapoport
  0 siblings, 0 replies; 254+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:12 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86,
	linux-arm, linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

I also think that adding pXd parameter to pXd_addr_end() is a cleaner
way and with this patch 1 is not really required. I would even merge
patches 2 and 3 into a single patch and use only it as the fix.

[ /me apologises to stable@ team :-) ]

> Changes in v2:
> - Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
> - Add patch 2 + 3 for more generic approach
> 
> Alexander Gordeev (3):
>   mm/gup: fix gup_fast with dynamic page table folding
>   mm: make pXd_addr_end() functions page-table entry aware
>   mm: make generic pXd_addr_end() macros inline functions
> 
>  arch/arm/include/asm/pgtable-2level.h    |  2 +-
>  arch/arm/mm/idmap.c                      |  6 ++--
>  arch/arm/mm/mmu.c                        |  8 ++---
>  arch/arm64/kernel/hibernate.c            | 16 +++++----
>  arch/arm64/kvm/mmu.c                     | 16 ++++-----
>  arch/arm64/mm/kasan_init.c               |  8 ++---
>  arch/arm64/mm/mmu.c                      | 25 +++++++-------
>  arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
>  arch/powerpc/mm/hugetlbpage.c            |  6 ++--
>  arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
>  arch/s390/mm/page-states.c               |  8 ++---
>  arch/s390/mm/pageattr.c                  |  8 ++---
>  arch/s390/mm/vmem.c                      |  8 ++---
>  arch/sparc/mm/hugetlbpage.c              |  6 ++--
>  arch/um/kernel/tlb.c                     |  8 ++---
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 ++++-----
>  include/asm-generic/pgtable-nop4d.h      |  2 +-
>  include/asm-generic/pgtable-nopmd.h      |  2 +-
>  include/asm-generic/pgtable-nopud.h      |  2 +-
>  include/linux/pgtable.h                  | 38 ++++++++++++---------
>  mm/gup.c                                 |  8 ++---
>  mm/ioremap.c                             |  8 ++---
>  mm/kasan/init.c                          | 17 +++++-----
>  mm/madvise.c                             |  4 +--
>  mm/memory.c                              | 40 +++++++++++-----------
>  mm/mlock.c                               | 18 +++++++---
>  mm/mprotect.c                            |  8 ++---
>  mm/pagewalk.c                            |  8 ++---
>  mm/swapfile.c                            |  8 ++---
>  mm/vmalloc.c                             | 16 ++++-----
>  31 files changed, 219 insertions(+), 165 deletions(-)
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 20:12   ` Mike Rapoport
  0 siblings, 0 replies; 254+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:12 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds

On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

I also think that adding pXd parameter to pXd_addr_end() is a cleaner
way and with this patch 1 is not really required. I would even merge
patches 2 and 3 into a single patch and use only it as the fix.

[ /me apologises to stable@ team :-) ]

> Changes in v2:
> - Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
> - Add patch 2 + 3 for more generic approach
> 
> Alexander Gordeev (3):
>   mm/gup: fix gup_fast with dynamic page table folding
>   mm: make pXd_addr_end() functions page-table entry aware
>   mm: make generic pXd_addr_end() macros inline functions
> 
>  arch/arm/include/asm/pgtable-2level.h    |  2 +-
>  arch/arm/mm/idmap.c                      |  6 ++--
>  arch/arm/mm/mmu.c                        |  8 ++---
>  arch/arm64/kernel/hibernate.c            | 16 +++++----
>  arch/arm64/kvm/mmu.c                     | 16 ++++-----
>  arch/arm64/mm/kasan_init.c               |  8 ++---
>  arch/arm64/mm/mmu.c                      | 25 +++++++-------
>  arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
>  arch/powerpc/mm/hugetlbpage.c            |  6 ++--
>  arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
>  arch/s390/mm/page-states.c               |  8 ++---
>  arch/s390/mm/pageattr.c                  |  8 ++---
>  arch/s390/mm/vmem.c                      |  8 ++---
>  arch/sparc/mm/hugetlbpage.c              |  6 ++--
>  arch/um/kernel/tlb.c                     |  8 ++---
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 ++++-----
>  include/asm-generic/pgtable-nop4d.h      |  2 +-
>  include/asm-generic/pgtable-nopmd.h      |  2 +-
>  include/asm-generic/pgtable-nopud.h      |  2 +-
>  include/linux/pgtable.h                  | 38 ++++++++++++---------
>  mm/gup.c                                 |  8 ++---
>  mm/ioremap.c                             |  8 ++---
>  mm/kasan/init.c                          | 17 +++++-----
>  mm/madvise.c                             |  4 +--
>  mm/memory.c                              | 40 +++++++++++-----------
>  mm/mlock.c                               | 18 +++++++---
>  mm/mprotect.c                            |  8 ++---
>  mm/pagewalk.c                            |  8 ++---
>  mm/swapfile.c                            |  8 ++---
>  mm/vmalloc.c                             | 16 ++++-----
>  31 files changed, 219 insertions(+), 165 deletions(-)
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 20:12   ` Mike Rapoport
  0 siblings, 0 replies; 254+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:12 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds

On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

I also think that adding pXd parameter to pXd_addr_end() is a cleaner
way and with this patch 1 is not really required. I would even merge
patches 2 and 3 into a single patch and use only it as the fix.

[ /me apologises to stable@ team :-) ]

> Changes in v2:
> - Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
> - Add patch 2 + 3 for more generic approach
> 
> Alexander Gordeev (3):
>   mm/gup: fix gup_fast with dynamic page table folding
>   mm: make pXd_addr_end() functions page-table entry aware
>   mm: make generic pXd_addr_end() macros inline functions
> 
>  arch/arm/include/asm/pgtable-2level.h    |  2 +-
>  arch/arm/mm/idmap.c                      |  6 ++--
>  arch/arm/mm/mmu.c                        |  8 ++---
>  arch/arm64/kernel/hibernate.c            | 16 +++++----
>  arch/arm64/kvm/mmu.c                     | 16 ++++-----
>  arch/arm64/mm/kasan_init.c               |  8 ++---
>  arch/arm64/mm/mmu.c                      | 25 +++++++-------
>  arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
>  arch/powerpc/mm/hugetlbpage.c            |  6 ++--
>  arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
>  arch/s390/mm/page-states.c               |  8 ++---
>  arch/s390/mm/pageattr.c                  |  8 ++---
>  arch/s390/mm/vmem.c                      |  8 ++---
>  arch/sparc/mm/hugetlbpage.c              |  6 ++--
>  arch/um/kernel/tlb.c                     |  8 ++---
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 ++++-----
>  include/asm-generic/pgtable-nop4d.h      |  2 +-
>  include/asm-generic/pgtable-nopmd.h      |  2 +-
>  include/asm-generic/pgtable-nopud.h      |  2 +-
>  include/linux/pgtable.h                  | 38 ++++++++++++---------
>  mm/gup.c                                 |  8 ++---
>  mm/ioremap.c                             |  8 ++---
>  mm/kasan/init.c                          | 17 +++++-----
>  mm/madvise.c                             |  4 +--
>  mm/memory.c                              | 40 +++++++++++-----------
>  mm/mlock.c                               | 18 +++++++---
>  mm/mprotect.c                            |  8 ++---
>  mm/pagewalk.c                            |  8 ++---
>  mm/swapfile.c                            |  8 ++---
>  mm/vmalloc.c                             | 16 ++++-----
>  31 files changed, 219 insertions(+), 165 deletions(-)
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
  2020-09-07 18:00   ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-07 20:15     ` Mike Rapoport
  -1 siblings, 0 replies; 254+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:15 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86,
	linux-arm, linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

Hi,

Some style comments below.

On Mon, Sep 07, 2020 at 08:00:58PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>  include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>  1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   */
>  
>  #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end
> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;

The code should be on a separate line from the curly brace.
Besides, since this is not a macro anymore, I think it would be nicer to
use 'boundary' without underscores.
This applies to the changes below as well.

> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  /*
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-07 20:15     ` Mike Rapoport
  0 siblings, 0 replies; 254+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:15 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86,
	linux-arm, linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

Hi,

Some style comments below.

On Mon, Sep 07, 2020 at 08:00:58PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>  include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>  1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   */
>  
>  #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end
> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;

The code should be on a separate line from the curly brace.
Besides, since this is not a macro anymore, I think it would be nicer to
use 'boundary' without underscores.
This applies to the changes below as well.

> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  /*
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-07 20:15     ` Mike Rapoport
  0 siblings, 0 replies; 254+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:15 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds

Hi,

Some style comments below.

On Mon, Sep 07, 2020 at 08:00:58PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>  include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>  1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   */
>  
>  #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end
> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;

The code should be on a separate line from the curly brace.
Besides, since this is not a macro anymore, I think it would be nicer to
use 'boundary' without underscores.
This applies to the changes below as well.

> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  /*
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-07 20:15     ` Mike Rapoport
  0 siblings, 0 replies; 254+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:15 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds

Hi,

Some style comments below.

On Mon, Sep 07, 2020 at 08:00:58PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>  include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>  1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   */
>  
>  #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end
> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;

The code should be on a separate line from the curly brace.
Besides, since this is not a macro anymore, I think it would be nicer to
use 'boundary' without underscores.
This applies to the changes below as well.

> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  /*
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-07 18:00 ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-08  4:42   ` Christophe Leroy
  -1 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  4:42 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

If one day you have to backport a fix that requires patch 2 and/or 3, 
just mark it "depends-on:" and the patches will go in stable at the 
relevant time.

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  4:42   ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  4:42 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

If one day you have to backport a fix that requires patch 2 and/or 3, 
just mark it "depends-on:" and the patches will go in stable at the 
relevant time.

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  4:42   ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  4:42 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

If one day you have to backport a fix that requires patch 2 and/or 3, 
just mark it "depends-on:" and the patches will go in stable at the 
relevant time.

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  4:42   ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  4:42 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

If one day you have to backport a fix that requires patch 2 and/or 3, 
just mark it "depends-on:" and the patches will go in stable at the 
relevant time.

Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-07 18:00   ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-08  5:06     ` Christophe Leroy
  -1 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:06 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.
> 
> The question "What would it require for the generic code to work for s390"
> has already been discussed here
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> and ended with a promising approach here
> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> which in the end unfortunately didn't quite work completely.
> 
> We tried to mimic static level folding by changing pgd_offset to always
> calculate top level page table offset, and do nothing in folded pXd_offset.
> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> not reflect this dynamic behaviour, and still act like static 5-level
> page tables.
> 

[...]

> 
> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> additional pXd entry value parameter, that can be used on s390
> to determine the correct page table level and return corresponding
> end / boundary. With that, the pointer iteration will always
> happen in gup_pgd_range for s390. No change for other architectures
> introduced.

Not sure pXd_addr_end_folded() is the best understandable name, 
allthough I don't have any alternative suggestion at the moment.
Maybe could be something like pXd_addr_end_fixup() as it will disappear 
in the next patch, or pXd_addr_end_gup() ?

Also, if it happens to be acceptable to get patch 2 in stable, I think 
you should switch patch 1 and patch 2 to avoid the step through 
pXd_addr_end_folded()


> 
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Cc: <stable@vger.kernel.org> # 5.2+
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>   include/linux/pgtable.h         | 16 +++++++++++++
>   mm/gup.c                        |  8 +++----
>   3 files changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..027206e4959d 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>   }
>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>   
> +/*
> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
> + * will not return corresponding dynamic boundaries. This is no problem as long
> + * as only pXd pointers are passed down during page table walk, because
> + * pXd_offset() will simply return the given pointer for folded levels, and the
> + * pointer iteration over a range simply happens at the correct page table
> + * level.
> + * It is however a problem with gup_fast, or other places walking the page
> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
> + * a stack variable, which cannot be used for pointer iteration at the correct
> + * level. Instead, the iteration then has to happen by going up to pgd level
> + * again. To allow this, provide pXd_addr_end_folded() functions with an
> + * additional pXd value parameter, which can be used on s390 to determine the
> + * folding level and return the corresponding boundary.
> + */
> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)

What does 'rste' stands for ?

Isn't this line a bit long ?

> +{
> +	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
> +	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
> +	unsigned long boundary = (addr + size) & ~(size - 1);
> +
> +	/*
> +	 * FIXME The below check is for internal testing only, to be removed
> +	 */
> +	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
> +
> +	return (boundary - 1) < (end - 1) ? boundary : end;
> +}
> +
> +#define pgd_addr_end_folded pgd_addr_end_folded
> +static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> +}
> +
> +#define p4d_addr_end_folded p4d_addr_end_folded
> +static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(p4d_val(p4d), addr, end);
> +}
> +
>   static inline int mm_has_pgste(struct mm_struct *mm)
>   {
>   #ifdef CONFIG_PGSTE
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..981c4c2a31fe 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   })
>   #endif
>   
> +#ifndef pgd_addr_end_folded
> +#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
> +#endif
> +
> +#ifndef p4d_addr_end_folded
> +#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
> +#endif
> +
> +#ifndef pud_addr_end_folded
> +#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
> +#endif
> +
> +#ifndef pmd_addr_end_folded
> +#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
> +#endif
> +
>   /*
>    * When walking page tables, we usually want to skip any p?d_none entries;
>    * and any p?d_bad entries - reporting the error before resetting to none.
> diff --git a/mm/gup.c b/mm/gup.c
> index bd883a112724..ba4aace5d0f4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>   		if (!pmd_present(pmd))
>   			return 0;
>   
> @@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> -		next = pud_addr_end(addr, end);
> +		next = pud_addr_end_folded(pud, addr, end);
>   		if (unlikely(!pud_present(pud)))
>   			return 0;
>   		if (unlikely(pud_huge(pud))) {
> @@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> -		next = p4d_addr_end(addr, end);
> +		next = p4d_addr_end_folded(p4d, addr, end);
>   		if (p4d_none(p4d))
>   			return 0;
>   		BUILD_BUG_ON(p4d_huge(p4d));
> @@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   	do {
>   		pgd_t pgd = READ_ONCE(*pgdp);
>   
> -		next = pgd_addr_end(addr, end);
> +		next = pgd_addr_end_folded(pgd, addr, end);
>   		if (pgd_none(pgd))
>   			return;
>   		if (unlikely(pgd_huge(pgd))) {
> 

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  5:06     ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:06 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.
> 
> The question "What would it require for the generic code to work for s390"
> has already been discussed here
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> and ended with a promising approach here
> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> which in the end unfortunately didn't quite work completely.
> 
> We tried to mimic static level folding by changing pgd_offset to always
> calculate top level page table offset, and do nothing in folded pXd_offset.
> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> not reflect this dynamic behaviour, and still act like static 5-level
> page tables.
> 

[...]

> 
> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> additional pXd entry value parameter, that can be used on s390
> to determine the correct page table level and return corresponding
> end / boundary. With that, the pointer iteration will always
> happen in gup_pgd_range for s390. No change for other architectures
> introduced.

Not sure pXd_addr_end_folded() is the best understandable name, 
allthough I don't have any alternative suggestion at the moment.
Maybe could be something like pXd_addr_end_fixup() as it will disappear 
in the next patch, or pXd_addr_end_gup() ?

Also, if it happens to be acceptable to get patch 2 in stable, I think 
you should switch patch 1 and patch 2 to avoid the step through 
pXd_addr_end_folded()


> 
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Cc: <stable@vger.kernel.org> # 5.2+
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>   include/linux/pgtable.h         | 16 +++++++++++++
>   mm/gup.c                        |  8 +++----
>   3 files changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..027206e4959d 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>   }
>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>   
> +/*
> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
> + * will not return corresponding dynamic boundaries. This is no problem as long
> + * as only pXd pointers are passed down during page table walk, because
> + * pXd_offset() will simply return the given pointer for folded levels, and the
> + * pointer iteration over a range simply happens at the correct page table
> + * level.
> + * It is however a problem with gup_fast, or other places walking the page
> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
> + * a stack variable, which cannot be used for pointer iteration at the correct
> + * level. Instead, the iteration then has to happen by going up to pgd level
> + * again. To allow this, provide pXd_addr_end_folded() functions with an
> + * additional pXd value parameter, which can be used on s390 to determine the
> + * folding level and return the corresponding boundary.
> + */
> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)

What does 'rste' stands for ?

Isn't this line a bit long ?

> +{
> +	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
> +	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
> +	unsigned long boundary = (addr + size) & ~(size - 1);
> +
> +	/*
> +	 * FIXME The below check is for internal testing only, to be removed
> +	 */
> +	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
> +
> +	return (boundary - 1) < (end - 1) ? boundary : end;
> +}
> +
> +#define pgd_addr_end_folded pgd_addr_end_folded
> +static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> +}
> +
> +#define p4d_addr_end_folded p4d_addr_end_folded
> +static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(p4d_val(p4d), addr, end);
> +}
> +
>   static inline int mm_has_pgste(struct mm_struct *mm)
>   {
>   #ifdef CONFIG_PGSTE
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..981c4c2a31fe 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   })
>   #endif
>   
> +#ifndef pgd_addr_end_folded
> +#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
> +#endif
> +
> +#ifndef p4d_addr_end_folded
> +#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
> +#endif
> +
> +#ifndef pud_addr_end_folded
> +#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
> +#endif
> +
> +#ifndef pmd_addr_end_folded
> +#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
> +#endif
> +
>   /*
>    * When walking page tables, we usually want to skip any p?d_none entries;
>    * and any p?d_bad entries - reporting the error before resetting to none.
> diff --git a/mm/gup.c b/mm/gup.c
> index bd883a112724..ba4aace5d0f4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>   		if (!pmd_present(pmd))
>   			return 0;
>   
> @@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> -		next = pud_addr_end(addr, end);
> +		next = pud_addr_end_folded(pud, addr, end);
>   		if (unlikely(!pud_present(pud)))
>   			return 0;
>   		if (unlikely(pud_huge(pud))) {
> @@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> -		next = p4d_addr_end(addr, end);
> +		next = p4d_addr_end_folded(p4d, addr, end);
>   		if (p4d_none(p4d))
>   			return 0;
>   		BUILD_BUG_ON(p4d_huge(p4d));
> @@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   	do {
>   		pgd_t pgd = READ_ONCE(*pgdp);
>   
> -		next = pgd_addr_end(addr, end);
> +		next = pgd_addr_end_folded(pgd, addr, end);
>   		if (pgd_none(pgd))
>   			return;
>   		if (unlikely(pgd_huge(pgd))) {
> 

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  5:06     ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:06 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.
> 
> The question "What would it require for the generic code to work for s390"
> has already been discussed here
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> and ended with a promising approach here
> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> which in the end unfortunately didn't quite work completely.
> 
> We tried to mimic static level folding by changing pgd_offset to always
> calculate top level page table offset, and do nothing in folded pXd_offset.
> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> not reflect this dynamic behaviour, and still act like static 5-level
> page tables.
> 

[...]

> 
> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> additional pXd entry value parameter, that can be used on s390
> to determine the correct page table level and return corresponding
> end / boundary. With that, the pointer iteration will always
> happen in gup_pgd_range for s390. No change for other architectures
> introduced.

Not sure pXd_addr_end_folded() is the best understandable name, 
allthough I don't have any alternative suggestion at the moment.
Maybe could be something like pXd_addr_end_fixup() as it will disappear 
in the next patch, or pXd_addr_end_gup() ?

Also, if it happens to be acceptable to get patch 2 in stable, I think 
you should switch patch 1 and patch 2 to avoid the step through 
pXd_addr_end_folded()


> 
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Cc: <stable@vger.kernel.org> # 5.2+
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>   include/linux/pgtable.h         | 16 +++++++++++++
>   mm/gup.c                        |  8 +++----
>   3 files changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..027206e4959d 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>   }
>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>   
> +/*
> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
> + * will not return corresponding dynamic boundaries. This is no problem as long
> + * as only pXd pointers are passed down during page table walk, because
> + * pXd_offset() will simply return the given pointer for folded levels, and the
> + * pointer iteration over a range simply happens at the correct page table
> + * level.
> + * It is however a problem with gup_fast, or other places walking the page
> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
> + * a stack variable, which cannot be used for pointer iteration at the correct
> + * level. Instead, the iteration then has to happen by going up to pgd level
> + * again. To allow this, provide pXd_addr_end_folded() functions with an
> + * additional pXd value parameter, which can be used on s390 to determine the
> + * folding level and return the corresponding boundary.
> + */
> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)

What does 'rste' stands for ?

Isn't this line a bit long ?

> +{
> +	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
> +	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
> +	unsigned long boundary = (addr + size) & ~(size - 1);
> +
> +	/*
> +	 * FIXME The below check is for internal testing only, to be removed
> +	 */
> +	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
> +
> +	return (boundary - 1) < (end - 1) ? boundary : end;
> +}
> +
> +#define pgd_addr_end_folded pgd_addr_end_folded
> +static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> +}
> +
> +#define p4d_addr_end_folded p4d_addr_end_folded
> +static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(p4d_val(p4d), addr, end);
> +}
> +
>   static inline int mm_has_pgste(struct mm_struct *mm)
>   {
>   #ifdef CONFIG_PGSTE
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..981c4c2a31fe 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   })
>   #endif
>   
> +#ifndef pgd_addr_end_folded
> +#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
> +#endif
> +
> +#ifndef p4d_addr_end_folded
> +#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
> +#endif
> +
> +#ifndef pud_addr_end_folded
> +#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
> +#endif
> +
> +#ifndef pmd_addr_end_folded
> +#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
> +#endif
> +
>   /*
>    * When walking page tables, we usually want to skip any p?d_none entries;
>    * and any p?d_bad entries - reporting the error before resetting to none.
> diff --git a/mm/gup.c b/mm/gup.c
> index bd883a112724..ba4aace5d0f4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>   		if (!pmd_present(pmd))
>   			return 0;
>   
> @@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> -		next = pud_addr_end(addr, end);
> +		next = pud_addr_end_folded(pud, addr, end);
>   		if (unlikely(!pud_present(pud)))
>   			return 0;
>   		if (unlikely(pud_huge(pud))) {
> @@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> -		next = p4d_addr_end(addr, end);
> +		next = p4d_addr_end_folded(p4d, addr, end);
>   		if (p4d_none(p4d))
>   			return 0;
>   		BUILD_BUG_ON(p4d_huge(p4d));
> @@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   	do {
>   		pgd_t pgd = READ_ONCE(*pgdp);
>   
> -		next = pgd_addr_end(addr, end);
> +		next = pgd_addr_end_folded(pgd, addr, end);
>   		if (pgd_none(pgd))
>   			return;
>   		if (unlikely(pgd_huge(pgd))) {
> 

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  5:06     ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:06 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.
> 
> The question "What would it require for the generic code to work for s390"
> has already been discussed here
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> and ended with a promising approach here
> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> which in the end unfortunately didn't quite work completely.
> 
> We tried to mimic static level folding by changing pgd_offset to always
> calculate top level page table offset, and do nothing in folded pXd_offset.
> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> not reflect this dynamic behaviour, and still act like static 5-level
> page tables.
> 

[...]

> 
> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> additional pXd entry value parameter, that can be used on s390
> to determine the correct page table level and return corresponding
> end / boundary. With that, the pointer iteration will always
> happen in gup_pgd_range for s390. No change for other architectures
> introduced.

Not sure pXd_addr_end_folded() is the best understandable name, 
allthough I don't have any alternative suggestion at the moment.
Maybe could be something like pXd_addr_end_fixup() as it will disappear 
in the next patch, or pXd_addr_end_gup() ?

Also, if it happens to be acceptable to get patch 2 in stable, I think 
you should switch patch 1 and patch 2 to avoid the step through 
pXd_addr_end_folded()


> 
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Cc: <stable@vger.kernel.org> # 5.2+
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>   include/linux/pgtable.h         | 16 +++++++++++++
>   mm/gup.c                        |  8 +++----
>   3 files changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..027206e4959d 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>   }
>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>   
> +/*
> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
> + * will not return corresponding dynamic boundaries. This is no problem as long
> + * as only pXd pointers are passed down during page table walk, because
> + * pXd_offset() will simply return the given pointer for folded levels, and the
> + * pointer iteration over a range simply happens at the correct page table
> + * level.
> + * It is however a problem with gup_fast, or other places walking the page
> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
> + * a stack variable, which cannot be used for pointer iteration at the correct
> + * level. Instead, the iteration then has to happen by going up to pgd level
> + * again. To allow this, provide pXd_addr_end_folded() functions with an
> + * additional pXd value parameter, which can be used on s390 to determine the
> + * folding level and return the corresponding boundary.
> + */
> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)

What does 'rste' stands for ?

Isn't this line a bit long ?

> +{
> +	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
> +	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
> +	unsigned long boundary = (addr + size) & ~(size - 1);
> +
> +	/*
> +	 * FIXME The below check is for internal testing only, to be removed
> +	 */
> +	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
> +
> +	return (boundary - 1) < (end - 1) ? boundary : end;
> +}
> +
> +#define pgd_addr_end_folded pgd_addr_end_folded
> +static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> +}
> +
> +#define p4d_addr_end_folded p4d_addr_end_folded
> +static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(p4d_val(p4d), addr, end);
> +}
> +
>   static inline int mm_has_pgste(struct mm_struct *mm)
>   {
>   #ifdef CONFIG_PGSTE
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..981c4c2a31fe 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   })
>   #endif
>   
> +#ifndef pgd_addr_end_folded
> +#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
> +#endif
> +
> +#ifndef p4d_addr_end_folded
> +#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
> +#endif
> +
> +#ifndef pud_addr_end_folded
> +#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
> +#endif
> +
> +#ifndef pmd_addr_end_folded
> +#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
> +#endif
> +
>   /*
>    * When walking page tables, we usually want to skip any p?d_none entries;
>    * and any p?d_bad entries - reporting the error before resetting to none.
> diff --git a/mm/gup.c b/mm/gup.c
> index bd883a112724..ba4aace5d0f4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>   		if (!pmd_present(pmd))
>   			return 0;
>   
> @@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> -		next = pud_addr_end(addr, end);
> +		next = pud_addr_end_folded(pud, addr, end);
>   		if (unlikely(!pud_present(pud)))
>   			return 0;
>   		if (unlikely(pud_huge(pud))) {
> @@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> -		next = p4d_addr_end(addr, end);
> +		next = p4d_addr_end_folded(p4d, addr, end);
>   		if (p4d_none(p4d))
>   			return 0;
>   		BUILD_BUG_ON(p4d_huge(p4d));
> @@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   	do {
>   		pgd_t pgd = READ_ONCE(*pgdp);
>   
> -		next = pgd_addr_end(addr, end);
> +		next = pgd_addr_end_folded(pgd, addr, end);
>   		if (pgd_none(pgd))
>   			return;
>   		if (unlikely(pgd_huge(pgd))) {
> 

Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-07 18:00   ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-08  5:14     ` Christophe Leroy
  -1 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:14 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 

[...]

> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/arm/include/asm/pgtable-2level.h    |  2 +-
>   arch/arm/mm/idmap.c                      |  6 ++--
>   arch/arm/mm/mmu.c                        |  8 ++---
>   arch/arm64/kernel/hibernate.c            | 16 ++++++----
>   arch/arm64/kvm/mmu.c                     | 16 +++++-----
>   arch/arm64/mm/kasan_init.c               |  8 ++---
>   arch/arm64/mm/mmu.c                      | 25 +++++++--------
>   arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
>   arch/powerpc/mm/hugetlbpage.c            |  6 ++--

You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

>   arch/s390/include/asm/pgtable.h          |  8 ++---
>   arch/s390/mm/page-states.c               |  8 ++---
>   arch/s390/mm/pageattr.c                  |  8 ++---
>   arch/s390/mm/vmem.c                      |  8 ++---
>   arch/sparc/mm/hugetlbpage.c              |  6 ++--
>   arch/um/kernel/tlb.c                     |  8 ++---
>   arch/x86/mm/init_64.c                    | 15 ++++-----
>   arch/x86/mm/kasan_init_64.c              | 16 +++++-----
>   include/asm-generic/pgtable-nop4d.h      |  2 +-
>   include/asm-generic/pgtable-nopmd.h      |  2 +-
>   include/asm-generic/pgtable-nopud.h      |  2 +-
>   include/linux/pgtable.h                  | 26 ++++-----------
>   mm/gup.c                                 |  8 ++---
>   mm/ioremap.c                             |  8 ++---
>   mm/kasan/init.c                          | 17 +++++-----
>   mm/madvise.c                             |  4 +--
>   mm/memory.c                              | 40 ++++++++++++------------
>   mm/mlock.c                               | 18 ++++++++---
>   mm/mprotect.c                            |  8 ++---
>   mm/pagewalk.c                            |  8 ++---
>   mm/swapfile.c                            |  8 ++---
>   mm/vmalloc.c                             | 16 +++++-----
>   31 files changed, 165 insertions(+), 173 deletions(-)

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  5:14     ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:14 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 

[...]

> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/arm/include/asm/pgtable-2level.h    |  2 +-
>   arch/arm/mm/idmap.c                      |  6 ++--
>   arch/arm/mm/mmu.c                        |  8 ++---
>   arch/arm64/kernel/hibernate.c            | 16 ++++++----
>   arch/arm64/kvm/mmu.c                     | 16 +++++-----
>   arch/arm64/mm/kasan_init.c               |  8 ++---
>   arch/arm64/mm/mmu.c                      | 25 +++++++--------
>   arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
>   arch/powerpc/mm/hugetlbpage.c            |  6 ++--

You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

>   arch/s390/include/asm/pgtable.h          |  8 ++---
>   arch/s390/mm/page-states.c               |  8 ++---
>   arch/s390/mm/pageattr.c                  |  8 ++---
>   arch/s390/mm/vmem.c                      |  8 ++---
>   arch/sparc/mm/hugetlbpage.c              |  6 ++--
>   arch/um/kernel/tlb.c                     |  8 ++---
>   arch/x86/mm/init_64.c                    | 15 ++++-----
>   arch/x86/mm/kasan_init_64.c              | 16 +++++-----
>   include/asm-generic/pgtable-nop4d.h      |  2 +-
>   include/asm-generic/pgtable-nopmd.h      |  2 +-
>   include/asm-generic/pgtable-nopud.h      |  2 +-
>   include/linux/pgtable.h                  | 26 ++++-----------
>   mm/gup.c                                 |  8 ++---
>   mm/ioremap.c                             |  8 ++---
>   mm/kasan/init.c                          | 17 +++++-----
>   mm/madvise.c                             |  4 +--
>   mm/memory.c                              | 40 ++++++++++++------------
>   mm/mlock.c                               | 18 ++++++++---
>   mm/mprotect.c                            |  8 ++---
>   mm/pagewalk.c                            |  8 ++---
>   mm/swapfile.c                            |  8 ++---
>   mm/vmalloc.c                             | 16 +++++-----
>   31 files changed, 165 insertions(+), 173 deletions(-)

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  5:14     ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:14 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 

[...]

> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/arm/include/asm/pgtable-2level.h    |  2 +-
>   arch/arm/mm/idmap.c                      |  6 ++--
>   arch/arm/mm/mmu.c                        |  8 ++---
>   arch/arm64/kernel/hibernate.c            | 16 ++++++----
>   arch/arm64/kvm/mmu.c                     | 16 +++++-----
>   arch/arm64/mm/kasan_init.c               |  8 ++---
>   arch/arm64/mm/mmu.c                      | 25 +++++++--------
>   arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
>   arch/powerpc/mm/hugetlbpage.c            |  6 ++--

You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

>   arch/s390/include/asm/pgtable.h          |  8 ++---
>   arch/s390/mm/page-states.c               |  8 ++---
>   arch/s390/mm/pageattr.c                  |  8 ++---
>   arch/s390/mm/vmem.c                      |  8 ++---
>   arch/sparc/mm/hugetlbpage.c              |  6 ++--
>   arch/um/kernel/tlb.c                     |  8 ++---
>   arch/x86/mm/init_64.c                    | 15 ++++-----
>   arch/x86/mm/kasan_init_64.c              | 16 +++++-----
>   include/asm-generic/pgtable-nop4d.h      |  2 +-
>   include/asm-generic/pgtable-nopmd.h      |  2 +-
>   include/asm-generic/pgtable-nopud.h      |  2 +-
>   include/linux/pgtable.h                  | 26 ++++-----------
>   mm/gup.c                                 |  8 ++---
>   mm/ioremap.c                             |  8 ++---
>   mm/kasan/init.c                          | 17 +++++-----
>   mm/madvise.c                             |  4 +--
>   mm/memory.c                              | 40 ++++++++++++------------
>   mm/mlock.c                               | 18 ++++++++---
>   mm/mprotect.c                            |  8 ++---
>   mm/pagewalk.c                            |  8 ++---
>   mm/swapfile.c                            |  8 ++---
>   mm/vmalloc.c                             | 16 +++++-----
>   31 files changed, 165 insertions(+), 173 deletions(-)

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  5:14     ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:14 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 

[...]

> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/arm/include/asm/pgtable-2level.h    |  2 +-
>   arch/arm/mm/idmap.c                      |  6 ++--
>   arch/arm/mm/mmu.c                        |  8 ++---
>   arch/arm64/kernel/hibernate.c            | 16 ++++++----
>   arch/arm64/kvm/mmu.c                     | 16 +++++-----
>   arch/arm64/mm/kasan_init.c               |  8 ++---
>   arch/arm64/mm/mmu.c                      | 25 +++++++--------
>   arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
>   arch/powerpc/mm/hugetlbpage.c            |  6 ++--

You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

>   arch/s390/include/asm/pgtable.h          |  8 ++---
>   arch/s390/mm/page-states.c               |  8 ++---
>   arch/s390/mm/pageattr.c                  |  8 ++---
>   arch/s390/mm/vmem.c                      |  8 ++---
>   arch/sparc/mm/hugetlbpage.c              |  6 ++--
>   arch/um/kernel/tlb.c                     |  8 ++---
>   arch/x86/mm/init_64.c                    | 15 ++++-----
>   arch/x86/mm/kasan_init_64.c              | 16 +++++-----
>   include/asm-generic/pgtable-nop4d.h      |  2 +-
>   include/asm-generic/pgtable-nopmd.h      |  2 +-
>   include/asm-generic/pgtable-nopud.h      |  2 +-
>   include/linux/pgtable.h                  | 26 ++++-----------
>   mm/gup.c                                 |  8 ++---
>   mm/ioremap.c                             |  8 ++---
>   mm/kasan/init.c                          | 17 +++++-----
>   mm/madvise.c                             |  4 +--
>   mm/memory.c                              | 40 ++++++++++++------------
>   mm/mlock.c                               | 18 ++++++++---
>   mm/mprotect.c                            |  8 ++---
>   mm/pagewalk.c                            |  8 ++---
>   mm/swapfile.c                            |  8 ++---
>   mm/vmalloc.c                             | 16 +++++-----
>   31 files changed, 165 insertions(+), 173 deletions(-)

Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
  2020-09-07 18:00   ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-08  5:19     ` Christophe Leroy
  -1 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:19 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>   1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>    */
>   
>   #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end

I think that #define is pointless, usually there is no such #define for 
the default case.

> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}

Please use the standard layout, ie entry { and exit } alone on their 
line, and space between local vars declaration and the rest.

Also remove the leading __ in front of var names as it's not needed once 
it is not macros anymore.

f_name()
{
	some_local_var;

	do_something();
}

>   #endif
>   
>   #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   /*
> 

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08  5:19     ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:19 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>   1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>    */
>   
>   #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end

I think that #define is pointless, usually there is no such #define for 
the default case.

> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}

Please use the standard layout, ie entry { and exit } alone on their 
line, and space between local vars declaration and the rest.

Also remove the leading __ in front of var names as it's not needed once 
it is not macros anymore.

f_name()
{
	some_local_var;

	do_something();
}

>   #endif
>   
>   #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   /*
> 

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08  5:19     ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:19 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>   1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>    */
>   
>   #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end

I think that #define is pointless, usually there is no such #define for 
the default case.

> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}

Please use the standard layout, ie entry { and exit } alone on their 
line, and space between local vars declaration and the rest.

Also remove the leading __ in front of var names as it's not needed once 
it is not macros anymore.

f_name()
{
	some_local_var;

	do_something();
}

>   #endif
>   
>   #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   /*
> 

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08  5:19     ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:19 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>   1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>    */
>   
>   #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end

I think that #define is pointless, usually there is no such #define for 
the default case.

> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}

Please use the standard layout, ie entry { and exit } alone on their 
line, and space between local vars declaration and the rest.

Also remove the leading __ in front of var names as it's not needed once 
it is not macros anymore.

f_name()
{
	some_local_var;

	do_something();
}

>   #endif
>   
>   #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   /*
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-07 20:12   ` Mike Rapoport
  (?)
  (?)
@ 2020-09-08  5:22     ` Christophe Leroy
  -1 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:22 UTC (permalink / raw)
  To: Mike Rapoport, Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds



Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
>> This is v2 of an RFC previously discussed here:
>> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
>>
>> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
>> to common gup_fast code. It will introduce special helper functions
>> pXd_addr_end_folded(), which have to be used in places where pagetable walk
>> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
>>
>> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
>> themselves by adding an extra pXd value parameter. That was suggested by
>> Jason during v1 discussion, because he is already thinking of some other
>> places where he might want to switch to the READ_ONCE logic for pagetable
>> walks. In general, that would be the cleanest / safest solution, but there
>> is some impact on other architectures and common code, hence the new and
>> greatly enlarged recipient list.
>>
>> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
>> functions instead of #defines, so that we get some type checking for the
>> new pXd value parameter.
>>
>> Not sure about Fixes/stable tags for the generic solution. Only patch 1
>> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
>> still be nice to have in stable, to ease future backports, but I guess
>> "nice to have" does not really qualify for stable backports.
> 
> I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> way and with this patch 1 is not really required. I would even merge
> patches 2 and 3 into a single patch and use only it as the fix.

Why not merging patches 2 and 3, but I would keep patch 1 separate but 
after the generic changes, so that we first do the generic changes, then 
we do the specific S390 use of it.

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  5:22     ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:22 UTC (permalink / raw)
  To: Mike Rapoport, Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds



Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
>> This is v2 of an RFC previously discussed here:
>> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
>>
>> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
>> to common gup_fast code. It will introduce special helper functions
>> pXd_addr_end_folded(), which have to be used in places where pagetable walk
>> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
>>
>> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
>> themselves by adding an extra pXd value parameter. That was suggested by
>> Jason during v1 discussion, because he is already thinking of some other
>> places where he might want to switch to the READ_ONCE logic for pagetable
>> walks. In general, that would be the cleanest / safest solution, but there
>> is some impact on other architectures and common code, hence the new and
>> greatly enlarged recipient list.
>>
>> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
>> functions instead of #defines, so that we get some type checking for the
>> new pXd value parameter.
>>
>> Not sure about Fixes/stable tags for the generic solution. Only patch 1
>> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
>> still be nice to have in stable, to ease future backports, but I guess
>> "nice to have" does not really qualify for stable backports.
> 
> I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> way and with this patch 1 is not really required. I would even merge
> patches 2 and 3 into a single patch and use only it as the fix.

Why not merging patches 2 and 3, but I would keep patch 1 separate but 
after the generic changes, so that we first do the generic changes, then 
we do the specific S390 use of it.

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  5:22     ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:22 UTC (permalink / raw)
  To: Mike Rapoport, Gerald Schaefer
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power



Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
>> This is v2 of an RFC previously discussed here:
>> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
>>
>> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
>> to common gup_fast code. It will introduce special helper functions
>> pXd_addr_end_folded(), which have to be used in places where pagetable walk
>> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
>>
>> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
>> themselves by adding an extra pXd value parameter. That was suggested by
>> Jason during v1 discussion, because he is already thinking of some other
>> places where he might want to switch to the READ_ONCE logic for pagetable
>> walks. In general, that would be the cleanest / safest solution, but there
>> is some impact on other architectures and common code, hence the new and
>> greatly enlarged recipient list.
>>
>> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
>> functions instead of #defines, so that we get some type checking for the
>> new pXd value parameter.
>>
>> Not sure about Fixes/stable tags for the generic solution. Only patch 1
>> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
>> still be nice to have in stable, to ease future backports, but I guess
>> "nice to have" does not really qualify for stable backports.
> 
> I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> way and with this patch 1 is not really required. I would even merge
> patches 2 and 3 into a single patch and use only it as the fix.

Why not merging patches 2 and 3, but I would keep patch 1 separate but 
after the generic changes, so that we first do the generic changes, then 
we do the specific S390 use of it.

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  5:22     ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:22 UTC (permalink / raw)
  To: Mike Rapoport, Gerald Schaefer
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power



Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
>> This is v2 of an RFC previously discussed here:
>> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
>>
>> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
>> to common gup_fast code. It will introduce special helper functions
>> pXd_addr_end_folded(), which have to be used in places where pagetable walk
>> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
>>
>> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
>> themselves by adding an extra pXd value parameter. That was suggested by
>> Jason during v1 discussion, because he is already thinking of some other
>> places where he might want to switch to the READ_ONCE logic for pagetable
>> walks. In general, that would be the cleanest / safest solution, but there
>> is some impact on other architectures and common code, hence the new and
>> greatly enlarged recipient list.
>>
>> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
>> functions instead of #defines, so that we get some type checking for the
>> new pXd value parameter.
>>
>> Not sure about Fixes/stable tags for the generic solution. Only patch 1
>> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
>> still be nice to have in stable, to ease future backports, but I guess
>> "nice to have" does not really qualify for stable backports.
> 
> I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> way and with this patch 1 is not really required. I would even merge
> patches 2 and 3 into a single patch and use only it as the fix.

Why not merging patches 2 and 3, but I would keep patch 1 separate but 
after the generic changes, so that we first do the generic changes, then 
we do the specific S390 use of it.

Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-08  5:14     ` Christophe Leroy
  (?)
  (?)
@ 2020-09-08  7:46       ` Alexander Gordeev
  -1 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08  7:46 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

Yes, and also two more sources :/
	arch/powerpc/mm/kasan/8xx.c
	arch/powerpc/mm/kasan/kasan_init_32.c

But these two are not quite obvious wrt pgd_addr_end() used
while traversing pmds. Could you please clarify a bit?


diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..89c5053 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..3f7d6dc6 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pmd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

> Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  7:46       ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08  7:46 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

Yes, and also two more sources :/
	arch/powerpc/mm/kasan/8xx.c
	arch/powerpc/mm/kasan/kasan_init_32.c

But these two are not quite obvious wrt pgd_addr_end() used
while traversing pmds. Could you please clarify a bit?


diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..89c5053 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..3f7d6dc6 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pmd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

> Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  7:46       ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08  7:46 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

Yes, and also two more sources :/
	arch/powerpc/mm/kasan/8xx.c
	arch/powerpc/mm/kasan/kasan_init_32.c

But these two are not quite obvious wrt pgd_addr_end() used
while traversing pmds. Could you please clarify a bit?


diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..89c5053 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..3f7d6dc6 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pmd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

> Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  7:46       ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08  7:46 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

Yes, and also two more sources :/
	arch/powerpc/mm/kasan/8xx.c
	arch/powerpc/mm/kasan/kasan_init_32.c

But these two are not quite obvious wrt pgd_addr_end() used
while traversing pmds. Could you please clarify a bit?


diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..89c5053 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..3f7d6dc6 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pmd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

> Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-08  7:46       ` Alexander Gordeev
  (?)
  (?)
@ 2020-09-08  8:16         ` Christophe Leroy
  -1 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  8:16 UTC (permalink / raw)
  To: Alexander Gordeev, Michael Ellerman
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



Le 08/09/2020 à 09:46, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
>> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.
> 
> Yes, and also two more sources :/
> 	arch/powerpc/mm/kasan/8xx.c
> 	arch/powerpc/mm/kasan/kasan_init_32.c
> 
> But these two are not quite obvious wrt pgd_addr_end() used
> while traversing pmds. Could you please clarify a bit?
> 
> 
> diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> index 2784224..89c5053 100644
> --- a/arch/powerpc/mm/kasan/8xx.c
> +++ b/arch/powerpc/mm/kasan/8xx.c
> @@ -15,8 +15,8 @@
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
>   		pte_basic_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> -		k_next = pgd_addr_end(k_next, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_next, k_end);

No, I don't think so.
On powerpc32 we have only two levels, so pgd and pmd are more or less 
the same.
But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h is 
a no-op, so I don't think it will work.

It is likely that this function should iterate on pgd, then you get pmd 
= pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
> index fb29404..3f7d6dc6 100644
> --- a/arch/powerpc/mm/kasan/kasan_init_32.c
> +++ b/arch/powerpc/mm/kasan/kasan_init_32.c
> @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
>   		pte_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);

Same here I get, iterate on pgd then get pmd = 
pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> @@ -196,7 +196,7 @@ void __init kasan_early_init(void)
>   	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
>   
>   	do {
> -		next = pgd_addr_end(addr, end);
> +		next = pmd_addr_end(addr, end);
>   		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
>   	} while (pmd++, addr = next, addr != end);
>   
> 

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  8:16         ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  8:16 UTC (permalink / raw)
  To: Alexander Gordeev, Michael Ellerman
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



Le 08/09/2020 à 09:46, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
>> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.
> 
> Yes, and also two more sources :/
> 	arch/powerpc/mm/kasan/8xx.c
> 	arch/powerpc/mm/kasan/kasan_init_32.c
> 
> But these two are not quite obvious wrt pgd_addr_end() used
> while traversing pmds. Could you please clarify a bit?
> 
> 
> diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> index 2784224..89c5053 100644
> --- a/arch/powerpc/mm/kasan/8xx.c
> +++ b/arch/powerpc/mm/kasan/8xx.c
> @@ -15,8 +15,8 @@
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
>   		pte_basic_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> -		k_next = pgd_addr_end(k_next, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_next, k_end);

No, I don't think so.
On powerpc32 we have only two levels, so pgd and pmd are more or less 
the same.
But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h is 
a no-op, so I don't think it will work.

It is likely that this function should iterate on pgd, then you get pmd 
= pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
> index fb29404..3f7d6dc6 100644
> --- a/arch/powerpc/mm/kasan/kasan_init_32.c
> +++ b/arch/powerpc/mm/kasan/kasan_init_32.c
> @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
>   		pte_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);

Same here I get, iterate on pgd then get pmd = 
pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> @@ -196,7 +196,7 @@ void __init kasan_early_init(void)
>   	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
>   
>   	do {
> -		next = pgd_addr_end(addr, end);
> +		next = pmd_addr_end(addr, end);
>   		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
>   	} while (pmd++, addr = next, addr != end);
>   
> 

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  8:16         ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  8:16 UTC (permalink / raw)
  To: Alexander Gordeev, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport



Le 08/09/2020 à 09:46, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
>> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.
> 
> Yes, and also two more sources :/
> 	arch/powerpc/mm/kasan/8xx.c
> 	arch/powerpc/mm/kasan/kasan_init_32.c
> 
> But these two are not quite obvious wrt pgd_addr_end() used
> while traversing pmds. Could you please clarify a bit?
> 
> 
> diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> index 2784224..89c5053 100644
> --- a/arch/powerpc/mm/kasan/8xx.c
> +++ b/arch/powerpc/mm/kasan/8xx.c
> @@ -15,8 +15,8 @@
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
>   		pte_basic_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> -		k_next = pgd_addr_end(k_next, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_next, k_end);

No, I don't think so.
On powerpc32 we have only two levels, so pgd and pmd are more or less 
the same.
But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h is 
a no-op, so I don't think it will work.

It is likely that this function should iterate on pgd, then you get pmd 
= pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
> index fb29404..3f7d6dc6 100644
> --- a/arch/powerpc/mm/kasan/kasan_init_32.c
> +++ b/arch/powerpc/mm/kasan/kasan_init_32.c
> @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
>   		pte_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);

Same here I get, iterate on pgd then get pmd = 
pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> @@ -196,7 +196,7 @@ void __init kasan_early_init(void)
>   	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
>   
>   	do {
> -		next = pgd_addr_end(addr, end);
> +		next = pmd_addr_end(addr, end);
>   		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
>   	} while (pmd++, addr = next, addr != end);
>   
> 

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  8:16         ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08  8:16 UTC (permalink / raw)
  To: Alexander Gordeev, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport



Le 08/09/2020 à 09:46, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
>> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.
> 
> Yes, and also two more sources :/
> 	arch/powerpc/mm/kasan/8xx.c
> 	arch/powerpc/mm/kasan/kasan_init_32.c
> 
> But these two are not quite obvious wrt pgd_addr_end() used
> while traversing pmds. Could you please clarify a bit?
> 
> 
> diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> index 2784224..89c5053 100644
> --- a/arch/powerpc/mm/kasan/8xx.c
> +++ b/arch/powerpc/mm/kasan/8xx.c
> @@ -15,8 +15,8 @@
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
>   		pte_basic_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> -		k_next = pgd_addr_end(k_next, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_next, k_end);

No, I don't think so.
On powerpc32 we have only two levels, so pgd and pmd are more or less 
the same.
But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h is 
a no-op, so I don't think it will work.

It is likely that this function should iterate on pgd, then you get pmd 
= pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
> index fb29404..3f7d6dc6 100644
> --- a/arch/powerpc/mm/kasan/kasan_init_32.c
> +++ b/arch/powerpc/mm/kasan/kasan_init_32.c
> @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
>   		pte_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);

Same here I get, iterate on pgd then get pmd = 
pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> @@ -196,7 +196,7 @@ void __init kasan_early_init(void)
>   	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
>   
>   	do {
> -		next = pgd_addr_end(addr, end);
> +		next = pmd_addr_end(addr, end);
>   		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
>   	} while (pmd++, addr = next, addr != end);
>   
> 

Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-08  5:06     ` Christophe Leroy
  (?)
  (?)
@ 2020-09-08 12:09       ` Christian Borntraeger
  -1 siblings, 0 replies; 254+ messages in thread
From: Christian Borntraeger @ 2020-09-08 12:09 UTC (permalink / raw)
  To: Christophe Leroy, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



On 08.09.20 07:06, Christophe Leroy wrote:
> 
> 
> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>
>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>> dynamic page table folding.
>>
>> The question "What would it require for the generic code to work for s390"
>> has already been discussed here
>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>> and ended with a promising approach here
>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>> which in the end unfortunately didn't quite work completely.
>>
>> We tried to mimic static level folding by changing pgd_offset to always
>> calculate top level page table offset, and do nothing in folded pXd_offset.
>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>> not reflect this dynamic behaviour, and still act like static 5-level
>> page tables.
>>
> 
> [...]
> 
>>
>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>> additional pXd entry value parameter, that can be used on s390
>> to determine the correct page table level and return corresponding
>> end / boundary. With that, the pointer iteration will always
>> happen in gup_pgd_range for s390. No change for other architectures
>> introduced.
> 
> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> 
> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()

given that this fixes a data corruption issue, wouldnt it be the best to go forward
with this patch ASAP and then handle the other patches on top with all the time that
we need?
> 
> 
>>
>> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
>> Cc: <stable@vger.kernel.org> # 5.2+
>> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
>> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>>   include/linux/pgtable.h         | 16 +++++++++++++
>>   mm/gup.c                        |  8 +++----
>>   3 files changed, 62 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>> index 7eb01a5459cd..027206e4959d 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>>   }
>>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>>   +/*
>> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
>> + * will not return corresponding dynamic boundaries. This is no problem as long
>> + * as only pXd pointers are passed down during page table walk, because
>> + * pXd_offset() will simply return the given pointer for folded levels, and the
>> + * pointer iteration over a range simply happens at the correct page table
>> + * level.
>> + * It is however a problem with gup_fast, or other places walking the page
>> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
>> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
>> + * a stack variable, which cannot be used for pointer iteration at the correct
>> + * level. Instead, the iteration then has to happen by going up to pgd level
>> + * again. To allow this, provide pXd_addr_end_folded() functions with an
>> + * additional pXd value parameter, which can be used on s390 to determine the
>> + * folding level and return the corresponding boundary.
>> + */
>> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
> 
> What does 'rste' stands for ?
> 
> Isn't this line a bit long ?

this is region/segment table entry according to the architecture. 
On our platform we do have the pagetables with a different format that
next levels (segment table -> 1MB granularity, region 3rd table -> 2 GB
granularity, region 2nd table -> 4TB granularity, region 1st table -> 8 PB
granularity. ST,R3,R2,R1 have the same format and are thus often called
crste (combined region and segment table entry).

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 12:09       ` Christian Borntraeger
  0 siblings, 0 replies; 254+ messages in thread
From: Christian Borntraeger @ 2020-09-08 12:09 UTC (permalink / raw)
  To: Christophe Leroy, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



On 08.09.20 07:06, Christophe Leroy wrote:
> 
> 
> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>
>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>> dynamic page table folding.
>>
>> The question "What would it require for the generic code to work for s390"
>> has already been discussed here
>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>> and ended with a promising approach here
>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>> which in the end unfortunately didn't quite work completely.
>>
>> We tried to mimic static level folding by changing pgd_offset to always
>> calculate top level page table offset, and do nothing in folded pXd_offset.
>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>> not reflect this dynamic behaviour, and still act like static 5-level
>> page tables.
>>
> 
> [...]
> 
>>
>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>> additional pXd entry value parameter, that can be used on s390
>> to determine the correct page table level and return corresponding
>> end / boundary. With that, the pointer iteration will always
>> happen in gup_pgd_range for s390. No change for other architectures
>> introduced.
> 
> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> 
> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()

given that this fixes a data corruption issue, wouldnt it be the best to go forward
with this patch ASAP and then handle the other patches on top with all the time that
we need?
> 
> 
>>
>> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
>> Cc: <stable@vger.kernel.org> # 5.2+
>> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
>> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>>   include/linux/pgtable.h         | 16 +++++++++++++
>>   mm/gup.c                        |  8 +++----
>>   3 files changed, 62 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>> index 7eb01a5459cd..027206e4959d 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>>   }
>>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>>   +/*
>> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
>> + * will not return corresponding dynamic boundaries. This is no problem as long
>> + * as only pXd pointers are passed down during page table walk, because
>> + * pXd_offset() will simply return the given pointer for folded levels, and the
>> + * pointer iteration over a range simply happens at the correct page table
>> + * level.
>> + * It is however a problem with gup_fast, or other places walking the page
>> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
>> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
>> + * a stack variable, which cannot be used for pointer iteration at the correct
>> + * level. Instead, the iteration then has to happen by going up to pgd level
>> + * again. To allow this, provide pXd_addr_end_folded() functions with an
>> + * additional pXd value parameter, which can be used on s390 to determine the
>> + * folding level and return the corresponding boundary.
>> + */
>> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
> 
> What does 'rste' stands for ?
> 
> Isn't this line a bit long ?

this is region/segment table entry according to the architecture. 
On our platform we do have the pagetables with a different format that
next levels (segment table -> 1MB granularity, region 3rd table -> 2 GB
granularity, region 2nd table -> 4TB granularity, region 1st table -> 8 PB
granularity. ST,R3,R2,R1 have the same format and are thus often called
crste (combined region and segment table entry).

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 12:09       ` Christian Borntraeger
  0 siblings, 0 replies; 254+ messages in thread
From: Christian Borntraeger @ 2020-09-08 12:09 UTC (permalink / raw)
  To: Christophe Leroy, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King, Ingo Molnar,
	Andrey Ryabinin, Jeff Dike, Arnd Bergmann, Heiko Carstens,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, Linus Torvalds, LKML, Andrew Morton, linux-power,
	Mike Rapoport



On 08.09.20 07:06, Christophe Leroy wrote:
> 
> 
> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>
>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>> dynamic page table folding.
>>
>> The question "What would it require for the generic code to work for s390"
>> has already been discussed here
>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>> and ended with a promising approach here
>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>> which in the end unfortunately didn't quite work completely.
>>
>> We tried to mimic static level folding by changing pgd_offset to always
>> calculate top level page table offset, and do nothing in folded pXd_offset.
>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>> not reflect this dynamic behaviour, and still act like static 5-level
>> page tables.
>>
> 
> [...]
> 
>>
>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>> additional pXd entry value parameter, that can be used on s390
>> to determine the correct page table level and return corresponding
>> end / boundary. With that, the pointer iteration will always
>> happen in gup_pgd_range for s390. No change for other architectures
>> introduced.
> 
> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> 
> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()

given that this fixes a data corruption issue, wouldnt it be the best to go forward
with this patch ASAP and then handle the other patches on top with all the time that
we need?
> 
> 
>>
>> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
>> Cc: <stable@vger.kernel.org> # 5.2+
>> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
>> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>>   include/linux/pgtable.h         | 16 +++++++++++++
>>   mm/gup.c                        |  8 +++----
>>   3 files changed, 62 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>> index 7eb01a5459cd..027206e4959d 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>>   }
>>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>>   +/*
>> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
>> + * will not return corresponding dynamic boundaries. This is no problem as long
>> + * as only pXd pointers are passed down during page table walk, because
>> + * pXd_offset() will simply return the given pointer for folded levels, and the
>> + * pointer iteration over a range simply happens at the correct page table
>> + * level.
>> + * It is however a problem with gup_fast, or other places walking the page
>> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
>> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
>> + * a stack variable, which cannot be used for pointer iteration at the correct
>> + * level. Instead, the iteration then has to happen by going up to pgd level
>> + * again. To allow this, provide pXd_addr_end_folded() functions with an
>> + * additional pXd value parameter, which can be used on s390 to determine the
>> + * folding level and return the corresponding boundary.
>> + */
>> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
> 
> What does 'rste' stands for ?
> 
> Isn't this line a bit long ?

this is region/segment table entry according to the architecture. 
On our platform we do have the pagetables with a different format that
next levels (segment table -> 1MB granularity, region 3rd table -> 2 GB
granularity, region 2nd table -> 4TB granularity, region 1st table -> 8 PB
granularity. ST,R3,R2,R1 have the same format and are thus often called
crste (combined region and segment table entry).

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 12:09       ` Christian Borntraeger
  0 siblings, 0 replies; 254+ messages in thread
From: Christian Borntraeger @ 2020-09-08 12:09 UTC (permalink / raw)
  To: Christophe Leroy, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King, Ingo Molnar,
	Andrey Ryabinin, Jeff Dike, Arnd Bergmann, Heiko Carstens,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, Linus Torvalds, LKML, Andrew Morton, linux-power,
	Mike Rapoport



On 08.09.20 07:06, Christophe Leroy wrote:
> 
> 
> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>
>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>> dynamic page table folding.
>>
>> The question "What would it require for the generic code to work for s390"
>> has already been discussed here
>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>> and ended with a promising approach here
>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>> which in the end unfortunately didn't quite work completely.
>>
>> We tried to mimic static level folding by changing pgd_offset to always
>> calculate top level page table offset, and do nothing in folded pXd_offset.
>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>> not reflect this dynamic behaviour, and still act like static 5-level
>> page tables.
>>
> 
> [...]
> 
>>
>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>> additional pXd entry value parameter, that can be used on s390
>> to determine the correct page table level and return corresponding
>> end / boundary. With that, the pointer iteration will always
>> happen in gup_pgd_range for s390. No change for other architectures
>> introduced.
> 
> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> 
> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()

given that this fixes a data corruption issue, wouldnt it be the best to go forward
with this patch ASAP and then handle the other patches on top with all the time that
we need?
> 
> 
>>
>> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
>> Cc: <stable@vger.kernel.org> # 5.2+
>> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
>> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>>   include/linux/pgtable.h         | 16 +++++++++++++
>>   mm/gup.c                        |  8 +++----
>>   3 files changed, 62 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>> index 7eb01a5459cd..027206e4959d 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>>   }
>>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>>   +/*
>> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
>> + * will not return corresponding dynamic boundaries. This is no problem as long
>> + * as only pXd pointers are passed down during page table walk, because
>> + * pXd_offset() will simply return the given pointer for folded levels, and the
>> + * pointer iteration over a range simply happens at the correct page table
>> + * level.
>> + * It is however a problem with gup_fast, or other places walking the page
>> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
>> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
>> + * a stack variable, which cannot be used for pointer iteration at the correct
>> + * level. Instead, the iteration then has to happen by going up to pgd level
>> + * again. To allow this, provide pXd_addr_end_folded() functions with an
>> + * additional pXd value parameter, which can be used on s390 to determine the
>> + * folding level and return the corresponding boundary.
>> + */
>> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
> 
> What does 'rste' stands for ?
> 
> Isn't this line a bit long ?

this is region/segment table entry according to the architecture. 
On our platform we do have the pagetables with a different format that
next levels (segment table -> 1MB granularity, region 3rd table -> 2 GB
granularity, region 2nd table -> 4TB granularity, region 1st table -> 8 PB
granularity. ST,R3,R2,R1 have the same format and are thus often called
crste (combined region and segment table entry).

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-08 12:09       ` Christian Borntraeger
  (?)
  (?)
@ 2020-09-08 12:40         ` Christophe Leroy
  -1 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08 12:40 UTC (permalink / raw)
  To: Christian Borntraeger, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> 
> 
> On 08.09.20 07:06, Christophe Leroy wrote:
>>
>>
>> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>>
>>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>>> dynamic page table folding.
>>>
>>> The question "What would it require for the generic code to work for s390"
>>> has already been discussed here
>>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>>> and ended with a promising approach here
>>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>>> which in the end unfortunately didn't quite work completely.
>>>
>>> We tried to mimic static level folding by changing pgd_offset to always
>>> calculate top level page table offset, and do nothing in folded pXd_offset.
>>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>>> not reflect this dynamic behaviour, and still act like static 5-level
>>> page tables.
>>>
>>
>> [...]
>>
>>>
>>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>>> additional pXd entry value parameter, that can be used on s390
>>> to determine the correct page table level and return corresponding
>>> end / boundary. With that, the pointer iteration will always
>>> happen in gup_pgd_range for s390. No change for other architectures
>>> introduced.
>>
>> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
>> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
>>
>> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> 
> given that this fixes a data corruption issue, wouldnt it be the best to go forward
> with this patch ASAP and then handle the other patches on top with all the time that
> we need?

I have no strong opinion on this, but I feel rather tricky to have to 
change generic part of GUP to use a new fonction then revert that change 
in the following patch, just because you want the first patch in stable 
and not the second one.

Regardless, I was wondering, why do we need a reference to the pXd at 
all when calling pXd_addr_end() ?

Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
passed addr ?

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 12:40         ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08 12:40 UTC (permalink / raw)
  To: Christian Borntraeger, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> 
> 
> On 08.09.20 07:06, Christophe Leroy wrote:
>>
>>
>> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>>
>>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>>> dynamic page table folding.
>>>
>>> The question "What would it require for the generic code to work for s390"
>>> has already been discussed here
>>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>>> and ended with a promising approach here
>>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>>> which in the end unfortunately didn't quite work completely.
>>>
>>> We tried to mimic static level folding by changing pgd_offset to always
>>> calculate top level page table offset, and do nothing in folded pXd_offset.
>>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>>> not reflect this dynamic behaviour, and still act like static 5-level
>>> page tables.
>>>
>>
>> [...]
>>
>>>
>>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>>> additional pXd entry value parameter, that can be used on s390
>>> to determine the correct page table level and return corresponding
>>> end / boundary. With that, the pointer iteration will always
>>> happen in gup_pgd_range for s390. No change for other architectures
>>> introduced.
>>
>> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
>> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
>>
>> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> 
> given that this fixes a data corruption issue, wouldnt it be the best to go forward
> with this patch ASAP and then handle the other patches on top with all the time that
> we need?

I have no strong opinion on this, but I feel rather tricky to have to 
change generic part of GUP to use a new fonction then revert that change 
in the following patch, just because you want the first patch in stable 
and not the second one.

Regardless, I was wondering, why do we need a reference to the pXd at 
all when calling pXd_addr_end() ?

Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
passed addr ?

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 12:40         ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08 12:40 UTC (permalink / raw)
  To: Christian Borntraeger, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King, Ingo Molnar,
	Andrey Ryabinin, Jeff Dike, Arnd Bergmann, Heiko Carstens,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, Linus Torvalds, LKML, Andrew Morton, linux-power,
	Mike Rapoport



Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> 
> 
> On 08.09.20 07:06, Christophe Leroy wrote:
>>
>>
>> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>>
>>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>>> dynamic page table folding.
>>>
>>> The question "What would it require for the generic code to work for s390"
>>> has already been discussed here
>>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>>> and ended with a promising approach here
>>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>>> which in the end unfortunately didn't quite work completely.
>>>
>>> We tried to mimic static level folding by changing pgd_offset to always
>>> calculate top level page table offset, and do nothing in folded pXd_offset.
>>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>>> not reflect this dynamic behaviour, and still act like static 5-level
>>> page tables.
>>>
>>
>> [...]
>>
>>>
>>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>>> additional pXd entry value parameter, that can be used on s390
>>> to determine the correct page table level and return corresponding
>>> end / boundary. With that, the pointer iteration will always
>>> happen in gup_pgd_range for s390. No change for other architectures
>>> introduced.
>>
>> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
>> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
>>
>> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> 
> given that this fixes a data corruption issue, wouldnt it be the best to go forward
> with this patch ASAP and then handle the other patches on top with all the time that
> we need?

I have no strong opinion on this, but I feel rather tricky to have to 
change generic part of GUP to use a new fonction then revert that change 
in the following patch, just because you want the first patch in stable 
and not the second one.

Regardless, I was wondering, why do we need a reference to the pXd at 
all when calling pXd_addr_end() ?

Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
passed addr ?

Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 12:40         ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08 12:40 UTC (permalink / raw)
  To: Christian Borntraeger, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King, Ingo Molnar,
	Andrey Ryabinin, Jeff Dike, Arnd Bergmann, Heiko Carstens,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, Linus Torvalds, LKML, Andrew Morton, linux-power,
	Mike Rapoport



Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> 
> 
> On 08.09.20 07:06, Christophe Leroy wrote:
>>
>>
>> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>>
>>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>>> dynamic page table folding.
>>>
>>> The question "What would it require for the generic code to work for s390"
>>> has already been discussed here
>>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>>> and ended with a promising approach here
>>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>>> which in the end unfortunately didn't quite work completely.
>>>
>>> We tried to mimic static level folding by changing pgd_offset to always
>>> calculate top level page table offset, and do nothing in folded pXd_offset.
>>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>>> not reflect this dynamic behaviour, and still act like static 5-level
>>> page tables.
>>>
>>
>> [...]
>>
>>>
>>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>>> additional pXd entry value parameter, that can be used on s390
>>> to determine the correct page table level and return corresponding
>>> end / boundary. With that, the pointer iteration will always
>>> happen in gup_pgd_range for s390. No change for other architectures
>>> introduced.
>>
>> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
>> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
>>
>> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> 
> given that this fixes a data corruption issue, wouldnt it be the best to go forward
> with this patch ASAP and then handle the other patches on top with all the time that
> we need?

I have no strong opinion on this, but I feel rather tricky to have to 
change generic part of GUP to use a new fonction then revert that change 
in the following patch, just because you want the first patch in stable 
and not the second one.

Regardless, I was wondering, why do we need a reference to the pXd at 
all when calling pXd_addr_end() ?

Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
passed addr ?

Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-07 18:00   ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-08 13:26     ` Jason Gunthorpe
  -1 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-08 13:26 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton,
	Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86,
	linux-arm, linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Mon, Sep 07, 2020 at 08:00:57PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 
> So, even though the dereferenced page-table entries are not used on
> archs other than s390, and are optimized out by the compiler, there
> is a small change in kernel size and this is what bloat-o-meter reports:

This looks pretty clean and straightfoward, only
__munlock_pagevec_fill() had any real increased complexity.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 13:26     ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-08 13:26 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton,
	Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86,
	linux-arm, linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Mon, Sep 07, 2020 at 08:00:57PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 
> So, even though the dereferenced page-table entries are not used on
> archs other than s390, and are optimized out by the compiler, there
> is a small change in kernel size and this is what bloat-o-meter reports:

This looks pretty clean and straightfoward, only
__munlock_pagevec_fill() had any real increased complexity.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 13:26     ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-08 13:26 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On Mon, Sep 07, 2020 at 08:00:57PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 
> So, even though the dereferenced page-table entries are not used on
> archs other than s390, and are optimized out by the compiler, there
> is a small change in kernel size and this is what bloat-o-meter reports:

This looks pretty clean and straightfoward, only
__munlock_pagevec_fill() had any real increased complexity.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 13:26     ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-08 13:26 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On Mon, Sep 07, 2020 at 08:00:57PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 
> So, even though the dereferenced page-table entries are not used on
> archs other than s390, and are optimized out by the compiler, there
> is a small change in kernel size and this is what bloat-o-meter reports:

This looks pretty clean and straightfoward, only
__munlock_pagevec_fill() had any real increased complexity.

Thanks,
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-08 12:40         ` Christophe Leroy
  (?)
  (?)
@ 2020-09-08 13:38           ` Gerald Schaefer
  -1 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-08 13:38 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Christian Borntraeger, Jason Gunthorpe, John Hubbard,
	Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, 8 Sep 2020 14:40:10 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> > 
> > 
> > On 08.09.20 07:06, Christophe Leroy wrote:
> >>
> >>
> >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> >>> From: Alexander Gordeev <agordeev@linux.ibm.com>
> >>>
> >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> >>> dynamic page table folding.
> >>>
> >>> The question "What would it require for the generic code to work for s390"
> >>> has already been discussed here
> >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> >>> and ended with a promising approach here
> >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> >>> which in the end unfortunately didn't quite work completely.
> >>>
> >>> We tried to mimic static level folding by changing pgd_offset to always
> >>> calculate top level page table offset, and do nothing in folded pXd_offset.
> >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> >>> not reflect this dynamic behaviour, and still act like static 5-level
> >>> page tables.
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> >>> additional pXd entry value parameter, that can be used on s390
> >>> to determine the correct page table level and return corresponding
> >>> end / boundary. With that, the pointer iteration will always
> >>> happen in gup_pgd_range for s390. No change for other architectures
> >>> introduced.
> >>
> >> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> >>
> >> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> > 
> > given that this fixes a data corruption issue, wouldnt it be the best to go forward
> > with this patch ASAP and then handle the other patches on top with all the time that
> > we need?
> 
> I have no strong opinion on this, but I feel rather tricky to have to 
> change generic part of GUP to use a new fonction then revert that change 
> in the following patch, just because you want the first patch in stable 
> and not the second one.
> 
> Regardless, I was wondering, why do we need a reference to the pXd at 
> all when calling pXd_addr_end() ?
> 
> Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
> passed addr ?

Apart from performance impact when re-doing that what has already been
done by the caller, I think we would also break the READ_ONCE semantics.
After all, the pXd_offset() would also require some pXd pointer input,
which we don't have. So we would need to start over again from mm->pgd.

Also, it seems to be more in line with other primitives that take
a pXd value or pointer.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 13:38           ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-08 13:38 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Christian Borntraeger, Jason Gunthorpe, John Hubbard,
	Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, 8 Sep 2020 14:40:10 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> > 
> > 
> > On 08.09.20 07:06, Christophe Leroy wrote:
> >>
> >>
> >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> >>> From: Alexander Gordeev <agordeev@linux.ibm.com>
> >>>
> >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> >>> dynamic page table folding.
> >>>
> >>> The question "What would it require for the generic code to work for s390"
> >>> has already been discussed here
> >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> >>> and ended with a promising approach here
> >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> >>> which in the end unfortunately didn't quite work completely.
> >>>
> >>> We tried to mimic static level folding by changing pgd_offset to always
> >>> calculate top level page table offset, and do nothing in folded pXd_offset.
> >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> >>> not reflect this dynamic behaviour, and still act like static 5-level
> >>> page tables.
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> >>> additional pXd entry value parameter, that can be used on s390
> >>> to determine the correct page table level and return corresponding
> >>> end / boundary. With that, the pointer iteration will always
> >>> happen in gup_pgd_range for s390. No change for other architectures
> >>> introduced.
> >>
> >> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> >>
> >> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> > 
> > given that this fixes a data corruption issue, wouldnt it be the best to go forward
> > with this patch ASAP and then handle the other patches on top with all the time that
> > we need?
> 
> I have no strong opinion on this, but I feel rather tricky to have to 
> change generic part of GUP to use a new fonction then revert that change 
> in the following patch, just because you want the first patch in stable 
> and not the second one.
> 
> Regardless, I was wondering, why do we need a reference to the pXd at 
> all when calling pXd_addr_end() ?
> 
> Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
> passed addr ?

Apart from performance impact when re-doing that what has already been
done by the caller, I think we would also break the READ_ONCE semantics.
After all, the pXd_offset() would also require some pXd pointer input,
which we don't have. So we would need to start over again from mm->pgd.

Also, it seems to be more in line with other primitives that take
a pXd value or pointer.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 13:38           ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-08 13:38 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Andrey Ryabinin,
	Jeff Dike, Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 14:40:10 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> > 
> > 
> > On 08.09.20 07:06, Christophe Leroy wrote:
> >>
> >>
> >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> >>> From: Alexander Gordeev <agordeev@linux.ibm.com>
> >>>
> >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> >>> dynamic page table folding.
> >>>
> >>> The question "What would it require for the generic code to work for s390"
> >>> has already been discussed here
> >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> >>> and ended with a promising approach here
> >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> >>> which in the end unfortunately didn't quite work completely.
> >>>
> >>> We tried to mimic static level folding by changing pgd_offset to always
> >>> calculate top level page table offset, and do nothing in folded pXd_offset.
> >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> >>> not reflect this dynamic behaviour, and still act like static 5-level
> >>> page tables.
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> >>> additional pXd entry value parameter, that can be used on s390
> >>> to determine the correct page table level and return corresponding
> >>> end / boundary. With that, the pointer iteration will always
> >>> happen in gup_pgd_range for s390. No change for other architectures
> >>> introduced.
> >>
> >> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> >>
> >> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> > 
> > given that this fixes a data corruption issue, wouldnt it be the best to go forward
> > with this patch ASAP and then handle the other patches on top with all the time that
> > we need?
> 
> I have no strong opinion on this, but I feel rather tricky to have to 
> change generic part of GUP to use a new fonction then revert that change 
> in the following patch, just because you want the first patch in stable 
> and not the second one.
> 
> Regardless, I was wondering, why do we need a reference to the pXd at 
> all when calling pXd_addr_end() ?
> 
> Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
> passed addr ?

Apart from performance impact when re-doing that what has already been
done by the caller, I think we would also break the READ_ONCE semantics.
After all, the pXd_offset() would also require some pXd pointer input,
which we don't have. So we would need to start over again from mm->pgd.

Also, it seems to be more in line with other primitives that take
a pXd value or pointer.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 13:38           ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-08 13:38 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Andrey Ryabinin,
	Jeff Dike, Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 14:40:10 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> > 
> > 
> > On 08.09.20 07:06, Christophe Leroy wrote:
> >>
> >>
> >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> >>> From: Alexander Gordeev <agordeev@linux.ibm.com>
> >>>
> >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> >>> dynamic page table folding.
> >>>
> >>> The question "What would it require for the generic code to work for s390"
> >>> has already been discussed here
> >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> >>> and ended with a promising approach here
> >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> >>> which in the end unfortunately didn't quite work completely.
> >>>
> >>> We tried to mimic static level folding by changing pgd_offset to always
> >>> calculate top level page table offset, and do nothing in folded pXd_offset.
> >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> >>> not reflect this dynamic behaviour, and still act like static 5-level
> >>> page tables.
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> >>> additional pXd entry value parameter, that can be used on s390
> >>> to determine the correct page table level and return corresponding
> >>> end / boundary. With that, the pointer iteration will always
> >>> happen in gup_pgd_range for s390. No change for other architectures
> >>> introduced.
> >>
> >> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> >>
> >> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> > 
> > given that this fixes a data corruption issue, wouldnt it be the best to go forward
> > with this patch ASAP and then handle the other patches on top with all the time that
> > we need?
> 
> I have no strong opinion on this, but I feel rather tricky to have to 
> change generic part of GUP to use a new fonction then revert that change 
> in the following patch, just because you want the first patch in stable 
> and not the second one.
> 
> Regardless, I was wondering, why do we need a reference to the pXd at 
> all when calling pXd_addr_end() ?
> 
> Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
> passed addr ?

Apart from performance impact when re-doing that what has already been
done by the caller, I think we would also break the READ_ONCE semantics.
After all, the pXd_offset() would also require some pXd pointer input,
which we don't have. So we would need to start over again from mm->pgd.

Also, it seems to be more in line with other primitives that take
a pXd value or pointer.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-08  8:16         ` Christophe Leroy
  (?)
  (?)
@ 2020-09-08 14:15           ` Alexander Gordeev
  -1 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:15 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Michael Ellerman, Gerald Schaefer, Jason Gunthorpe, John Hubbard,
	Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> >Yes, and also two more sources :/
> >	arch/powerpc/mm/kasan/8xx.c
> >	arch/powerpc/mm/kasan/kasan_init_32.c
> >
> >But these two are not quite obvious wrt pgd_addr_end() used
> >while traversing pmds. Could you please clarify a bit?
> >
> >
> >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> >index 2784224..89c5053 100644
> >--- a/arch/powerpc/mm/kasan/8xx.c
> >+++ b/arch/powerpc/mm/kasan/8xx.c
> >@@ -15,8 +15,8 @@
> >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> >  		pte_basic_t *new;
> >-		k_next = pgd_addr_end(k_cur, k_end);
> >-		k_next = pgd_addr_end(k_next, k_end);
> >+		k_next = pmd_addr_end(k_cur, k_end);
> >+		k_next = pmd_addr_end(k_next, k_end);
> 
> No, I don't think so.
> On powerpc32 we have only two levels, so pgd and pmd are more or
> less the same.
> But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> is a no-op, so I don't think it will work.
> 
> It is likely that this function should iterate on pgd, then you get
> pmd = pmd_offset(pud_offset(p4d_offset(pgd)));

It looks like the code iterates over single pmd table while using
pgd_addr_end() only to skip all the middle levels and bail out
from the loop.

I would be wary for switching from pmds to pgds, since we are
trying to minimize impact (especially functional) and the
rework does not seem that obvious.

Assuming pmd and pgd are the same would actually such approach
work for now?

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..94466cc 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*(pmd + 1))), k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..c0bcd64 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(__pgd(pmd_val(*pmd)), addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

Alternatively we could pass invalid pgd to keep the code structure
intact, but that of course is less nice.

Thanks!

> Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:15           ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:15 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Michael Ellerman, Gerald Schaefer, Jason Gunthorpe, John Hubbard,
	Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> >Yes, and also two more sources :/
> >	arch/powerpc/mm/kasan/8xx.c
> >	arch/powerpc/mm/kasan/kasan_init_32.c
> >
> >But these two are not quite obvious wrt pgd_addr_end() used
> >while traversing pmds. Could you please clarify a bit?
> >
> >
> >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> >index 2784224..89c5053 100644
> >--- a/arch/powerpc/mm/kasan/8xx.c
> >+++ b/arch/powerpc/mm/kasan/8xx.c
> >@@ -15,8 +15,8 @@
> >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> >  		pte_basic_t *new;
> >-		k_next = pgd_addr_end(k_cur, k_end);
> >-		k_next = pgd_addr_end(k_next, k_end);
> >+		k_next = pmd_addr_end(k_cur, k_end);
> >+		k_next = pmd_addr_end(k_next, k_end);
> 
> No, I don't think so.
> On powerpc32 we have only two levels, so pgd and pmd are more or
> less the same.
> But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> is a no-op, so I don't think it will work.
> 
> It is likely that this function should iterate on pgd, then you get
> pmd = pmd_offset(pud_offset(p4d_offset(pgd)));

It looks like the code iterates over single pmd table while using
pgd_addr_end() only to skip all the middle levels and bail out
from the loop.

I would be wary for switching from pmds to pgds, since we are
trying to minimize impact (especially functional) and the
rework does not seem that obvious.

Assuming pmd and pgd are the same would actually such approach
work for now?

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..94466cc 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*(pmd + 1))), k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..c0bcd64 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(__pgd(pmd_val(*pmd)), addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

Alternatively we could pass invalid pgd to keep the code structure
intact, but that of course is less nice.

Thanks!

> Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:15           ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:15 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Arnd Bergmann, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Vasily Gorbik, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> >Yes, and also two more sources :/
> >	arch/powerpc/mm/kasan/8xx.c
> >	arch/powerpc/mm/kasan/kasan_init_32.c
> >
> >But these two are not quite obvious wrt pgd_addr_end() used
> >while traversing pmds. Could you please clarify a bit?
> >
> >
> >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> >index 2784224..89c5053 100644
> >--- a/arch/powerpc/mm/kasan/8xx.c
> >+++ b/arch/powerpc/mm/kasan/8xx.c
> >@@ -15,8 +15,8 @@
> >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> >  		pte_basic_t *new;
> >-		k_next = pgd_addr_end(k_cur, k_end);
> >-		k_next = pgd_addr_end(k_next, k_end);
> >+		k_next = pmd_addr_end(k_cur, k_end);
> >+		k_next = pmd_addr_end(k_next, k_end);
> 
> No, I don't think so.
> On powerpc32 we have only two levels, so pgd and pmd are more or
> less the same.
> But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> is a no-op, so I don't think it will work.
> 
> It is likely that this function should iterate on pgd, then you get
> pmd = pmd_offset(pud_offset(p4d_offset(pgd)));

It looks like the code iterates over single pmd table while using
pgd_addr_end() only to skip all the middle levels and bail out
from the loop.

I would be wary for switching from pmds to pgds, since we are
trying to minimize impact (especially functional) and the
rework does not seem that obvious.

Assuming pmd and pgd are the same would actually such approach
work for now?

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..94466cc 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*(pmd + 1))), k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..c0bcd64 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(__pgd(pmd_val(*pmd)), addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

Alternatively we could pass invalid pgd to keep the code structure
intact, but that of course is less nice.

Thanks!

> Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:15           ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:15 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Arnd Bergmann, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Vasily Gorbik, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Michael Ellerman, Andrew Morton,
	linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> >Yes, and also two more sources :/
> >	arch/powerpc/mm/kasan/8xx.c
> >	arch/powerpc/mm/kasan/kasan_init_32.c
> >
> >But these two are not quite obvious wrt pgd_addr_end() used
> >while traversing pmds. Could you please clarify a bit?
> >
> >
> >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> >index 2784224..89c5053 100644
> >--- a/arch/powerpc/mm/kasan/8xx.c
> >+++ b/arch/powerpc/mm/kasan/8xx.c
> >@@ -15,8 +15,8 @@
> >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> >  		pte_basic_t *new;
> >-		k_next = pgd_addr_end(k_cur, k_end);
> >-		k_next = pgd_addr_end(k_next, k_end);
> >+		k_next = pmd_addr_end(k_cur, k_end);
> >+		k_next = pmd_addr_end(k_next, k_end);
> 
> No, I don't think so.
> On powerpc32 we have only two levels, so pgd and pmd are more or
> less the same.
> But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> is a no-op, so I don't think it will work.
> 
> It is likely that this function should iterate on pgd, then you get
> pmd = pmd_offset(pud_offset(p4d_offset(pgd)));

It looks like the code iterates over single pmd table while using
pgd_addr_end() only to skip all the middle levels and bail out
from the loop.

I would be wary for switching from pmds to pgds, since we are
trying to minimize impact (especially functional) and the
rework does not seem that obvious.

Assuming pmd and pgd are the same would actually such approach
work for now?

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..94466cc 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*(pmd + 1))), k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..c0bcd64 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(__pgd(pmd_val(*pmd)), addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

Alternatively we could pass invalid pgd to keep the code structure
intact, but that of course is less nice.

Thanks!

> Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-08  5:14     ` Christophe Leroy
  (?)
  (?)
@ 2020-09-08 14:25       ` Alexander Gordeev
  -1 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:25 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
[...]
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

If this one would be okay?

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index 60c6ea16..3690d22 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -88,6 +88,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
 static void subpage_prot_clear(unsigned long addr, unsigned long len)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -103,8 +104,8 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len)
 	limit = addr + len;
 	if (limit > spt->maxaddr)
 		limit = spt->maxaddr;
-	for (; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;
 		} else {
@@ -191,6 +192,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 		unsigned long, len, u32 __user *, map)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -236,8 +238,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 	}
 
 	subpage_mark_vma_nohuge(mm, addr, len);
-	for (limit = addr + len; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (limit = addr + len; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		err = -ENOMEM;
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;

Thanks!

> Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:25       ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:25 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
[...]
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

If this one would be okay?

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index 60c6ea16..3690d22 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -88,6 +88,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
 static void subpage_prot_clear(unsigned long addr, unsigned long len)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -103,8 +104,8 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len)
 	limit = addr + len;
 	if (limit > spt->maxaddr)
 		limit = spt->maxaddr;
-	for (; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;
 		} else {
@@ -191,6 +192,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 		unsigned long, len, u32 __user *, map)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -236,8 +238,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 	}
 
 	subpage_mark_vma_nohuge(mm, addr, len);
-	for (limit = addr + len; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (limit = addr + len; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		err = -ENOMEM;
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;

Thanks!

> Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:25       ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:25 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
[...]
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

If this one would be okay?

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index 60c6ea16..3690d22 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -88,6 +88,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
 static void subpage_prot_clear(unsigned long addr, unsigned long len)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -103,8 +104,8 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len)
 	limit = addr + len;
 	if (limit > spt->maxaddr)
 		limit = spt->maxaddr;
-	for (; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;
 		} else {
@@ -191,6 +192,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 		unsigned long, len, u32 __user *, map)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -236,8 +238,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 	}
 
 	subpage_mark_vma_nohuge(mm, addr, len);
-	for (limit = addr + len; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (limit = addr + len; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		err = -ENOMEM;
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;

Thanks!

> Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:25       ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:25 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
[...]
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

If this one would be okay?

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index 60c6ea16..3690d22 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -88,6 +88,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
 static void subpage_prot_clear(unsigned long addr, unsigned long len)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -103,8 +104,8 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len)
 	limit = addr + len;
 	if (limit > spt->maxaddr)
 		limit = spt->maxaddr;
-	for (; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;
 		} else {
@@ -191,6 +192,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 		unsigned long, len, u32 __user *, map)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -236,8 +238,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 	}
 
 	subpage_mark_vma_nohuge(mm, addr, len);
-	for (limit = addr + len; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (limit = addr + len; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		err = -ENOMEM;
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;

Thanks!

> Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-07 18:00   ` Gerald Schaefer
  (?)
@ 2020-09-08 14:30     ` Dave Hansen
  -1 siblings, 0 replies; 254+ messages in thread
From: Dave Hansen @ 2020-09-08 14:30 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.

Would it be fair to say that the "fake" page table entries s390
allocates on the stack are what's causing the trouble here?  That might
be a nice thing to open up with here.  "Dynamic page table folding"
really means nothing to me.

> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  	do {
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>  		if (!pmd_present(pmd))
>  			return 0;

It looks like you fix this up later, but this would be a problem if left
this way.  There's no documentation for whether I use
pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.


^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 14:30     ` Dave Hansen
  0 siblings, 0 replies; 254+ messages in thread
From: Dave Hansen @ 2020-09-08 14:30 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.

Would it be fair to say that the "fake" page table entries s390
allocates on the stack are what's causing the trouble here?  That might
be a nice thing to open up with here.  "Dynamic page table folding"
really means nothing to me.

> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  	do {
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>  		if (!pmd_present(pmd))
>  			return 0;

It looks like you fix this up later, but this would be a problem if left
this way.  There's no documentation for whether I use
pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.


^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 14:30     ` Dave Hansen
  0 siblings, 0 replies; 254+ messages in thread
From: Dave Hansen @ 2020-09-08 14:30 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.

Would it be fair to say that the "fake" page table entries s390
allocates on the stack are what's causing the trouble here?  That might
be a nice thing to open up with here.  "Dynamic page table folding"
really means nothing to me.

> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  	do {
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>  		if (!pmd_present(pmd))
>  			return 0;

It looks like you fix this up later, but this would be a problem if left
this way.  There's no documentation for whether I use
pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-07 18:00   ` Gerald Schaefer
  (?)
@ 2020-09-08 14:33     ` Dave Hansen
  -1 siblings, 0 replies; 254+ messages in thread
From: Dave Hansen @ 2020-09-08 14:33 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> x86:
> add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
> Function                                     old     new   delta
> vmemmap_populate                             587     592      +5
> munlock_vma_pages_range                      556     561      +5
> Total: Before=15534694, After=15534704, chg +0.00%
...
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 +++++-----

I didn't do a super thorough review on this, but it generally looks OK
and the benefits of sharing more code between arches certainly outweigh
a few bytes of binary growth.  For the x86 bits at least, feel free to
add my ack.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:33     ` Dave Hansen
  0 siblings, 0 replies; 254+ messages in thread
From: Dave Hansen @ 2020-09-08 14:33 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> x86:
> add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
> Function                                     old     new   delta
> vmemmap_populate                             587     592      +5
> munlock_vma_pages_range                      556     561      +5
> Total: Before=15534694, After=15534704, chg +0.00%
...
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 +++++-----

I didn't do a super thorough review on this, but it generally looks OK
and the benefits of sharing more code between arches certainly outweigh
a few bytes of binary growth.  For the x86 bits at least, feel free to
add my ack.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:33     ` Dave Hansen
  0 siblings, 0 replies; 254+ messages in thread
From: Dave Hansen @ 2020-09-08 14:33 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> x86:
> add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
> Function                                     old     new   delta
> vmemmap_populate                             587     592      +5
> munlock_vma_pages_range                      556     561      +5
> Total: Before=15534694, After=15534704, chg +0.00%
...
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 +++++-----

I didn't do a super thorough review on this, but it generally looks OK
and the benefits of sharing more code between arches certainly outweigh
a few bytes of binary growth.  For the x86 bits at least, feel free to
add my ack.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
  2020-09-08  5:19     ` Christophe Leroy
  (?)
  (?)
@ 2020-09-08 15:48       ` Alexander Gordeev
  -1 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08 15:48 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:

[...]

> >diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >index 67ebc22cf83d..d9e7d16c2263 100644
> >--- a/include/linux/pgtable.h
> >+++ b/include/linux/pgtable.h
> >@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   */
> >  #ifndef pgd_addr_end
> >-#define pgd_addr_end(pgd, addr, end)					\
> >-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> >-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> >-})
> >+#define pgd_addr_end pgd_addr_end
> 
> I think that #define is pointless, usually there is no such #define
> for the default case.

Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):

#define pgd_addr_end pgd_addr_end
static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
{
	return rste_addr_end_folded(pgd_val(pgd), addr, end);
}

> >+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> >+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> >+	return (__boundary - 1 < end - 1) ? __boundary : end;
> >+}

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08 15:48       ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08 15:48 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:

[...]

> >diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >index 67ebc22cf83d..d9e7d16c2263 100644
> >--- a/include/linux/pgtable.h
> >+++ b/include/linux/pgtable.h
> >@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   */
> >  #ifndef pgd_addr_end
> >-#define pgd_addr_end(pgd, addr, end)					\
> >-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> >-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> >-})
> >+#define pgd_addr_end pgd_addr_end
> 
> I think that #define is pointless, usually there is no such #define
> for the default case.

Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):

#define pgd_addr_end pgd_addr_end
static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
{
	return rste_addr_end_folded(pgd_val(pgd), addr, end);
}

> >+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> >+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> >+	return (__boundary - 1 < end - 1) ? __boundary : end;
> >+}

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08 15:48       ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08 15:48 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:

[...]

> >diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >index 67ebc22cf83d..d9e7d16c2263 100644
> >--- a/include/linux/pgtable.h
> >+++ b/include/linux/pgtable.h
> >@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   */
> >  #ifndef pgd_addr_end
> >-#define pgd_addr_end(pgd, addr, end)					\
> >-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> >-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> >-})
> >+#define pgd_addr_end pgd_addr_end
> 
> I think that #define is pointless, usually there is no such #define
> for the default case.

Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):

#define pgd_addr_end pgd_addr_end
static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
{
	return rste_addr_end_folded(pgd_val(pgd), addr, end);
}

> >+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> >+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> >+	return (__boundary - 1 < end - 1) ? __boundary : end;
> >+}

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08 15:48       ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-08 15:48 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:

[...]

> >diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >index 67ebc22cf83d..d9e7d16c2263 100644
> >--- a/include/linux/pgtable.h
> >+++ b/include/linux/pgtable.h
> >@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   */
> >  #ifndef pgd_addr_end
> >-#define pgd_addr_end(pgd, addr, end)					\
> >-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> >-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> >-})
> >+#define pgd_addr_end pgd_addr_end
> 
> I think that #define is pointless, usually there is no such #define
> for the default case.

Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):

#define pgd_addr_end pgd_addr_end
static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
{
	return rste_addr_end_folded(pgd_val(pgd), addr, end);
}

> >+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> >+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> >+	return (__boundary - 1 < end - 1) ? __boundary : end;
> >+}

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
  2020-09-08 15:48       ` Alexander Gordeev
  (?)
  (?)
@ 2020-09-08 17:20         ` Christophe Leroy
  -1 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08 17:20 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



Le 08/09/2020 à 17:48, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:
> 
> [...]
> 
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 67ebc22cf83d..d9e7d16c2263 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>>>    */
>>>   #ifndef pgd_addr_end
>>> -#define pgd_addr_end(pgd, addr, end)					\
>>> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
>>> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
>>> -})
>>> +#define pgd_addr_end pgd_addr_end
>>
>> I think that #define is pointless, usually there is no such #define
>> for the default case.
> 
> Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):
> 
> #define pgd_addr_end pgd_addr_end
> static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> {
> 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> }

Yes, there in s390 the #define is needed to hit the #ifndef pgd_addr_end 
that's in include/linux/pgtable.h

But in include/linux/pgtable.h, there is no need of an #define 
pgd_addr_end pgd_addr_end I think

> 
>>> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
>>> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
>>> +	return (__boundary - 1 < end - 1) ? __boundary : end;
>>> +}


Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08 17:20         ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08 17:20 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



Le 08/09/2020 à 17:48, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:
> 
> [...]
> 
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 67ebc22cf83d..d9e7d16c2263 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>>>    */
>>>   #ifndef pgd_addr_end
>>> -#define pgd_addr_end(pgd, addr, end)					\
>>> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
>>> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
>>> -})
>>> +#define pgd_addr_end pgd_addr_end
>>
>> I think that #define is pointless, usually there is no such #define
>> for the default case.
> 
> Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):
> 
> #define pgd_addr_end pgd_addr_end
> static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> {
> 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> }

Yes, there in s390 the #define is needed to hit the #ifndef pgd_addr_end 
that's in include/linux/pgtable.h

But in include/linux/pgtable.h, there is no need of an #define 
pgd_addr_end pgd_addr_end I think

> 
>>> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
>>> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
>>> +	return (__boundary - 1 < end - 1) ? __boundary : end;
>>> +}


Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08 17:20         ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08 17:20 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport



Le 08/09/2020 à 17:48, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:
> 
> [...]
> 
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 67ebc22cf83d..d9e7d16c2263 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>>>    */
>>>   #ifndef pgd_addr_end
>>> -#define pgd_addr_end(pgd, addr, end)					\
>>> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
>>> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
>>> -})
>>> +#define pgd_addr_end pgd_addr_end
>>
>> I think that #define is pointless, usually there is no such #define
>> for the default case.
> 
> Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):
> 
> #define pgd_addr_end pgd_addr_end
> static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> {
> 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> }

Yes, there in s390 the #define is needed to hit the #ifndef pgd_addr_end 
that's in include/linux/pgtable.h

But in include/linux/pgtable.h, there is no need of an #define 
pgd_addr_end pgd_addr_end I think

> 
>>> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
>>> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
>>> +	return (__boundary - 1 < end - 1) ? __boundary : end;
>>> +}


Christophe

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08 17:20         ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-08 17:20 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport



Le 08/09/2020 à 17:48, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:
> 
> [...]
> 
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 67ebc22cf83d..d9e7d16c2263 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>>>    */
>>>   #ifndef pgd_addr_end
>>> -#define pgd_addr_end(pgd, addr, end)					\
>>> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
>>> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
>>> -})
>>> +#define pgd_addr_end pgd_addr_end
>>
>> I think that #define is pointless, usually there is no such #define
>> for the default case.
> 
> Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):
> 
> #define pgd_addr_end pgd_addr_end
> static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> {
> 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> }

Yes, there in s390 the #define is needed to hit the #ifndef pgd_addr_end 
that's in include/linux/pgtable.h

But in include/linux/pgtable.h, there is no need of an #define 
pgd_addr_end pgd_addr_end I think

> 
>>> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
>>> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
>>> +	return (__boundary - 1 < end - 1) ? __boundary : end;
>>> +}


Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-08  5:22     ` Christophe Leroy
  (?)
  (?)
@ 2020-09-08 17:36       ` Gerald Schaefer
  -1 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:36 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Mike Rapoport, Peter Zijlstra, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds

On Tue, 8 Sep 2020 07:22:39 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> > On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> >> This is v2 of an RFC previously discussed here:
> >> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> >>
> >> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> >> to common gup_fast code. It will introduce special helper functions
> >> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> >> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> >>
> >> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> >> themselves by adding an extra pXd value parameter. That was suggested by
> >> Jason during v1 discussion, because he is already thinking of some other
> >> places where he might want to switch to the READ_ONCE logic for pagetable
> >> walks. In general, that would be the cleanest / safest solution, but there
> >> is some impact on other architectures and common code, hence the new and
> >> greatly enlarged recipient list.
> >>
> >> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> >> functions instead of #defines, so that we get some type checking for the
> >> new pXd value parameter.
> >>
> >> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> >> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> >> still be nice to have in stable, to ease future backports, but I guess
> >> "nice to have" does not really qualify for stable backports.
> > 
> > I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> > way and with this patch 1 is not really required. I would even merge
> > patches 2 and 3 into a single patch and use only it as the fix.
> 
> Why not merging patches 2 and 3, but I would keep patch 1 separate but 
> after the generic changes, so that we first do the generic changes, then 
> we do the specific S390 use of it.

Yes, we thought about that approach too. It would at least allow to
get all into stable, more or less nicely, as prerequisite for the s390
fix.

Two concerns kept us from going that way. For once, it might not be
the nicest way to get it all in stable, and we would not want to risk
further objections due to the imminent and rather scary data corruption
issue that we want to fix asap.

For the same reason, we thought that the generalization part might
need more time and agreement from various people, so that we could at
least get the first patch as short-term solution.

It seems now that the generalization is very well accepted so far,
apart from some apparent issues on arm. Also, merging 2 + 3 and
putting them first seems to be acceptable, so we could do that for
v3, if there are no objections.

Of course, we first need to address the few remaining issues for
arm(32?), which do look quite confusing to me so far. BTW, sorry for
the compile error with patch 3, I guess we did the cross-compile only
for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
patch 3 already proved its usefulness by that :-)

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 17:36       ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:36 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Mike Rapoport, Peter Zijlstra, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds

On Tue, 8 Sep 2020 07:22:39 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> > On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> >> This is v2 of an RFC previously discussed here:
> >> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> >>
> >> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> >> to common gup_fast code. It will introduce special helper functions
> >> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> >> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> >>
> >> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> >> themselves by adding an extra pXd value parameter. That was suggested by
> >> Jason during v1 discussion, because he is already thinking of some other
> >> places where he might want to switch to the READ_ONCE logic for pagetable
> >> walks. In general, that would be the cleanest / safest solution, but there
> >> is some impact on other architectures and common code, hence the new and
> >> greatly enlarged recipient list.
> >>
> >> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> >> functions instead of #defines, so that we get some type checking for the
> >> new pXd value parameter.
> >>
> >> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> >> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> >> still be nice to have in stable, to ease future backports, but I guess
> >> "nice to have" does not really qualify for stable backports.
> > 
> > I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> > way and with this patch 1 is not really required. I would even merge
> > patches 2 and 3 into a single patch and use only it as the fix.
> 
> Why not merging patches 2 and 3, but I would keep patch 1 separate but 
> after the generic changes, so that we first do the generic changes, then 
> we do the specific S390 use of it.

Yes, we thought about that approach too. It would at least allow to
get all into stable, more or less nicely, as prerequisite for the s390
fix.

Two concerns kept us from going that way. For once, it might not be
the nicest way to get it all in stable, and we would not want to risk
further objections due to the imminent and rather scary data corruption
issue that we want to fix asap.

For the same reason, we thought that the generalization part might
need more time and agreement from various people, so that we could at
least get the first patch as short-term solution.

It seems now that the generalization is very well accepted so far,
apart from some apparent issues on arm. Also, merging 2 + 3 and
putting them first seems to be acceptable, so we could do that for
v3, if there are no objections.

Of course, we first need to address the few remaining issues for
arm(32?), which do look quite confusing to me so far. BTW, sorry for
the compile error with patch 3, I guess we did the cross-compile only
for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
patch 3 already proved its usefulness by that :-)

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 17:36       ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:36 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 07:22:39 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> > On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> >> This is v2 of an RFC previously discussed here:
> >> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> >>
> >> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> >> to common gup_fast code. It will introduce special helper functions
> >> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> >> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> >>
> >> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> >> themselves by adding an extra pXd value parameter. That was suggested by
> >> Jason during v1 discussion, because he is already thinking of some other
> >> places where he might want to switch to the READ_ONCE logic for pagetable
> >> walks. In general, that would be the cleanest / safest solution, but there
> >> is some impact on other architectures and common code, hence the new and
> >> greatly enlarged recipient list.
> >>
> >> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> >> functions instead of #defines, so that we get some type checking for the
> >> new pXd value parameter.
> >>
> >> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> >> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> >> still be nice to have in stable, to ease future backports, but I guess
> >> "nice to have" does not really qualify for stable backports.
> > 
> > I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> > way and with this patch 1 is not really required. I would even merge
> > patches 2 and 3 into a single patch and use only it as the fix.
> 
> Why not merging patches 2 and 3, but I would keep patch 1 separate but 
> after the generic changes, so that we first do the generic changes, then 
> we do the specific S390 use of it.

Yes, we thought about that approach too. It would at least allow to
get all into stable, more or less nicely, as prerequisite for the s390
fix.

Two concerns kept us from going that way. For once, it might not be
the nicest way to get it all in stable, and we would not want to risk
further objections due to the imminent and rather scary data corruption
issue that we want to fix asap.

For the same reason, we thought that the generalization part might
need more time and agreement from various people, so that we could at
least get the first patch as short-term solution.

It seems now that the generalization is very well accepted so far,
apart from some apparent issues on arm. Also, merging 2 + 3 and
putting them first seems to be acceptable, so we could do that for
v3, if there are no objections.

Of course, we first need to address the few remaining issues for
arm(32?), which do look quite confusing to me so far. BTW, sorry for
the compile error with patch 3, I guess we did the cross-compile only
for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
patch 3 already proved its usefulness by that :-)

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 17:36       ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:36 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 07:22:39 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> > On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> >> This is v2 of an RFC previously discussed here:
> >> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> >>
> >> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> >> to common gup_fast code. It will introduce special helper functions
> >> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> >> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> >>
> >> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> >> themselves by adding an extra pXd value parameter. That was suggested by
> >> Jason during v1 discussion, because he is already thinking of some other
> >> places where he might want to switch to the READ_ONCE logic for pagetable
> >> walks. In general, that would be the cleanest / safest solution, but there
> >> is some impact on other architectures and common code, hence the new and
> >> greatly enlarged recipient list.
> >>
> >> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> >> functions instead of #defines, so that we get some type checking for the
> >> new pXd value parameter.
> >>
> >> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> >> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> >> still be nice to have in stable, to ease future backports, but I guess
> >> "nice to have" does not really qualify for stable backports.
> > 
> > I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> > way and with this patch 1 is not really required. I would even merge
> > patches 2 and 3 into a single patch and use only it as the fix.
> 
> Why not merging patches 2 and 3, but I would keep patch 1 separate but 
> after the generic changes, so that we first do the generic changes, then 
> we do the specific S390 use of it.

Yes, we thought about that approach too. It would at least allow to
get all into stable, more or less nicely, as prerequisite for the s390
fix.

Two concerns kept us from going that way. For once, it might not be
the nicest way to get it all in stable, and we would not want to risk
further objections due to the imminent and rather scary data corruption
issue that we want to fix asap.

For the same reason, we thought that the generalization part might
need more time and agreement from various people, so that we could at
least get the first patch as short-term solution.

It seems now that the generalization is very well accepted so far,
apart from some apparent issues on arm. Also, merging 2 + 3 and
putting them first seems to be acceptable, so we could do that for
v3, if there are no objections.

Of course, we first need to address the few remaining issues for
arm(32?), which do look quite confusing to me so far. BTW, sorry for
the compile error with patch 3, I guess we did the cross-compile only
for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
patch 3 already proved its usefulness by that :-)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-08 14:30     ` Dave Hansen
  (?)
  (?)
@ 2020-09-08 17:59       ` Gerald Schaefer
  -1 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

We do not really allocate anything on the stack, it is the generic logic
from gup_fast that passes over pXd values (read once before), and using
pointers to such (stack) variables instead of real pXd pointers.
That, combined with the fact that we just return the passed in pointer in
pXd_offset() for folded levels.

That works similar on x86 IIUC, but with static folding, and thus also
proper pXd_addr_end() results because of statically (and correspondingly)
defined Pxd_INDEX/SHIFT. We always have static 5-level PxD_INDEX/SHIFT, and
that cannot really be made dynamic, so we just make pXd_addr_end()
dynamic instead, and that requires the pXd value to determine the correct
pagetable level.

Still makes my head spin when trying to explain, sorry. It is a very
special s390 oddity, or lets call it "feature", because I don't think any
other architecture has "dynamic pagetable folding" capability, depending
on process requirements, for whatever it is worth...

> 
> > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> >  	do {
> >  		pmd_t pmd = READ_ONCE(*pmdp);
> >  
> > -		next = pmd_addr_end(addr, end);
> > +		next = pmd_addr_end_folded(pmd, addr, end);
> >  		if (!pmd_present(pmd))
> >  			return 0;
> 
> It looks like you fix this up later, but this would be a problem if left
> this way.  There's no documentation for whether I use
> pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.

Yes, that is very unfortunate. We did have some lengthy comment in
include/linux/pgtable.h where the pXd_addr_end(_folded) were defined.
But that was moved to arch/s390/include/asm/pgtable.h in this version,
probably because we already had the generalization in mind, where we
would not need such explanation in common header any more.

So, it might help better understand the issue that we have with
dynamic page table folding and READ_ONCE-style pagetable walkers
when looking at that comment.

Thanks for pointing out, that comment should definitely go into
include/linux/pgtable.h again. At least if we would still go for
that "s390 fix first, generalization second" approach, but it
seems we have other / better options now.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 17:59       ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

We do not really allocate anything on the stack, it is the generic logic
from gup_fast that passes over pXd values (read once before), and using
pointers to such (stack) variables instead of real pXd pointers.
That, combined with the fact that we just return the passed in pointer in
pXd_offset() for folded levels.

That works similar on x86 IIUC, but with static folding, and thus also
proper pXd_addr_end() results because of statically (and correspondingly)
defined Pxd_INDEX/SHIFT. We always have static 5-level PxD_INDEX/SHIFT, and
that cannot really be made dynamic, so we just make pXd_addr_end()
dynamic instead, and that requires the pXd value to determine the correct
pagetable level.

Still makes my head spin when trying to explain, sorry. It is a very
special s390 oddity, or lets call it "feature", because I don't think any
other architecture has "dynamic pagetable folding" capability, depending
on process requirements, for whatever it is worth...

> 
> > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> >  	do {
> >  		pmd_t pmd = READ_ONCE(*pmdp);
> >  
> > -		next = pmd_addr_end(addr, end);
> > +		next = pmd_addr_end_folded(pmd, addr, end);
> >  		if (!pmd_present(pmd))
> >  			return 0;
> 
> It looks like you fix this up later, but this would be a problem if left
> this way.  There's no documentation for whether I use
> pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.

Yes, that is very unfortunate. We did have some lengthy comment in
include/linux/pgtable.h where the pXd_addr_end(_folded) were defined.
But that was moved to arch/s390/include/asm/pgtable.h in this version,
probably because we already had the generalization in mind, where we
would not need such explanation in common header any more.

So, it might help better understand the issue that we have with
dynamic page table folding and READ_ONCE-style pagetable walkers
when looking at that comment.

Thanks for pointing out, that comment should definitely go into
include/linux/pgtable.h again. At least if we would still go for
that "s390 fix first, generalization second" approach, but it
seems we have other / better options now.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 17:59       ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

We do not really allocate anything on the stack, it is the generic logic
from gup_fast that passes over pXd values (read once before), and using
pointers to such (stack) variables instead of real pXd pointers.
That, combined with the fact that we just return the passed in pointer in
pXd_offset() for folded levels.

That works similar on x86 IIUC, but with static folding, and thus also
proper pXd_addr_end() results because of statically (and correspondingly)
defined Pxd_INDEX/SHIFT. We always have static 5-level PxD_INDEX/SHIFT, and
that cannot really be made dynamic, so we just make pXd_addr_end()
dynamic instead, and that requires the pXd value to determine the correct
pagetable level.

Still makes my head spin when trying to explain, sorry. It is a very
special s390 oddity, or lets call it "feature", because I don't think any
other architecture has "dynamic pagetable folding" capability, depending
on process requirements, for whatever it is worth...

> 
> > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> >  	do {
> >  		pmd_t pmd = READ_ONCE(*pmdp);
> >  
> > -		next = pmd_addr_end(addr, end);
> > +		next = pmd_addr_end_folded(pmd, addr, end);
> >  		if (!pmd_present(pmd))
> >  			return 0;
> 
> It looks like you fix this up later, but this would be a problem if left
> this way.  There's no documentation for whether I use
> pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.

Yes, that is very unfortunate. We did have some lengthy comment in
include/linux/pgtable.h where the pXd_addr_end(_folded) were defined.
But that was moved to arch/s390/include/asm/pgtable.h in this version,
probably because we already had the generalization in mind, where we
would not need such explanation in common header any more.

So, it might help better understand the issue that we have with
dynamic page table folding and READ_ONCE-style pagetable walkers
when looking at that comment.

Thanks for pointing out, that comment should definitely go into
include/linux/pgtable.h again. At least if we would still go for
that "s390 fix first, generalization second" approach, but it
seems we have other / better options now.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 17:59       ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

We do not really allocate anything on the stack, it is the generic logic
from gup_fast that passes over pXd values (read once before), and using
pointers to such (stack) variables instead of real pXd pointers.
That, combined with the fact that we just return the passed in pointer in
pXd_offset() for folded levels.

That works similar on x86 IIUC, but with static folding, and thus also
proper pXd_addr_end() results because of statically (and correspondingly)
defined Pxd_INDEX/SHIFT. We always have static 5-level PxD_INDEX/SHIFT, and
that cannot really be made dynamic, so we just make pXd_addr_end()
dynamic instead, and that requires the pXd value to determine the correct
pagetable level.

Still makes my head spin when trying to explain, sorry. It is a very
special s390 oddity, or lets call it "feature", because I don't think any
other architecture has "dynamic pagetable folding" capability, depending
on process requirements, for whatever it is worth...

> 
> > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> >  	do {
> >  		pmd_t pmd = READ_ONCE(*pmdp);
> >  
> > -		next = pmd_addr_end(addr, end);
> > +		next = pmd_addr_end_folded(pmd, addr, end);
> >  		if (!pmd_present(pmd))
> >  			return 0;
> 
> It looks like you fix this up later, but this would be a problem if left
> this way.  There's no documentation for whether I use
> pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.

Yes, that is very unfortunate. We did have some lengthy comment in
include/linux/pgtable.h where the pXd_addr_end(_folded) were defined.
But that was moved to arch/s390/include/asm/pgtable.h in this version,
probably because we already had the generalization in mind, where we
would not need such explanation in common header any more.

So, it might help better understand the issue that we have with
dynamic page table folding and READ_ONCE-style pagetable walkers
when looking at that comment.

Thanks for pointing out, that comment should definitely go into
include/linux/pgtable.h again. At least if we would still go for
that "s390 fix first, generalization second" approach, but it
seems we have other / better options now.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-08 14:15           ` Alexander Gordeev
  (?)
  (?)
@ 2020-09-09  8:38             ` Christophe Leroy
  -1 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-09  8:38 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Michael Ellerman, Gerald Schaefer, Jason Gunthorpe, John Hubbard,
	Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Tue, 2020-09-08 at 16:15 +0200, Alexander Gordeev wrote:
> On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> > >Yes, and also two more sources :/
> > >	arch/powerpc/mm/kasan/8xx.c
> > >	arch/powerpc/mm/kasan/kasan_init_32.c
> > >
> > >But these two are not quite obvious wrt pgd_addr_end() used
> > >while traversing pmds. Could you please clarify a bit?
> > >
> > >
> > >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> > >index 2784224..89c5053 100644
> > >--- a/arch/powerpc/mm/kasan/8xx.c
> > >+++ b/arch/powerpc/mm/kasan/8xx.c
> > >@@ -15,8 +15,8 @@
> > >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> > >  		pte_basic_t *new;
> > >-		k_next = pgd_addr_end(k_cur, k_end);
> > >-		k_next = pgd_addr_end(k_next, k_end);
> > >+		k_next = pmd_addr_end(k_cur, k_end);
> > >+		k_next = pmd_addr_end(k_next, k_end);
> > 
> > No, I don't think so.
> > On powerpc32 we have only two levels, so pgd and pmd are more or
> > less the same.
> > But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> > is a no-op, so I don't think it will work.
> > 
> > It is likely that this function should iterate on pgd, then you get
> > pmd = pmd_offset(pud_offset(p4d_offset(pgd)));
> 
> It looks like the code iterates over single pmd table while using
> pgd_addr_end() only to skip all the middle levels and bail out
> from the loop.
> 
> I would be wary for switching from pmds to pgds, since we are
> trying to minimize impact (especially functional) and the
> rework does not seem that obvious.
> 

I've just tested the following change, it works and should fix the
oddity:

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224054f8..8e53ddf57b84 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -9,11 +9,12 @@
 static int __init
 kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void
*block)
 {
-	pmd_t *pmd = pmd_off_k(k_start);
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block
+= SZ_8M) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd += 2, block
+= SZ_8M) {
 		pte_basic_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		k_next = pgd_addr_end(k_next, k_end);
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c
b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb294046e00e..e5f524fa71a7 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -30,13 +30,12 @@ static void __init kasan_populate_pte(pte_t *ptep,
pgprot_t prot)
 
 int __init kasan_init_shadow_page_tables(unsigned long k_start,
unsigned long k_end)
 {
-	pmd_t *pmd;
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	pmd = pmd_off_k(k_start);
-
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd++) {
 		pte_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
@@ -189,16 +188,18 @@ void __init kasan_early_init(void)
 	unsigned long addr = KASAN_SHADOW_START;
 	unsigned long end = KASAN_SHADOW_END;
 	unsigned long next;
-	pmd_t *pmd = pmd_off_k(addr);
+	pgd_t *pgd = pgd_offset_k(addr);
 
 	BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK);
 
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, addr), addr),
addr);
+
 		next = pgd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
-	} while (pmd++, addr = next, addr != end);
+	} while (pgd++, addr = next, addr != end);
 
 	if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
 		kasan_early_hash_table();
---
Christophe


^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-09  8:38             ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-09  8:38 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Michael Ellerman, Gerald Schaefer, Jason Gunthorpe, John Hubbard,
	Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Tue, 2020-09-08 at 16:15 +0200, Alexander Gordeev wrote:
> On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> > >Yes, and also two more sources :/
> > >	arch/powerpc/mm/kasan/8xx.c
> > >	arch/powerpc/mm/kasan/kasan_init_32.c
> > >
> > >But these two are not quite obvious wrt pgd_addr_end() used
> > >while traversing pmds. Could you please clarify a bit?
> > >
> > >
> > >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> > >index 2784224..89c5053 100644
> > >--- a/arch/powerpc/mm/kasan/8xx.c
> > >+++ b/arch/powerpc/mm/kasan/8xx.c
> > >@@ -15,8 +15,8 @@
> > >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> > >  		pte_basic_t *new;
> > >-		k_next = pgd_addr_end(k_cur, k_end);
> > >-		k_next = pgd_addr_end(k_next, k_end);
> > >+		k_next = pmd_addr_end(k_cur, k_end);
> > >+		k_next = pmd_addr_end(k_next, k_end);
> > 
> > No, I don't think so.
> > On powerpc32 we have only two levels, so pgd and pmd are more or
> > less the same.
> > But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> > is a no-op, so I don't think it will work.
> > 
> > It is likely that this function should iterate on pgd, then you get
> > pmd = pmd_offset(pud_offset(p4d_offset(pgd)));
> 
> It looks like the code iterates over single pmd table while using
> pgd_addr_end() only to skip all the middle levels and bail out
> from the loop.
> 
> I would be wary for switching from pmds to pgds, since we are
> trying to minimize impact (especially functional) and the
> rework does not seem that obvious.
> 

I've just tested the following change, it works and should fix the
oddity:

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224054f8..8e53ddf57b84 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -9,11 +9,12 @@
 static int __init
 kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void
*block)
 {
-	pmd_t *pmd = pmd_off_k(k_start);
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block
+= SZ_8M) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd += 2, block
+= SZ_8M) {
 		pte_basic_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		k_next = pgd_addr_end(k_next, k_end);
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c
b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb294046e00e..e5f524fa71a7 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -30,13 +30,12 @@ static void __init kasan_populate_pte(pte_t *ptep,
pgprot_t prot)
 
 int __init kasan_init_shadow_page_tables(unsigned long k_start,
unsigned long k_end)
 {
-	pmd_t *pmd;
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	pmd = pmd_off_k(k_start);
-
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd++) {
 		pte_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
@@ -189,16 +188,18 @@ void __init kasan_early_init(void)
 	unsigned long addr = KASAN_SHADOW_START;
 	unsigned long end = KASAN_SHADOW_END;
 	unsigned long next;
-	pmd_t *pmd = pmd_off_k(addr);
+	pgd_t *pgd = pgd_offset_k(addr);
 
 	BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK);
 
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, addr), addr),
addr);
+
 		next = pgd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
-	} while (pmd++, addr = next, addr != end);
+	} while (pgd++, addr = next, addr != end);
 
 	if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
 		kasan_early_hash_table();
---
Christophe



^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-09  8:38             ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-09  8:38 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Arnd Bergmann, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Vasily Gorbik, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 2020-09-08 at 16:15 +0200, Alexander Gordeev wrote:
> On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> > >Yes, and also two more sources :/
> > >	arch/powerpc/mm/kasan/8xx.c
> > >	arch/powerpc/mm/kasan/kasan_init_32.c
> > >
> > >But these two are not quite obvious wrt pgd_addr_end() used
> > >while traversing pmds. Could you please clarify a bit?
> > >
> > >
> > >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> > >index 2784224..89c5053 100644
> > >--- a/arch/powerpc/mm/kasan/8xx.c
> > >+++ b/arch/powerpc/mm/kasan/8xx.c
> > >@@ -15,8 +15,8 @@
> > >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> > >  		pte_basic_t *new;
> > >-		k_next = pgd_addr_end(k_cur, k_end);
> > >-		k_next = pgd_addr_end(k_next, k_end);
> > >+		k_next = pmd_addr_end(k_cur, k_end);
> > >+		k_next = pmd_addr_end(k_next, k_end);
> > 
> > No, I don't think so.
> > On powerpc32 we have only two levels, so pgd and pmd are more or
> > less the same.
> > But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> > is a no-op, so I don't think it will work.
> > 
> > It is likely that this function should iterate on pgd, then you get
> > pmd = pmd_offset(pud_offset(p4d_offset(pgd)));
> 
> It looks like the code iterates over single pmd table while using
> pgd_addr_end() only to skip all the middle levels and bail out
> from the loop.
> 
> I would be wary for switching from pmds to pgds, since we are
> trying to minimize impact (especially functional) and the
> rework does not seem that obvious.
> 

I've just tested the following change, it works and should fix the
oddity:

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224054f8..8e53ddf57b84 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -9,11 +9,12 @@
 static int __init
 kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void
*block)
 {
-	pmd_t *pmd = pmd_off_k(k_start);
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block
+= SZ_8M) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd += 2, block
+= SZ_8M) {
 		pte_basic_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		k_next = pgd_addr_end(k_next, k_end);
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c
b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb294046e00e..e5f524fa71a7 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -30,13 +30,12 @@ static void __init kasan_populate_pte(pte_t *ptep,
pgprot_t prot)
 
 int __init kasan_init_shadow_page_tables(unsigned long k_start,
unsigned long k_end)
 {
-	pmd_t *pmd;
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	pmd = pmd_off_k(k_start);
-
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd++) {
 		pte_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
@@ -189,16 +188,18 @@ void __init kasan_early_init(void)
 	unsigned long addr = KASAN_SHADOW_START;
 	unsigned long end = KASAN_SHADOW_END;
 	unsigned long next;
-	pmd_t *pmd = pmd_off_k(addr);
+	pgd_t *pgd = pgd_offset_k(addr);
 
 	BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK);
 
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, addr), addr),
addr);
+
 		next = pgd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
-	} while (pmd++, addr = next, addr != end);
+	} while (pgd++, addr = next, addr != end);
 
 	if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
 		kasan_early_hash_table();
---
Christophe


^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-09  8:38             ` Christophe Leroy
  0 siblings, 0 replies; 254+ messages in thread
From: Christophe Leroy @ 2020-09-09  8:38 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Arnd Bergmann, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Vasily Gorbik, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Michael Ellerman, Andrew Morton,
	linux-power, Mike Rapoport

On Tue, 2020-09-08 at 16:15 +0200, Alexander Gordeev wrote:
> On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> > >Yes, and also two more sources :/
> > >	arch/powerpc/mm/kasan/8xx.c
> > >	arch/powerpc/mm/kasan/kasan_init_32.c
> > >
> > >But these two are not quite obvious wrt pgd_addr_end() used
> > >while traversing pmds. Could you please clarify a bit?
> > >
> > >
> > >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> > >index 2784224..89c5053 100644
> > >--- a/arch/powerpc/mm/kasan/8xx.c
> > >+++ b/arch/powerpc/mm/kasan/8xx.c
> > >@@ -15,8 +15,8 @@
> > >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> > >  		pte_basic_t *new;
> > >-		k_next = pgd_addr_end(k_cur, k_end);
> > >-		k_next = pgd_addr_end(k_next, k_end);
> > >+		k_next = pmd_addr_end(k_cur, k_end);
> > >+		k_next = pmd_addr_end(k_next, k_end);
> > 
> > No, I don't think so.
> > On powerpc32 we have only two levels, so pgd and pmd are more or
> > less the same.
> > But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> > is a no-op, so I don't think it will work.
> > 
> > It is likely that this function should iterate on pgd, then you get
> > pmd = pmd_offset(pud_offset(p4d_offset(pgd)));
> 
> It looks like the code iterates over single pmd table while using
> pgd_addr_end() only to skip all the middle levels and bail out
> from the loop.
> 
> I would be wary for switching from pmds to pgds, since we are
> trying to minimize impact (especially functional) and the
> rework does not seem that obvious.
> 

I've just tested the following change, it works and should fix the
oddity:

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224054f8..8e53ddf57b84 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -9,11 +9,12 @@
 static int __init
 kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void
*block)
 {
-	pmd_t *pmd = pmd_off_k(k_start);
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block
+= SZ_8M) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd += 2, block
+= SZ_8M) {
 		pte_basic_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		k_next = pgd_addr_end(k_next, k_end);
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c
b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb294046e00e..e5f524fa71a7 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -30,13 +30,12 @@ static void __init kasan_populate_pte(pte_t *ptep,
pgprot_t prot)
 
 int __init kasan_init_shadow_page_tables(unsigned long k_start,
unsigned long k_end)
 {
-	pmd_t *pmd;
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	pmd = pmd_off_k(k_start);
-
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd++) {
 		pte_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
@@ -189,16 +188,18 @@ void __init kasan_early_init(void)
 	unsigned long addr = KASAN_SHADOW_START;
 	unsigned long end = KASAN_SHADOW_END;
 	unsigned long next;
-	pmd_t *pmd = pmd_off_k(addr);
+	pgd_t *pgd = pgd_offset_k(addr);
 
 	BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK);
 
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, addr), addr),
addr);
+
 		next = pgd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
-	} while (pmd++, addr = next, addr != end);
+	} while (pgd++, addr = next, addr != end);
 
 	if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
 		kasan_early_hash_table();
---
Christophe


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-08 14:30     ` Dave Hansen
  (?)
  (?)
@ 2020-09-09 12:29       ` Gerald Schaefer
  -1 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-09 12:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

Sorry, I guess my previous reply does not really explain "what the heck
is dynamic page table folding?".

On s390, we can have different number of page table levels for different
processes / mms. We always start with 3 levels, and update dynamically
on process demand to 4 or 5 levels, hence the dynamic folding. Still,
the PxD_SIZE/SHIFT is defined statically, so that e.g. pXd_addr_end() will
not reflect this dynamic behavior.

For the various pagetable walkers using pXd_addr_end() (w/o READ_ONCE
logic) this is no problem. With static folding, iteration over the folded
levels will always happen at pgd level (top-level folding). For s390,
we stay at the respective level and iterate there (dynamic middle-level
folding), only return to pgd level if there really were 5 levels.

This only works well as long there are real pagetable pointers involved,
that can also be used for iteration. For gup_fast, or any other future
pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
There are pointers involved to local pXd values on the stack, because of
the READ_ONCE logic, and our middle-level iteration will suddenly iterate
over such stack pointers instead of pagetable pointers.

This will be addressed by making the pXd_addr_end() dynamic, for which
we need to see the pXd value in order to determine its level / type.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 12:29       ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-09 12:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

Sorry, I guess my previous reply does not really explain "what the heck
is dynamic page table folding?".

On s390, we can have different number of page table levels for different
processes / mms. We always start with 3 levels, and update dynamically
on process demand to 4 or 5 levels, hence the dynamic folding. Still,
the PxD_SIZE/SHIFT is defined statically, so that e.g. pXd_addr_end() will
not reflect this dynamic behavior.

For the various pagetable walkers using pXd_addr_end() (w/o READ_ONCE
logic) this is no problem. With static folding, iteration over the folded
levels will always happen at pgd level (top-level folding). For s390,
we stay at the respective level and iterate there (dynamic middle-level
folding), only return to pgd level if there really were 5 levels.

This only works well as long there are real pagetable pointers involved,
that can also be used for iteration. For gup_fast, or any other future
pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
There are pointers involved to local pXd values on the stack, because of
the READ_ONCE logic, and our middle-level iteration will suddenly iterate
over such stack pointers instead of pagetable pointers.

This will be addressed by making the pXd_addr_end() dynamic, for which
we need to see the pXd value in order to determine its level / type.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 12:29       ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-09 12:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

Sorry, I guess my previous reply does not really explain "what the heck
is dynamic page table folding?".

On s390, we can have different number of page table levels for different
processes / mms. We always start with 3 levels, and update dynamically
on process demand to 4 or 5 levels, hence the dynamic folding. Still,
the PxD_SIZE/SHIFT is defined statically, so that e.g. pXd_addr_end() will
not reflect this dynamic behavior.

For the various pagetable walkers using pXd_addr_end() (w/o READ_ONCE
logic) this is no problem. With static folding, iteration over the folded
levels will always happen at pgd level (top-level folding). For s390,
we stay at the respective level and iterate there (dynamic middle-level
folding), only return to pgd level if there really were 5 levels.

This only works well as long there are real pagetable pointers involved,
that can also be used for iteration. For gup_fast, or any other future
pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
There are pointers involved to local pXd values on the stack, because of
the READ_ONCE logic, and our middle-level iteration will suddenly iterate
over such stack pointers instead of pagetable pointers.

This will be addressed by making the pXd_addr_end() dynamic, for which
we need to see the pXd value in order to determine its level / type.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 12:29       ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-09 12:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

Sorry, I guess my previous reply does not really explain "what the heck
is dynamic page table folding?".

On s390, we can have different number of page table levels for different
processes / mms. We always start with 3 levels, and update dynamically
on process demand to 4 or 5 levels, hence the dynamic folding. Still,
the PxD_SIZE/SHIFT is defined statically, so that e.g. pXd_addr_end() will
not reflect this dynamic behavior.

For the various pagetable walkers using pXd_addr_end() (w/o READ_ONCE
logic) this is no problem. With static folding, iteration over the folded
levels will always happen at pgd level (top-level folding). For s390,
we stay at the respective level and iterate there (dynamic middle-level
folding), only return to pgd level if there really were 5 levels.

This only works well as long there are real pagetable pointers involved,
that can also be used for iteration. For gup_fast, or any other future
pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
There are pointers involved to local pXd values on the stack, because of
the READ_ONCE logic, and our middle-level iteration will suddenly iterate
over such stack pointers instead of pagetable pointers.

This will be addressed by making the pXd_addr_end() dynamic, for which
we need to see the pXd value in order to determine its level / type.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-08 17:36       ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-09 16:12         ` Gerald Schaefer
  -1 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-09 16:12 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Mike Rapoport, Peter Zijlstra, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds

On Tue, 8 Sep 2020 19:36:50 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:

[..]
> 
> It seems now that the generalization is very well accepted so far,
> apart from some apparent issues on arm. Also, merging 2 + 3 and
> putting them first seems to be acceptable, so we could do that for
> v3, if there are no objections.
> 
> Of course, we first need to address the few remaining issues for
> arm(32?), which do look quite confusing to me so far. BTW, sorry for
> the compile error with patch 3, I guess we did the cross-compile only
> for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
> patch 3 already proved its usefulness by that :-)

Umm, replace "arm" with "power", sorry. No issues on arm so far, but
also no ack I think.

Thanks to Christophe for the power change, and to Mike for volunteering
for some cross compilation and cross-arch testing. Will send v3 with
merged and re-ordered patches after some more testing.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 16:12         ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-09 16:12 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Mike Rapoport, Peter Zijlstra, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds

On Tue, 8 Sep 2020 19:36:50 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:

[..]
> 
> It seems now that the generalization is very well accepted so far,
> apart from some apparent issues on arm. Also, merging 2 + 3 and
> putting them first seems to be acceptable, so we could do that for
> v3, if there are no objections.
> 
> Of course, we first need to address the few remaining issues for
> arm(32?), which do look quite confusing to me so far. BTW, sorry for
> the compile error with patch 3, I guess we did the cross-compile only
> for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
> patch 3 already proved its usefulness by that :-)

Umm, replace "arm" with "power", sorry. No issues on arm so far, but
also no ack I think.

Thanks to Christophe for the power change, and to Mike for volunteering
for some cross compilation and cross-arch testing. Will send v3 with
merged and re-ordered patches after some more testing.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 16:12         ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-09 16:12 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 19:36:50 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:

[..]
> 
> It seems now that the generalization is very well accepted so far,
> apart from some apparent issues on arm. Also, merging 2 + 3 and
> putting them first seems to be acceptable, so we could do that for
> v3, if there are no objections.
> 
> Of course, we first need to address the few remaining issues for
> arm(32?), which do look quite confusing to me so far. BTW, sorry for
> the compile error with patch 3, I guess we did the cross-compile only
> for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
> patch 3 already proved its usefulness by that :-)

Umm, replace "arm" with "power", sorry. No issues on arm so far, but
also no ack I think.

Thanks to Christophe for the power change, and to Mike for volunteering
for some cross compilation and cross-arch testing. Will send v3 with
merged and re-ordered patches after some more testing.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 16:12         ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-09 16:12 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 19:36:50 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:

[..]
> 
> It seems now that the generalization is very well accepted so far,
> apart from some apparent issues on arm. Also, merging 2 + 3 and
> putting them first seems to be acceptable, so we could do that for
> v3, if there are no objections.
> 
> Of course, we first need to address the few remaining issues for
> arm(32?), which do look quite confusing to me so far. BTW, sorry for
> the compile error with patch 3, I guess we did the cross-compile only
> for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
> patch 3 already proved its usefulness by that :-)

Umm, replace "arm" with "power", sorry. No issues on arm so far, but
also no ack I think.

Thanks to Christophe for the power change, and to Mike for volunteering
for some cross compilation and cross-arch testing. Will send v3 with
merged and re-ordered patches after some more testing.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-09 12:29       ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-09 16:18         ` Dave Hansen
  -1 siblings, 0 replies; 254+ messages in thread
From: Dave Hansen @ 2020-09-09 16:18 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> This only works well as long there are real pagetable pointers involved,
> that can also be used for iteration. For gup_fast, or any other future
> pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> There are pointers involved to local pXd values on the stack, because of
> the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> over such stack pointers instead of pagetable pointers.

By "There are pointers involved to local pXd values on the stack", did
you mean "locate" instead of "local"?  That sentence confused me.

Which code is it, exactly that allocates these troublesome on-stack pXd
values, btw?

> This will be addressed by making the pXd_addr_end() dynamic, for which
> we need to see the pXd value in order to determine its level / type.

Thanks for the explanation!

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 16:18         ` Dave Hansen
  0 siblings, 0 replies; 254+ messages in thread
From: Dave Hansen @ 2020-09-09 16:18 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> This only works well as long there are real pagetable pointers involved,
> that can also be used for iteration. For gup_fast, or any other future
> pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> There are pointers involved to local pXd values on the stack, because of
> the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> over such stack pointers instead of pagetable pointers.

By "There are pointers involved to local pXd values on the stack", did
you mean "locate" instead of "local"?  That sentence confused me.

Which code is it, exactly that allocates these troublesome on-stack pXd
values, btw?

> This will be addressed by making the pXd_addr_end() dynamic, for which
> we need to see the pXd value in order to determine its level / type.

Thanks for the explanation!

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 16:18         ` Dave Hansen
  0 siblings, 0 replies; 254+ messages in thread
From: Dave Hansen @ 2020-09-09 16:18 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> This only works well as long there are real pagetable pointers involved,
> that can also be used for iteration. For gup_fast, or any other future
> pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> There are pointers involved to local pXd values on the stack, because of
> the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> over such stack pointers instead of pagetable pointers.

By "There are pointers involved to local pXd values on the stack", did
you mean "locate" instead of "local"?  That sentence confused me.

Which code is it, exactly that allocates these troublesome on-stack pXd
values, btw?

> This will be addressed by making the pXd_addr_end() dynamic, for which
> we need to see the pXd value in order to determine its level / type.

Thanks for the explanation!

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 16:18         ` Dave Hansen
  0 siblings, 0 replies; 254+ messages in thread
From: Dave Hansen @ 2020-09-09 16:18 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> This only works well as long there are real pagetable pointers involved,
> that can also be used for iteration. For gup_fast, or any other future
> pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> There are pointers involved to local pXd values on the stack, because of
> the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> over such stack pointers instead of pagetable pointers.

By "There are pointers involved to local pXd values on the stack", did
you mean "locate" instead of "local"?  That sentence confused me.

Which code is it, exactly that allocates these troublesome on-stack pXd
values, btw?

> This will be addressed by making the pXd_addr_end() dynamic, for which
> we need to see the pXd value in order to determine its level / type.

Thanks for the explanation!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-09 16:18         ` Dave Hansen
  (?)
  (?)
@ 2020-09-09 17:25           ` Gerald Schaefer
  -1 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-09 17:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Wed, 9 Sep 2020 09:18:46 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> > This only works well as long there are real pagetable pointers involved,
> > that can also be used for iteration. For gup_fast, or any other future
> > pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> > There are pointers involved to local pXd values on the stack, because of
> > the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> > over such stack pointers instead of pagetable pointers.
> 
> By "There are pointers involved to local pXd values on the stack", did
> you mean "locate" instead of "local"?  That sentence confused me.
> 
> Which code is it, exactly that allocates these troublesome on-stack pXd
> values, btw?

It is the gup_pXd_range() call sequence in mm/gup.c. It starts in
gup_pgd_range() with "pgdp = pgd_offset(current->mm, addr)" and then
the "pgd_t pgd = READ_ONCE(*pgdp)" which creates the first local
stack variable "pgd".

The next-level call to gup_p4d_range() gets this "pgd" value as
input, but not the original pgdp pointer where it was read from.
This is already the essential difference to other pagetable walkers
like e.g. walk_pXd_range() in mm/pagewalk.c, where the original
pointer is passed through. With READ_ONCE, that pointer must not
be further de-referenced, so instead the value is passed over.

In gup_p4d_range() we then have "p4dp = p4d_offset(&pgd, addr)",
with &pgd being a pointer to the passed over pgd value, so that's
the first pXd pointer that does not point directly to the pXd in
the page table, but a local stack variable.

With folded p4d, p4d_offset(&pgd, addr) will simply return
the passed-in &pgd pointer, so we now also have p4dp point to that.
That continues with "p4d_t p4d = READ_ONCE(*p4dp)", and that second
stack variable passed to gup_huge_pud() and so on. Due to inlining,
all those variables will not really be passed anywhere, but simply
sit on the stack.

So far, IIUC, that would also happen on x86 (or everywhere else
actually) for folded levels, i.e. some pXd_offset() calls would
simply return the passed in (stack) value pointer. This works
as designed, and it will not lead to the "iteration over stack
pointer" for anybody but s390, because the pXd_addr_end()
boundaries usually take care that you always return to pgd
level for iteration, and that is the only level with a real
pagetable pointer. For s390, we stay at the first non-folded
level and do the iteration there, which is fine for other
pagetable walkers using the original pointers, but not for
the READ_ONCE-style gup_fast.

I actually had to draw myself a picture to get some hold of
this, or rather a walk-through with a certain pud-crossing
range in a folded 3-level scenario. Not sure if I would have
understood my explanation above w/o that, but I hope you can
make some sense out of it. Or draw yourself a picture :-)

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 17:25           ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-09 17:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Wed, 9 Sep 2020 09:18:46 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> > This only works well as long there are real pagetable pointers involved,
> > that can also be used for iteration. For gup_fast, or any other future
> > pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> > There are pointers involved to local pXd values on the stack, because of
> > the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> > over such stack pointers instead of pagetable pointers.
> 
> By "There are pointers involved to local pXd values on the stack", did
> you mean "locate" instead of "local"?  That sentence confused me.
> 
> Which code is it, exactly that allocates these troublesome on-stack pXd
> values, btw?

It is the gup_pXd_range() call sequence in mm/gup.c. It starts in
gup_pgd_range() with "pgdp = pgd_offset(current->mm, addr)" and then
the "pgd_t pgd = READ_ONCE(*pgdp)" which creates the first local
stack variable "pgd".

The next-level call to gup_p4d_range() gets this "pgd" value as
input, but not the original pgdp pointer where it was read from.
This is already the essential difference to other pagetable walkers
like e.g. walk_pXd_range() in mm/pagewalk.c, where the original
pointer is passed through. With READ_ONCE, that pointer must not
be further de-referenced, so instead the value is passed over.

In gup_p4d_range() we then have "p4dp = p4d_offset(&pgd, addr)",
with &pgd being a pointer to the passed over pgd value, so that's
the first pXd pointer that does not point directly to the pXd in
the page table, but a local stack variable.

With folded p4d, p4d_offset(&pgd, addr) will simply return
the passed-in &pgd pointer, so we now also have p4dp point to that.
That continues with "p4d_t p4d = READ_ONCE(*p4dp)", and that second
stack variable passed to gup_huge_pud() and so on. Due to inlining,
all those variables will not really be passed anywhere, but simply
sit on the stack.

So far, IIUC, that would also happen on x86 (or everywhere else
actually) for folded levels, i.e. some pXd_offset() calls would
simply return the passed in (stack) value pointer. This works
as designed, and it will not lead to the "iteration over stack
pointer" for anybody but s390, because the pXd_addr_end()
boundaries usually take care that you always return to pgd
level for iteration, and that is the only level with a real
pagetable pointer. For s390, we stay at the first non-folded
level and do the iteration there, which is fine for other
pagetable walkers using the original pointers, but not for
the READ_ONCE-style gup_fast.

I actually had to draw myself a picture to get some hold of
this, or rather a walk-through with a certain pud-crossing
range in a folded 3-level scenario. Not sure if I would have
understood my explanation above w/o that, but I hope you can
make some sense out of it. Or draw yourself a picture :-)

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 17:25           ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-09 17:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, 9 Sep 2020 09:18:46 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> > This only works well as long there are real pagetable pointers involved,
> > that can also be used for iteration. For gup_fast, or any other future
> > pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> > There are pointers involved to local pXd values on the stack, because of
> > the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> > over such stack pointers instead of pagetable pointers.
> 
> By "There are pointers involved to local pXd values on the stack", did
> you mean "locate" instead of "local"?  That sentence confused me.
> 
> Which code is it, exactly that allocates these troublesome on-stack pXd
> values, btw?

It is the gup_pXd_range() call sequence in mm/gup.c. It starts in
gup_pgd_range() with "pgdp = pgd_offset(current->mm, addr)" and then
the "pgd_t pgd = READ_ONCE(*pgdp)" which creates the first local
stack variable "pgd".

The next-level call to gup_p4d_range() gets this "pgd" value as
input, but not the original pgdp pointer where it was read from.
This is already the essential difference to other pagetable walkers
like e.g. walk_pXd_range() in mm/pagewalk.c, where the original
pointer is passed through. With READ_ONCE, that pointer must not
be further de-referenced, so instead the value is passed over.

In gup_p4d_range() we then have "p4dp = p4d_offset(&pgd, addr)",
with &pgd being a pointer to the passed over pgd value, so that's
the first pXd pointer that does not point directly to the pXd in
the page table, but a local stack variable.

With folded p4d, p4d_offset(&pgd, addr) will simply return
the passed-in &pgd pointer, so we now also have p4dp point to that.
That continues with "p4d_t p4d = READ_ONCE(*p4dp)", and that second
stack variable passed to gup_huge_pud() and so on. Due to inlining,
all those variables will not really be passed anywhere, but simply
sit on the stack.

So far, IIUC, that would also happen on x86 (or everywhere else
actually) for folded levels, i.e. some pXd_offset() calls would
simply return the passed in (stack) value pointer. This works
as designed, and it will not lead to the "iteration over stack
pointer" for anybody but s390, because the pXd_addr_end()
boundaries usually take care that you always return to pgd
level for iteration, and that is the only level with a real
pagetable pointer. For s390, we stay at the first non-folded
level and do the iteration there, which is fine for other
pagetable walkers using the original pointers, but not for
the READ_ONCE-style gup_fast.

I actually had to draw myself a picture to get some hold of
this, or rather a walk-through with a certain pud-crossing
range in a folded 3-level scenario. Not sure if I would have
understood my explanation above w/o that, but I hope you can
make some sense out of it. Or draw yourself a picture :-)

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 17:25           ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-09 17:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, 9 Sep 2020 09:18:46 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> > This only works well as long there are real pagetable pointers involved,
> > that can also be used for iteration. For gup_fast, or any other future
> > pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> > There are pointers involved to local pXd values on the stack, because of
> > the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> > over such stack pointers instead of pagetable pointers.
> 
> By "There are pointers involved to local pXd values on the stack", did
> you mean "locate" instead of "local"?  That sentence confused me.
> 
> Which code is it, exactly that allocates these troublesome on-stack pXd
> values, btw?

It is the gup_pXd_range() call sequence in mm/gup.c. It starts in
gup_pgd_range() with "pgdp = pgd_offset(current->mm, addr)" and then
the "pgd_t pgd = READ_ONCE(*pgdp)" which creates the first local
stack variable "pgd".

The next-level call to gup_p4d_range() gets this "pgd" value as
input, but not the original pgdp pointer where it was read from.
This is already the essential difference to other pagetable walkers
like e.g. walk_pXd_range() in mm/pagewalk.c, where the original
pointer is passed through. With READ_ONCE, that pointer must not
be further de-referenced, so instead the value is passed over.

In gup_p4d_range() we then have "p4dp = p4d_offset(&pgd, addr)",
with &pgd being a pointer to the passed over pgd value, so that's
the first pXd pointer that does not point directly to the pXd in
the page table, but a local stack variable.

With folded p4d, p4d_offset(&pgd, addr) will simply return
the passed-in &pgd pointer, so we now also have p4dp point to that.
That continues with "p4d_t p4d = READ_ONCE(*p4dp)", and that second
stack variable passed to gup_huge_pud() and so on. Due to inlining,
all those variables will not really be passed anywhere, but simply
sit on the stack.

So far, IIUC, that would also happen on x86 (or everywhere else
actually) for folded levels, i.e. some pXd_offset() calls would
simply return the passed in (stack) value pointer. This works
as designed, and it will not lead to the "iteration over stack
pointer" for anybody but s390, because the pXd_addr_end()
boundaries usually take care that you always return to pgd
level for iteration, and that is the only level with a real
pagetable pointer. For s390, we stay at the first non-folded
level and do the iteration there, which is fine for other
pagetable walkers using the original pointers, but not for
the READ_ONCE-style gup_fast.

I actually had to draw myself a picture to get some hold of
this, or rather a walk-through with a certain pud-crossing
range in a folded 3-level scenario. Not sure if I would have
understood my explanation above w/o that, but I hope you can
make some sense out of it. Or draw yourself a picture :-)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-09 17:25           ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-09 18:03             ` Jason Gunthorpe
  -1 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-09 18:03 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> I actually had to draw myself a picture to get some hold of
> this, or rather a walk-through with a certain pud-crossing
> range in a folded 3-level scenario. Not sure if I would have
> understood my explanation above w/o that, but I hope you can
> make some sense out of it. Or draw yourself a picture :-)

What I don't understand is how does anything work with S390 today?

If the fix is only to change pxx_addr_end() then than generic code
like mm/pagewalk.c will iterate over a *different list* of page table
entries. 

It's choice of entries to look at is entirely driven by pxx_addr_end().

Which suggest to me that mm/pagewalk.c also doesn't work properly
today on S390 and this issue is not really about stack variables?

Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
together, if pXX_offset() is folded then pXX_addr_end() must cause a
single iteration of that level.

Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 18:03             ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-09 18:03 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> I actually had to draw myself a picture to get some hold of
> this, or rather a walk-through with a certain pud-crossing
> range in a folded 3-level scenario. Not sure if I would have
> understood my explanation above w/o that, but I hope you can
> make some sense out of it. Or draw yourself a picture :-)

What I don't understand is how does anything work with S390 today?

If the fix is only to change pxx_addr_end() then than generic code
like mm/pagewalk.c will iterate over a *different list* of page table
entries. 

It's choice of entries to look at is entirely driven by pxx_addr_end().

Which suggest to me that mm/pagewalk.c also doesn't work properly
today on S390 and this issue is not really about stack variables?

Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
together, if pXX_offset() is folded then pXX_addr_end() must cause a
single iteration of that level.

Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 18:03             ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-09 18:03 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> I actually had to draw myself a picture to get some hold of
> this, or rather a walk-through with a certain pud-crossing
> range in a folded 3-level scenario. Not sure if I would have
> understood my explanation above w/o that, but I hope you can
> make some sense out of it. Or draw yourself a picture :-)

What I don't understand is how does anything work with S390 today?

If the fix is only to change pxx_addr_end() then than generic code
like mm/pagewalk.c will iterate over a *different list* of page table
entries. 

It's choice of entries to look at is entirely driven by pxx_addr_end().

Which suggest to me that mm/pagewalk.c also doesn't work properly
today on S390 and this issue is not really about stack variables?

Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
together, if pXX_offset() is folded then pXX_addr_end() must cause a
single iteration of that level.

Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 18:03             ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-09 18:03 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> I actually had to draw myself a picture to get some hold of
> this, or rather a walk-through with a certain pud-crossing
> range in a folded 3-level scenario. Not sure if I would have
> understood my explanation above w/o that, but I hope you can
> make some sense out of it. Or draw yourself a picture :-)

What I don't understand is how does anything work with S390 today?

If the fix is only to change pxx_addr_end() then than generic code
like mm/pagewalk.c will iterate over a *different list* of page table
entries. 

It's choice of entries to look at is entirely driven by pxx_addr_end().

Which suggest to me that mm/pagewalk.c also doesn't work properly
today on S390 and this issue is not really about stack variables?

Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
together, if pXX_offset() is folded then pXX_addr_end() must cause a
single iteration of that level.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-09 18:03             ` Jason Gunthorpe
  (?)
  (?)
@ 2020-09-10  9:39               ` Alexander Gordeev
  -1 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-10  9:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Gerald Schaefer, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Wed, Sep 09, 2020 at 03:03:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?
> 
> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?
> 
> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

Your observation is correct.

Another way to describe the problem is existing pXd_addr_end helpers
could be applied to mismatching levels on s390 (e.g p4d_addr_end
applied to pud or pgd_addr_end applied to p4d). As you noticed,
all *_pXd_range iterators could be called with address ranges that
exceed single pXd table.

However, when it happens with pointers to real page tables (passed to
*_pXd_range iterators) we still operate on valid tables, which just
(lucky for us) happened to be folded. Thus we still reference correct
table entries.

It is only gup_fast case that exposes the issue. It hits because
pointers to stack copies are passed to gup_pXd_range iterators, not
pointers to real page tables itself.

As Gerald mentioned, it is very difficult to explain in a clear way.
Hopefully, one could make sense ot of it.

> Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10  9:39               ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-10  9:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Gerald Schaefer, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Wed, Sep 09, 2020 at 03:03:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?
> 
> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?
> 
> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

Your observation is correct.

Another way to describe the problem is existing pXd_addr_end helpers
could be applied to mismatching levels on s390 (e.g p4d_addr_end
applied to pud or pgd_addr_end applied to p4d). As you noticed,
all *_pXd_range iterators could be called with address ranges that
exceed single pXd table.

However, when it happens with pointers to real page tables (passed to
*_pXd_range iterators) we still operate on valid tables, which just
(lucky for us) happened to be folded. Thus we still reference correct
table entries.

It is only gup_fast case that exposes the issue. It hits because
pointers to stack copies are passed to gup_pXd_range iterators, not
pointers to real page tables itself.

As Gerald mentioned, it is very difficult to explain in a clear way.
Hopefully, one could make sense ot of it.

> Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10  9:39               ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-10  9:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Wed, Sep 09, 2020 at 03:03:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?
> 
> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?
> 
> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

Your observation is correct.

Another way to describe the problem is existing pXd_addr_end helpers
could be applied to mismatching levels on s390 (e.g p4d_addr_end
applied to pud or pgd_addr_end applied to p4d). As you noticed,
all *_pXd_range iterators could be called with address ranges that
exceed single pXd table.

However, when it happens with pointers to real page tables (passed to
*_pXd_range iterators) we still operate on valid tables, which just
(lucky for us) happened to be folded. Thus we still reference correct
table entries.

It is only gup_fast case that exposes the issue. It hits because
pointers to stack copies are passed to gup_pXd_range iterators, not
pointers to real page tables itself.

As Gerald mentioned, it is very difficult to explain in a clear way.
Hopefully, one could make sense ot of it.

> Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10  9:39               ` Alexander Gordeev
  0 siblings, 0 replies; 254+ messages in thread
From: Alexander Gordeev @ 2020-09-10  9:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Wed, Sep 09, 2020 at 03:03:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?
> 
> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?
> 
> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

Your observation is correct.

Another way to describe the problem is existing pXd_addr_end helpers
could be applied to mismatching levels on s390 (e.g p4d_addr_end
applied to pud or pgd_addr_end applied to p4d). As you noticed,
all *_pXd_range iterators could be called with address ranges that
exceed single pXd table.

However, when it happens with pointers to real page tables (passed to
*_pXd_range iterators) we still operate on valid tables, which just
(lucky for us) happened to be folded. Thus we still reference correct
table entries.

It is only gup_fast case that exposes the issue. It hits because
pointers to stack copies are passed to gup_pXd_range iterators, not
pointers to real page tables itself.

As Gerald mentioned, it is very difficult to explain in a clear way.
Hopefully, one could make sense ot of it.

> Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10  9:39               ` Alexander Gordeev
  (?)
  (?)
@ 2020-09-10 13:02                 ` Jason Gunthorpe
  -1 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 13:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Gerald Schaefer, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:

> As Gerald mentioned, it is very difficult to explain in a clear way.
> Hopefully, one could make sense ot of it.

I would say the page table API requires this invariant:

        pud = pud_offset(p4d, addr);
        do {
		WARN_ON(pud != pud_offset(p4d, addr);
                next = pud_addr_end(addr, end);
        } while (pud++, addr = next, addr != end);

ie pud++ is supposed to be a shortcut for 
  pud_offset(p4d, next)

While S390 does not follow this. Fixing addr_end brings it into
alignment by preventing pud++ from happening.

The only currently known side effect is that gup_fast crashes, but it
sure is an unexpected thing.

This suggests another fix, which is to say that pud++ is undefined and
pud_offset() must always be called, but I think that would cause worse
codegen on all other archs.

Jason


^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:02                 ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 13:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Gerald Schaefer, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:

> As Gerald mentioned, it is very difficult to explain in a clear way.
> Hopefully, one could make sense ot of it.

I would say the page table API requires this invariant:

        pud = pud_offset(p4d, addr);
        do {
		WARN_ON(pud != pud_offset(p4d, addr);
                next = pud_addr_end(addr, end);
        } while (pud++, addr = next, addr != end);

ie pud++ is supposed to be a shortcut for 
  pud_offset(p4d, next)

While S390 does not follow this. Fixing addr_end brings it into
alignment by preventing pud++ from happening.

The only currently known side effect is that gup_fast crashes, but it
sure is an unexpected thing.

This suggests another fix, which is to say that pud++ is undefined and
pud_offset() must always be called, but I think that would cause worse
codegen on all other archs.

Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:02                 ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 13:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:

> As Gerald mentioned, it is very difficult to explain in a clear way.
> Hopefully, one could make sense ot of it.

I would say the page table API requires this invariant:

        pud = pud_offset(p4d, addr);
        do {
		WARN_ON(pud != pud_offset(p4d, addr);
                next = pud_addr_end(addr, end);
        } while (pud++, addr = next, addr != end);

ie pud++ is supposed to be a shortcut for 
  pud_offset(p4d, next)

While S390 does not follow this. Fixing addr_end brings it into
alignment by preventing pud++ from happening.

The only currently known side effect is that gup_fast crashes, but it
sure is an unexpected thing.

This suggests another fix, which is to say that pud++ is undefined and
pud_offset() must always be called, but I think that would cause worse
codegen on all other archs.

Jason


^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:02                 ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 13:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:

> As Gerald mentioned, it is very difficult to explain in a clear way.
> Hopefully, one could make sense ot of it.

I would say the page table API requires this invariant:

        pud = pud_offset(p4d, addr);
        do {
		WARN_ON(pud != pud_offset(p4d, addr);
                next = pud_addr_end(addr, end);
        } while (pud++, addr = next, addr != end);

ie pud++ is supposed to be a shortcut for 
  pud_offset(p4d, next)

While S390 does not follow this. Fixing addr_end brings it into
alignment by preventing pud++ from happening.

The only currently known side effect is that gup_fast crashes, but it
sure is an unexpected thing.

This suggests another fix, which is to say that pud++ is undefined and
pud_offset() must always be called, but I think that would cause worse
codegen on all other archs.

Jason


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-09 18:03             ` Jason Gunthorpe
  (?)
  (?)
@ 2020-09-10 13:11               ` Gerald Schaefer
  -1 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Wed, 9 Sep 2020 15:03:24 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?

That is totally comprehensible :-)

> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?

I guess you are confused by the fact that the generic change will indeed
change the logic for _all_ pagetable walkers on s390, not just for
the gup_fast case. But that doesn't mean that they were doing it wrong
before, we simply can do it both ways. However, we probably should
make that (in theory useless) change more explicit.

Let's compare before and after for mm/pagewalk.c on s390, with 3-level
pagetables, range crossing 2 GB pud boundary.

* Before (with pXd_addr_end always using static 5-level PxD_SIZE):

walk_pgd_range()
-> pgd_addr_end() will use static 2^53 PGDIR_SIZE, range is not cropped,
                  no iterations needed, passed over to next level

walk_p4d_range()
-> p4d_addr_end() will use static 2^42 P4D_SIZE, range still not cropped

walk_pud_range()
-> pud_addr_end() now we're cropping, with 2^31 PUD_SIZE, need two
                  iterations for range crossing pud boundary, doing
                  that right here on a pudp which is actually the
                  previously passed-through pgdp/p4dp (pointing to
                  correct pagetable entry)

* After (with dynamic pXd_addr_end using "correct" PxD_SIZE boundaries,
         should be similar to other archs static "top-level folding"):

walk_pgd_range()
-> pgd_addr_end() will now determine "correct" boundary based on pgd
                  value, i.e. 2^31 PUD_SIZE, do cropping now, iteration
                  will now happen here

walk_p4d/pud_range()
->  operate on cropped range, will not iterate, instead return to pgd level,
    which will then use the same pointer for iteration as in the "Before"
    case, but not on the same level.

IMHO, our "Before" logic is more efficient, and also feels more natural.
After all, it is not really necessary to return to pgd level, and it will
surely cost some extra instructions. We are willing to take that cost
for the sake of doing it in a more generic way, hoping that will reduce
future issues. E.g. you already mentioned that you have plans for using
the READ_ONCE logic also in other places, and that would be such a
"future issue".

> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

well, that sounds correct in theory, but I guess it depends on "how
you fold it". E.g. what does "if pXX_offset() is folded" mean?
Take pgd_offset() for the 3-level case above. From our previous
"middle-level folding/iteration" perspective, I would say that
pgd/p4d are folded into pud, so if you say "if pgd_offset() is folded
then pgd_addr_end() must cause a single iteration of that level",
we were doing it all correctly, i.e only having single iteration
on pgd/p4d level. You could even say that all others are doing /
using it wrong :-)

Now take pgd_offset() from the "top-level folding/iteration".
Here you would say that p4d/pud are folded into pgd, which again
does not sound like the natural / most efficient way to me,
but IIUC this has to be how it works for all other archs with
(static) pagetable folding. Now you'd say "if pud/p4d_offset()
is folded then pud/p4d_addr_end() must cause a single iteration
of that level", and that would sound correct. At least until
you look more closely, because e.g. p4d_addr_end() in
include/asm-generic/pgtable-nop4d.h is simply this:
#define p4d_addr_end(addr, end) (end)

How can that cause a single iteration? It clearly won't, it only
works because the previous pgd_addr_end already cropped the range
so that there will be only single iterations for p4d/pud.

The more I think of it, the more it sounds like s390 "middle-level
folding/iteration" was doing it "the right way", and everybody else
was wrong, or at least not in an optimally efficient way :-) Might
also be that only we could do this because we can determine the
pagetable level from a pagetable entry value.

Anyway, if you are not yet confused enough, I recommend looking
at the other option we had in mind, for fixing the gup_fast issue.
See "Patch 1" from here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

That would actually have kept that "middle-level iteration" also
for gup_fast, by additionally passing through the pXd pointers.
However, it also needed a gup-specific version of pXd_offset(),
in order to keep the READ_ONCE semantics. For s390, that would
have actually been the best solution, but a generic version of
that might not have been so easy. And doing it like everybody
else can not be so bad, at least I really hope so.

Of course, at some point in time, we might come up with some fancy
fundamental change that would "do it the right middle-level way
for everybody". At least I think I overheard Vasily and Alexander
discussing some wild ideas, but that is certainly beyond this scope
here...

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:11               ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Wed, 9 Sep 2020 15:03:24 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?

That is totally comprehensible :-)

> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?

I guess you are confused by the fact that the generic change will indeed
change the logic for _all_ pagetable walkers on s390, not just for
the gup_fast case. But that doesn't mean that they were doing it wrong
before, we simply can do it both ways. However, we probably should
make that (in theory useless) change more explicit.

Let's compare before and after for mm/pagewalk.c on s390, with 3-level
pagetables, range crossing 2 GB pud boundary.

* Before (with pXd_addr_end always using static 5-level PxD_SIZE):

walk_pgd_range()
-> pgd_addr_end() will use static 2^53 PGDIR_SIZE, range is not cropped,
                  no iterations needed, passed over to next level

walk_p4d_range()
-> p4d_addr_end() will use static 2^42 P4D_SIZE, range still not cropped

walk_pud_range()
-> pud_addr_end() now we're cropping, with 2^31 PUD_SIZE, need two
                  iterations for range crossing pud boundary, doing
                  that right here on a pudp which is actually the
                  previously passed-through pgdp/p4dp (pointing to
                  correct pagetable entry)

* After (with dynamic pXd_addr_end using "correct" PxD_SIZE boundaries,
         should be similar to other archs static "top-level folding"):

walk_pgd_range()
-> pgd_addr_end() will now determine "correct" boundary based on pgd
                  value, i.e. 2^31 PUD_SIZE, do cropping now, iteration
                  will now happen here

walk_p4d/pud_range()
->  operate on cropped range, will not iterate, instead return to pgd level,
    which will then use the same pointer for iteration as in the "Before"
    case, but not on the same level.

IMHO, our "Before" logic is more efficient, and also feels more natural.
After all, it is not really necessary to return to pgd level, and it will
surely cost some extra instructions. We are willing to take that cost
for the sake of doing it in a more generic way, hoping that will reduce
future issues. E.g. you already mentioned that you have plans for using
the READ_ONCE logic also in other places, and that would be such a
"future issue".

> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

well, that sounds correct in theory, but I guess it depends on "how
you fold it". E.g. what does "if pXX_offset() is folded" mean?
Take pgd_offset() for the 3-level case above. From our previous
"middle-level folding/iteration" perspective, I would say that
pgd/p4d are folded into pud, so if you say "if pgd_offset() is folded
then pgd_addr_end() must cause a single iteration of that level",
we were doing it all correctly, i.e only having single iteration
on pgd/p4d level. You could even say that all others are doing /
using it wrong :-)

Now take pgd_offset() from the "top-level folding/iteration".
Here you would say that p4d/pud are folded into pgd, which again
does not sound like the natural / most efficient way to me,
but IIUC this has to be how it works for all other archs with
(static) pagetable folding. Now you'd say "if pud/p4d_offset()
is folded then pud/p4d_addr_end() must cause a single iteration
of that level", and that would sound correct. At least until
you look more closely, because e.g. p4d_addr_end() in
include/asm-generic/pgtable-nop4d.h is simply this:
#define p4d_addr_end(addr, end) (end)

How can that cause a single iteration? It clearly won't, it only
works because the previous pgd_addr_end already cropped the range
so that there will be only single iterations for p4d/pud.

The more I think of it, the more it sounds like s390 "middle-level
folding/iteration" was doing it "the right way", and everybody else
was wrong, or at least not in an optimally efficient way :-) Might
also be that only we could do this because we can determine the
pagetable level from a pagetable entry value.

Anyway, if you are not yet confused enough, I recommend looking
at the other option we had in mind, for fixing the gup_fast issue.
See "Patch 1" from here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

That would actually have kept that "middle-level iteration" also
for gup_fast, by additionally passing through the pXd pointers.
However, it also needed a gup-specific version of pXd_offset(),
in order to keep the READ_ONCE semantics. For s390, that would
have actually been the best solution, but a generic version of
that might not have been so easy. And doing it like everybody
else can not be so bad, at least I really hope so.

Of course, at some point in time, we might come up with some fancy
fundamental change that would "do it the right middle-level way
for everybody". At least I think I overheard Vasily and Alexander
discussing some wild ideas, but that is certainly beyond this scope
here...

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:11               ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, 9 Sep 2020 15:03:24 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?

That is totally comprehensible :-)

> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?

I guess you are confused by the fact that the generic change will indeed
change the logic for _all_ pagetable walkers on s390, not just for
the gup_fast case. But that doesn't mean that they were doing it wrong
before, we simply can do it both ways. However, we probably should
make that (in theory useless) change more explicit.

Let's compare before and after for mm/pagewalk.c on s390, with 3-level
pagetables, range crossing 2 GB pud boundary.

* Before (with pXd_addr_end always using static 5-level PxD_SIZE):

walk_pgd_range()
-> pgd_addr_end() will use static 2^53 PGDIR_SIZE, range is not cropped,
                  no iterations needed, passed over to next level

walk_p4d_range()
-> p4d_addr_end() will use static 2^42 P4D_SIZE, range still not cropped

walk_pud_range()
-> pud_addr_end() now we're cropping, with 2^31 PUD_SIZE, need two
                  iterations for range crossing pud boundary, doing
                  that right here on a pudp which is actually the
                  previously passed-through pgdp/p4dp (pointing to
                  correct pagetable entry)

* After (with dynamic pXd_addr_end using "correct" PxD_SIZE boundaries,
         should be similar to other archs static "top-level folding"):

walk_pgd_range()
-> pgd_addr_end() will now determine "correct" boundary based on pgd
                  value, i.e. 2^31 PUD_SIZE, do cropping now, iteration
                  will now happen here

walk_p4d/pud_range()
->  operate on cropped range, will not iterate, instead return to pgd level,
    which will then use the same pointer for iteration as in the "Before"
    case, but not on the same level.

IMHO, our "Before" logic is more efficient, and also feels more natural.
After all, it is not really necessary to return to pgd level, and it will
surely cost some extra instructions. We are willing to take that cost
for the sake of doing it in a more generic way, hoping that will reduce
future issues. E.g. you already mentioned that you have plans for using
the READ_ONCE logic also in other places, and that would be such a
"future issue".

> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

well, that sounds correct in theory, but I guess it depends on "how
you fold it". E.g. what does "if pXX_offset() is folded" mean?
Take pgd_offset() for the 3-level case above. From our previous
"middle-level folding/iteration" perspective, I would say that
pgd/p4d are folded into pud, so if you say "if pgd_offset() is folded
then pgd_addr_end() must cause a single iteration of that level",
we were doing it all correctly, i.e only having single iteration
on pgd/p4d level. You could even say that all others are doing /
using it wrong :-)

Now take pgd_offset() from the "top-level folding/iteration".
Here you would say that p4d/pud are folded into pgd, which again
does not sound like the natural / most efficient way to me,
but IIUC this has to be how it works for all other archs with
(static) pagetable folding. Now you'd say "if pud/p4d_offset()
is folded then pud/p4d_addr_end() must cause a single iteration
of that level", and that would sound correct. At least until
you look more closely, because e.g. p4d_addr_end() in
include/asm-generic/pgtable-nop4d.h is simply this:
#define p4d_addr_end(addr, end) (end)

How can that cause a single iteration? It clearly won't, it only
works because the previous pgd_addr_end already cropped the range
so that there will be only single iterations for p4d/pud.

The more I think of it, the more it sounds like s390 "middle-level
folding/iteration" was doing it "the right way", and everybody else
was wrong, or at least not in an optimally efficient way :-) Might
also be that only we could do this because we can determine the
pagetable level from a pagetable entry value.

Anyway, if you are not yet confused enough, I recommend looking
at the other option we had in mind, for fixing the gup_fast issue.
See "Patch 1" from here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

That would actually have kept that "middle-level iteration" also
for gup_fast, by additionally passing through the pXd pointers.
However, it also needed a gup-specific version of pXd_offset(),
in order to keep the READ_ONCE semantics. For s390, that would
have actually been the best solution, but a generic version of
that might not have been so easy. And doing it like everybody
else can not be so bad, at least I really hope so.

Of course, at some point in time, we might come up with some fancy
fundamental change that would "do it the right middle-level way
for everybody". At least I think I overheard Vasily and Alexander
discussing some wild ideas, but that is certainly beyond this scope
here...

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:11               ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, 9 Sep 2020 15:03:24 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?

That is totally comprehensible :-)

> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?

I guess you are confused by the fact that the generic change will indeed
change the logic for _all_ pagetable walkers on s390, not just for
the gup_fast case. But that doesn't mean that they were doing it wrong
before, we simply can do it both ways. However, we probably should
make that (in theory useless) change more explicit.

Let's compare before and after for mm/pagewalk.c on s390, with 3-level
pagetables, range crossing 2 GB pud boundary.

* Before (with pXd_addr_end always using static 5-level PxD_SIZE):

walk_pgd_range()
-> pgd_addr_end() will use static 2^53 PGDIR_SIZE, range is not cropped,
                  no iterations needed, passed over to next level

walk_p4d_range()
-> p4d_addr_end() will use static 2^42 P4D_SIZE, range still not cropped

walk_pud_range()
-> pud_addr_end() now we're cropping, with 2^31 PUD_SIZE, need two
                  iterations for range crossing pud boundary, doing
                  that right here on a pudp which is actually the
                  previously passed-through pgdp/p4dp (pointing to
                  correct pagetable entry)

* After (with dynamic pXd_addr_end using "correct" PxD_SIZE boundaries,
         should be similar to other archs static "top-level folding"):

walk_pgd_range()
-> pgd_addr_end() will now determine "correct" boundary based on pgd
                  value, i.e. 2^31 PUD_SIZE, do cropping now, iteration
                  will now happen here

walk_p4d/pud_range()
->  operate on cropped range, will not iterate, instead return to pgd level,
    which will then use the same pointer for iteration as in the "Before"
    case, but not on the same level.

IMHO, our "Before" logic is more efficient, and also feels more natural.
After all, it is not really necessary to return to pgd level, and it will
surely cost some extra instructions. We are willing to take that cost
for the sake of doing it in a more generic way, hoping that will reduce
future issues. E.g. you already mentioned that you have plans for using
the READ_ONCE logic also in other places, and that would be such a
"future issue".

> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

well, that sounds correct in theory, but I guess it depends on "how
you fold it". E.g. what does "if pXX_offset() is folded" mean?
Take pgd_offset() for the 3-level case above. From our previous
"middle-level folding/iteration" perspective, I would say that
pgd/p4d are folded into pud, so if you say "if pgd_offset() is folded
then pgd_addr_end() must cause a single iteration of that level",
we were doing it all correctly, i.e only having single iteration
on pgd/p4d level. You could even say that all others are doing /
using it wrong :-)

Now take pgd_offset() from the "top-level folding/iteration".
Here you would say that p4d/pud are folded into pgd, which again
does not sound like the natural / most efficient way to me,
but IIUC this has to be how it works for all other archs with
(static) pagetable folding. Now you'd say "if pud/p4d_offset()
is folded then pud/p4d_addr_end() must cause a single iteration
of that level", and that would sound correct. At least until
you look more closely, because e.g. p4d_addr_end() in
include/asm-generic/pgtable-nop4d.h is simply this:
#define p4d_addr_end(addr, end) (end)

How can that cause a single iteration? It clearly won't, it only
works because the previous pgd_addr_end already cropped the range
so that there will be only single iterations for p4d/pud.

The more I think of it, the more it sounds like s390 "middle-level
folding/iteration" was doing it "the right way", and everybody else
was wrong, or at least not in an optimally efficient way :-) Might
also be that only we could do this because we can determine the
pagetable level from a pagetable entry value.

Anyway, if you are not yet confused enough, I recommend looking
at the other option we had in mind, for fixing the gup_fast issue.
See "Patch 1" from here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

That would actually have kept that "middle-level iteration" also
for gup_fast, by additionally passing through the pXd pointers.
However, it also needed a gup-specific version of pXd_offset(),
in order to keep the READ_ONCE semantics. For s390, that would
have actually been the best solution, but a generic version of
that might not have been so easy. And doing it like everybody
else can not be so bad, at least I really hope so.

Of course, at some point in time, we might come up with some fancy
fundamental change that would "do it the right middle-level way
for everybody". At least I think I overheard Vasily and Alexander
discussing some wild ideas, but that is certainly beyond this scope
here...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 13:02                 ` Jason Gunthorpe
  (?)
  (?)
@ 2020-09-10 13:28                   ` Gerald Schaefer
  -1 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.  
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 
> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.
> 
> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

It only is unexpected in a "top-level folding" world, see my other reply.
Consider it an optimization, which was possible because of how our dynamic
folding works, and e.g. because we can determine the correct pagetable
level from a pXd value in pXd_offset.

> This suggests another fix, which is to say that pud++ is undefined and
> pud_offset() must always be called, but I think that would cause worse
> codegen on all other archs.

There really is nothing to fix for s390 outside of gup_fast, or other
potential future READ_ONCE pagetable walkers. We do take the side-effect
of the generic change on all other pagetable walkers for s390, but it
really is rather a slight degradation than a fix.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:28                   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.  
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 
> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.
> 
> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

It only is unexpected in a "top-level folding" world, see my other reply.
Consider it an optimization, which was possible because of how our dynamic
folding works, and e.g. because we can determine the correct pagetable
level from a pXd value in pXd_offset.

> This suggests another fix, which is to say that pud++ is undefined and
> pud_offset() must always be called, but I think that would cause worse
> codegen on all other archs.

There really is nothing to fix for s390 outside of gup_fast, or other
potential future READ_ONCE pagetable walkers. We do take the side-effect
of the generic change on all other pagetable walkers for s390, but it
really is rather a slight degradation than a fix.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:28                   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.  
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 
> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.
> 
> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

It only is unexpected in a "top-level folding" world, see my other reply.
Consider it an optimization, which was possible because of how our dynamic
folding works, and e.g. because we can determine the correct pagetable
level from a pXd value in pXd_offset.

> This suggests another fix, which is to say that pud++ is undefined and
> pud_offset() must always be called, but I think that would cause worse
> codegen on all other archs.

There really is nothing to fix for s390 outside of gup_fast, or other
potential future READ_ONCE pagetable walkers. We do take the side-effect
of the generic change on all other pagetable walkers for s390, but it
really is rather a slight degradation than a fix.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:28                   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.  
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 
> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.
> 
> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

It only is unexpected in a "top-level folding" world, see my other reply.
Consider it an optimization, which was possible because of how our dynamic
folding works, and e.g. because we can determine the correct pagetable
level from a pXd value in pXd_offset.

> This suggests another fix, which is to say that pud++ is undefined and
> pud_offset() must always be called, but I think that would cause worse
> codegen on all other archs.

There really is nothing to fix for s390 outside of gup_fast, or other
potential future READ_ONCE pagetable walkers. We do take the side-effect
of the generic change on all other pagetable walkers for s390, but it
really is rather a slight degradation than a fix.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 13:28                   ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-10 15:10                     ` Jason Gunthorpe
  -1 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 15:10 UTC (permalink / raw)
  To: Gerald Schaefer, Anshuman Khandual
  Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.  
> > 
> > I would say the page table API requires this invariant:
> > 
> >         pud = pud_offset(p4d, addr);
> >         do {
> > 		WARN_ON(pud != pud_offset(p4d, addr);
> >                 next = pud_addr_end(addr, end);
> >         } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> > While S390 does not follow this. Fixing addr_end brings it into
> > alignment by preventing pud++ from happening.
> > 
> > The only currently known side effect is that gup_fast crashes, but it
> > sure is an unexpected thing.
> 
> It only is unexpected in a "top-level folding" world, see my other reply.
> Consider it an optimization, which was possible because of how our dynamic
> folding works, and e.g. because we can determine the correct pagetable
> level from a pXd value in pXd_offset.

No, I disagree. The page walker API the arch presents has to have well
defined semantics. For instance, there is an effort to define tests
and invarients for the page table accesses to bring this understanding
and uniformity:

 mm/debug_vm_pgtable.c

If we fix S390 using the pX_addr_end() change then the above should be
updated with an invariant to check it. I've added Anshuman for some
thoughts..

For better or worse, that invariant does exclude arches from using
other folding techniques.

The other solution would be to address the other side of != and adjust
the pud++

eg replcae pud++ with something like:
  pud = pud_next_entry(p4d, pud, next)

Such that:
  pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)

In which case the invarient changes to 'callers can never do pointer
arithmetic on the result of pXX_offset()' which is a bit harder to
enforce.

Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 15:10                     ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 15:10 UTC (permalink / raw)
  To: Gerald Schaefer, Anshuman Khandual
  Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.  
> > 
> > I would say the page table API requires this invariant:
> > 
> >         pud = pud_offset(p4d, addr);
> >         do {
> > 		WARN_ON(pud != pud_offset(p4d, addr);
> >                 next = pud_addr_end(addr, end);
> >         } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> > While S390 does not follow this. Fixing addr_end brings it into
> > alignment by preventing pud++ from happening.
> > 
> > The only currently known side effect is that gup_fast crashes, but it
> > sure is an unexpected thing.
> 
> It only is unexpected in a "top-level folding" world, see my other reply.
> Consider it an optimization, which was possible because of how our dynamic
> folding works, and e.g. because we can determine the correct pagetable
> level from a pXd value in pXd_offset.

No, I disagree. The page walker API the arch presents has to have well
defined semantics. For instance, there is an effort to define tests
and invarients for the page table accesses to bring this understanding
and uniformity:

 mm/debug_vm_pgtable.c

If we fix S390 using the pX_addr_end() change then the above should be
updated with an invariant to check it. I've added Anshuman for some
thoughts..

For better or worse, that invariant does exclude arches from using
other folding techniques.

The other solution would be to address the other side of != and adjust
the pud++

eg replcae pud++ with something like:
  pud = pud_next_entry(p4d, pud, next)

Such that:
  pud_next_entry(p4d, pud, next) == pud_offset(p4d, next)

In which case the invarient changes to 'callers can never do pointer
arithmetic on the result of pXX_offset()' which is a bit harder to
enforce.

Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 15:10                     ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 15:10 UTC (permalink / raw)
  To: Gerald Schaefer, Anshuman Khandual
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.  
> > 
> > I would say the page table API requires this invariant:
> > 
> >         pud = pud_offset(p4d, addr);
> >         do {
> > 		WARN_ON(pud != pud_offset(p4d, addr);
> >                 next = pud_addr_end(addr, end);
> >         } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> > While S390 does not follow this. Fixing addr_end brings it into
> > alignment by preventing pud++ from happening.
> > 
> > The only currently known side effect is that gup_fast crashes, but it
> > sure is an unexpected thing.
> 
> It only is unexpected in a "top-level folding" world, see my other reply.
> Consider it an optimization, which was possible because of how our dynamic
> folding works, and e.g. because we can determine the correct pagetable
> level from a pXd value in pXd_offset.

No, I disagree. The page walker API the arch presents has to have well
defined semantics. For instance, there is an effort to define tests
and invarients for the page table accesses to bring this understanding
and uniformity:

 mm/debug_vm_pgtable.c

If we fix S390 using the pX_addr_end() change then the above should be
updated with an invariant to check it. I've added Anshuman for some
thoughts..

For better or worse, that invariant does exclude arches from using
other folding techniques.

The other solution would be to address the other side of != and adjust
the pud++

eg replcae pud++ with something like:
  pud = pud_next_entry(p4d, pud, next)

Such that:
  pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)

In which case the invarient changes to 'callers can never do pointer
arithmetic on the result of pXX_offset()' which is a bit harder to
enforce.

Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 15:10                     ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 15:10 UTC (permalink / raw)
  To: Gerald Schaefer, Anshuman Khandual
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.  
> > 
> > I would say the page table API requires this invariant:
> > 
> >         pud = pud_offset(p4d, addr);
> >         do {
> > 		WARN_ON(pud != pud_offset(p4d, addr);
> >                 next = pud_addr_end(addr, end);
> >         } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> > While S390 does not follow this. Fixing addr_end brings it into
> > alignment by preventing pud++ from happening.
> > 
> > The only currently known side effect is that gup_fast crashes, but it
> > sure is an unexpected thing.
> 
> It only is unexpected in a "top-level folding" world, see my other reply.
> Consider it an optimization, which was possible because of how our dynamic
> folding works, and e.g. because we can determine the correct pagetable
> level from a pXd value in pXd_offset.

No, I disagree. The page walker API the arch presents has to have well
defined semantics. For instance, there is an effort to define tests
and invarients for the page table accesses to bring this understanding
and uniformity:

 mm/debug_vm_pgtable.c

If we fix S390 using the pX_addr_end() change then the above should be
updated with an invariant to check it. I've added Anshuman for some
thoughts..

For better or worse, that invariant does exclude arches from using
other folding techniques.

The other solution would be to address the other side of != and adjust
the pud++

eg replcae pud++ with something like:
  pud = pud_next_entry(p4d, pud, next)

Such that:
  pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)

In which case the invarient changes to 'callers can never do pointer
arithmetic on the result of pXX_offset()' which is a bit harder to
enforce.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 15:10                     ` Jason Gunthorpe
  (?)
  (?)
@ 2020-09-10 17:07                       ` Gerald Schaefer
  -1 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Anshuman Khandual, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Thu, 10 Sep 2020 12:10:26 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> > On Thu, 10 Sep 2020 10:02:33 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >   
> > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > >   
> > > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > > Hopefully, one could make sense ot of it.    
> > > 
> > > I would say the page table API requires this invariant:
> > > 
> > >         pud = pud_offset(p4d, addr);
> > >         do {
> > > 		WARN_ON(pud != pud_offset(p4d, addr);
> > >                 next = pud_addr_end(addr, end);
> > >         } while (pud++, addr = next, addr != end);
> > > 
> > > ie pud++ is supposed to be a shortcut for 
> > >   pud_offset(p4d, next)
> > > 
> > > While S390 does not follow this. Fixing addr_end brings it into
> > > alignment by preventing pud++ from happening.
> > > 
> > > The only currently known side effect is that gup_fast crashes, but it
> > > sure is an unexpected thing.  
> > 
> > It only is unexpected in a "top-level folding" world, see my other reply.
> > Consider it an optimization, which was possible because of how our dynamic
> > folding works, and e.g. because we can determine the correct pagetable
> > level from a pXd value in pXd_offset.  
> 
> No, I disagree. The page walker API the arch presents has to have well
> defined semantics. For instance, there is an effort to define tests
> and invarients for the page table accesses to bring this understanding
> and uniformity:
> 
>  mm/debug_vm_pgtable.c
> 
> If we fix S390 using the pX_addr_end() change then the above should be
> updated with an invariant to check it. I've added Anshuman for some
> thoughts..

We are very aware of those tests, and actually a big supporter of the
idea. Also part of the supported architectures already, and it has
already helped us find / fix some s390 oddities.

However, we did not see any issues wrt to our pagetable walking,
neither with the current version, nor with the new generic approach.
We do currently see other issues, Anshuman will know what I mean :-)

> For better or worse, that invariant does exclude arches from using
> other folding techniques.
> 
> The other solution would be to address the other side of != and adjust
> the pud++
> 
> eg replcae pud++ with something like:
>   pud = pud_next_entry(p4d, pud, next)
> 
> Such that:
>   pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)
> 
> In which case the invarient changes to 'callers can never do pointer
> arithmetic on the result of pXX_offset()' which is a bit harder to
> enforce.

I might have lost track a bit. Are we still talking about possible
functional impacts of either our current pagetable walking with s390
(apart from gup_fast), or the proposed generic change (for s390, or
others?)?

Or is this rather some (other) generic issue / idea that you have,
in order to put "some more structure / enforcement" to generic
pagetable walkers?

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:07                       ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Anshuman Khandual, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Thu, 10 Sep 2020 12:10:26 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> > On Thu, 10 Sep 2020 10:02:33 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >   
> > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > >   
> > > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > > Hopefully, one could make sense ot of it.    
> > > 
> > > I would say the page table API requires this invariant:
> > > 
> > >         pud = pud_offset(p4d, addr);
> > >         do {
> > > 		WARN_ON(pud != pud_offset(p4d, addr);
> > >                 next = pud_addr_end(addr, end);
> > >         } while (pud++, addr = next, addr != end);
> > > 
> > > ie pud++ is supposed to be a shortcut for 
> > >   pud_offset(p4d, next)
> > > 
> > > While S390 does not follow this. Fixing addr_end brings it into
> > > alignment by preventing pud++ from happening.
> > > 
> > > The only currently known side effect is that gup_fast crashes, but it
> > > sure is an unexpected thing.  
> > 
> > It only is unexpected in a "top-level folding" world, see my other reply.
> > Consider it an optimization, which was possible because of how our dynamic
> > folding works, and e.g. because we can determine the correct pagetable
> > level from a pXd value in pXd_offset.  
> 
> No, I disagree. The page walker API the arch presents has to have well
> defined semantics. For instance, there is an effort to define tests
> and invarients for the page table accesses to bring this understanding
> and uniformity:
> 
>  mm/debug_vm_pgtable.c
> 
> If we fix S390 using the pX_addr_end() change then the above should be
> updated with an invariant to check it. I've added Anshuman for some
> thoughts..

We are very aware of those tests, and actually a big supporter of the
idea. Also part of the supported architectures already, and it has
already helped us find / fix some s390 oddities.

However, we did not see any issues wrt to our pagetable walking,
neither with the current version, nor with the new generic approach.
We do currently see other issues, Anshuman will know what I mean :-)

> For better or worse, that invariant does exclude arches from using
> other folding techniques.
> 
> The other solution would be to address the other side of != and adjust
> the pud++
> 
> eg replcae pud++ with something like:
>   pud = pud_next_entry(p4d, pud, next)
> 
> Such that:
>   pud_next_entry(p4d, pud, next) == pud_offset(p4d, next)
> 
> In which case the invarient changes to 'callers can never do pointer
> arithmetic on the result of pXX_offset()' which is a bit harder to
> enforce.

I might have lost track a bit. Are we still talking about possible
functional impacts of either our current pagetable walking with s390
(apart from gup_fast), or the proposed generic change (for s390, or
others?)?

Or is this rather some (other) generic issue / idea that you have,
in order to put "some more structure / enforcement" to generic
pagetable walkers?

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:07                       ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Anshuman Khandual, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Thu, 10 Sep 2020 12:10:26 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> > On Thu, 10 Sep 2020 10:02:33 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >   
> > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > >   
> > > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > > Hopefully, one could make sense ot of it.    
> > > 
> > > I would say the page table API requires this invariant:
> > > 
> > >         pud = pud_offset(p4d, addr);
> > >         do {
> > > 		WARN_ON(pud != pud_offset(p4d, addr);
> > >                 next = pud_addr_end(addr, end);
> > >         } while (pud++, addr = next, addr != end);
> > > 
> > > ie pud++ is supposed to be a shortcut for 
> > >   pud_offset(p4d, next)
> > > 
> > > While S390 does not follow this. Fixing addr_end brings it into
> > > alignment by preventing pud++ from happening.
> > > 
> > > The only currently known side effect is that gup_fast crashes, but it
> > > sure is an unexpected thing.  
> > 
> > It only is unexpected in a "top-level folding" world, see my other reply.
> > Consider it an optimization, which was possible because of how our dynamic
> > folding works, and e.g. because we can determine the correct pagetable
> > level from a pXd value in pXd_offset.  
> 
> No, I disagree. The page walker API the arch presents has to have well
> defined semantics. For instance, there is an effort to define tests
> and invarients for the page table accesses to bring this understanding
> and uniformity:
> 
>  mm/debug_vm_pgtable.c
> 
> If we fix S390 using the pX_addr_end() change then the above should be
> updated with an invariant to check it. I've added Anshuman for some
> thoughts..

We are very aware of those tests, and actually a big supporter of the
idea. Also part of the supported architectures already, and it has
already helped us find / fix some s390 oddities.

However, we did not see any issues wrt to our pagetable walking,
neither with the current version, nor with the new generic approach.
We do currently see other issues, Anshuman will know what I mean :-)

> For better or worse, that invariant does exclude arches from using
> other folding techniques.
> 
> The other solution would be to address the other side of != and adjust
> the pud++
> 
> eg replcae pud++ with something like:
>   pud = pud_next_entry(p4d, pud, next)
> 
> Such that:
>   pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)
> 
> In which case the invarient changes to 'callers can never do pointer
> arithmetic on the result of pXX_offset()' which is a bit harder to
> enforce.

I might have lost track a bit. Are we still talking about possible
functional impacts of either our current pagetable walking with s390
(apart from gup_fast), or the proposed generic change (for s390, or
others?)?

Or is this rather some (other) generic issue / idea that you have,
in order to put "some more structure / enforcement" to generic
pagetable walkers?

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:07                       ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Anshuman Khandual, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Thu, 10 Sep 2020 12:10:26 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> > On Thu, 10 Sep 2020 10:02:33 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >   
> > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > >   
> > > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > > Hopefully, one could make sense ot of it.    
> > > 
> > > I would say the page table API requires this invariant:
> > > 
> > >         pud = pud_offset(p4d, addr);
> > >         do {
> > > 		WARN_ON(pud != pud_offset(p4d, addr);
> > >                 next = pud_addr_end(addr, end);
> > >         } while (pud++, addr = next, addr != end);
> > > 
> > > ie pud++ is supposed to be a shortcut for 
> > >   pud_offset(p4d, next)
> > > 
> > > While S390 does not follow this. Fixing addr_end brings it into
> > > alignment by preventing pud++ from happening.
> > > 
> > > The only currently known side effect is that gup_fast crashes, but it
> > > sure is an unexpected thing.  
> > 
> > It only is unexpected in a "top-level folding" world, see my other reply.
> > Consider it an optimization, which was possible because of how our dynamic
> > folding works, and e.g. because we can determine the correct pagetable
> > level from a pXd value in pXd_offset.  
> 
> No, I disagree. The page walker API the arch presents has to have well
> defined semantics. For instance, there is an effort to define tests
> and invarients for the page table accesses to bring this understanding
> and uniformity:
> 
>  mm/debug_vm_pgtable.c
> 
> If we fix S390 using the pX_addr_end() change then the above should be
> updated with an invariant to check it. I've added Anshuman for some
> thoughts..

We are very aware of those tests, and actually a big supporter of the
idea. Also part of the supported architectures already, and it has
already helped us find / fix some s390 oddities.

However, we did not see any issues wrt to our pagetable walking,
neither with the current version, nor with the new generic approach.
We do currently see other issues, Anshuman will know what I mean :-)

> For better or worse, that invariant does exclude arches from using
> other folding techniques.
> 
> The other solution would be to address the other side of != and adjust
> the pud++
> 
> eg replcae pud++ with something like:
>   pud = pud_next_entry(p4d, pud, next)
> 
> Such that:
>   pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)
> 
> In which case the invarient changes to 'callers can never do pointer
> arithmetic on the result of pXX_offset()' which is a bit harder to
> enforce.

I might have lost track a bit. Are we still talking about possible
functional impacts of either our current pagetable walking with s390
(apart from gup_fast), or the proposed generic change (for s390, or
others?)?

Or is this rather some (other) generic issue / idea that you have,
in order to put "some more structure / enforcement" to generic
pagetable walkers?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 17:07                       ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-10 17:19                         ` Jason Gunthorpe
  -1 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 17:19 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Anshuman Khandual, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 07:07:57PM +0200, Gerald Schaefer wrote:

> I might have lost track a bit. Are we still talking about possible
> functional impacts of either our current pagetable walking with s390
> (apart from gup_fast), or the proposed generic change (for s390, or
> others?)?

I'm looking for an more understandable explanation what is wrong with
the S390 implementation.

If the page operations require the invariant I described then it is
quite easy to explain the problem and understand the solution.

Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:19                         ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 17:19 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Anshuman Khandual, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 07:07:57PM +0200, Gerald Schaefer wrote:

> I might have lost track a bit. Are we still talking about possible
> functional impacts of either our current pagetable walking with s390
> (apart from gup_fast), or the proposed generic change (for s390, or
> others?)?

I'm looking for an more understandable explanation what is wrong with
the S390 implementation.

If the page operations require the invariant I described then it is
quite easy to explain the problem and understand the solution.

Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:19                         ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 17:19 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Anshuman Khandual, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Thu, Sep 10, 2020 at 07:07:57PM +0200, Gerald Schaefer wrote:

> I might have lost track a bit. Are we still talking about possible
> functional impacts of either our current pagetable walking with s390
> (apart from gup_fast), or the proposed generic change (for s390, or
> others?)?

I'm looking for an more understandable explanation what is wrong with
the S390 implementation.

If the page operations require the invariant I described then it is
quite easy to explain the problem and understand the solution.

Jason

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:19                         ` Jason Gunthorpe
  0 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 17:19 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Anshuman Khandual, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 07:07:57PM +0200, Gerald Schaefer wrote:

> I might have lost track a bit. Are we still talking about possible
> functional impacts of either our current pagetable walking with s390
> (apart from gup_fast), or the proposed generic change (for s390, or
> others?)?

I'm looking for an more understandable explanation what is wrong with
the S390 implementation.

If the page operations require the invariant I described then it is
quite easy to explain the problem and understand the solution.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10  9:39               ` Alexander Gordeev
                                   ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 17:35                 ` Linus Torvalds
  -1 siblings, 0 replies; 254+ messages in thread
From: Linus Torvalds @ 2020-09-10 17:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Jason Gunthorpe, Gerald Schaefer, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> It is only gup_fast case that exposes the issue. It hits because
> pointers to stack copies are passed to gup_pXd_range iterators, not
> pointers to real page tables itself.

Can we possibly change fast-gup to not do the stack copies?

I'd actually rather do something like that, than the "addr_end" thing.

As you say, none of the other page table walking code does what the
GUP code does, and I don't think it's required.

The GUP code is kind of strange, I'm not quite sure why. Some of it
unusually came from the powerpc code that handled their special odd
hugepage model, and that may be why it's so different.

How painful would it be to just pass the pmd (etc) _pointers_ around,
rather than do the odd "take the address of local copies"?

                  Linus

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:35                 ` Linus Torvalds
  0 siblings, 0 replies; 254+ messages in thread
From: Linus Torvalds @ 2020-09-10 17:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Jason Gunthorpe, Gerald Schaefer, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> It is only gup_fast case that exposes the issue. It hits because
> pointers to stack copies are passed to gup_pXd_range iterators, not
> pointers to real page tables itself.

Can we possibly change fast-gup to not do the stack copies?

I'd actually rather do something like that, than the "addr_end" thing.

As you say, none of the other page table walking code does what the
GUP code does, and I don't think it's required.

The GUP code is kind of strange, I'm not quite sure why. Some of it
unusually came from the powerpc code that handled their special odd
hugepage model, and that may be why it's so different.

How painful would it be to just pass the pmd (etc) _pointers_ around,
rather than do the odd "take the address of local copies"?

                  Linus

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:35                 ` Linus Torvalds
  0 siblings, 0 replies; 254+ messages in thread
From: Linus Torvalds @ 2020-09-10 17:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Jason Gunthorpe, Gerald Schaefer, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> It is only gup_fast case that exposes the issue. It hits because
> pointers to stack copies are passed to gup_pXd_range iterators, not
> pointers to real page tables itself.

Can we possibly change fast-gup to not do the stack copies?

I'd actually rather do something like that, than the "addr_end" thing.

As you say, none of the other page table walking code does what the
GUP code does, and I don't think it's required.

The GUP code is kind of strange, I'm not quite sure why. Some of it
unusually came from the powerpc code that handled their special odd
hugepage model, and that may be why it's so different.

How painful would it be to just pass the pmd (etc) _pointers_ around,
rather than do the odd "take the address of local copies"?

                  Linus


^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:35                 ` Linus Torvalds
  0 siblings, 0 replies; 254+ messages in thread
From: Linus Torvalds @ 2020-09-10 17:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Andrew Morton, linux-power, Mike Rapoport

On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> It is only gup_fast case that exposes the issue. It hits because
> pointers to stack copies are passed to gup_pXd_range iterators, not
> pointers to real page tables itself.

Can we possibly change fast-gup to not do the stack copies?

I'd actually rather do something like that, than the "addr_end" thing.

As you say, none of the other page table walking code does what the
GUP code does, and I don't think it's required.

The GUP code is kind of strange, I'm not quite sure why. Some of it
unusually came from the powerpc code that handled their special odd
hugepage model, and that may be why it's so different.

How painful would it be to just pass the pmd (etc) _pointers_ around,
rather than do the odd "take the address of local copies"?

                  Linus

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:35                 ` Linus Torvalds
  0 siblings, 0 replies; 254+ messages in thread
From: Linus Torvalds @ 2020-09-10 17:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Michael Ellerman, Andrew Morton, linux-power,
	Mike Rapoport

On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> It is only gup_fast case that exposes the issue. It hits because
> pointers to stack copies are passed to gup_pXd_range iterators, not
> pointers to real page tables itself.

Can we possibly change fast-gup to not do the stack copies?

I'd actually rather do something like that, than the "addr_end" thing.

As you say, none of the other page table walking code does what the
GUP code does, and I don't think it's required.

The GUP code is kind of strange, I'm not quite sure why. Some of it
unusually came from the powerpc code that handled their special odd
hugepage model, and that may be why it's so different.

How painful would it be to just pass the pmd (etc) _pointers_ around,
rather than do the odd "take the address of local copies"?

                  Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 13:02                 ` Jason Gunthorpe
  (?)
  (?)
@ 2020-09-10 17:57                   ` Gerald Schaefer
  -1 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 

Hmm, IIUC, all architectures with static folding will simply return
the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
pagetables. There is no difference for s390. For gup_fast, that p4d
pointer is not really a pointer to a value in a pagetable, but
to some local copy of such a value, and not just for s390.

So, pud = p4d = pointer to copy, and increasing that pud pointer
cannot be the same as pud_offset(p4d, next). I do see your point
however, at last I think :-) My problem is that I do not see where
we would have an s390-specific issue here. Maybe my understanding
of how it works for others with static folding is wrong. That
would explain my difficulties in getting your point...

> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.

Exactly, only that nobody seems to follow it, IIUC. Fixing it up
with pXd_addr_end was my impression of what we need to do, in order to
have it work the same way as for others.

> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

Well, from my understanding it feels more unexpected that something
that is supposed to be a pointer to an entry in a page table, really is
just a pointer to some copy somewhere.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:57                   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 

Hmm, IIUC, all architectures with static folding will simply return
the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
pagetables. There is no difference for s390. For gup_fast, that p4d
pointer is not really a pointer to a value in a pagetable, but
to some local copy of such a value, and not just for s390.

So, pud = p4d = pointer to copy, and increasing that pud pointer
cannot be the same as pud_offset(p4d, next). I do see your point
however, at last I think :-) My problem is that I do not see where
we would have an s390-specific issue here. Maybe my understanding
of how it works for others with static folding is wrong. That
would explain my difficulties in getting your point...

> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.

Exactly, only that nobody seems to follow it, IIUC. Fixing it up
with pXd_addr_end was my impression of what we need to do, in order to
have it work the same way as for others.

> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

Well, from my understanding it feels more unexpected that something
that is supposed to be a pointer to an entry in a page table, really is
just a pointer to some copy somewhere.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:57                   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 

Hmm, IIUC, all architectures with static folding will simply return
the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
pagetables. There is no difference for s390. For gup_fast, that p4d
pointer is not really a pointer to a value in a pagetable, but
to some local copy of such a value, and not just for s390.

So, pud = p4d = pointer to copy, and increasing that pud pointer
cannot be the same as pud_offset(p4d, next). I do see your point
however, at last I think :-) My problem is that I do not see where
we would have an s390-specific issue here. Maybe my understanding
of how it works for others with static folding is wrong. That
would explain my difficulties in getting your point...

> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.

Exactly, only that nobody seems to follow it, IIUC. Fixing it up
with pXd_addr_end was my impression of what we need to do, in order to
have it work the same way as for others.

> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

Well, from my understanding it feels more unexpected that something
that is supposed to be a pointer to an entry in a page table, really is
just a pointer to some copy somewhere.

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:57                   ` Gerald Schaefer
  0 siblings, 0 replies; 254+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 

Hmm, IIUC, all architectures with static folding will simply return
the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
pagetables. There is no difference for s390. For gup_fast, that p4d
pointer is not really a pointer to a value in a pagetable, but
to some local copy of such a value, and not just for s390.

So, pud = p4d = pointer to copy, and increasing that pud pointer
cannot be the same as pud_offset(p4d, next). I do see your point
however, at last I think :-) My problem is that I do not see where
we would have an s390-specific issue here. Maybe my understanding
of how it works for others with static folding is wrong. That
would explain my difficulties in getting your point...

> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.

Exactly, only that nobody seems to follow it, IIUC. Fixing it up
with pXd_addr_end was my impression of what we need to do, in order to
have it work the same way as for others.

> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

Well, from my understanding it feels more unexpected that something
that is supposed to be a pointer to an entry in a page table, really is
just a pointer to some copy somewhere.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 254+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 17:35                 ` Linus Torvalds
  (?)
  (?)
@ 2020-09-10 18:13                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 254+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 18:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Gordeev, Gerald Schaefer, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote:
> On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
> <agordeev@linux.ibm.com> wrote:
> >
> > It is only gup_fast case that exposes the issue. It hits because
> > pointers to stack copies are passed to gup_pXd_range iterators, not
> > pointers to real page tables itself.
> 
> Can we possibly change fast-gup to not do the stack copies?
>
> I'd actually rather do something like that, than the "addr_end" thing.

> As you say, none of the other page table walking code does what the
> GUP code does, and I don't think it's required.

As I understand it, the requirement is because fast-gup walks without
the page table spinlock, or mmap_sem held so it must READ_ONCE the
*pXX.

It then checks that it is a valid page table pointer, then calls
pXX_offset().

The arch implementation of pXX_offset() derefs again the passed pXX
pointer. So it defeats the READ_ONCE and the 2nd load could observe
something that is no longer a page table pointer and crash.

Passing it the address of the stack value is a way to force
pXX_offset() to use the READ_ONCE result which has already been tested
to be a page table pointer.

Other page walking code that holds the mmap_sem tends to use
pmd_trans_unstable() which solves this problem by injecting a
barrier. The load hidden in pte_offset() after a pmd_trans_unstable()
can't be re-ordered and will only see a page table entry under the
mmap_sem.

However, I think that logic would have been much clearer following the
GUP model of READ_ONCE vs extra reads and a hidden barrier. At least
it took me a long time to work it out :(

I also think there are real bugs here where places are reading *pXX
multiple times without locking the page table. O