All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00 ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

This is v2 of an RFC previously discussed here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
to common gup_fast code. It will introduce special helper functions
pXd_addr_end_folded(), which have to be used in places where pagetable walk
is done w/o lock and with READ_ONCE, so currently only in gup_fast.

Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
themselves by adding an extra pXd value parameter. That was suggested by
Jason during v1 discussion, because he is already thinking of some other
places where he might want to switch to the READ_ONCE logic for pagetable
walks. In general, that would be the cleanest / safest solution, but there
is some impact on other architectures and common code, hence the new and
greatly enlarged recipient list.

Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
functions instead of #defines, so that we get some type checking for the
new pXd value parameter.

Not sure about Fixes/stable tags for the generic solution. Only patch 1
fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
still be nice to have in stable, to ease future backports, but I guess
"nice to have" does not really qualify for stable backports.

Changes in v2:
- Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
- Add patch 2 + 3 for more generic approach

Alexander Gordeev (3):
  mm/gup: fix gup_fast with dynamic page table folding
  mm: make pXd_addr_end() functions page-table entry aware
  mm: make generic pXd_addr_end() macros inline functions

 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 +++++----
 arch/arm64/kvm/mmu.c                     | 16 ++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++-------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 ++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 38 ++++++++++++---------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 +++++++++++-----------
 mm/mlock.c                               | 18 +++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 ++++-----
 31 files changed, 219 insertions(+), 165 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00 ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

This is v2 of an RFC previously discussed here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
to common gup_fast code. It will introduce special helper functions
pXd_addr_end_folded(), which have to be used in places where pagetable walk
is done w/o lock and with READ_ONCE, so currently only in gup_fast.

Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
themselves by adding an extra pXd value parameter. That was suggested by
Jason during v1 discussion, because he is already thinking of some other
places where he might want to switch to the READ_ONCE logic for pagetable
walks. In general, that would be the cleanest / safest solution, but there
is some impact on other architectures and common code, hence the new and
greatly enlarged recipient list.

Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
functions instead of #defines, so that we get some type checking for the
new pXd value parameter.

Not sure about Fixes/stable tags for the generic solution. Only patch 1
fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
still be nice to have in stable, to ease future backports, but I guess
"nice to have" does not really qualify for stable backports.

Changes in v2:
- Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
- Add patch 2 + 3 for more generic approach

Alexander Gordeev (3):
  mm/gup: fix gup_fast with dynamic page table folding
  mm: make pXd_addr_end() functions page-table entry aware
  mm: make generic pXd_addr_end() macros inline functions

 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 +++++----
 arch/arm64/kvm/mmu.c                     | 16 ++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++-------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 ++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 38 ++++++++++++---------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 +++++++++++-----------
 mm/mlock.c                               | 18 +++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 ++++-----
 31 files changed, 219 insertions(+), 165 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00 ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

This is v2 of an RFC previously discussed here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
to common gup_fast code. It will introduce special helper functions
pXd_addr_end_folded(), which have to be used in places where pagetable walk
is done w/o lock and with READ_ONCE, so currently only in gup_fast.

Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
themselves by adding an extra pXd value parameter. That was suggested by
Jason during v1 discussion, because he is already thinking of some other
places where he might want to switch to the READ_ONCE logic for pagetable
walks. In general, that would be the cleanest / safest solution, but there
is some impact on other architectures and common code, hence the new and
greatly enlarged recipient list.

Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
functions instead of #defines, so that we get some type checking for the
new pXd value parameter.

Not sure about Fixes/stable tags for the generic solution. Only patch 1
fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
still be nice to have in stable, to ease future backports, but I guess
"nice to have" does not really qualify for stable backports.

Changes in v2:
- Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
- Add patch 2 + 3 for more generic approach

Alexander Gordeev (3):
  mm/gup: fix gup_fast with dynamic page table folding
  mm: make pXd_addr_end() functions page-table entry aware
  mm: make generic pXd_addr_end() macros inline functions

 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 +++++----
 arch/arm64/kvm/mmu.c                     | 16 ++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++-------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 ++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 38 ++++++++++++---------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 +++++++++++-----------
 mm/mlock.c                               | 18 +++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 ++++-----
 31 files changed, 219 insertions(+), 165 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00 ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

This is v2 of an RFC previously discussed here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
to common gup_fast code. It will introduce special helper functions
pXd_addr_end_folded(), which have to be used in places where pagetable walk
is done w/o lock and with READ_ONCE, so currently only in gup_fast.

Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
themselves by adding an extra pXd value parameter. That was suggested by
Jason during v1 discussion, because he is already thinking of some other
places where he might want to switch to the READ_ONCE logic for pagetable
walks. In general, that would be the cleanest / safest solution, but there
is some impact on other architectures and common code, hence the new and
greatly enlarged recipient list.

Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
functions instead of #defines, so that we get some type checking for the
new pXd value parameter.

Not sure about Fixes/stable tags for the generic solution. Only patch 1
fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
still be nice to have in stable, to ease future backports, but I guess
"nice to have" does not really qualify for stable backports.

Changes in v2:
- Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
- Add patch 2 + 3 for more generic approach

Alexander Gordeev (3):
  mm/gup: fix gup_fast with dynamic page table folding
  mm: make pXd_addr_end() functions page-table entry aware
  mm: make generic pXd_addr_end() macros inline functions

 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 +++++----
 arch/arm64/kvm/mmu.c                     | 16 ++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++-------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 ++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 38 ++++++++++++---------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 +++++++++++-----------
 mm/mlock.c                               | 18 +++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 ++++-----
 31 files changed, 219 insertions(+), 165 deletions(-)

-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00 ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

This is v2 of an RFC previously discussed here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
to common gup_fast code. It will introduce special helper functions
pXd_addr_end_folded(), which have to be used in places where pagetable walk
is done w/o lock and with READ_ONCE, so currently only in gup_fast.

Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
themselves by adding an extra pXd value parameter. That was suggested by
Jason during v1 discussion, because he is already thinking of some other
places where he might want to switch to the READ_ONCE logic for pagetable
walks. In general, that would be the cleanest / safest solution, but there
is some impact on other architectures and common code, hence the new and
greatly enlarged recipient list.

Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
functions instead of #defines, so that we get some type checking for the
new pXd value parameter.

Not sure about Fixes/stable tags for the generic solution. Only patch 1
fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
still be nice to have in stable, to ease future backports, but I guess
"nice to have" does not really qualify for stable backports.

Changes in v2:
- Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
- Add patch 2 + 3 for more generic approach

Alexander Gordeev (3):
  mm/gup: fix gup_fast with dynamic page table folding
  mm: make pXd_addr_end() functions page-table entry aware
  mm: make generic pXd_addr_end() macros inline functions

 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 +++++----
 arch/arm64/kvm/mmu.c                     | 16 ++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++-------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 ++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 38 ++++++++++++---------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 +++++++++++-----------
 mm/mlock.c                               | 18 +++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 ++++-----
 31 files changed, 219 insertions(+), 165 deletions(-)

-- 
2.17.1


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-07 18:00 ` Gerald Schaefer
                     ` (2 preceding siblings ...)
  (?)
@ 2020-09-07 18:00   ` Gerald Schaefer
  -1 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

From: Alexander Gordeev <agordeev@linux.ibm.com>

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
calculate top level page table offset, and do nothing in folded pXd_offset.
What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
not reflect this dynamic behaviour, and still act like static 5-level
page tables.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

pud_addr_end = 0x10080000000 is correct, but the previous pgd/p4d_addr_end
should also have returned that limit, instead of the 5-level static
pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end"
parameter for gup_pud_range would also have been 0x10080000000, and we
would not iterate further in gup_pud_range, but rather go back and
(correctly) do it in gup_pgd_range.

So, for the second iteration in gup_pud_range, we will increase pudp,
which pointed to a stack value and not the real pud table. This new pudp
will then point to whatever lies behind the p4d stack value. In general,
this happens to be the previously read pgd, but it probably could also
be something different, depending on compiler decisions.

Most unfortunately, if it happens to be the pgd value, which is the
same as the p4d / pud due to folding, it is a valid and present entry.
So after the increment, we would still point to the same pud entry.
The addr however has been increased in the second iteration, so that we
now have different pmd/pte_index values, which will result in very wrong
behaviour for the remaining gup_pmd/pte_range calls. We will effectively
operate on an address minus 2 GB, due to missing pudp increase.

In the "good case", if nothing is mapped there, we will fall back to
the slow gup path. But if something is mapped there, and valid
for gup_fast, we will end up (silently) getting references on the wrong
pages and also add the wrong pages to the **pages result array. This
can cause data corruption.

Fix this by introducing new pXd_addr_end_folded helpers, which take an
additional pXd entry value parameter, that can be used on s390
to determine the correct page table level and return corresponding
end / boundary. With that, the pointer iteration will always
happen in gup_pgd_range for s390. No change for other architectures
introduced.

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Cc: <stable@vger.kernel.org> # 5.2+
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
 include/linux/pgtable.h         | 16 +++++++++++++
 mm/gup.c                        |  8 +++----
 3 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..027206e4959d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
 }
 #define mm_pmd_folded(mm) mm_pmd_folded(mm)
 
+/*
+ * With dynamic page table levels on s390, the static pXd_addr_end() functions
+ * will not return corresponding dynamic boundaries. This is no problem as long
+ * as only pXd pointers are passed down during page table walk, because
+ * pXd_offset() will simply return the given pointer for folded levels, and the
+ * pointer iteration over a range simply happens at the correct page table
+ * level.
+ * It is however a problem with gup_fast, or other places walking the page
+ * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
+ * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
+ * a stack variable, which cannot be used for pointer iteration at the correct
+ * level. Instead, the iteration then has to happen by going up to pgd level
+ * again. To allow this, provide pXd_addr_end_folded() functions with an
+ * additional pXd value parameter, which can be used on s390 to determine the
+ * folding level and return the corresponding boundary.
+ */
+static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
+{
+	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
+	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
+	unsigned long boundary = (addr + size) & ~(size - 1);
+
+	/*
+	 * FIXME The below check is for internal testing only, to be removed
+	 */
+	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
+
+	return (boundary - 1) < (end - 1) ? boundary : end;
+}
+
+#define pgd_addr_end_folded pgd_addr_end_folded
+static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(pgd_val(pgd), addr, end);
+}
+
+#define p4d_addr_end_folded p4d_addr_end_folded
+static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(p4d_val(p4d), addr, end);
+}
+
 static inline int mm_has_pgste(struct mm_struct *mm)
 {
 #ifdef CONFIG_PGSTE
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..981c4c2a31fe 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
 })
 #endif
 
+#ifndef pgd_addr_end_folded
+#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
+#endif
+
+#ifndef p4d_addr_end_folded
+#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
+#endif
+
+#ifndef pud_addr_end_folded
+#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
+#endif
+
+#ifndef pmd_addr_end_folded
+#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
+#endif
+
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index bd883a112724..ba4aace5d0f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end_folded(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end_folded(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end_folded(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end_folded(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

From: Alexander Gordeev <agordeev@linux.ibm.com>

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
calculate top level page table offset, and do nothing in folded pXd_offset.
What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
not reflect this dynamic behaviour, and still act like static 5-level
page tables.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

pud_addr_end = 0x10080000000 is correct, but the previous pgd/p4d_addr_end
should also have returned that limit, instead of the 5-level static
pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end"
parameter for gup_pud_range would also have been 0x10080000000, and we
would not iterate further in gup_pud_range, but rather go back and
(correctly) do it in gup_pgd_range.

So, for the second iteration in gup_pud_range, we will increase pudp,
which pointed to a stack value and not the real pud table. This new pudp
will then point to whatever lies behind the p4d stack value. In general,
this happens to be the previously read pgd, but it probably could also
be something different, depending on compiler decisions.

Most unfortunately, if it happens to be the pgd value, which is the
same as the p4d / pud due to folding, it is a valid and present entry.
So after the increment, we would still point to the same pud entry.
The addr however has been increased in the second iteration, so that we
now have different pmd/pte_index values, which will result in very wrong
behaviour for the remaining gup_pmd/pte_range calls. We will effectively
operate on an address minus 2 GB, due to missing pudp increase.

In the "good case", if nothing is mapped there, we will fall back to
the slow gup path. But if something is mapped there, and valid
for gup_fast, we will end up (silently) getting references on the wrong
pages and also add the wrong pages to the **pages result array. This
can cause data corruption.

Fix this by introducing new pXd_addr_end_folded helpers, which take an
additional pXd entry value parameter, that can be used on s390
to determine the correct page table level and return corresponding
end / boundary. With that, the pointer iteration will always
happen in gup_pgd_range for s390. No change for other architectures
introduced.

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Cc: <stable@vger.kernel.org> # 5.2+
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
 include/linux/pgtable.h         | 16 +++++++++++++
 mm/gup.c                        |  8 +++----
 3 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..027206e4959d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
 }
 #define mm_pmd_folded(mm) mm_pmd_folded(mm)
 
+/*
+ * With dynamic page table levels on s390, the static pXd_addr_end() functions
+ * will not return corresponding dynamic boundaries. This is no problem as long
+ * as only pXd pointers are passed down during page table walk, because
+ * pXd_offset() will simply return the given pointer for folded levels, and the
+ * pointer iteration over a range simply happens at the correct page table
+ * level.
+ * It is however a problem with gup_fast, or other places walking the page
+ * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
+ * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
+ * a stack variable, which cannot be used for pointer iteration at the correct
+ * level. Instead, the iteration then has to happen by going up to pgd level
+ * again. To allow this, provide pXd_addr_end_folded() functions with an
+ * additional pXd value parameter, which can be used on s390 to determine the
+ * folding level and return the corresponding boundary.
+ */
+static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
+{
+	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
+	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
+	unsigned long boundary = (addr + size) & ~(size - 1);
+
+	/*
+	 * FIXME The below check is for internal testing only, to be removed
+	 */
+	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
+
+	return (boundary - 1) < (end - 1) ? boundary : end;
+}
+
+#define pgd_addr_end_folded pgd_addr_end_folded
+static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(pgd_val(pgd), addr, end);
+}
+
+#define p4d_addr_end_folded p4d_addr_end_folded
+static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(p4d_val(p4d), addr, end);
+}
+
 static inline int mm_has_pgste(struct mm_struct *mm)
 {
 #ifdef CONFIG_PGSTE
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..981c4c2a31fe 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
 })
 #endif
 
+#ifndef pgd_addr_end_folded
+#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
+#endif
+
+#ifndef p4d_addr_end_folded
+#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
+#endif
+
+#ifndef pud_addr_end_folded
+#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
+#endif
+
+#ifndef pmd_addr_end_folded
+#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
+#endif
+
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index bd883a112724..ba4aace5d0f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end_folded(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end_folded(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end_folded(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end_folded(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
calculate top level page table offset, and do nothing in folded pXd_offset.
What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
not reflect this dynamic behaviour, and still act like static 5-level
page tables.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

pud_addr_end = 0x10080000000 is correct, but the previous pgd/p4d_addr_end
should also have returned that limit, instead of the 5-level static
pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end"
parameter for gup_pud_range would also have been 0x10080000000, and we
would not iterate further in gup_pud_range, but rather go back and
(correctly) do it in gup_pgd_range.

So, for the second iteration in gup_pud_range, we will increase pudp,
which pointed to a stack value and not the real pud table. This new pudp
will then point to whatever lies behind the p4d stack value. In general,
this happens to be the previously read pgd, but it probably could also
be something different, depending on compiler decisions.

Most unfortunately, if it happens to be the pgd value, which is the
same as the p4d / pud due to folding, it is a valid and present entry.
So after the increment, we would still point to the same pud entry.
The addr however has been increased in the second iteration, so that we
now have different pmd/pte_index values, which will result in very wrong
behaviour for the remaining gup_pmd/pte_range calls. We will effectively
operate on an address minus 2 GB, due to missing pudp increase.

In the "good case", if nothing is mapped there, we will fall back to
the slow gup path. But if something is mapped there, and valid
for gup_fast, we will end up (silently) getting references on the wrong
pages and also add the wrong pages to the **pages result array. This
can cause data corruption.

Fix this by introducing new pXd_addr_end_folded helpers, which take an
additional pXd entry value parameter, that can be used on s390
to determine the correct page table level and return corresponding
end / boundary. With that, the pointer iteration will always
happen in gup_pgd_range for s390. No change for other architectures
introduced.

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Cc: <stable@vger.kernel.org> # 5.2+
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
 include/linux/pgtable.h         | 16 +++++++++++++
 mm/gup.c                        |  8 +++----
 3 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..027206e4959d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
 }
 #define mm_pmd_folded(mm) mm_pmd_folded(mm)
 
+/*
+ * With dynamic page table levels on s390, the static pXd_addr_end() functions
+ * will not return corresponding dynamic boundaries. This is no problem as long
+ * as only pXd pointers are passed down during page table walk, because
+ * pXd_offset() will simply return the given pointer for folded levels, and the
+ * pointer iteration over a range simply happens at the correct page table
+ * level.
+ * It is however a problem with gup_fast, or other places walking the page
+ * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
+ * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
+ * a stack variable, which cannot be used for pointer iteration at the correct
+ * level. Instead, the iteration then has to happen by going up to pgd level
+ * again. To allow this, provide pXd_addr_end_folded() functions with an
+ * additional pXd value parameter, which can be used on s390 to determine the
+ * folding level and return the corresponding boundary.
+ */
+static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
+{
+	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
+	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
+	unsigned long boundary = (addr + size) & ~(size - 1);
+
+	/*
+	 * FIXME The below check is for internal testing only, to be removed
+	 */
+	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
+
+	return (boundary - 1) < (end - 1) ? boundary : end;
+}
+
+#define pgd_addr_end_folded pgd_addr_end_folded
+static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(pgd_val(pgd), addr, end);
+}
+
+#define p4d_addr_end_folded p4d_addr_end_folded
+static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(p4d_val(p4d), addr, end);
+}
+
 static inline int mm_has_pgste(struct mm_struct *mm)
 {
 #ifdef CONFIG_PGSTE
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..981c4c2a31fe 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
 })
 #endif
 
+#ifndef pgd_addr_end_folded
+#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
+#endif
+
+#ifndef p4d_addr_end_folded
+#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
+#endif
+
+#ifndef pud_addr_end_folded
+#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
+#endif
+
+#ifndef pmd_addr_end_folded
+#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
+#endif
+
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index bd883a112724..ba4aace5d0f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end_folded(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end_folded(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end_folded(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end_folded(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
calculate top level page table offset, and do nothing in folded pXd_offset.
What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
not reflect this dynamic behaviour, and still act like static 5-level
page tables.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

pud_addr_end = 0x10080000000 is correct, but the previous pgd/p4d_addr_end
should also have returned that limit, instead of the 5-level static
pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end"
parameter for gup_pud_range would also have been 0x10080000000, and we
would not iterate further in gup_pud_range, but rather go back and
(correctly) do it in gup_pgd_range.

So, for the second iteration in gup_pud_range, we will increase pudp,
which pointed to a stack value and not the real pud table. This new pudp
will then point to whatever lies behind the p4d stack value. In general,
this happens to be the previously read pgd, but it probably could also
be something different, depending on compiler decisions.

Most unfortunately, if it happens to be the pgd value, which is the
same as the p4d / pud due to folding, it is a valid and present entry.
So after the increment, we would still point to the same pud entry.
The addr however has been increased in the second iteration, so that we
now have different pmd/pte_index values, which will result in very wrong
behaviour for the remaining gup_pmd/pte_range calls. We will effectively
operate on an address minus 2 GB, due to missing pudp increase.

In the "good case", if nothing is mapped there, we will fall back to
the slow gup path. But if something is mapped there, and valid
for gup_fast, we will end up (silently) getting references on the wrong
pages and also add the wrong pages to the **pages result array. This
can cause data corruption.

Fix this by introducing new pXd_addr_end_folded helpers, which take an
additional pXd entry value parameter, that can be used on s390
to determine the correct page table level and return corresponding
end / boundary. With that, the pointer iteration will always
happen in gup_pgd_range for s390. No change for other architectures
introduced.

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Cc: <stable@vger.kernel.org> # 5.2+
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
 include/linux/pgtable.h         | 16 +++++++++++++
 mm/gup.c                        |  8 +++----
 3 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..027206e4959d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
 }
 #define mm_pmd_folded(mm) mm_pmd_folded(mm)
 
+/*
+ * With dynamic page table levels on s390, the static pXd_addr_end() functions
+ * will not return corresponding dynamic boundaries. This is no problem as long
+ * as only pXd pointers are passed down during page table walk, because
+ * pXd_offset() will simply return the given pointer for folded levels, and the
+ * pointer iteration over a range simply happens at the correct page table
+ * level.
+ * It is however a problem with gup_fast, or other places walking the page
+ * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
+ * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
+ * a stack variable, which cannot be used for pointer iteration at the correct
+ * level. Instead, the iteration then has to happen by going up to pgd level
+ * again. To allow this, provide pXd_addr_end_folded() functions with an
+ * additional pXd value parameter, which can be used on s390 to determine the
+ * folding level and return the corresponding boundary.
+ */
+static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
+{
+	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
+	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
+	unsigned long boundary = (addr + size) & ~(size - 1);
+
+	/*
+	 * FIXME The below check is for internal testing only, to be removed
+	 */
+	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
+
+	return (boundary - 1) < (end - 1) ? boundary : end;
+}
+
+#define pgd_addr_end_folded pgd_addr_end_folded
+static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(pgd_val(pgd), addr, end);
+}
+
+#define p4d_addr_end_folded p4d_addr_end_folded
+static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(p4d_val(p4d), addr, end);
+}
+
 static inline int mm_has_pgste(struct mm_struct *mm)
 {
 #ifdef CONFIG_PGSTE
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..981c4c2a31fe 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
 })
 #endif
 
+#ifndef pgd_addr_end_folded
+#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
+#endif
+
+#ifndef p4d_addr_end_folded
+#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
+#endif
+
+#ifndef pud_addr_end_folded
+#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
+#endif
+
+#ifndef pmd_addr_end_folded
+#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
+#endif
+
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index bd883a112724..ba4aace5d0f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end_folded(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end_folded(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end_folded(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end_folded(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
calculate top level page table offset, and do nothing in folded pXd_offset.
What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
not reflect this dynamic behaviour, and still act like static 5-level
page tables.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

pud_addr_end = 0x10080000000 is correct, but the previous pgd/p4d_addr_end
should also have returned that limit, instead of the 5-level static
pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end"
parameter for gup_pud_range would also have been 0x10080000000, and we
would not iterate further in gup_pud_range, but rather go back and
(correctly) do it in gup_pgd_range.

So, for the second iteration in gup_pud_range, we will increase pudp,
which pointed to a stack value and not the real pud table. This new pudp
will then point to whatever lies behind the p4d stack value. In general,
this happens to be the previously read pgd, but it probably could also
be something different, depending on compiler decisions.

Most unfortunately, if it happens to be the pgd value, which is the
same as the p4d / pud due to folding, it is a valid and present entry.
So after the increment, we would still point to the same pud entry.
The addr however has been increased in the second iteration, so that we
now have different pmd/pte_index values, which will result in very wrong
behaviour for the remaining gup_pmd/pte_range calls. We will effectively
operate on an address minus 2 GB, due to missing pudp increase.

In the "good case", if nothing is mapped there, we will fall back to
the slow gup path. But if something is mapped there, and valid
for gup_fast, we will end up (silently) getting references on the wrong
pages and also add the wrong pages to the **pages result array. This
can cause data corruption.

Fix this by introducing new pXd_addr_end_folded helpers, which take an
additional pXd entry value parameter, that can be used on s390
to determine the correct page table level and return corresponding
end / boundary. With that, the pointer iteration will always
happen in gup_pgd_range for s390. No change for other architectures
introduced.

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Cc: <stable@vger.kernel.org> # 5.2+
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
 include/linux/pgtable.h         | 16 +++++++++++++
 mm/gup.c                        |  8 +++----
 3 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..027206e4959d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
 }
 #define mm_pmd_folded(mm) mm_pmd_folded(mm)
 
+/*
+ * With dynamic page table levels on s390, the static pXd_addr_end() functions
+ * will not return corresponding dynamic boundaries. This is no problem as long
+ * as only pXd pointers are passed down during page table walk, because
+ * pXd_offset() will simply return the given pointer for folded levels, and the
+ * pointer iteration over a range simply happens at the correct page table
+ * level.
+ * It is however a problem with gup_fast, or other places walking the page
+ * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
+ * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
+ * a stack variable, which cannot be used for pointer iteration at the correct
+ * level. Instead, the iteration then has to happen by going up to pgd level
+ * again. To allow this, provide pXd_addr_end_folded() functions with an
+ * additional pXd value parameter, which can be used on s390 to determine the
+ * folding level and return the corresponding boundary.
+ */
+static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
+{
+	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
+	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
+	unsigned long boundary = (addr + size) & ~(size - 1);
+
+	/*
+	 * FIXME The below check is for internal testing only, to be removed
+	 */
+	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
+
+	return (boundary - 1) < (end - 1) ? boundary : end;
+}
+
+#define pgd_addr_end_folded pgd_addr_end_folded
+static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(pgd_val(pgd), addr, end);
+}
+
+#define p4d_addr_end_folded p4d_addr_end_folded
+static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+{
+	return rste_addr_end_folded(p4d_val(p4d), addr, end);
+}
+
 static inline int mm_has_pgste(struct mm_struct *mm)
 {
 #ifdef CONFIG_PGSTE
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..981c4c2a31fe 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
 })
 #endif
 
+#ifndef pgd_addr_end_folded
+#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
+#endif
+
+#ifndef p4d_addr_end_folded
+#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
+#endif
+
+#ifndef pud_addr_end_folded
+#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
+#endif
+
+#ifndef pmd_addr_end_folded
+#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
+#endif
+
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index bd883a112724..ba4aace5d0f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end_folded(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end_folded(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end_folded(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end_folded(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
-- 
2.17.1


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-07 18:00 ` Gerald Schaefer
                     ` (2 preceding siblings ...)
  (?)
@ 2020-09-07 18:00   ` Gerald Schaefer
  -1 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

From: Alexander Gordeev <agordeev@linux.ibm.com>

Unlike all other page-table abstractions pXd_addr_end() do not take
into account a particular table entry in which context the functions
are called. On architectures with dynamic page-tables folding that
might lead to lack of necessary information that is difficult to
obtain other than from the table entry itself. That already led to
a subtle memory corruption issue on s390.

By letting pXd_addr_end() functions know about the page-table entry
we allow archs not only make extra checks, but also optimizations.

As result of this change the pXd_addr_end_folded() functions used
in gup_fast traversal code become unnecessary and get replaced with
universal pXd_addr_end() variants.

The arch-specific updates not only add dereferencing of page-table
entry pointers, but also small changes to the code flow to make those
dereferences possible, at least for x86 and powerpc. Also for arm64,
but in way that should not have any impact.

So, even though the dereferenced page-table entries are not used on
archs other than s390, and are optimized out by the compiler, there
is a small change in kernel size and this is what bloat-o-meter reports:

x86:
add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
Function                                     old     new   delta
vmemmap_populate                             587     592      +5
munlock_vma_pages_range                      556     561      +5
Total: Before=15534694, After=15534704, chg +0.00%

powerpc:
add/remove: 0/0 grow/shrink: 1/0 up/down: 4/0 (4)
Function                                     old     new   delta
.remove_pagetable                           1648    1652      +4
Total: Before=21478240, After=21478244, chg +0.00%

arm64:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before=20240851, After=20240851, chg +0.00%

sparc:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before=4907262, After=4907262, chg +0.00%

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 ++++++----
 arch/arm64/kvm/mmu.c                     | 16 +++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++--------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          |  8 ++---
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 +++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 26 ++++-----------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 ++++++++++++------------
 mm/mlock.c                               | 18 ++++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 +++++-----
 31 files changed, 165 insertions(+), 173 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index 3502c2f746ca..5e6416b339f4 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -209,7 +209,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 	} while (0)
 
 /* we don't need complex calculations here as the pmd is folded into the pgd */
-#define pmd_addr_end(addr,end) (end)
+#define pmd_addr_end(pmd,addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
 
diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c
index 448e57c6f653..5437f943ca8b 100644
--- a/arch/arm/mm/idmap.c
+++ b/arch/arm/mm/idmap.c
@@ -46,7 +46,7 @@ static void idmap_add_pmd(pud_t *pud, unsigned long addr, unsigned long end,
 		pmd = pmd_offset(pud, addr);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		*pmd = __pmd((addr & PMD_MASK) | prot);
 		flush_pmd_entry(pmd);
 	} while (pmd++, addr = next, addr != end);
@@ -73,7 +73,7 @@ static void idmap_add_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		idmap_add_pmd(pud, addr, next, prot);
 	} while (pud++, addr = next, addr != end);
 }
@@ -95,7 +95,7 @@ static void identity_mapping_add(pgd_t *pgd, const char *text_start,
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		idmap_add_pud(pgd, addr, next, prot);
 	} while (pgd++, addr = next, addr != end);
 }
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 698cc740c6b8..4013746e4c75 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -777,7 +777,7 @@ static void __init alloc_init_pmd(pud_t *pud, unsigned long addr,
 		 * With LPAE, we must loop over to map
 		 * all the pmds for the given range.
 		 */
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Try a section mapping - addr, next and phys must all be
@@ -805,7 +805,7 @@ static void __init alloc_init_pud(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		alloc_init_pmd(pud, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (pud++, addr = next, addr != end);
@@ -820,7 +820,7 @@ static void __init alloc_init_p4d(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		alloc_init_pud(p4d, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (p4d++, addr = next, addr != end);
@@ -923,7 +923,7 @@ static void __init __create_mapping(struct mm_struct *mm, struct map_desc *md,
 	pgd = pgd_offset(mm, addr);
 	end = addr + length;
 	do {
-		unsigned long next = pgd_addr_end(addr, end);
+		unsigned long next = pgd_addr_end(*pgd, addr, end);
 
 		alloc_init_p4d(pgd, addr, next, phys, type, alloc, ng);
 
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 68e14152d6e9..7be8c9cdc5c8 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -412,7 +412,7 @@ static int copy_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long start,
 	do {
 		pmd_t pmd = READ_ONCE(*src_pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 		if (pmd_table(pmd)) {
@@ -447,7 +447,7 @@ static int copy_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp, unsigned long start,
 	do {
 		pud_t pud = READ_ONCE(*src_pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 		if (pud_table(pud)) {
@@ -473,8 +473,10 @@ static int copy_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long start,
 	dst_p4dp = p4d_offset(dst_pgdp, start);
 	src_p4dp = p4d_offset(src_pgdp, start);
 	do {
-		next = p4d_addr_end(addr, end);
-		if (p4d_none(READ_ONCE(*src_p4dp)))
+		p4d_t p4d = READ_ONCE(*src_p4dp);
+
+		next = p4d_addr_end(p4d, addr, end);
+		if (p4d_none(p4d))
 			continue;
 		if (copy_pud(dst_p4dp, src_p4dp, addr, next))
 			return -ENOMEM;
@@ -492,8 +494,10 @@ static int copy_page_tables(pgd_t *dst_pgdp, unsigned long start,
 
 	dst_pgdp = pgd_offset_pgd(dst_pgdp, start);
 	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(READ_ONCE(*src_pgdp)))
+		pgd_t pgd = READ_ONCE(*src_pgdp);
+
+		next = pgd_addr_end(pgd, addr, end);
+		if (pgd_none(pgd))
 			continue;
 		if (copy_p4d(dst_pgdp, src_pgdp, addr, next))
 			return -ENOMEM;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ba00bcc0c884..8f470f93a8e9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -547,7 +547,7 @@ static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
 
 	start_pmd = pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		/* Hyp doesn't use huge pmds */
 		if (!pmd_none(*pmd))
 			unmap_hyp_ptes(pmd, addr, next);
@@ -564,7 +564,7 @@ static void unmap_hyp_puds(p4d_t *p4d, phys_addr_t addr, phys_addr_t end)
 
 	start_pud = pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		/* Hyp doesn't use huge puds */
 		if (!pud_none(*pud))
 			unmap_hyp_pmds(pud, addr, next);
@@ -581,7 +581,7 @@ static void unmap_hyp_p4ds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
 
 	start_p4d = p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		/* Hyp doesn't use huge p4ds */
 		if (!p4d_none(*p4d))
 			unmap_hyp_puds(p4d, addr, next);
@@ -609,7 +609,7 @@ static void __unmap_hyp_range(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 	 */
 	pgd = pgdp + kvm_pgd_index(addr, ptrs_per_pgd);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_none(*pgd))
 			unmap_hyp_p4ds(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
@@ -712,7 +712,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
 			get_page(virt_to_page(pmd));
 		}
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		create_hyp_pte_mappings(pmd, addr, next, pfn, prot);
 		pfn += (next - addr) >> PAGE_SHIFT;
@@ -744,7 +744,7 @@ static int create_hyp_pud_mappings(p4d_t *p4d, unsigned long start,
 			get_page(virt_to_page(pud));
 		}
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		ret = create_hyp_pmd_mappings(pud, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -777,7 +777,7 @@ static int create_hyp_p4d_mappings(pgd_t *pgd, unsigned long start,
 			get_page(virt_to_page(p4d));
 		}
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		ret = create_hyp_pud_mappings(p4d, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -813,7 +813,7 @@ static int __create_hyp_mappings(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 			get_page(virt_to_page(pgd));
 		}
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = create_hyp_p4d_mappings(pgd, addr, next, pfn, prot);
 		if (err)
 			goto out;
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b24e43d20667..8d1c811fd59e 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -120,7 +120,7 @@ static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp = kasan_pmd_offset(pudp, addr, node, early);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		kasan_pte_populate(pmdp, addr, next, node, early);
 	} while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)));
 }
@@ -132,7 +132,7 @@ static void __init kasan_pud_populate(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp = kasan_pud_offset(p4dp, addr, node, early);
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		kasan_pmd_populate(pudp, addr, next, node, early);
 	} while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)));
 }
@@ -144,7 +144,7 @@ static void __init kasan_p4d_populate(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp = p4d_offset(pgdp, addr);
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		kasan_pud_populate(p4dp, addr, next, node, early);
 	} while (p4dp++, addr = next, addr != end);
 }
@@ -157,7 +157,7 @@ static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
 
 	pgdp = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		kasan_p4d_populate(pgdp, addr, next, node, early);
 	} while (pgdp++, addr = next, addr != end);
 }
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 64211436629d..d679cf024bc8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -209,7 +209,7 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		pmd_t old_pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(old_pmd, addr, end);
 
 		/* try section mapping first */
 		if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
@@ -307,7 +307,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
 	do {
 		pud_t old_pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(old_pud, addr, end);
 
 		/*
 		 * For 4K granule only, attempt to put down a 1GB block
@@ -356,7 +356,7 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 	end = PAGE_ALIGN(virt + size);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		alloc_init_pud(pgdp, addr, next, phys, prot, pgtable_alloc,
 			       flags);
 		phys += next - addr;
@@ -820,9 +820,9 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp, pmd;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -853,9 +853,9 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp, pud;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -886,9 +886,9 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -912,9 +912,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end,
 	WARN_ON(!free_mapped && altmap);
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -968,9 +968,9 @@ static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -1008,9 +1008,9 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -1048,9 +1048,9 @@ static void free_empty_p4d_table(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -1066,9 +1066,9 @@ static void free_empty_tables(unsigned long addr, unsigned long end,
 	pgd_t *pgdp, pgd;
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -1097,8 +1097,6 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	pmd_t *pmdp;
 
 	do {
-		next = pmd_addr_end(addr, end);
-
 		pgdp = vmemmap_pgd_populate(addr, node);
 		if (!pgdp)
 			return -ENOMEM;
@@ -1112,6 +1110,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 			return -ENOMEM;
 
 		pmdp = pmd_offset(pudp, addr);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_none(READ_ONCE(*pmdp))) {
 			void *p = NULL;
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index cc72666e891a..816e218df285 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -817,7 +817,7 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -847,7 +847,7 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -878,10 +878,9 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
 	spin_lock(&init_mm.page_table_lock);
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!p4d_present(*p4d))
 			continue;
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26292544630f..f0606d6774a4 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -352,7 +352,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 		unsigned long more;
 
 		pmd = pmd_offset(pud, addr);
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!is_hugepd(__hugepd(pmd_val(*pmd)))) {
 			if (pmd_none_or_clear_bad(pmd))
 				continue;
@@ -409,7 +409,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	do {
 		pud = pud_offset(p4d, addr);
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!is_hugepd(__hugepd(pud_val(*pud)))) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
@@ -478,9 +478,9 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	 */
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset(tlb->mm, addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!is_hugepd(__hugepd(pgd_val(*pgd)))) {
 			if (p4d_none_or_clear_bad(p4d))
 				continue;
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 027206e4959d..6fb17ac413be 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -542,14 +542,14 @@ static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned lo
 	return (boundary - 1) < (end - 1) ? boundary : end;
 }
 
-#define pgd_addr_end_folded pgd_addr_end_folded
-static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
 }
 
-#define p4d_addr_end_folded p4d_addr_end_folded
-static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(p4d_val(p4d), addr, end);
 }
diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c
index 567c69f3069e..4aba634b4b26 100644
--- a/arch/s390/mm/page-states.c
+++ b/arch/s390/mm/page-states.c
@@ -109,7 +109,7 @@ static void mark_kernel_pmd(pud_t *pud, unsigned long addr, unsigned long end)
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || pmd_large(*pmd))
 			continue;
 		page = virt_to_page(pmd_val(*pmd));
@@ -126,7 +126,7 @@ static void mark_kernel_pud(p4d_t *p4d, unsigned long addr, unsigned long end)
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || pud_large(*pud))
 			continue;
 		if (!pud_folded(*pud)) {
@@ -147,7 +147,7 @@ static void mark_kernel_p4d(pgd_t *pgd, unsigned long addr, unsigned long end)
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none(*p4d))
 			continue;
 		if (!p4d_folded(*p4d)) {
@@ -169,7 +169,7 @@ static void mark_kernel_pgd(void)
 	addr = 0;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, MODULES_END);
+		next = pgd_addr_end(*pgd, addr, MODULES_END);
 		if (pgd_none(*pgd))
 			continue;
 		if (!pgd_folded(*pgd)) {
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index c5c52ec2b46f..b827d758a17a 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -162,7 +162,7 @@ static int walk_pmd_level(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		if (pmd_none(*pmdp))
 			return -EINVAL;
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_large(*pmdp)) {
 			if (addr & ~PMD_MASK || addr + PMD_SIZE > next) {
 				rc = split_pmd_page(pmdp, addr);
@@ -239,7 +239,7 @@ static int walk_pud_level(p4d_t *p4d, unsigned long addr, unsigned long end,
 	do {
 		if (pud_none(*pudp))
 			return -EINVAL;
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		if (pud_large(*pudp)) {
 			if (addr & ~PUD_MASK || addr + PUD_SIZE > next) {
 				rc = split_pud_page(pudp, addr);
@@ -269,7 +269,7 @@ static int walk_p4d_level(pgd_t *pgd, unsigned long addr, unsigned long end,
 	do {
 		if (p4d_none(*p4dp))
 			return -EINVAL;
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		rc = walk_pud_level(p4dp, addr, next, flags);
 		p4dp++;
 		addr = next;
@@ -296,7 +296,7 @@ static int change_page_attr(unsigned long addr, unsigned long end,
 	do {
 		if (pgd_none(*pgdp))
 			break;
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		rc = walk_p4d_level(pgdp, addr, next, flags);
 		if (rc)
 			break;
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index b239f2ba93b0..672bc89f13e7 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -219,7 +219,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!add) {
 			if (pmd_none(*pmd))
 				continue;
@@ -320,7 +320,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 		prot &= ~_REGION_ENTRY_NOEXEC;
 	pud = pud_offset(p4d, addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!add) {
 			if (pud_none(*pud))
 				continue;
@@ -394,7 +394,7 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!add) {
 			if (p4d_none(*p4d))
 				continue;
@@ -449,8 +449,8 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 	if (WARN_ON_ONCE(!PAGE_ALIGNED(start | end)))
 		return -EINVAL;
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (!add) {
 			if (pgd_none(*pgd))
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ec423b5f17dd..341c2ff8d31a 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -428,7 +428,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd))
 			continue;
 		if (is_hugetlb_pmd(*pmd))
@@ -465,7 +465,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		if (is_hugetlb_pud(*pud))
@@ -519,7 +519,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	pgd = pgd_offset(tlb->mm, addr);
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		hugetlb_free_pud_range(tlb, p4d, addr, next, floor, ceiling);
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index 61776790cd67..7b4fe31c8df2 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -264,7 +264,7 @@ static inline int update_pmd_range(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_present(*pmd)) {
 			if (hvc->force || pmd_newpage(*pmd)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -286,7 +286,7 @@ static inline int update_pud_range(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_present(*pud)) {
 			if (hvc->force || pud_newpage(*pud)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -308,7 +308,7 @@ static inline int update_p4d_range(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!p4d_present(*p4d)) {
 			if (hvc->force || p4d_newpage(*p4d)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -331,7 +331,7 @@ void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
 	hvc = INIT_HVC(mm, force, userspace);
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end_addr);
+		next = pgd_addr_end(*pgd, addr, end_addr);
 		if (!pgd_present(*pgd)) {
 			if (force || pgd_newpage(*pgd)) {
 				ret = add_munmap(addr, next - addr, &hvc);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a4ac13cc3fdc..e2cb9316a104 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1043,7 +1043,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -1099,7 +1099,7 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -1153,7 +1153,7 @@ remove_p4d_table(p4d_t *p4d_start, unsigned long addr, unsigned long end,
 
 	p4d = p4d_start + p4d_index(addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -1186,9 +1186,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct,
 	p4d_t *p4d;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
@@ -1500,8 +1499,6 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 	pmd_t *pmd;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pmd_addr_end(addr, end);
-
 		pgd = vmemmap_pgd_populate(addr, node);
 		if (!pgd)
 			return -ENOMEM;
@@ -1515,6 +1512,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 			return -ENOMEM;
 
 		pmd = pmd_offset(pud, addr);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd)) {
 			void *p;
 
@@ -1623,9 +1621,8 @@ void register_page_bootmem_memmap(unsigned long section_nr,
 			get_page_bootmem(section_nr, pte_page(*pte),
 					 SECTION_INFO);
 		} else {
-			next = pmd_addr_end(addr, end);
-
 			pmd = pmd_offset(pud, addr);
+			next = pmd_addr_end(*pmd, addr, end);
 			if (pmd_none(*pmd))
 				continue;
 
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 1a50434c8a4d..2c105b5154ba 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -96,7 +96,7 @@ static void __init kasan_populate_pud(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_large(*pmd))
 			kasan_populate_pmd(pmd, addr, next, nid);
 	} while (pmd++, addr = next, addr != end);
@@ -116,7 +116,7 @@ static void __init kasan_populate_p4d(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_large(*pud))
 			kasan_populate_pud(pud, addr, next, nid);
 	} while (pud++, addr = next, addr != end);
@@ -136,7 +136,7 @@ static void __init kasan_populate_pgd(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		kasan_populate_p4d(p4d, addr, next, nid);
 	} while (p4d++, addr = next, addr != end);
 }
@@ -151,7 +151,7 @@ static void __init kasan_populate_shadow(unsigned long addr, unsigned long end,
 	end = round_up(end, PAGE_SIZE);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_populate_pgd(pgd, addr, next, nid);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -219,7 +219,7 @@ static void __init kasan_early_p4d_populate(pgd_t *pgd,
 
 	p4d = early_p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_none(*p4d))
 			continue;
@@ -239,7 +239,7 @@ static void __init kasan_map_early_shadow(pgd_t *pgd)
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_early_p4d_populate(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -254,7 +254,7 @@ static void __init kasan_shallow_populate_p4ds(pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (p4d_none(*p4d)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
@@ -272,7 +272,7 @@ static void __init kasan_shallow_populate_pgds(void *start, void *end)
 	addr = (unsigned long)start;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, (unsigned long)end);
+		next = pgd_addr_end(*pgd, addr, (unsigned long)end);
 
 		if (pgd_none(*pgd)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
index ce2cbb3c380f..156b42e51424 100644
--- a/include/asm-generic/pgtable-nop4d.h
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -53,7 +53,7 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
 #define p4d_free_tlb(tlb, x, a)			do { } while (0)
 
 #undef  p4d_addr_end
-#define p4d_addr_end(addr, end)			(end)
+#define p4d_addr_end(p4d, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOP4D_H */
diff --git a/include/asm-generic/pgtable-nopmd.h b/include/asm-generic/pgtable-nopmd.h
index 3e13acd019ae..e988384de1c7 100644
--- a/include/asm-generic/pgtable-nopmd.h
+++ b/include/asm-generic/pgtable-nopmd.h
@@ -64,7 +64,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 #define pmd_free_tlb(tlb, x, a)		do { } while (0)
 
 #undef  pmd_addr_end
-#define pmd_addr_end(addr, end)			(end)
+#define pmd_addr_end(pmd, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index a9d751fbda9e..57a28bade9f9 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -60,7 +60,7 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
 #define pud_free_tlb(tlb, x, a)		        do { } while (0)
 
 #undef  pud_addr_end
-#define pud_addr_end(addr, end)			(end)
+#define pud_addr_end(pud, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOPUD_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 981c4c2a31fe..67ebc22cf83d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -655,48 +655,34 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
  */
 
-#define pgd_addr_end(addr, end)						\
+#ifndef pgd_addr_end
+#define pgd_addr_end(pgd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
+#endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(addr, end)						\
+#define p4d_addr_end(p4d, addr, end)					\
 ({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(addr, end)						\
+#define pud_addr_end(pud, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(addr, end)						\
+#define pmd_addr_end(pmd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
-#ifndef pgd_addr_end_folded
-#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
-#endif
-
-#ifndef p4d_addr_end_folded
-#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
-#endif
-
-#ifndef pud_addr_end_folded
-#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
-#endif
-
-#ifndef pmd_addr_end_folded
-#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
-#endif
-
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index ba4aace5d0f4..7826876ae7e0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end_folded(pmd, addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end_folded(pud, addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end_folded(p4d, addr, end);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end_folded(pgd, addr, end);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..400fa119c09d 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -114,7 +114,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PMD_MODIFIED;
@@ -160,7 +160,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PUD_MODIFIED;
@@ -206,7 +206,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_P4D_MODIFIED;
@@ -234,7 +234,7 @@ int ioremap_page_range(unsigned long addr,
 	start = addr;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
 					&mask);
 		if (err)
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index fe6be0be1f76..829627a92763 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -117,7 +117,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (IS_ALIGNED(addr, PMD_SIZE) && end - addr >= PMD_SIZE) {
 			pmd_populate_kernel(&init_mm, pmd,
@@ -150,7 +150,7 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (IS_ALIGNED(addr, PUD_SIZE) && end - addr >= PUD_SIZE) {
 			pmd_t *pmd;
 
@@ -187,7 +187,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (IS_ALIGNED(addr, P4D_SIZE) && end - addr >= P4D_SIZE) {
 			pud_t *pud;
 			pmd_t *pmd;
@@ -236,7 +236,7 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
 	unsigned long next;
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (IS_ALIGNED(addr, PGDIR_SIZE) && end - addr >= PGDIR_SIZE) {
 			p4d_t *p4d;
@@ -370,7 +370,7 @@ static void kasan_remove_pmd_table(pmd_t *pmd, unsigned long addr,
 	for (; addr < end; addr = next, pmd++) {
 		pte_t *pte;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -395,7 +395,7 @@ static void kasan_remove_pud_table(pud_t *pud, unsigned long addr,
 	for (; addr < end; addr = next, pud++) {
 		pmd_t *pmd, *pmd_base;
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -421,7 +421,7 @@ static void kasan_remove_p4d_table(p4d_t *p4d, unsigned long addr,
 	for (; addr < end; addr = next, p4d++) {
 		pud_t *pud;
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -454,9 +454,8 @@ void kasan_remove_zero_shadow(void *start, unsigned long size)
 	for (; addr < end; addr = next) {
 		p4d_t *p4d;
 
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
diff --git a/mm/madvise.c b/mm/madvise.c
index e32e7efbba0f..acfb3441d97e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -326,7 +326,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (pmd_trans_huge(*pmd)) {
 		pmd_t orig_pmd;
-		unsigned long next = pmd_addr_end(addr, end);
+		unsigned long next = pmd_addr_end(*pmd, addr, end);
 
 		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 		ptl = pmd_trans_huge_lock(pmd, vma);
@@ -587,7 +587,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	int nr_swap = 0;
 	unsigned long next;
 
-	next = pmd_addr_end(addr, end);
+	next = pmd_addr_end(*pmd, addr, end);
 	if (pmd_trans_huge(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
diff --git a/mm/memory.c b/mm/memory.c
index fb5463153351..f95424946b0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -233,7 +233,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		free_pte_range(tlb, pmd, addr);
@@ -267,7 +267,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		free_pmd_range(tlb, pud, addr, next, floor, ceiling);
@@ -301,7 +301,7 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 	start = addr;
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		free_pud_range(tlb, p4d, addr, next, floor, ceiling);
@@ -381,7 +381,7 @@ void free_pgd_range(struct mmu_gather *tlb,
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	pgd = pgd_offset(tlb->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
@@ -887,7 +887,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*src_pmd, addr, end);
 		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
 			|| pmd_devmap(*src_pmd)) {
 			int err;
@@ -921,7 +921,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pud = pud_offset(src_p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*src_pud, addr, end);
 		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
 			int err;
 
@@ -955,7 +955,7 @@ static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_p4d = p4d_offset(src_pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*src_p4d, addr, end);
 		if (p4d_none_or_clear_bad(src_p4d))
 			continue;
 		if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
@@ -1017,7 +1017,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*src_pgd, addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
@@ -1177,7 +1177,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
@@ -1212,7 +1212,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				mmap_assert_locked(tlb->mm);
@@ -1241,7 +1241,7 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		next = zap_pud_range(tlb, vma, p4d, addr, next, details);
@@ -1262,7 +1262,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
@@ -2030,7 +2030,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 		return -ENOMEM;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		err = remap_pte_range(mm, pmd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2052,7 +2052,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		err = remap_pmd_range(mm, pud, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2074,7 +2074,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		err = remap_pud_range(mm, p4d, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2143,7 +2143,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = remap_p4d_range(mm, pgd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2266,7 +2266,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 		pmd = pmd_offset(pud, addr);
 	}
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (create || !pmd_none_or_clear_bad(pmd)) {
 			err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
 						 create, mask);
@@ -2294,7 +2294,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 		pud = pud_offset(p4d, addr);
 	}
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (create || !pud_none_or_clear_bad(pud)) {
 			err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
 						 create, mask);
@@ -2322,7 +2322,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 		p4d = p4d_offset(pgd, addr);
 	}
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (create || !p4d_none_or_clear_bad(p4d)) {
 			err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
 						 create, mask);
@@ -2348,7 +2348,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!create && pgd_none_or_clear_bad(pgd))
 			continue;
 		err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, &mask);
diff --git a/mm/mlock.c b/mm/mlock.c
index 93ca2bf30b4f..5898e8fe2288 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -374,8 +374,12 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 			struct vm_area_struct *vma, struct zone *zone,
 			unsigned long start, unsigned long end)
 {
-	pte_t *pte;
 	spinlock_t *ptl;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
 
 	/*
 	 * Initialize pte walk starting at the already pinned page where we
@@ -384,10 +388,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 	 */
 	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
 	/* Make sure we do not cross the page table boundary */
-	end = pgd_addr_end(start, end);
-	end = p4d_addr_end(start, end);
-	end = pud_addr_end(start, end);
-	end = pmd_addr_end(start, end);
+	pgd = pgd_offset(vma->vm_mm, start);
+	end = pgd_addr_end(*pgd, start, end);
+	p4d = p4d_offset(pgd, start);
+	end = p4d_addr_end(*p4d, start, end);
+	pud = pud_offset(p4d, start);
+	end = pud_addr_end(*pud, start, end);
+	pmd = pmd_offset(pud, start);
+	end = pmd_addr_end(*pmd, start, end);
 
 	/* The page next to the pinned page is the first we will try to get */
 	start += PAGE_SIZE;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..278f2dbd1f20 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -225,7 +225,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	do {
 		unsigned long this_pages;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
@@ -291,7 +291,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
@@ -311,7 +311,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(vma, p4d, addr, next, newprot,
@@ -336,7 +336,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	flush_cache_range(vma, addr, end);
 	inc_tlb_flush_pending(mm);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(vma, pgd, addr, next, newprot,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..a5b9f61b5d45 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -70,7 +70,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	pmd = pmd_offset(pud, addr);
 	do {
 again:
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -128,7 +128,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 	pud = pud_offset(p4d, addr);
 	do {
  again:
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -176,7 +176,7 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -211,7 +211,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
 	else
 		pgd = pgd_offset(walk->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, 0, walk);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 20012c0c0252..b1dd815aee6b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2018,7 +2018,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	pmd = pmd_offset(pud, addr);
 	do {
 		cond_resched();
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type,
@@ -2040,7 +2040,7 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		ret = unuse_pmd_range(vma, pud, addr, next, type,
@@ -2062,7 +2062,7 @@ static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		ret = unuse_pud_range(vma, p4d, addr, next, type,
@@ -2085,7 +2085,7 @@ static int unuse_vma(struct vm_area_struct *vma, unsigned int type,
 
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		ret = unuse_p4d_range(vma, pgd, addr, next, type,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index be4724b916b3..09ff0d5ecbc1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -93,7 +93,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		cleared = pmd_clear_huge(pmd);
 		if (cleared || pmd_bad(*pmd))
@@ -118,7 +118,7 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		cleared = pud_clear_huge(pud);
 		if (cleared || pud_bad(*pud))
@@ -141,7 +141,7 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		cleared = p4d_clear_huge(p4d);
 		if (cleared || p4d_bad(*p4d))
@@ -179,7 +179,7 @@ void unmap_kernel_range_noflush(unsigned long start, unsigned long size)
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		if (pgd_none_or_clear_bad(pgd))
@@ -230,7 +230,7 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
@@ -248,7 +248,7 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
@@ -266,7 +266,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
@@ -305,7 +305,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

From: Alexander Gordeev <agordeev@linux.ibm.com>

Unlike all other page-table abstractions pXd_addr_end() do not take
into account a particular table entry in which context the functions
are called. On architectures with dynamic page-tables folding that
might lead to lack of necessary information that is difficult to
obtain other than from the table entry itself. That already led to
a subtle memory corruption issue on s390.

By letting pXd_addr_end() functions know about the page-table entry
we allow archs not only make extra checks, but also optimizations.

As result of this change the pXd_addr_end_folded() functions used
in gup_fast traversal code become unnecessary and get replaced with
universal pXd_addr_end() variants.

The arch-specific updates not only add dereferencing of page-table
entry pointers, but also small changes to the code flow to make those
dereferences possible, at least for x86 and powerpc. Also for arm64,
but in way that should not have any impact.

So, even though the dereferenced page-table entries are not used on
archs other than s390, and are optimized out by the compiler, there
is a small change in kernel size and this is what bloat-o-meter reports:

x86:
add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
Function                                     old     new   delta
vmemmap_populate                             587     592      +5
munlock_vma_pages_range                      556     561      +5
Total: Before\x15534694, After\x15534704, chg +0.00%

powerpc:
add/remove: 0/0 grow/shrink: 1/0 up/down: 4/0 (4)
Function                                     old     new   delta
.remove_pagetable                           1648    1652      +4
Total: Before!478240, After!478244, chg +0.00%

arm64:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before 240851, After 240851, chg +0.00%

sparc:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: BeforeI07262, AfterI07262, chg +0.00%

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 ++++++----
 arch/arm64/kvm/mmu.c                     | 16 +++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++--------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          |  8 ++---
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 +++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 26 ++++-----------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 ++++++++++++------------
 mm/mlock.c                               | 18 ++++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 +++++-----
 31 files changed, 165 insertions(+), 173 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index 3502c2f746ca..5e6416b339f4 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -209,7 +209,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 	} while (0)
 
 /* we don't need complex calculations here as the pmd is folded into the pgd */
-#define pmd_addr_end(addr,end) (end)
+#define pmd_addr_end(pmd,addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
 
diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c
index 448e57c6f653..5437f943ca8b 100644
--- a/arch/arm/mm/idmap.c
+++ b/arch/arm/mm/idmap.c
@@ -46,7 +46,7 @@ static void idmap_add_pmd(pud_t *pud, unsigned long addr, unsigned long end,
 		pmd = pmd_offset(pud, addr);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		*pmd = __pmd((addr & PMD_MASK) | prot);
 		flush_pmd_entry(pmd);
 	} while (pmd++, addr = next, addr != end);
@@ -73,7 +73,7 @@ static void idmap_add_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		idmap_add_pmd(pud, addr, next, prot);
 	} while (pud++, addr = next, addr != end);
 }
@@ -95,7 +95,7 @@ static void identity_mapping_add(pgd_t *pgd, const char *text_start,
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		idmap_add_pud(pgd, addr, next, prot);
 	} while (pgd++, addr = next, addr != end);
 }
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 698cc740c6b8..4013746e4c75 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -777,7 +777,7 @@ static void __init alloc_init_pmd(pud_t *pud, unsigned long addr,
 		 * With LPAE, we must loop over to map
 		 * all the pmds for the given range.
 		 */
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Try a section mapping - addr, next and phys must all be
@@ -805,7 +805,7 @@ static void __init alloc_init_pud(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		alloc_init_pmd(pud, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (pud++, addr = next, addr != end);
@@ -820,7 +820,7 @@ static void __init alloc_init_p4d(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		alloc_init_pud(p4d, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (p4d++, addr = next, addr != end);
@@ -923,7 +923,7 @@ static void __init __create_mapping(struct mm_struct *mm, struct map_desc *md,
 	pgd = pgd_offset(mm, addr);
 	end = addr + length;
 	do {
-		unsigned long next = pgd_addr_end(addr, end);
+		unsigned long next = pgd_addr_end(*pgd, addr, end);
 
 		alloc_init_p4d(pgd, addr, next, phys, type, alloc, ng);
 
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 68e14152d6e9..7be8c9cdc5c8 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -412,7 +412,7 @@ static int copy_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long start,
 	do {
 		pmd_t pmd = READ_ONCE(*src_pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 		if (pmd_table(pmd)) {
@@ -447,7 +447,7 @@ static int copy_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp, unsigned long start,
 	do {
 		pud_t pud = READ_ONCE(*src_pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 		if (pud_table(pud)) {
@@ -473,8 +473,10 @@ static int copy_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long start,
 	dst_p4dp = p4d_offset(dst_pgdp, start);
 	src_p4dp = p4d_offset(src_pgdp, start);
 	do {
-		next = p4d_addr_end(addr, end);
-		if (p4d_none(READ_ONCE(*src_p4dp)))
+		p4d_t p4d = READ_ONCE(*src_p4dp);
+
+		next = p4d_addr_end(p4d, addr, end);
+		if (p4d_none(p4d))
 			continue;
 		if (copy_pud(dst_p4dp, src_p4dp, addr, next))
 			return -ENOMEM;
@@ -492,8 +494,10 @@ static int copy_page_tables(pgd_t *dst_pgdp, unsigned long start,
 
 	dst_pgdp = pgd_offset_pgd(dst_pgdp, start);
 	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(READ_ONCE(*src_pgdp)))
+		pgd_t pgd = READ_ONCE(*src_pgdp);
+
+		next = pgd_addr_end(pgd, addr, end);
+		if (pgd_none(pgd))
 			continue;
 		if (copy_p4d(dst_pgdp, src_pgdp, addr, next))
 			return -ENOMEM;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ba00bcc0c884..8f470f93a8e9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -547,7 +547,7 @@ static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
 
 	start_pmd = pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		/* Hyp doesn't use huge pmds */
 		if (!pmd_none(*pmd))
 			unmap_hyp_ptes(pmd, addr, next);
@@ -564,7 +564,7 @@ static void unmap_hyp_puds(p4d_t *p4d, phys_addr_t addr, phys_addr_t end)
 
 	start_pud = pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		/* Hyp doesn't use huge puds */
 		if (!pud_none(*pud))
 			unmap_hyp_pmds(pud, addr, next);
@@ -581,7 +581,7 @@ static void unmap_hyp_p4ds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
 
 	start_p4d = p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		/* Hyp doesn't use huge p4ds */
 		if (!p4d_none(*p4d))
 			unmap_hyp_puds(p4d, addr, next);
@@ -609,7 +609,7 @@ static void __unmap_hyp_range(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 	 */
 	pgd = pgdp + kvm_pgd_index(addr, ptrs_per_pgd);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_none(*pgd))
 			unmap_hyp_p4ds(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
@@ -712,7 +712,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
 			get_page(virt_to_page(pmd));
 		}
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		create_hyp_pte_mappings(pmd, addr, next, pfn, prot);
 		pfn += (next - addr) >> PAGE_SHIFT;
@@ -744,7 +744,7 @@ static int create_hyp_pud_mappings(p4d_t *p4d, unsigned long start,
 			get_page(virt_to_page(pud));
 		}
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		ret = create_hyp_pmd_mappings(pud, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -777,7 +777,7 @@ static int create_hyp_p4d_mappings(pgd_t *pgd, unsigned long start,
 			get_page(virt_to_page(p4d));
 		}
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		ret = create_hyp_pud_mappings(p4d, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -813,7 +813,7 @@ static int __create_hyp_mappings(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 			get_page(virt_to_page(pgd));
 		}
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = create_hyp_p4d_mappings(pgd, addr, next, pfn, prot);
 		if (err)
 			goto out;
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b24e43d20667..8d1c811fd59e 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -120,7 +120,7 @@ static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp = kasan_pmd_offset(pudp, addr, node, early);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		kasan_pte_populate(pmdp, addr, next, node, early);
 	} while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)));
 }
@@ -132,7 +132,7 @@ static void __init kasan_pud_populate(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp = kasan_pud_offset(p4dp, addr, node, early);
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		kasan_pmd_populate(pudp, addr, next, node, early);
 	} while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)));
 }
@@ -144,7 +144,7 @@ static void __init kasan_p4d_populate(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp = p4d_offset(pgdp, addr);
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		kasan_pud_populate(p4dp, addr, next, node, early);
 	} while (p4dp++, addr = next, addr != end);
 }
@@ -157,7 +157,7 @@ static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
 
 	pgdp = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		kasan_p4d_populate(pgdp, addr, next, node, early);
 	} while (pgdp++, addr = next, addr != end);
 }
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 64211436629d..d679cf024bc8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -209,7 +209,7 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		pmd_t old_pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(old_pmd, addr, end);
 
 		/* try section mapping first */
 		if (((addr | next | phys) & ~SECTION_MASK) = 0 &&
@@ -307,7 +307,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
 	do {
 		pud_t old_pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(old_pud, addr, end);
 
 		/*
 		 * For 4K granule only, attempt to put down a 1GB block
@@ -356,7 +356,7 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 	end = PAGE_ALIGN(virt + size);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		alloc_init_pud(pgdp, addr, next, phys, prot, pgtable_alloc,
 			       flags);
 		phys += next - addr;
@@ -820,9 +820,9 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp, pmd;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -853,9 +853,9 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp, pud;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -886,9 +886,9 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -912,9 +912,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end,
 	WARN_ON(!free_mapped && altmap);
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -968,9 +968,9 @@ static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -1008,9 +1008,9 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -1048,9 +1048,9 @@ static void free_empty_p4d_table(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -1066,9 +1066,9 @@ static void free_empty_tables(unsigned long addr, unsigned long end,
 	pgd_t *pgdp, pgd;
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -1097,8 +1097,6 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	pmd_t *pmdp;
 
 	do {
-		next = pmd_addr_end(addr, end);
-
 		pgdp = vmemmap_pgd_populate(addr, node);
 		if (!pgdp)
 			return -ENOMEM;
@@ -1112,6 +1110,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 			return -ENOMEM;
 
 		pmdp = pmd_offset(pudp, addr);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_none(READ_ONCE(*pmdp))) {
 			void *p = NULL;
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index cc72666e891a..816e218df285 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -817,7 +817,7 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -847,7 +847,7 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -878,10 +878,9 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
 	spin_lock(&init_mm.page_table_lock);
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!p4d_present(*p4d))
 			continue;
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26292544630f..f0606d6774a4 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -352,7 +352,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 		unsigned long more;
 
 		pmd = pmd_offset(pud, addr);
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!is_hugepd(__hugepd(pmd_val(*pmd)))) {
 			if (pmd_none_or_clear_bad(pmd))
 				continue;
@@ -409,7 +409,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	do {
 		pud = pud_offset(p4d, addr);
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!is_hugepd(__hugepd(pud_val(*pud)))) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
@@ -478,9 +478,9 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	 */
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset(tlb->mm, addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!is_hugepd(__hugepd(pgd_val(*pgd)))) {
 			if (p4d_none_or_clear_bad(p4d))
 				continue;
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 027206e4959d..6fb17ac413be 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -542,14 +542,14 @@ static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned lo
 	return (boundary - 1) < (end - 1) ? boundary : end;
 }
 
-#define pgd_addr_end_folded pgd_addr_end_folded
-static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
 }
 
-#define p4d_addr_end_folded p4d_addr_end_folded
-static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(p4d_val(p4d), addr, end);
 }
diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c
index 567c69f3069e..4aba634b4b26 100644
--- a/arch/s390/mm/page-states.c
+++ b/arch/s390/mm/page-states.c
@@ -109,7 +109,7 @@ static void mark_kernel_pmd(pud_t *pud, unsigned long addr, unsigned long end)
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || pmd_large(*pmd))
 			continue;
 		page = virt_to_page(pmd_val(*pmd));
@@ -126,7 +126,7 @@ static void mark_kernel_pud(p4d_t *p4d, unsigned long addr, unsigned long end)
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || pud_large(*pud))
 			continue;
 		if (!pud_folded(*pud)) {
@@ -147,7 +147,7 @@ static void mark_kernel_p4d(pgd_t *pgd, unsigned long addr, unsigned long end)
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none(*p4d))
 			continue;
 		if (!p4d_folded(*p4d)) {
@@ -169,7 +169,7 @@ static void mark_kernel_pgd(void)
 	addr = 0;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, MODULES_END);
+		next = pgd_addr_end(*pgd, addr, MODULES_END);
 		if (pgd_none(*pgd))
 			continue;
 		if (!pgd_folded(*pgd)) {
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index c5c52ec2b46f..b827d758a17a 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -162,7 +162,7 @@ static int walk_pmd_level(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		if (pmd_none(*pmdp))
 			return -EINVAL;
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_large(*pmdp)) {
 			if (addr & ~PMD_MASK || addr + PMD_SIZE > next) {
 				rc = split_pmd_page(pmdp, addr);
@@ -239,7 +239,7 @@ static int walk_pud_level(p4d_t *p4d, unsigned long addr, unsigned long end,
 	do {
 		if (pud_none(*pudp))
 			return -EINVAL;
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		if (pud_large(*pudp)) {
 			if (addr & ~PUD_MASK || addr + PUD_SIZE > next) {
 				rc = split_pud_page(pudp, addr);
@@ -269,7 +269,7 @@ static int walk_p4d_level(pgd_t *pgd, unsigned long addr, unsigned long end,
 	do {
 		if (p4d_none(*p4dp))
 			return -EINVAL;
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		rc = walk_pud_level(p4dp, addr, next, flags);
 		p4dp++;
 		addr = next;
@@ -296,7 +296,7 @@ static int change_page_attr(unsigned long addr, unsigned long end,
 	do {
 		if (pgd_none(*pgdp))
 			break;
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		rc = walk_p4d_level(pgdp, addr, next, flags);
 		if (rc)
 			break;
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index b239f2ba93b0..672bc89f13e7 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -219,7 +219,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!add) {
 			if (pmd_none(*pmd))
 				continue;
@@ -320,7 +320,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 		prot &= ~_REGION_ENTRY_NOEXEC;
 	pud = pud_offset(p4d, addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!add) {
 			if (pud_none(*pud))
 				continue;
@@ -394,7 +394,7 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!add) {
 			if (p4d_none(*p4d))
 				continue;
@@ -449,8 +449,8 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 	if (WARN_ON_ONCE(!PAGE_ALIGNED(start | end)))
 		return -EINVAL;
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (!add) {
 			if (pgd_none(*pgd))
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ec423b5f17dd..341c2ff8d31a 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -428,7 +428,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd))
 			continue;
 		if (is_hugetlb_pmd(*pmd))
@@ -465,7 +465,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		if (is_hugetlb_pud(*pud))
@@ -519,7 +519,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	pgd = pgd_offset(tlb->mm, addr);
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		hugetlb_free_pud_range(tlb, p4d, addr, next, floor, ceiling);
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index 61776790cd67..7b4fe31c8df2 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -264,7 +264,7 @@ static inline int update_pmd_range(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_present(*pmd)) {
 			if (hvc->force || pmd_newpage(*pmd)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -286,7 +286,7 @@ static inline int update_pud_range(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_present(*pud)) {
 			if (hvc->force || pud_newpage(*pud)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -308,7 +308,7 @@ static inline int update_p4d_range(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!p4d_present(*p4d)) {
 			if (hvc->force || p4d_newpage(*p4d)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -331,7 +331,7 @@ void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
 	hvc = INIT_HVC(mm, force, userspace);
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end_addr);
+		next = pgd_addr_end(*pgd, addr, end_addr);
 		if (!pgd_present(*pgd)) {
 			if (force || pgd_newpage(*pgd)) {
 				ret = add_munmap(addr, next - addr, &hvc);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a4ac13cc3fdc..e2cb9316a104 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1043,7 +1043,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -1099,7 +1099,7 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -1153,7 +1153,7 @@ remove_p4d_table(p4d_t *p4d_start, unsigned long addr, unsigned long end,
 
 	p4d = p4d_start + p4d_index(addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -1186,9 +1186,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct,
 	p4d_t *p4d;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
@@ -1500,8 +1499,6 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 	pmd_t *pmd;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pmd_addr_end(addr, end);
-
 		pgd = vmemmap_pgd_populate(addr, node);
 		if (!pgd)
 			return -ENOMEM;
@@ -1515,6 +1512,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 			return -ENOMEM;
 
 		pmd = pmd_offset(pud, addr);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd)) {
 			void *p;
 
@@ -1623,9 +1621,8 @@ void register_page_bootmem_memmap(unsigned long section_nr,
 			get_page_bootmem(section_nr, pte_page(*pte),
 					 SECTION_INFO);
 		} else {
-			next = pmd_addr_end(addr, end);
-
 			pmd = pmd_offset(pud, addr);
+			next = pmd_addr_end(*pmd, addr, end);
 			if (pmd_none(*pmd))
 				continue;
 
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 1a50434c8a4d..2c105b5154ba 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -96,7 +96,7 @@ static void __init kasan_populate_pud(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_large(*pmd))
 			kasan_populate_pmd(pmd, addr, next, nid);
 	} while (pmd++, addr = next, addr != end);
@@ -116,7 +116,7 @@ static void __init kasan_populate_p4d(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_large(*pud))
 			kasan_populate_pud(pud, addr, next, nid);
 	} while (pud++, addr = next, addr != end);
@@ -136,7 +136,7 @@ static void __init kasan_populate_pgd(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		kasan_populate_p4d(p4d, addr, next, nid);
 	} while (p4d++, addr = next, addr != end);
 }
@@ -151,7 +151,7 @@ static void __init kasan_populate_shadow(unsigned long addr, unsigned long end,
 	end = round_up(end, PAGE_SIZE);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_populate_pgd(pgd, addr, next, nid);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -219,7 +219,7 @@ static void __init kasan_early_p4d_populate(pgd_t *pgd,
 
 	p4d = early_p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_none(*p4d))
 			continue;
@@ -239,7 +239,7 @@ static void __init kasan_map_early_shadow(pgd_t *pgd)
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_early_p4d_populate(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -254,7 +254,7 @@ static void __init kasan_shallow_populate_p4ds(pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (p4d_none(*p4d)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
@@ -272,7 +272,7 @@ static void __init kasan_shallow_populate_pgds(void *start, void *end)
 	addr = (unsigned long)start;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, (unsigned long)end);
+		next = pgd_addr_end(*pgd, addr, (unsigned long)end);
 
 		if (pgd_none(*pgd)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
index ce2cbb3c380f..156b42e51424 100644
--- a/include/asm-generic/pgtable-nop4d.h
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -53,7 +53,7 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
 #define p4d_free_tlb(tlb, x, a)			do { } while (0)
 
 #undef  p4d_addr_end
-#define p4d_addr_end(addr, end)			(end)
+#define p4d_addr_end(p4d, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOP4D_H */
diff --git a/include/asm-generic/pgtable-nopmd.h b/include/asm-generic/pgtable-nopmd.h
index 3e13acd019ae..e988384de1c7 100644
--- a/include/asm-generic/pgtable-nopmd.h
+++ b/include/asm-generic/pgtable-nopmd.h
@@ -64,7 +64,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 #define pmd_free_tlb(tlb, x, a)		do { } while (0)
 
 #undef  pmd_addr_end
-#define pmd_addr_end(addr, end)			(end)
+#define pmd_addr_end(pmd, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index a9d751fbda9e..57a28bade9f9 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -60,7 +60,7 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
 #define pud_free_tlb(tlb, x, a)		        do { } while (0)
 
 #undef  pud_addr_end
-#define pud_addr_end(addr, end)			(end)
+#define pud_addr_end(pud, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOPUD_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 981c4c2a31fe..67ebc22cf83d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -655,48 +655,34 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
  */
 
-#define pgd_addr_end(addr, end)						\
+#ifndef pgd_addr_end
+#define pgd_addr_end(pgd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
+#endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(addr, end)						\
+#define p4d_addr_end(p4d, addr, end)					\
 ({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(addr, end)						\
+#define pud_addr_end(pud, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(addr, end)						\
+#define pmd_addr_end(pmd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
-#ifndef pgd_addr_end_folded
-#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
-#endif
-
-#ifndef p4d_addr_end_folded
-#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
-#endif
-
-#ifndef pud_addr_end_folded
-#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
-#endif
-
-#ifndef pmd_addr_end_folded
-#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
-#endif
-
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index ba4aace5d0f4..7826876ae7e0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end_folded(pmd, addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end_folded(pud, addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end_folded(p4d, addr, end);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end_folded(pgd, addr, end);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..400fa119c09d 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -114,7 +114,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PMD_MODIFIED;
@@ -160,7 +160,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PUD_MODIFIED;
@@ -206,7 +206,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_P4D_MODIFIED;
@@ -234,7 +234,7 @@ int ioremap_page_range(unsigned long addr,
 	start = addr;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
 					&mask);
 		if (err)
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index fe6be0be1f76..829627a92763 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -117,7 +117,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (IS_ALIGNED(addr, PMD_SIZE) && end - addr >= PMD_SIZE) {
 			pmd_populate_kernel(&init_mm, pmd,
@@ -150,7 +150,7 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (IS_ALIGNED(addr, PUD_SIZE) && end - addr >= PUD_SIZE) {
 			pmd_t *pmd;
 
@@ -187,7 +187,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (IS_ALIGNED(addr, P4D_SIZE) && end - addr >= P4D_SIZE) {
 			pud_t *pud;
 			pmd_t *pmd;
@@ -236,7 +236,7 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
 	unsigned long next;
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (IS_ALIGNED(addr, PGDIR_SIZE) && end - addr >= PGDIR_SIZE) {
 			p4d_t *p4d;
@@ -370,7 +370,7 @@ static void kasan_remove_pmd_table(pmd_t *pmd, unsigned long addr,
 	for (; addr < end; addr = next, pmd++) {
 		pte_t *pte;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -395,7 +395,7 @@ static void kasan_remove_pud_table(pud_t *pud, unsigned long addr,
 	for (; addr < end; addr = next, pud++) {
 		pmd_t *pmd, *pmd_base;
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -421,7 +421,7 @@ static void kasan_remove_p4d_table(p4d_t *p4d, unsigned long addr,
 	for (; addr < end; addr = next, p4d++) {
 		pud_t *pud;
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -454,9 +454,8 @@ void kasan_remove_zero_shadow(void *start, unsigned long size)
 	for (; addr < end; addr = next) {
 		p4d_t *p4d;
 
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
diff --git a/mm/madvise.c b/mm/madvise.c
index e32e7efbba0f..acfb3441d97e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -326,7 +326,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (pmd_trans_huge(*pmd)) {
 		pmd_t orig_pmd;
-		unsigned long next = pmd_addr_end(addr, end);
+		unsigned long next = pmd_addr_end(*pmd, addr, end);
 
 		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 		ptl = pmd_trans_huge_lock(pmd, vma);
@@ -587,7 +587,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	int nr_swap = 0;
 	unsigned long next;
 
-	next = pmd_addr_end(addr, end);
+	next = pmd_addr_end(*pmd, addr, end);
 	if (pmd_trans_huge(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
diff --git a/mm/memory.c b/mm/memory.c
index fb5463153351..f95424946b0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -233,7 +233,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		free_pte_range(tlb, pmd, addr);
@@ -267,7 +267,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		free_pmd_range(tlb, pud, addr, next, floor, ceiling);
@@ -301,7 +301,7 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 	start = addr;
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		free_pud_range(tlb, p4d, addr, next, floor, ceiling);
@@ -381,7 +381,7 @@ void free_pgd_range(struct mmu_gather *tlb,
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	pgd = pgd_offset(tlb->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
@@ -887,7 +887,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*src_pmd, addr, end);
 		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
 			|| pmd_devmap(*src_pmd)) {
 			int err;
@@ -921,7 +921,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pud = pud_offset(src_p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*src_pud, addr, end);
 		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
 			int err;
 
@@ -955,7 +955,7 @@ static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_p4d = p4d_offset(src_pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*src_p4d, addr, end);
 		if (p4d_none_or_clear_bad(src_p4d))
 			continue;
 		if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
@@ -1017,7 +1017,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*src_pgd, addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
@@ -1177,7 +1177,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
@@ -1212,7 +1212,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				mmap_assert_locked(tlb->mm);
@@ -1241,7 +1241,7 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		next = zap_pud_range(tlb, vma, p4d, addr, next, details);
@@ -1262,7 +1262,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
@@ -2030,7 +2030,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 		return -ENOMEM;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		err = remap_pte_range(mm, pmd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2052,7 +2052,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		err = remap_pmd_range(mm, pud, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2074,7 +2074,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		err = remap_pud_range(mm, p4d, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2143,7 +2143,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = remap_p4d_range(mm, pgd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2266,7 +2266,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 		pmd = pmd_offset(pud, addr);
 	}
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (create || !pmd_none_or_clear_bad(pmd)) {
 			err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
 						 create, mask);
@@ -2294,7 +2294,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 		pud = pud_offset(p4d, addr);
 	}
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (create || !pud_none_or_clear_bad(pud)) {
 			err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
 						 create, mask);
@@ -2322,7 +2322,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 		p4d = p4d_offset(pgd, addr);
 	}
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (create || !p4d_none_or_clear_bad(p4d)) {
 			err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
 						 create, mask);
@@ -2348,7 +2348,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!create && pgd_none_or_clear_bad(pgd))
 			continue;
 		err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, &mask);
diff --git a/mm/mlock.c b/mm/mlock.c
index 93ca2bf30b4f..5898e8fe2288 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -374,8 +374,12 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 			struct vm_area_struct *vma, struct zone *zone,
 			unsigned long start, unsigned long end)
 {
-	pte_t *pte;
 	spinlock_t *ptl;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
 
 	/*
 	 * Initialize pte walk starting at the already pinned page where we
@@ -384,10 +388,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 	 */
 	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
 	/* Make sure we do not cross the page table boundary */
-	end = pgd_addr_end(start, end);
-	end = p4d_addr_end(start, end);
-	end = pud_addr_end(start, end);
-	end = pmd_addr_end(start, end);
+	pgd = pgd_offset(vma->vm_mm, start);
+	end = pgd_addr_end(*pgd, start, end);
+	p4d = p4d_offset(pgd, start);
+	end = p4d_addr_end(*p4d, start, end);
+	pud = pud_offset(p4d, start);
+	end = pud_addr_end(*pud, start, end);
+	pmd = pmd_offset(pud, start);
+	end = pmd_addr_end(*pmd, start, end);
 
 	/* The page next to the pinned page is the first we will try to get */
 	start += PAGE_SIZE;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..278f2dbd1f20 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -225,7 +225,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	do {
 		unsigned long this_pages;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
@@ -291,7 +291,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
@@ -311,7 +311,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(vma, p4d, addr, next, newprot,
@@ -336,7 +336,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	flush_cache_range(vma, addr, end);
 	inc_tlb_flush_pending(mm);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(vma, pgd, addr, next, newprot,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..a5b9f61b5d45 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -70,7 +70,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	pmd = pmd_offset(pud, addr);
 	do {
 again:
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -128,7 +128,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 	pud = pud_offset(p4d, addr);
 	do {
  again:
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -176,7 +176,7 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -211,7 +211,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
 	else
 		pgd = pgd_offset(walk->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, 0, walk);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 20012c0c0252..b1dd815aee6b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2018,7 +2018,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	pmd = pmd_offset(pud, addr);
 	do {
 		cond_resched();
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type,
@@ -2040,7 +2040,7 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		ret = unuse_pmd_range(vma, pud, addr, next, type,
@@ -2062,7 +2062,7 @@ static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		ret = unuse_pud_range(vma, p4d, addr, next, type,
@@ -2085,7 +2085,7 @@ static int unuse_vma(struct vm_area_struct *vma, unsigned int type,
 
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		ret = unuse_p4d_range(vma, pgd, addr, next, type,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index be4724b916b3..09ff0d5ecbc1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -93,7 +93,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		cleared = pmd_clear_huge(pmd);
 		if (cleared || pmd_bad(*pmd))
@@ -118,7 +118,7 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		cleared = pud_clear_huge(pud);
 		if (cleared || pud_bad(*pud))
@@ -141,7 +141,7 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		cleared = p4d_clear_huge(p4d);
 		if (cleared || p4d_bad(*p4d))
@@ -179,7 +179,7 @@ void unmap_kernel_range_noflush(unsigned long start, unsigned long size)
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		if (pgd_none_or_clear_bad(pgd))
@@ -230,7 +230,7 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
@@ -248,7 +248,7 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
@@ -266,7 +266,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
@@ -305,7 +305,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Unlike all other page-table abstractions pXd_addr_end() do not take
into account a particular table entry in which context the functions
are called. On architectures with dynamic page-tables folding that
might lead to lack of necessary information that is difficult to
obtain other than from the table entry itself. That already led to
a subtle memory corruption issue on s390.

By letting pXd_addr_end() functions know about the page-table entry
we allow archs not only make extra checks, but also optimizations.

As result of this change the pXd_addr_end_folded() functions used
in gup_fast traversal code become unnecessary and get replaced with
universal pXd_addr_end() variants.

The arch-specific updates not only add dereferencing of page-table
entry pointers, but also small changes to the code flow to make those
dereferences possible, at least for x86 and powerpc. Also for arm64,
but in way that should not have any impact.

So, even though the dereferenced page-table entries are not used on
archs other than s390, and are optimized out by the compiler, there
is a small change in kernel size and this is what bloat-o-meter reports:

x86:
add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
Function                                     old     new   delta
vmemmap_populate                             587     592      +5
munlock_vma_pages_range                      556     561      +5
Total: Before=15534694, After=15534704, chg +0.00%

powerpc:
add/remove: 0/0 grow/shrink: 1/0 up/down: 4/0 (4)
Function                                     old     new   delta
.remove_pagetable                           1648    1652      +4
Total: Before=21478240, After=21478244, chg +0.00%

arm64:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before=20240851, After=20240851, chg +0.00%

sparc:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before=4907262, After=4907262, chg +0.00%

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 ++++++----
 arch/arm64/kvm/mmu.c                     | 16 +++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++--------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          |  8 ++---
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 +++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 26 ++++-----------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 ++++++++++++------------
 mm/mlock.c                               | 18 ++++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 +++++-----
 31 files changed, 165 insertions(+), 173 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index 3502c2f746ca..5e6416b339f4 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -209,7 +209,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 	} while (0)
 
 /* we don't need complex calculations here as the pmd is folded into the pgd */
-#define pmd_addr_end(addr,end) (end)
+#define pmd_addr_end(pmd,addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
 
diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c
index 448e57c6f653..5437f943ca8b 100644
--- a/arch/arm/mm/idmap.c
+++ b/arch/arm/mm/idmap.c
@@ -46,7 +46,7 @@ static void idmap_add_pmd(pud_t *pud, unsigned long addr, unsigned long end,
 		pmd = pmd_offset(pud, addr);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		*pmd = __pmd((addr & PMD_MASK) | prot);
 		flush_pmd_entry(pmd);
 	} while (pmd++, addr = next, addr != end);
@@ -73,7 +73,7 @@ static void idmap_add_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		idmap_add_pmd(pud, addr, next, prot);
 	} while (pud++, addr = next, addr != end);
 }
@@ -95,7 +95,7 @@ static void identity_mapping_add(pgd_t *pgd, const char *text_start,
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		idmap_add_pud(pgd, addr, next, prot);
 	} while (pgd++, addr = next, addr != end);
 }
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 698cc740c6b8..4013746e4c75 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -777,7 +777,7 @@ static void __init alloc_init_pmd(pud_t *pud, unsigned long addr,
 		 * With LPAE, we must loop over to map
 		 * all the pmds for the given range.
 		 */
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Try a section mapping - addr, next and phys must all be
@@ -805,7 +805,7 @@ static void __init alloc_init_pud(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		alloc_init_pmd(pud, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (pud++, addr = next, addr != end);
@@ -820,7 +820,7 @@ static void __init alloc_init_p4d(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		alloc_init_pud(p4d, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (p4d++, addr = next, addr != end);
@@ -923,7 +923,7 @@ static void __init __create_mapping(struct mm_struct *mm, struct map_desc *md,
 	pgd = pgd_offset(mm, addr);
 	end = addr + length;
 	do {
-		unsigned long next = pgd_addr_end(addr, end);
+		unsigned long next = pgd_addr_end(*pgd, addr, end);
 
 		alloc_init_p4d(pgd, addr, next, phys, type, alloc, ng);
 
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 68e14152d6e9..7be8c9cdc5c8 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -412,7 +412,7 @@ static int copy_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long start,
 	do {
 		pmd_t pmd = READ_ONCE(*src_pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 		if (pmd_table(pmd)) {
@@ -447,7 +447,7 @@ static int copy_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp, unsigned long start,
 	do {
 		pud_t pud = READ_ONCE(*src_pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 		if (pud_table(pud)) {
@@ -473,8 +473,10 @@ static int copy_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long start,
 	dst_p4dp = p4d_offset(dst_pgdp, start);
 	src_p4dp = p4d_offset(src_pgdp, start);
 	do {
-		next = p4d_addr_end(addr, end);
-		if (p4d_none(READ_ONCE(*src_p4dp)))
+		p4d_t p4d = READ_ONCE(*src_p4dp);
+
+		next = p4d_addr_end(p4d, addr, end);
+		if (p4d_none(p4d))
 			continue;
 		if (copy_pud(dst_p4dp, src_p4dp, addr, next))
 			return -ENOMEM;
@@ -492,8 +494,10 @@ static int copy_page_tables(pgd_t *dst_pgdp, unsigned long start,
 
 	dst_pgdp = pgd_offset_pgd(dst_pgdp, start);
 	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(READ_ONCE(*src_pgdp)))
+		pgd_t pgd = READ_ONCE(*src_pgdp);
+
+		next = pgd_addr_end(pgd, addr, end);
+		if (pgd_none(pgd))
 			continue;
 		if (copy_p4d(dst_pgdp, src_pgdp, addr, next))
 			return -ENOMEM;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ba00bcc0c884..8f470f93a8e9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -547,7 +547,7 @@ static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
 
 	start_pmd = pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		/* Hyp doesn't use huge pmds */
 		if (!pmd_none(*pmd))
 			unmap_hyp_ptes(pmd, addr, next);
@@ -564,7 +564,7 @@ static void unmap_hyp_puds(p4d_t *p4d, phys_addr_t addr, phys_addr_t end)
 
 	start_pud = pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		/* Hyp doesn't use huge puds */
 		if (!pud_none(*pud))
 			unmap_hyp_pmds(pud, addr, next);
@@ -581,7 +581,7 @@ static void unmap_hyp_p4ds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
 
 	start_p4d = p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		/* Hyp doesn't use huge p4ds */
 		if (!p4d_none(*p4d))
 			unmap_hyp_puds(p4d, addr, next);
@@ -609,7 +609,7 @@ static void __unmap_hyp_range(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 	 */
 	pgd = pgdp + kvm_pgd_index(addr, ptrs_per_pgd);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_none(*pgd))
 			unmap_hyp_p4ds(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
@@ -712,7 +712,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
 			get_page(virt_to_page(pmd));
 		}
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		create_hyp_pte_mappings(pmd, addr, next, pfn, prot);
 		pfn += (next - addr) >> PAGE_SHIFT;
@@ -744,7 +744,7 @@ static int create_hyp_pud_mappings(p4d_t *p4d, unsigned long start,
 			get_page(virt_to_page(pud));
 		}
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		ret = create_hyp_pmd_mappings(pud, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -777,7 +777,7 @@ static int create_hyp_p4d_mappings(pgd_t *pgd, unsigned long start,
 			get_page(virt_to_page(p4d));
 		}
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		ret = create_hyp_pud_mappings(p4d, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -813,7 +813,7 @@ static int __create_hyp_mappings(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 			get_page(virt_to_page(pgd));
 		}
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = create_hyp_p4d_mappings(pgd, addr, next, pfn, prot);
 		if (err)
 			goto out;
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b24e43d20667..8d1c811fd59e 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -120,7 +120,7 @@ static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp = kasan_pmd_offset(pudp, addr, node, early);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		kasan_pte_populate(pmdp, addr, next, node, early);
 	} while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)));
 }
@@ -132,7 +132,7 @@ static void __init kasan_pud_populate(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp = kasan_pud_offset(p4dp, addr, node, early);
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		kasan_pmd_populate(pudp, addr, next, node, early);
 	} while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)));
 }
@@ -144,7 +144,7 @@ static void __init kasan_p4d_populate(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp = p4d_offset(pgdp, addr);
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		kasan_pud_populate(p4dp, addr, next, node, early);
 	} while (p4dp++, addr = next, addr != end);
 }
@@ -157,7 +157,7 @@ static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
 
 	pgdp = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		kasan_p4d_populate(pgdp, addr, next, node, early);
 	} while (pgdp++, addr = next, addr != end);
 }
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 64211436629d..d679cf024bc8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -209,7 +209,7 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		pmd_t old_pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(old_pmd, addr, end);
 
 		/* try section mapping first */
 		if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
@@ -307,7 +307,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
 	do {
 		pud_t old_pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(old_pud, addr, end);
 
 		/*
 		 * For 4K granule only, attempt to put down a 1GB block
@@ -356,7 +356,7 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 	end = PAGE_ALIGN(virt + size);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		alloc_init_pud(pgdp, addr, next, phys, prot, pgtable_alloc,
 			       flags);
 		phys += next - addr;
@@ -820,9 +820,9 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp, pmd;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -853,9 +853,9 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp, pud;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -886,9 +886,9 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -912,9 +912,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end,
 	WARN_ON(!free_mapped && altmap);
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -968,9 +968,9 @@ static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -1008,9 +1008,9 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -1048,9 +1048,9 @@ static void free_empty_p4d_table(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -1066,9 +1066,9 @@ static void free_empty_tables(unsigned long addr, unsigned long end,
 	pgd_t *pgdp, pgd;
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -1097,8 +1097,6 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	pmd_t *pmdp;
 
 	do {
-		next = pmd_addr_end(addr, end);
-
 		pgdp = vmemmap_pgd_populate(addr, node);
 		if (!pgdp)
 			return -ENOMEM;
@@ -1112,6 +1110,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 			return -ENOMEM;
 
 		pmdp = pmd_offset(pudp, addr);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_none(READ_ONCE(*pmdp))) {
 			void *p = NULL;
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index cc72666e891a..816e218df285 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -817,7 +817,7 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -847,7 +847,7 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -878,10 +878,9 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
 	spin_lock(&init_mm.page_table_lock);
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!p4d_present(*p4d))
 			continue;
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26292544630f..f0606d6774a4 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -352,7 +352,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 		unsigned long more;
 
 		pmd = pmd_offset(pud, addr);
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!is_hugepd(__hugepd(pmd_val(*pmd)))) {
 			if (pmd_none_or_clear_bad(pmd))
 				continue;
@@ -409,7 +409,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	do {
 		pud = pud_offset(p4d, addr);
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!is_hugepd(__hugepd(pud_val(*pud)))) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
@@ -478,9 +478,9 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	 */
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset(tlb->mm, addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!is_hugepd(__hugepd(pgd_val(*pgd)))) {
 			if (p4d_none_or_clear_bad(p4d))
 				continue;
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 027206e4959d..6fb17ac413be 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -542,14 +542,14 @@ static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned lo
 	return (boundary - 1) < (end - 1) ? boundary : end;
 }
 
-#define pgd_addr_end_folded pgd_addr_end_folded
-static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
 }
 
-#define p4d_addr_end_folded p4d_addr_end_folded
-static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(p4d_val(p4d), addr, end);
 }
diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c
index 567c69f3069e..4aba634b4b26 100644
--- a/arch/s390/mm/page-states.c
+++ b/arch/s390/mm/page-states.c
@@ -109,7 +109,7 @@ static void mark_kernel_pmd(pud_t *pud, unsigned long addr, unsigned long end)
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || pmd_large(*pmd))
 			continue;
 		page = virt_to_page(pmd_val(*pmd));
@@ -126,7 +126,7 @@ static void mark_kernel_pud(p4d_t *p4d, unsigned long addr, unsigned long end)
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || pud_large(*pud))
 			continue;
 		if (!pud_folded(*pud)) {
@@ -147,7 +147,7 @@ static void mark_kernel_p4d(pgd_t *pgd, unsigned long addr, unsigned long end)
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none(*p4d))
 			continue;
 		if (!p4d_folded(*p4d)) {
@@ -169,7 +169,7 @@ static void mark_kernel_pgd(void)
 	addr = 0;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, MODULES_END);
+		next = pgd_addr_end(*pgd, addr, MODULES_END);
 		if (pgd_none(*pgd))
 			continue;
 		if (!pgd_folded(*pgd)) {
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index c5c52ec2b46f..b827d758a17a 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -162,7 +162,7 @@ static int walk_pmd_level(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		if (pmd_none(*pmdp))
 			return -EINVAL;
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_large(*pmdp)) {
 			if (addr & ~PMD_MASK || addr + PMD_SIZE > next) {
 				rc = split_pmd_page(pmdp, addr);
@@ -239,7 +239,7 @@ static int walk_pud_level(p4d_t *p4d, unsigned long addr, unsigned long end,
 	do {
 		if (pud_none(*pudp))
 			return -EINVAL;
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		if (pud_large(*pudp)) {
 			if (addr & ~PUD_MASK || addr + PUD_SIZE > next) {
 				rc = split_pud_page(pudp, addr);
@@ -269,7 +269,7 @@ static int walk_p4d_level(pgd_t *pgd, unsigned long addr, unsigned long end,
 	do {
 		if (p4d_none(*p4dp))
 			return -EINVAL;
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		rc = walk_pud_level(p4dp, addr, next, flags);
 		p4dp++;
 		addr = next;
@@ -296,7 +296,7 @@ static int change_page_attr(unsigned long addr, unsigned long end,
 	do {
 		if (pgd_none(*pgdp))
 			break;
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		rc = walk_p4d_level(pgdp, addr, next, flags);
 		if (rc)
 			break;
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index b239f2ba93b0..672bc89f13e7 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -219,7 +219,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!add) {
 			if (pmd_none(*pmd))
 				continue;
@@ -320,7 +320,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 		prot &= ~_REGION_ENTRY_NOEXEC;
 	pud = pud_offset(p4d, addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!add) {
 			if (pud_none(*pud))
 				continue;
@@ -394,7 +394,7 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!add) {
 			if (p4d_none(*p4d))
 				continue;
@@ -449,8 +449,8 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 	if (WARN_ON_ONCE(!PAGE_ALIGNED(start | end)))
 		return -EINVAL;
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (!add) {
 			if (pgd_none(*pgd))
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ec423b5f17dd..341c2ff8d31a 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -428,7 +428,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd))
 			continue;
 		if (is_hugetlb_pmd(*pmd))
@@ -465,7 +465,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		if (is_hugetlb_pud(*pud))
@@ -519,7 +519,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	pgd = pgd_offset(tlb->mm, addr);
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		hugetlb_free_pud_range(tlb, p4d, addr, next, floor, ceiling);
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index 61776790cd67..7b4fe31c8df2 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -264,7 +264,7 @@ static inline int update_pmd_range(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_present(*pmd)) {
 			if (hvc->force || pmd_newpage(*pmd)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -286,7 +286,7 @@ static inline int update_pud_range(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_present(*pud)) {
 			if (hvc->force || pud_newpage(*pud)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -308,7 +308,7 @@ static inline int update_p4d_range(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!p4d_present(*p4d)) {
 			if (hvc->force || p4d_newpage(*p4d)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -331,7 +331,7 @@ void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
 	hvc = INIT_HVC(mm, force, userspace);
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end_addr);
+		next = pgd_addr_end(*pgd, addr, end_addr);
 		if (!pgd_present(*pgd)) {
 			if (force || pgd_newpage(*pgd)) {
 				ret = add_munmap(addr, next - addr, &hvc);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a4ac13cc3fdc..e2cb9316a104 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1043,7 +1043,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -1099,7 +1099,7 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -1153,7 +1153,7 @@ remove_p4d_table(p4d_t *p4d_start, unsigned long addr, unsigned long end,
 
 	p4d = p4d_start + p4d_index(addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -1186,9 +1186,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct,
 	p4d_t *p4d;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
@@ -1500,8 +1499,6 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 	pmd_t *pmd;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pmd_addr_end(addr, end);
-
 		pgd = vmemmap_pgd_populate(addr, node);
 		if (!pgd)
 			return -ENOMEM;
@@ -1515,6 +1512,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 			return -ENOMEM;
 
 		pmd = pmd_offset(pud, addr);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd)) {
 			void *p;
 
@@ -1623,9 +1621,8 @@ void register_page_bootmem_memmap(unsigned long section_nr,
 			get_page_bootmem(section_nr, pte_page(*pte),
 					 SECTION_INFO);
 		} else {
-			next = pmd_addr_end(addr, end);
-
 			pmd = pmd_offset(pud, addr);
+			next = pmd_addr_end(*pmd, addr, end);
 			if (pmd_none(*pmd))
 				continue;
 
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 1a50434c8a4d..2c105b5154ba 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -96,7 +96,7 @@ static void __init kasan_populate_pud(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_large(*pmd))
 			kasan_populate_pmd(pmd, addr, next, nid);
 	} while (pmd++, addr = next, addr != end);
@@ -116,7 +116,7 @@ static void __init kasan_populate_p4d(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_large(*pud))
 			kasan_populate_pud(pud, addr, next, nid);
 	} while (pud++, addr = next, addr != end);
@@ -136,7 +136,7 @@ static void __init kasan_populate_pgd(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		kasan_populate_p4d(p4d, addr, next, nid);
 	} while (p4d++, addr = next, addr != end);
 }
@@ -151,7 +151,7 @@ static void __init kasan_populate_shadow(unsigned long addr, unsigned long end,
 	end = round_up(end, PAGE_SIZE);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_populate_pgd(pgd, addr, next, nid);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -219,7 +219,7 @@ static void __init kasan_early_p4d_populate(pgd_t *pgd,
 
 	p4d = early_p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_none(*p4d))
 			continue;
@@ -239,7 +239,7 @@ static void __init kasan_map_early_shadow(pgd_t *pgd)
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_early_p4d_populate(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -254,7 +254,7 @@ static void __init kasan_shallow_populate_p4ds(pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (p4d_none(*p4d)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
@@ -272,7 +272,7 @@ static void __init kasan_shallow_populate_pgds(void *start, void *end)
 	addr = (unsigned long)start;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, (unsigned long)end);
+		next = pgd_addr_end(*pgd, addr, (unsigned long)end);
 
 		if (pgd_none(*pgd)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
index ce2cbb3c380f..156b42e51424 100644
--- a/include/asm-generic/pgtable-nop4d.h
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -53,7 +53,7 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
 #define p4d_free_tlb(tlb, x, a)			do { } while (0)
 
 #undef  p4d_addr_end
-#define p4d_addr_end(addr, end)			(end)
+#define p4d_addr_end(p4d, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOP4D_H */
diff --git a/include/asm-generic/pgtable-nopmd.h b/include/asm-generic/pgtable-nopmd.h
index 3e13acd019ae..e988384de1c7 100644
--- a/include/asm-generic/pgtable-nopmd.h
+++ b/include/asm-generic/pgtable-nopmd.h
@@ -64,7 +64,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 #define pmd_free_tlb(tlb, x, a)		do { } while (0)
 
 #undef  pmd_addr_end
-#define pmd_addr_end(addr, end)			(end)
+#define pmd_addr_end(pmd, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index a9d751fbda9e..57a28bade9f9 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -60,7 +60,7 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
 #define pud_free_tlb(tlb, x, a)		        do { } while (0)
 
 #undef  pud_addr_end
-#define pud_addr_end(addr, end)			(end)
+#define pud_addr_end(pud, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOPUD_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 981c4c2a31fe..67ebc22cf83d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -655,48 +655,34 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
  */
 
-#define pgd_addr_end(addr, end)						\
+#ifndef pgd_addr_end
+#define pgd_addr_end(pgd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
+#endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(addr, end)						\
+#define p4d_addr_end(p4d, addr, end)					\
 ({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(addr, end)						\
+#define pud_addr_end(pud, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(addr, end)						\
+#define pmd_addr_end(pmd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
-#ifndef pgd_addr_end_folded
-#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
-#endif
-
-#ifndef p4d_addr_end_folded
-#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
-#endif
-
-#ifndef pud_addr_end_folded
-#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
-#endif
-
-#ifndef pmd_addr_end_folded
-#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
-#endif
-
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index ba4aace5d0f4..7826876ae7e0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end_folded(pmd, addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end_folded(pud, addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end_folded(p4d, addr, end);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end_folded(pgd, addr, end);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..400fa119c09d 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -114,7 +114,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PMD_MODIFIED;
@@ -160,7 +160,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PUD_MODIFIED;
@@ -206,7 +206,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_P4D_MODIFIED;
@@ -234,7 +234,7 @@ int ioremap_page_range(unsigned long addr,
 	start = addr;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
 					&mask);
 		if (err)
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index fe6be0be1f76..829627a92763 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -117,7 +117,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (IS_ALIGNED(addr, PMD_SIZE) && end - addr >= PMD_SIZE) {
 			pmd_populate_kernel(&init_mm, pmd,
@@ -150,7 +150,7 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (IS_ALIGNED(addr, PUD_SIZE) && end - addr >= PUD_SIZE) {
 			pmd_t *pmd;
 
@@ -187,7 +187,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (IS_ALIGNED(addr, P4D_SIZE) && end - addr >= P4D_SIZE) {
 			pud_t *pud;
 			pmd_t *pmd;
@@ -236,7 +236,7 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
 	unsigned long next;
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (IS_ALIGNED(addr, PGDIR_SIZE) && end - addr >= PGDIR_SIZE) {
 			p4d_t *p4d;
@@ -370,7 +370,7 @@ static void kasan_remove_pmd_table(pmd_t *pmd, unsigned long addr,
 	for (; addr < end; addr = next, pmd++) {
 		pte_t *pte;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -395,7 +395,7 @@ static void kasan_remove_pud_table(pud_t *pud, unsigned long addr,
 	for (; addr < end; addr = next, pud++) {
 		pmd_t *pmd, *pmd_base;
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -421,7 +421,7 @@ static void kasan_remove_p4d_table(p4d_t *p4d, unsigned long addr,
 	for (; addr < end; addr = next, p4d++) {
 		pud_t *pud;
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -454,9 +454,8 @@ void kasan_remove_zero_shadow(void *start, unsigned long size)
 	for (; addr < end; addr = next) {
 		p4d_t *p4d;
 
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
diff --git a/mm/madvise.c b/mm/madvise.c
index e32e7efbba0f..acfb3441d97e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -326,7 +326,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (pmd_trans_huge(*pmd)) {
 		pmd_t orig_pmd;
-		unsigned long next = pmd_addr_end(addr, end);
+		unsigned long next = pmd_addr_end(*pmd, addr, end);
 
 		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 		ptl = pmd_trans_huge_lock(pmd, vma);
@@ -587,7 +587,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	int nr_swap = 0;
 	unsigned long next;
 
-	next = pmd_addr_end(addr, end);
+	next = pmd_addr_end(*pmd, addr, end);
 	if (pmd_trans_huge(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
diff --git a/mm/memory.c b/mm/memory.c
index fb5463153351..f95424946b0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -233,7 +233,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		free_pte_range(tlb, pmd, addr);
@@ -267,7 +267,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		free_pmd_range(tlb, pud, addr, next, floor, ceiling);
@@ -301,7 +301,7 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 	start = addr;
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		free_pud_range(tlb, p4d, addr, next, floor, ceiling);
@@ -381,7 +381,7 @@ void free_pgd_range(struct mmu_gather *tlb,
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	pgd = pgd_offset(tlb->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
@@ -887,7 +887,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*src_pmd, addr, end);
 		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
 			|| pmd_devmap(*src_pmd)) {
 			int err;
@@ -921,7 +921,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pud = pud_offset(src_p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*src_pud, addr, end);
 		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
 			int err;
 
@@ -955,7 +955,7 @@ static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_p4d = p4d_offset(src_pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*src_p4d, addr, end);
 		if (p4d_none_or_clear_bad(src_p4d))
 			continue;
 		if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
@@ -1017,7 +1017,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*src_pgd, addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
@@ -1177,7 +1177,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
@@ -1212,7 +1212,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				mmap_assert_locked(tlb->mm);
@@ -1241,7 +1241,7 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		next = zap_pud_range(tlb, vma, p4d, addr, next, details);
@@ -1262,7 +1262,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
@@ -2030,7 +2030,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 		return -ENOMEM;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		err = remap_pte_range(mm, pmd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2052,7 +2052,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		err = remap_pmd_range(mm, pud, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2074,7 +2074,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		err = remap_pud_range(mm, p4d, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2143,7 +2143,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = remap_p4d_range(mm, pgd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2266,7 +2266,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 		pmd = pmd_offset(pud, addr);
 	}
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (create || !pmd_none_or_clear_bad(pmd)) {
 			err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
 						 create, mask);
@@ -2294,7 +2294,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 		pud = pud_offset(p4d, addr);
 	}
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (create || !pud_none_or_clear_bad(pud)) {
 			err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
 						 create, mask);
@@ -2322,7 +2322,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 		p4d = p4d_offset(pgd, addr);
 	}
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (create || !p4d_none_or_clear_bad(p4d)) {
 			err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
 						 create, mask);
@@ -2348,7 +2348,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!create && pgd_none_or_clear_bad(pgd))
 			continue;
 		err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, &mask);
diff --git a/mm/mlock.c b/mm/mlock.c
index 93ca2bf30b4f..5898e8fe2288 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -374,8 +374,12 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 			struct vm_area_struct *vma, struct zone *zone,
 			unsigned long start, unsigned long end)
 {
-	pte_t *pte;
 	spinlock_t *ptl;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
 
 	/*
 	 * Initialize pte walk starting at the already pinned page where we
@@ -384,10 +388,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 	 */
 	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
 	/* Make sure we do not cross the page table boundary */
-	end = pgd_addr_end(start, end);
-	end = p4d_addr_end(start, end);
-	end = pud_addr_end(start, end);
-	end = pmd_addr_end(start, end);
+	pgd = pgd_offset(vma->vm_mm, start);
+	end = pgd_addr_end(*pgd, start, end);
+	p4d = p4d_offset(pgd, start);
+	end = p4d_addr_end(*p4d, start, end);
+	pud = pud_offset(p4d, start);
+	end = pud_addr_end(*pud, start, end);
+	pmd = pmd_offset(pud, start);
+	end = pmd_addr_end(*pmd, start, end);
 
 	/* The page next to the pinned page is the first we will try to get */
 	start += PAGE_SIZE;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..278f2dbd1f20 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -225,7 +225,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	do {
 		unsigned long this_pages;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
@@ -291,7 +291,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
@@ -311,7 +311,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(vma, p4d, addr, next, newprot,
@@ -336,7 +336,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	flush_cache_range(vma, addr, end);
 	inc_tlb_flush_pending(mm);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(vma, pgd, addr, next, newprot,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..a5b9f61b5d45 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -70,7 +70,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	pmd = pmd_offset(pud, addr);
 	do {
 again:
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -128,7 +128,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 	pud = pud_offset(p4d, addr);
 	do {
  again:
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -176,7 +176,7 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -211,7 +211,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
 	else
 		pgd = pgd_offset(walk->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, 0, walk);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 20012c0c0252..b1dd815aee6b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2018,7 +2018,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	pmd = pmd_offset(pud, addr);
 	do {
 		cond_resched();
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type,
@@ -2040,7 +2040,7 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		ret = unuse_pmd_range(vma, pud, addr, next, type,
@@ -2062,7 +2062,7 @@ static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		ret = unuse_pud_range(vma, p4d, addr, next, type,
@@ -2085,7 +2085,7 @@ static int unuse_vma(struct vm_area_struct *vma, unsigned int type,
 
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		ret = unuse_p4d_range(vma, pgd, addr, next, type,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index be4724b916b3..09ff0d5ecbc1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -93,7 +93,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		cleared = pmd_clear_huge(pmd);
 		if (cleared || pmd_bad(*pmd))
@@ -118,7 +118,7 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		cleared = pud_clear_huge(pud);
 		if (cleared || pud_bad(*pud))
@@ -141,7 +141,7 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		cleared = p4d_clear_huge(p4d);
 		if (cleared || p4d_bad(*p4d))
@@ -179,7 +179,7 @@ void unmap_kernel_range_noflush(unsigned long start, unsigned long size)
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		if (pgd_none_or_clear_bad(pgd))
@@ -230,7 +230,7 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
@@ -248,7 +248,7 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
@@ -266,7 +266,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
@@ -305,7 +305,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Unlike all other page-table abstractions pXd_addr_end() do not take
into account a particular table entry in which context the functions
are called. On architectures with dynamic page-tables folding that
might lead to lack of necessary information that is difficult to
obtain other than from the table entry itself. That already led to
a subtle memory corruption issue on s390.

By letting pXd_addr_end() functions know about the page-table entry
we allow archs not only make extra checks, but also optimizations.

As result of this change the pXd_addr_end_folded() functions used
in gup_fast traversal code become unnecessary and get replaced with
universal pXd_addr_end() variants.

The arch-specific updates not only add dereferencing of page-table
entry pointers, but also small changes to the code flow to make those
dereferences possible, at least for x86 and powerpc. Also for arm64,
but in way that should not have any impact.

So, even though the dereferenced page-table entries are not used on
archs other than s390, and are optimized out by the compiler, there
is a small change in kernel size and this is what bloat-o-meter reports:

x86:
add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
Function                                     old     new   delta
vmemmap_populate                             587     592      +5
munlock_vma_pages_range                      556     561      +5
Total: Before=15534694, After=15534704, chg +0.00%

powerpc:
add/remove: 0/0 grow/shrink: 1/0 up/down: 4/0 (4)
Function                                     old     new   delta
.remove_pagetable                           1648    1652      +4
Total: Before=21478240, After=21478244, chg +0.00%

arm64:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before=20240851, After=20240851, chg +0.00%

sparc:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before=4907262, After=4907262, chg +0.00%

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 ++++++----
 arch/arm64/kvm/mmu.c                     | 16 +++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++--------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          |  8 ++---
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 +++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 26 ++++-----------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 ++++++++++++------------
 mm/mlock.c                               | 18 ++++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 +++++-----
 31 files changed, 165 insertions(+), 173 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index 3502c2f746ca..5e6416b339f4 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -209,7 +209,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 	} while (0)
 
 /* we don't need complex calculations here as the pmd is folded into the pgd */
-#define pmd_addr_end(addr,end) (end)
+#define pmd_addr_end(pmd,addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
 
diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c
index 448e57c6f653..5437f943ca8b 100644
--- a/arch/arm/mm/idmap.c
+++ b/arch/arm/mm/idmap.c
@@ -46,7 +46,7 @@ static void idmap_add_pmd(pud_t *pud, unsigned long addr, unsigned long end,
 		pmd = pmd_offset(pud, addr);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		*pmd = __pmd((addr & PMD_MASK) | prot);
 		flush_pmd_entry(pmd);
 	} while (pmd++, addr = next, addr != end);
@@ -73,7 +73,7 @@ static void idmap_add_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		idmap_add_pmd(pud, addr, next, prot);
 	} while (pud++, addr = next, addr != end);
 }
@@ -95,7 +95,7 @@ static void identity_mapping_add(pgd_t *pgd, const char *text_start,
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		idmap_add_pud(pgd, addr, next, prot);
 	} while (pgd++, addr = next, addr != end);
 }
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 698cc740c6b8..4013746e4c75 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -777,7 +777,7 @@ static void __init alloc_init_pmd(pud_t *pud, unsigned long addr,
 		 * With LPAE, we must loop over to map
 		 * all the pmds for the given range.
 		 */
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Try a section mapping - addr, next and phys must all be
@@ -805,7 +805,7 @@ static void __init alloc_init_pud(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		alloc_init_pmd(pud, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (pud++, addr = next, addr != end);
@@ -820,7 +820,7 @@ static void __init alloc_init_p4d(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		alloc_init_pud(p4d, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (p4d++, addr = next, addr != end);
@@ -923,7 +923,7 @@ static void __init __create_mapping(struct mm_struct *mm, struct map_desc *md,
 	pgd = pgd_offset(mm, addr);
 	end = addr + length;
 	do {
-		unsigned long next = pgd_addr_end(addr, end);
+		unsigned long next = pgd_addr_end(*pgd, addr, end);
 
 		alloc_init_p4d(pgd, addr, next, phys, type, alloc, ng);
 
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 68e14152d6e9..7be8c9cdc5c8 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -412,7 +412,7 @@ static int copy_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long start,
 	do {
 		pmd_t pmd = READ_ONCE(*src_pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 		if (pmd_table(pmd)) {
@@ -447,7 +447,7 @@ static int copy_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp, unsigned long start,
 	do {
 		pud_t pud = READ_ONCE(*src_pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 		if (pud_table(pud)) {
@@ -473,8 +473,10 @@ static int copy_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long start,
 	dst_p4dp = p4d_offset(dst_pgdp, start);
 	src_p4dp = p4d_offset(src_pgdp, start);
 	do {
-		next = p4d_addr_end(addr, end);
-		if (p4d_none(READ_ONCE(*src_p4dp)))
+		p4d_t p4d = READ_ONCE(*src_p4dp);
+
+		next = p4d_addr_end(p4d, addr, end);
+		if (p4d_none(p4d))
 			continue;
 		if (copy_pud(dst_p4dp, src_p4dp, addr, next))
 			return -ENOMEM;
@@ -492,8 +494,10 @@ static int copy_page_tables(pgd_t *dst_pgdp, unsigned long start,
 
 	dst_pgdp = pgd_offset_pgd(dst_pgdp, start);
 	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(READ_ONCE(*src_pgdp)))
+		pgd_t pgd = READ_ONCE(*src_pgdp);
+
+		next = pgd_addr_end(pgd, addr, end);
+		if (pgd_none(pgd))
 			continue;
 		if (copy_p4d(dst_pgdp, src_pgdp, addr, next))
 			return -ENOMEM;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ba00bcc0c884..8f470f93a8e9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -547,7 +547,7 @@ static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
 
 	start_pmd = pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		/* Hyp doesn't use huge pmds */
 		if (!pmd_none(*pmd))
 			unmap_hyp_ptes(pmd, addr, next);
@@ -564,7 +564,7 @@ static void unmap_hyp_puds(p4d_t *p4d, phys_addr_t addr, phys_addr_t end)
 
 	start_pud = pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		/* Hyp doesn't use huge puds */
 		if (!pud_none(*pud))
 			unmap_hyp_pmds(pud, addr, next);
@@ -581,7 +581,7 @@ static void unmap_hyp_p4ds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
 
 	start_p4d = p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		/* Hyp doesn't use huge p4ds */
 		if (!p4d_none(*p4d))
 			unmap_hyp_puds(p4d, addr, next);
@@ -609,7 +609,7 @@ static void __unmap_hyp_range(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 	 */
 	pgd = pgdp + kvm_pgd_index(addr, ptrs_per_pgd);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_none(*pgd))
 			unmap_hyp_p4ds(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
@@ -712,7 +712,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
 			get_page(virt_to_page(pmd));
 		}
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		create_hyp_pte_mappings(pmd, addr, next, pfn, prot);
 		pfn += (next - addr) >> PAGE_SHIFT;
@@ -744,7 +744,7 @@ static int create_hyp_pud_mappings(p4d_t *p4d, unsigned long start,
 			get_page(virt_to_page(pud));
 		}
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		ret = create_hyp_pmd_mappings(pud, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -777,7 +777,7 @@ static int create_hyp_p4d_mappings(pgd_t *pgd, unsigned long start,
 			get_page(virt_to_page(p4d));
 		}
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		ret = create_hyp_pud_mappings(p4d, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -813,7 +813,7 @@ static int __create_hyp_mappings(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 			get_page(virt_to_page(pgd));
 		}
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = create_hyp_p4d_mappings(pgd, addr, next, pfn, prot);
 		if (err)
 			goto out;
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b24e43d20667..8d1c811fd59e 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -120,7 +120,7 @@ static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp = kasan_pmd_offset(pudp, addr, node, early);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		kasan_pte_populate(pmdp, addr, next, node, early);
 	} while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)));
 }
@@ -132,7 +132,7 @@ static void __init kasan_pud_populate(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp = kasan_pud_offset(p4dp, addr, node, early);
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		kasan_pmd_populate(pudp, addr, next, node, early);
 	} while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)));
 }
@@ -144,7 +144,7 @@ static void __init kasan_p4d_populate(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp = p4d_offset(pgdp, addr);
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		kasan_pud_populate(p4dp, addr, next, node, early);
 	} while (p4dp++, addr = next, addr != end);
 }
@@ -157,7 +157,7 @@ static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
 
 	pgdp = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		kasan_p4d_populate(pgdp, addr, next, node, early);
 	} while (pgdp++, addr = next, addr != end);
 }
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 64211436629d..d679cf024bc8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -209,7 +209,7 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		pmd_t old_pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(old_pmd, addr, end);
 
 		/* try section mapping first */
 		if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
@@ -307,7 +307,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
 	do {
 		pud_t old_pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(old_pud, addr, end);
 
 		/*
 		 * For 4K granule only, attempt to put down a 1GB block
@@ -356,7 +356,7 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 	end = PAGE_ALIGN(virt + size);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		alloc_init_pud(pgdp, addr, next, phys, prot, pgtable_alloc,
 			       flags);
 		phys += next - addr;
@@ -820,9 +820,9 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp, pmd;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -853,9 +853,9 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp, pud;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -886,9 +886,9 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -912,9 +912,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end,
 	WARN_ON(!free_mapped && altmap);
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -968,9 +968,9 @@ static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -1008,9 +1008,9 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -1048,9 +1048,9 @@ static void free_empty_p4d_table(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -1066,9 +1066,9 @@ static void free_empty_tables(unsigned long addr, unsigned long end,
 	pgd_t *pgdp, pgd;
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -1097,8 +1097,6 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	pmd_t *pmdp;
 
 	do {
-		next = pmd_addr_end(addr, end);
-
 		pgdp = vmemmap_pgd_populate(addr, node);
 		if (!pgdp)
 			return -ENOMEM;
@@ -1112,6 +1110,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 			return -ENOMEM;
 
 		pmdp = pmd_offset(pudp, addr);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_none(READ_ONCE(*pmdp))) {
 			void *p = NULL;
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index cc72666e891a..816e218df285 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -817,7 +817,7 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -847,7 +847,7 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -878,10 +878,9 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
 	spin_lock(&init_mm.page_table_lock);
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!p4d_present(*p4d))
 			continue;
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26292544630f..f0606d6774a4 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -352,7 +352,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 		unsigned long more;
 
 		pmd = pmd_offset(pud, addr);
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!is_hugepd(__hugepd(pmd_val(*pmd)))) {
 			if (pmd_none_or_clear_bad(pmd))
 				continue;
@@ -409,7 +409,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	do {
 		pud = pud_offset(p4d, addr);
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!is_hugepd(__hugepd(pud_val(*pud)))) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
@@ -478,9 +478,9 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	 */
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset(tlb->mm, addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!is_hugepd(__hugepd(pgd_val(*pgd)))) {
 			if (p4d_none_or_clear_bad(p4d))
 				continue;
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 027206e4959d..6fb17ac413be 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -542,14 +542,14 @@ static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned lo
 	return (boundary - 1) < (end - 1) ? boundary : end;
 }
 
-#define pgd_addr_end_folded pgd_addr_end_folded
-static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
 }
 
-#define p4d_addr_end_folded p4d_addr_end_folded
-static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(p4d_val(p4d), addr, end);
 }
diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c
index 567c69f3069e..4aba634b4b26 100644
--- a/arch/s390/mm/page-states.c
+++ b/arch/s390/mm/page-states.c
@@ -109,7 +109,7 @@ static void mark_kernel_pmd(pud_t *pud, unsigned long addr, unsigned long end)
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || pmd_large(*pmd))
 			continue;
 		page = virt_to_page(pmd_val(*pmd));
@@ -126,7 +126,7 @@ static void mark_kernel_pud(p4d_t *p4d, unsigned long addr, unsigned long end)
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || pud_large(*pud))
 			continue;
 		if (!pud_folded(*pud)) {
@@ -147,7 +147,7 @@ static void mark_kernel_p4d(pgd_t *pgd, unsigned long addr, unsigned long end)
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none(*p4d))
 			continue;
 		if (!p4d_folded(*p4d)) {
@@ -169,7 +169,7 @@ static void mark_kernel_pgd(void)
 	addr = 0;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, MODULES_END);
+		next = pgd_addr_end(*pgd, addr, MODULES_END);
 		if (pgd_none(*pgd))
 			continue;
 		if (!pgd_folded(*pgd)) {
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index c5c52ec2b46f..b827d758a17a 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -162,7 +162,7 @@ static int walk_pmd_level(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		if (pmd_none(*pmdp))
 			return -EINVAL;
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_large(*pmdp)) {
 			if (addr & ~PMD_MASK || addr + PMD_SIZE > next) {
 				rc = split_pmd_page(pmdp, addr);
@@ -239,7 +239,7 @@ static int walk_pud_level(p4d_t *p4d, unsigned long addr, unsigned long end,
 	do {
 		if (pud_none(*pudp))
 			return -EINVAL;
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		if (pud_large(*pudp)) {
 			if (addr & ~PUD_MASK || addr + PUD_SIZE > next) {
 				rc = split_pud_page(pudp, addr);
@@ -269,7 +269,7 @@ static int walk_p4d_level(pgd_t *pgd, unsigned long addr, unsigned long end,
 	do {
 		if (p4d_none(*p4dp))
 			return -EINVAL;
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		rc = walk_pud_level(p4dp, addr, next, flags);
 		p4dp++;
 		addr = next;
@@ -296,7 +296,7 @@ static int change_page_attr(unsigned long addr, unsigned long end,
 	do {
 		if (pgd_none(*pgdp))
 			break;
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		rc = walk_p4d_level(pgdp, addr, next, flags);
 		if (rc)
 			break;
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index b239f2ba93b0..672bc89f13e7 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -219,7 +219,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!add) {
 			if (pmd_none(*pmd))
 				continue;
@@ -320,7 +320,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 		prot &= ~_REGION_ENTRY_NOEXEC;
 	pud = pud_offset(p4d, addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!add) {
 			if (pud_none(*pud))
 				continue;
@@ -394,7 +394,7 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!add) {
 			if (p4d_none(*p4d))
 				continue;
@@ -449,8 +449,8 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 	if (WARN_ON_ONCE(!PAGE_ALIGNED(start | end)))
 		return -EINVAL;
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (!add) {
 			if (pgd_none(*pgd))
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ec423b5f17dd..341c2ff8d31a 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -428,7 +428,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd))
 			continue;
 		if (is_hugetlb_pmd(*pmd))
@@ -465,7 +465,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		if (is_hugetlb_pud(*pud))
@@ -519,7 +519,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	pgd = pgd_offset(tlb->mm, addr);
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		hugetlb_free_pud_range(tlb, p4d, addr, next, floor, ceiling);
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index 61776790cd67..7b4fe31c8df2 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -264,7 +264,7 @@ static inline int update_pmd_range(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_present(*pmd)) {
 			if (hvc->force || pmd_newpage(*pmd)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -286,7 +286,7 @@ static inline int update_pud_range(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_present(*pud)) {
 			if (hvc->force || pud_newpage(*pud)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -308,7 +308,7 @@ static inline int update_p4d_range(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!p4d_present(*p4d)) {
 			if (hvc->force || p4d_newpage(*p4d)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -331,7 +331,7 @@ void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
 	hvc = INIT_HVC(mm, force, userspace);
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end_addr);
+		next = pgd_addr_end(*pgd, addr, end_addr);
 		if (!pgd_present(*pgd)) {
 			if (force || pgd_newpage(*pgd)) {
 				ret = add_munmap(addr, next - addr, &hvc);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a4ac13cc3fdc..e2cb9316a104 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1043,7 +1043,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -1099,7 +1099,7 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -1153,7 +1153,7 @@ remove_p4d_table(p4d_t *p4d_start, unsigned long addr, unsigned long end,
 
 	p4d = p4d_start + p4d_index(addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -1186,9 +1186,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct,
 	p4d_t *p4d;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
@@ -1500,8 +1499,6 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 	pmd_t *pmd;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pmd_addr_end(addr, end);
-
 		pgd = vmemmap_pgd_populate(addr, node);
 		if (!pgd)
 			return -ENOMEM;
@@ -1515,6 +1512,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 			return -ENOMEM;
 
 		pmd = pmd_offset(pud, addr);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd)) {
 			void *p;
 
@@ -1623,9 +1621,8 @@ void register_page_bootmem_memmap(unsigned long section_nr,
 			get_page_bootmem(section_nr, pte_page(*pte),
 					 SECTION_INFO);
 		} else {
-			next = pmd_addr_end(addr, end);
-
 			pmd = pmd_offset(pud, addr);
+			next = pmd_addr_end(*pmd, addr, end);
 			if (pmd_none(*pmd))
 				continue;
 
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 1a50434c8a4d..2c105b5154ba 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -96,7 +96,7 @@ static void __init kasan_populate_pud(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_large(*pmd))
 			kasan_populate_pmd(pmd, addr, next, nid);
 	} while (pmd++, addr = next, addr != end);
@@ -116,7 +116,7 @@ static void __init kasan_populate_p4d(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_large(*pud))
 			kasan_populate_pud(pud, addr, next, nid);
 	} while (pud++, addr = next, addr != end);
@@ -136,7 +136,7 @@ static void __init kasan_populate_pgd(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		kasan_populate_p4d(p4d, addr, next, nid);
 	} while (p4d++, addr = next, addr != end);
 }
@@ -151,7 +151,7 @@ static void __init kasan_populate_shadow(unsigned long addr, unsigned long end,
 	end = round_up(end, PAGE_SIZE);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_populate_pgd(pgd, addr, next, nid);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -219,7 +219,7 @@ static void __init kasan_early_p4d_populate(pgd_t *pgd,
 
 	p4d = early_p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_none(*p4d))
 			continue;
@@ -239,7 +239,7 @@ static void __init kasan_map_early_shadow(pgd_t *pgd)
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_early_p4d_populate(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -254,7 +254,7 @@ static void __init kasan_shallow_populate_p4ds(pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (p4d_none(*p4d)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
@@ -272,7 +272,7 @@ static void __init kasan_shallow_populate_pgds(void *start, void *end)
 	addr = (unsigned long)start;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, (unsigned long)end);
+		next = pgd_addr_end(*pgd, addr, (unsigned long)end);
 
 		if (pgd_none(*pgd)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
index ce2cbb3c380f..156b42e51424 100644
--- a/include/asm-generic/pgtable-nop4d.h
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -53,7 +53,7 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
 #define p4d_free_tlb(tlb, x, a)			do { } while (0)
 
 #undef  p4d_addr_end
-#define p4d_addr_end(addr, end)			(end)
+#define p4d_addr_end(p4d, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOP4D_H */
diff --git a/include/asm-generic/pgtable-nopmd.h b/include/asm-generic/pgtable-nopmd.h
index 3e13acd019ae..e988384de1c7 100644
--- a/include/asm-generic/pgtable-nopmd.h
+++ b/include/asm-generic/pgtable-nopmd.h
@@ -64,7 +64,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 #define pmd_free_tlb(tlb, x, a)		do { } while (0)
 
 #undef  pmd_addr_end
-#define pmd_addr_end(addr, end)			(end)
+#define pmd_addr_end(pmd, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index a9d751fbda9e..57a28bade9f9 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -60,7 +60,7 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
 #define pud_free_tlb(tlb, x, a)		        do { } while (0)
 
 #undef  pud_addr_end
-#define pud_addr_end(addr, end)			(end)
+#define pud_addr_end(pud, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOPUD_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 981c4c2a31fe..67ebc22cf83d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -655,48 +655,34 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
  */
 
-#define pgd_addr_end(addr, end)						\
+#ifndef pgd_addr_end
+#define pgd_addr_end(pgd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
+#endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(addr, end)						\
+#define p4d_addr_end(p4d, addr, end)					\
 ({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(addr, end)						\
+#define pud_addr_end(pud, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(addr, end)						\
+#define pmd_addr_end(pmd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
-#ifndef pgd_addr_end_folded
-#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
-#endif
-
-#ifndef p4d_addr_end_folded
-#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
-#endif
-
-#ifndef pud_addr_end_folded
-#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
-#endif
-
-#ifndef pmd_addr_end_folded
-#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
-#endif
-
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index ba4aace5d0f4..7826876ae7e0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end_folded(pmd, addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end_folded(pud, addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end_folded(p4d, addr, end);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end_folded(pgd, addr, end);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..400fa119c09d 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -114,7 +114,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PMD_MODIFIED;
@@ -160,7 +160,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PUD_MODIFIED;
@@ -206,7 +206,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_P4D_MODIFIED;
@@ -234,7 +234,7 @@ int ioremap_page_range(unsigned long addr,
 	start = addr;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
 					&mask);
 		if (err)
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index fe6be0be1f76..829627a92763 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -117,7 +117,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (IS_ALIGNED(addr, PMD_SIZE) && end - addr >= PMD_SIZE) {
 			pmd_populate_kernel(&init_mm, pmd,
@@ -150,7 +150,7 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (IS_ALIGNED(addr, PUD_SIZE) && end - addr >= PUD_SIZE) {
 			pmd_t *pmd;
 
@@ -187,7 +187,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (IS_ALIGNED(addr, P4D_SIZE) && end - addr >= P4D_SIZE) {
 			pud_t *pud;
 			pmd_t *pmd;
@@ -236,7 +236,7 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
 	unsigned long next;
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (IS_ALIGNED(addr, PGDIR_SIZE) && end - addr >= PGDIR_SIZE) {
 			p4d_t *p4d;
@@ -370,7 +370,7 @@ static void kasan_remove_pmd_table(pmd_t *pmd, unsigned long addr,
 	for (; addr < end; addr = next, pmd++) {
 		pte_t *pte;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -395,7 +395,7 @@ static void kasan_remove_pud_table(pud_t *pud, unsigned long addr,
 	for (; addr < end; addr = next, pud++) {
 		pmd_t *pmd, *pmd_base;
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -421,7 +421,7 @@ static void kasan_remove_p4d_table(p4d_t *p4d, unsigned long addr,
 	for (; addr < end; addr = next, p4d++) {
 		pud_t *pud;
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -454,9 +454,8 @@ void kasan_remove_zero_shadow(void *start, unsigned long size)
 	for (; addr < end; addr = next) {
 		p4d_t *p4d;
 
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
diff --git a/mm/madvise.c b/mm/madvise.c
index e32e7efbba0f..acfb3441d97e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -326,7 +326,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (pmd_trans_huge(*pmd)) {
 		pmd_t orig_pmd;
-		unsigned long next = pmd_addr_end(addr, end);
+		unsigned long next = pmd_addr_end(*pmd, addr, end);
 
 		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 		ptl = pmd_trans_huge_lock(pmd, vma);
@@ -587,7 +587,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	int nr_swap = 0;
 	unsigned long next;
 
-	next = pmd_addr_end(addr, end);
+	next = pmd_addr_end(*pmd, addr, end);
 	if (pmd_trans_huge(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
diff --git a/mm/memory.c b/mm/memory.c
index fb5463153351..f95424946b0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -233,7 +233,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		free_pte_range(tlb, pmd, addr);
@@ -267,7 +267,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		free_pmd_range(tlb, pud, addr, next, floor, ceiling);
@@ -301,7 +301,7 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 	start = addr;
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		free_pud_range(tlb, p4d, addr, next, floor, ceiling);
@@ -381,7 +381,7 @@ void free_pgd_range(struct mmu_gather *tlb,
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	pgd = pgd_offset(tlb->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
@@ -887,7 +887,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*src_pmd, addr, end);
 		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
 			|| pmd_devmap(*src_pmd)) {
 			int err;
@@ -921,7 +921,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pud = pud_offset(src_p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*src_pud, addr, end);
 		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
 			int err;
 
@@ -955,7 +955,7 @@ static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_p4d = p4d_offset(src_pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*src_p4d, addr, end);
 		if (p4d_none_or_clear_bad(src_p4d))
 			continue;
 		if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
@@ -1017,7 +1017,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*src_pgd, addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
@@ -1177,7 +1177,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
@@ -1212,7 +1212,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				mmap_assert_locked(tlb->mm);
@@ -1241,7 +1241,7 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		next = zap_pud_range(tlb, vma, p4d, addr, next, details);
@@ -1262,7 +1262,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
@@ -2030,7 +2030,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 		return -ENOMEM;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		err = remap_pte_range(mm, pmd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2052,7 +2052,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		err = remap_pmd_range(mm, pud, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2074,7 +2074,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		err = remap_pud_range(mm, p4d, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2143,7 +2143,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = remap_p4d_range(mm, pgd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2266,7 +2266,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 		pmd = pmd_offset(pud, addr);
 	}
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (create || !pmd_none_or_clear_bad(pmd)) {
 			err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
 						 create, mask);
@@ -2294,7 +2294,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 		pud = pud_offset(p4d, addr);
 	}
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (create || !pud_none_or_clear_bad(pud)) {
 			err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
 						 create, mask);
@@ -2322,7 +2322,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 		p4d = p4d_offset(pgd, addr);
 	}
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (create || !p4d_none_or_clear_bad(p4d)) {
 			err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
 						 create, mask);
@@ -2348,7 +2348,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!create && pgd_none_or_clear_bad(pgd))
 			continue;
 		err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, &mask);
diff --git a/mm/mlock.c b/mm/mlock.c
index 93ca2bf30b4f..5898e8fe2288 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -374,8 +374,12 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 			struct vm_area_struct *vma, struct zone *zone,
 			unsigned long start, unsigned long end)
 {
-	pte_t *pte;
 	spinlock_t *ptl;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
 
 	/*
 	 * Initialize pte walk starting at the already pinned page where we
@@ -384,10 +388,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 	 */
 	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
 	/* Make sure we do not cross the page table boundary */
-	end = pgd_addr_end(start, end);
-	end = p4d_addr_end(start, end);
-	end = pud_addr_end(start, end);
-	end = pmd_addr_end(start, end);
+	pgd = pgd_offset(vma->vm_mm, start);
+	end = pgd_addr_end(*pgd, start, end);
+	p4d = p4d_offset(pgd, start);
+	end = p4d_addr_end(*p4d, start, end);
+	pud = pud_offset(p4d, start);
+	end = pud_addr_end(*pud, start, end);
+	pmd = pmd_offset(pud, start);
+	end = pmd_addr_end(*pmd, start, end);
 
 	/* The page next to the pinned page is the first we will try to get */
 	start += PAGE_SIZE;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..278f2dbd1f20 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -225,7 +225,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	do {
 		unsigned long this_pages;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
@@ -291,7 +291,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
@@ -311,7 +311,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(vma, p4d, addr, next, newprot,
@@ -336,7 +336,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	flush_cache_range(vma, addr, end);
 	inc_tlb_flush_pending(mm);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(vma, pgd, addr, next, newprot,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..a5b9f61b5d45 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -70,7 +70,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	pmd = pmd_offset(pud, addr);
 	do {
 again:
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -128,7 +128,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 	pud = pud_offset(p4d, addr);
 	do {
  again:
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -176,7 +176,7 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -211,7 +211,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
 	else
 		pgd = pgd_offset(walk->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, 0, walk);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 20012c0c0252..b1dd815aee6b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2018,7 +2018,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	pmd = pmd_offset(pud, addr);
 	do {
 		cond_resched();
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type,
@@ -2040,7 +2040,7 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		ret = unuse_pmd_range(vma, pud, addr, next, type,
@@ -2062,7 +2062,7 @@ static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		ret = unuse_pud_range(vma, p4d, addr, next, type,
@@ -2085,7 +2085,7 @@ static int unuse_vma(struct vm_area_struct *vma, unsigned int type,
 
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		ret = unuse_p4d_range(vma, pgd, addr, next, type,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index be4724b916b3..09ff0d5ecbc1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -93,7 +93,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		cleared = pmd_clear_huge(pmd);
 		if (cleared || pmd_bad(*pmd))
@@ -118,7 +118,7 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		cleared = pud_clear_huge(pud);
 		if (cleared || pud_bad(*pud))
@@ -141,7 +141,7 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		cleared = p4d_clear_huge(p4d);
 		if (cleared || p4d_bad(*p4d))
@@ -179,7 +179,7 @@ void unmap_kernel_range_noflush(unsigned long start, unsigned long size)
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		if (pgd_none_or_clear_bad(pgd))
@@ -230,7 +230,7 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
@@ -248,7 +248,7 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
@@ -266,7 +266,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
@@ -305,7 +305,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Unlike all other page-table abstractions pXd_addr_end() do not take
into account a particular table entry in which context the functions
are called. On architectures with dynamic page-tables folding that
might lead to lack of necessary information that is difficult to
obtain other than from the table entry itself. That already led to
a subtle memory corruption issue on s390.

By letting pXd_addr_end() functions know about the page-table entry
we allow archs not only make extra checks, but also optimizations.

As result of this change the pXd_addr_end_folded() functions used
in gup_fast traversal code become unnecessary and get replaced with
universal pXd_addr_end() variants.

The arch-specific updates not only add dereferencing of page-table
entry pointers, but also small changes to the code flow to make those
dereferences possible, at least for x86 and powerpc. Also for arm64,
but in way that should not have any impact.

So, even though the dereferenced page-table entries are not used on
archs other than s390, and are optimized out by the compiler, there
is a small change in kernel size and this is what bloat-o-meter reports:

x86:
add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
Function                                     old     new   delta
vmemmap_populate                             587     592      +5
munlock_vma_pages_range                      556     561      +5
Total: Before=15534694, After=15534704, chg +0.00%

powerpc:
add/remove: 0/0 grow/shrink: 1/0 up/down: 4/0 (4)
Function                                     old     new   delta
.remove_pagetable                           1648    1652      +4
Total: Before=21478240, After=21478244, chg +0.00%

arm64:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before=20240851, After=20240851, chg +0.00%

sparc:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function                                     old     new   delta
Total: Before=4907262, After=4907262, chg +0.00%

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/arm/include/asm/pgtable-2level.h    |  2 +-
 arch/arm/mm/idmap.c                      |  6 ++--
 arch/arm/mm/mmu.c                        |  8 ++---
 arch/arm64/kernel/hibernate.c            | 16 ++++++----
 arch/arm64/kvm/mmu.c                     | 16 +++++-----
 arch/arm64/mm/kasan_init.c               |  8 ++---
 arch/arm64/mm/mmu.c                      | 25 +++++++--------
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
 arch/powerpc/mm/hugetlbpage.c            |  6 ++--
 arch/s390/include/asm/pgtable.h          |  8 ++---
 arch/s390/mm/page-states.c               |  8 ++---
 arch/s390/mm/pageattr.c                  |  8 ++---
 arch/s390/mm/vmem.c                      |  8 ++---
 arch/sparc/mm/hugetlbpage.c              |  6 ++--
 arch/um/kernel/tlb.c                     |  8 ++---
 arch/x86/mm/init_64.c                    | 15 ++++-----
 arch/x86/mm/kasan_init_64.c              | 16 +++++-----
 include/asm-generic/pgtable-nop4d.h      |  2 +-
 include/asm-generic/pgtable-nopmd.h      |  2 +-
 include/asm-generic/pgtable-nopud.h      |  2 +-
 include/linux/pgtable.h                  | 26 ++++-----------
 mm/gup.c                                 |  8 ++---
 mm/ioremap.c                             |  8 ++---
 mm/kasan/init.c                          | 17 +++++-----
 mm/madvise.c                             |  4 +--
 mm/memory.c                              | 40 ++++++++++++------------
 mm/mlock.c                               | 18 ++++++++---
 mm/mprotect.c                            |  8 ++---
 mm/pagewalk.c                            |  8 ++---
 mm/swapfile.c                            |  8 ++---
 mm/vmalloc.c                             | 16 +++++-----
 31 files changed, 165 insertions(+), 173 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index 3502c2f746ca..5e6416b339f4 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -209,7 +209,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 	} while (0)
 
 /* we don't need complex calculations here as the pmd is folded into the pgd */
-#define pmd_addr_end(addr,end) (end)
+#define pmd_addr_end(pmd,addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
 
diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c
index 448e57c6f653..5437f943ca8b 100644
--- a/arch/arm/mm/idmap.c
+++ b/arch/arm/mm/idmap.c
@@ -46,7 +46,7 @@ static void idmap_add_pmd(pud_t *pud, unsigned long addr, unsigned long end,
 		pmd = pmd_offset(pud, addr);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		*pmd = __pmd((addr & PMD_MASK) | prot);
 		flush_pmd_entry(pmd);
 	} while (pmd++, addr = next, addr != end);
@@ -73,7 +73,7 @@ static void idmap_add_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		idmap_add_pmd(pud, addr, next, prot);
 	} while (pud++, addr = next, addr != end);
 }
@@ -95,7 +95,7 @@ static void identity_mapping_add(pgd_t *pgd, const char *text_start,
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		idmap_add_pud(pgd, addr, next, prot);
 	} while (pgd++, addr = next, addr != end);
 }
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 698cc740c6b8..4013746e4c75 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -777,7 +777,7 @@ static void __init alloc_init_pmd(pud_t *pud, unsigned long addr,
 		 * With LPAE, we must loop over to map
 		 * all the pmds for the given range.
 		 */
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Try a section mapping - addr, next and phys must all be
@@ -805,7 +805,7 @@ static void __init alloc_init_pud(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		alloc_init_pmd(pud, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (pud++, addr = next, addr != end);
@@ -820,7 +820,7 @@ static void __init alloc_init_p4d(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		alloc_init_pud(p4d, addr, next, phys, type, alloc, ng);
 		phys += next - addr;
 	} while (p4d++, addr = next, addr != end);
@@ -923,7 +923,7 @@ static void __init __create_mapping(struct mm_struct *mm, struct map_desc *md,
 	pgd = pgd_offset(mm, addr);
 	end = addr + length;
 	do {
-		unsigned long next = pgd_addr_end(addr, end);
+		unsigned long next = pgd_addr_end(*pgd, addr, end);
 
 		alloc_init_p4d(pgd, addr, next, phys, type, alloc, ng);
 
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 68e14152d6e9..7be8c9cdc5c8 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -412,7 +412,7 @@ static int copy_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long start,
 	do {
 		pmd_t pmd = READ_ONCE(*src_pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 		if (pmd_table(pmd)) {
@@ -447,7 +447,7 @@ static int copy_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp, unsigned long start,
 	do {
 		pud_t pud = READ_ONCE(*src_pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 		if (pud_table(pud)) {
@@ -473,8 +473,10 @@ static int copy_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long start,
 	dst_p4dp = p4d_offset(dst_pgdp, start);
 	src_p4dp = p4d_offset(src_pgdp, start);
 	do {
-		next = p4d_addr_end(addr, end);
-		if (p4d_none(READ_ONCE(*src_p4dp)))
+		p4d_t p4d = READ_ONCE(*src_p4dp);
+
+		next = p4d_addr_end(p4d, addr, end);
+		if (p4d_none(p4d))
 			continue;
 		if (copy_pud(dst_p4dp, src_p4dp, addr, next))
 			return -ENOMEM;
@@ -492,8 +494,10 @@ static int copy_page_tables(pgd_t *dst_pgdp, unsigned long start,
 
 	dst_pgdp = pgd_offset_pgd(dst_pgdp, start);
 	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(READ_ONCE(*src_pgdp)))
+		pgd_t pgd = READ_ONCE(*src_pgdp);
+
+		next = pgd_addr_end(pgd, addr, end);
+		if (pgd_none(pgd))
 			continue;
 		if (copy_p4d(dst_pgdp, src_pgdp, addr, next))
 			return -ENOMEM;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ba00bcc0c884..8f470f93a8e9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -547,7 +547,7 @@ static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
 
 	start_pmd = pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		/* Hyp doesn't use huge pmds */
 		if (!pmd_none(*pmd))
 			unmap_hyp_ptes(pmd, addr, next);
@@ -564,7 +564,7 @@ static void unmap_hyp_puds(p4d_t *p4d, phys_addr_t addr, phys_addr_t end)
 
 	start_pud = pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		/* Hyp doesn't use huge puds */
 		if (!pud_none(*pud))
 			unmap_hyp_pmds(pud, addr, next);
@@ -581,7 +581,7 @@ static void unmap_hyp_p4ds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
 
 	start_p4d = p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		/* Hyp doesn't use huge p4ds */
 		if (!p4d_none(*p4d))
 			unmap_hyp_puds(p4d, addr, next);
@@ -609,7 +609,7 @@ static void __unmap_hyp_range(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 	 */
 	pgd = pgdp + kvm_pgd_index(addr, ptrs_per_pgd);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_none(*pgd))
 			unmap_hyp_p4ds(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
@@ -712,7 +712,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
 			get_page(virt_to_page(pmd));
 		}
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		create_hyp_pte_mappings(pmd, addr, next, pfn, prot);
 		pfn += (next - addr) >> PAGE_SHIFT;
@@ -744,7 +744,7 @@ static int create_hyp_pud_mappings(p4d_t *p4d, unsigned long start,
 			get_page(virt_to_page(pud));
 		}
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		ret = create_hyp_pmd_mappings(pud, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -777,7 +777,7 @@ static int create_hyp_p4d_mappings(pgd_t *pgd, unsigned long start,
 			get_page(virt_to_page(p4d));
 		}
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		ret = create_hyp_pud_mappings(p4d, addr, next, pfn, prot);
 		if (ret)
 			return ret;
@@ -813,7 +813,7 @@ static int __create_hyp_mappings(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 			get_page(virt_to_page(pgd));
 		}
 
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = create_hyp_p4d_mappings(pgd, addr, next, pfn, prot);
 		if (err)
 			goto out;
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b24e43d20667..8d1c811fd59e 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -120,7 +120,7 @@ static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp = kasan_pmd_offset(pudp, addr, node, early);
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		kasan_pte_populate(pmdp, addr, next, node, early);
 	} while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)));
 }
@@ -132,7 +132,7 @@ static void __init kasan_pud_populate(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp = kasan_pud_offset(p4dp, addr, node, early);
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		kasan_pmd_populate(pudp, addr, next, node, early);
 	} while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)));
 }
@@ -144,7 +144,7 @@ static void __init kasan_p4d_populate(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp = p4d_offset(pgdp, addr);
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		kasan_pud_populate(p4dp, addr, next, node, early);
 	} while (p4dp++, addr = next, addr != end);
 }
@@ -157,7 +157,7 @@ static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
 
 	pgdp = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		kasan_p4d_populate(pgdp, addr, next, node, early);
 	} while (pgdp++, addr = next, addr != end);
 }
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 64211436629d..d679cf024bc8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -209,7 +209,7 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		pmd_t old_pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(old_pmd, addr, end);
 
 		/* try section mapping first */
 		if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
@@ -307,7 +307,7 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
 	do {
 		pud_t old_pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(old_pud, addr, end);
 
 		/*
 		 * For 4K granule only, attempt to put down a 1GB block
@@ -356,7 +356,7 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 	end = PAGE_ALIGN(virt + size);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		alloc_init_pud(pgdp, addr, next, phys, prot, pgtable_alloc,
 			       flags);
 		phys += next - addr;
@@ -820,9 +820,9 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
 	pmd_t *pmdp, pmd;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -853,9 +853,9 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr,
 	pud_t *pudp, pud;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -886,9 +886,9 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -912,9 +912,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end,
 	WARN_ON(!free_mapped && altmap);
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -968,9 +968,9 @@ static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pmd_addr_end(addr, end);
 		pmdp = pmd_offset(pudp, addr);
 		pmd = READ_ONCE(*pmdp);
+		next = pmd_addr_end(pmd, addr, end);
 		if (pmd_none(pmd))
 			continue;
 
@@ -1008,9 +1008,9 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
 	unsigned long i, next, start = addr;
 
 	do {
-		next = pud_addr_end(addr, end);
 		pudp = pud_offset(p4dp, addr);
 		pud = READ_ONCE(*pudp);
+		next = pud_addr_end(pud, addr, end);
 		if (pud_none(pud))
 			continue;
 
@@ -1048,9 +1048,9 @@ static void free_empty_p4d_table(pgd_t *pgdp, unsigned long addr,
 	p4d_t *p4dp, p4d;
 
 	do {
-		next = p4d_addr_end(addr, end);
 		p4dp = p4d_offset(pgdp, addr);
 		p4d = READ_ONCE(*p4dp);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			continue;
 
@@ -1066,9 +1066,9 @@ static void free_empty_tables(unsigned long addr, unsigned long end,
 	pgd_t *pgdp, pgd;
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset_k(addr);
 		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			continue;
 
@@ -1097,8 +1097,6 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	pmd_t *pmdp;
 
 	do {
-		next = pmd_addr_end(addr, end);
-
 		pgdp = vmemmap_pgd_populate(addr, node);
 		if (!pgdp)
 			return -ENOMEM;
@@ -1112,6 +1110,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 			return -ENOMEM;
 
 		pmdp = pmd_offset(pudp, addr);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_none(READ_ONCE(*pmdp))) {
 			void *p = NULL;
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index cc72666e891a..816e218df285 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -817,7 +817,7 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -847,7 +847,7 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -878,10 +878,9 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
 	spin_lock(&init_mm.page_table_lock);
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!p4d_present(*p4d))
 			continue;
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26292544630f..f0606d6774a4 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -352,7 +352,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 		unsigned long more;
 
 		pmd = pmd_offset(pud, addr);
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!is_hugepd(__hugepd(pmd_val(*pmd)))) {
 			if (pmd_none_or_clear_bad(pmd))
 				continue;
@@ -409,7 +409,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	do {
 		pud = pud_offset(p4d, addr);
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!is_hugepd(__hugepd(pud_val(*pud)))) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
@@ -478,9 +478,9 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	 */
 
 	do {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset(tlb->mm, addr);
 		p4d = p4d_offset(pgd, addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!is_hugepd(__hugepd(pgd_val(*pgd)))) {
 			if (p4d_none_or_clear_bad(p4d))
 				continue;
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 027206e4959d..6fb17ac413be 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -542,14 +542,14 @@ static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned lo
 	return (boundary - 1) < (end - 1) ? boundary : end;
 }
 
-#define pgd_addr_end_folded pgd_addr_end_folded
-static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
 }
 
-#define p4d_addr_end_folded p4d_addr_end_folded
-static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
 {
 	return rste_addr_end_folded(p4d_val(p4d), addr, end);
 }
diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c
index 567c69f3069e..4aba634b4b26 100644
--- a/arch/s390/mm/page-states.c
+++ b/arch/s390/mm/page-states.c
@@ -109,7 +109,7 @@ static void mark_kernel_pmd(pud_t *pud, unsigned long addr, unsigned long end)
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || pmd_large(*pmd))
 			continue;
 		page = virt_to_page(pmd_val(*pmd));
@@ -126,7 +126,7 @@ static void mark_kernel_pud(p4d_t *p4d, unsigned long addr, unsigned long end)
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || pud_large(*pud))
 			continue;
 		if (!pud_folded(*pud)) {
@@ -147,7 +147,7 @@ static void mark_kernel_p4d(pgd_t *pgd, unsigned long addr, unsigned long end)
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none(*p4d))
 			continue;
 		if (!p4d_folded(*p4d)) {
@@ -169,7 +169,7 @@ static void mark_kernel_pgd(void)
 	addr = 0;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, MODULES_END);
+		next = pgd_addr_end(*pgd, addr, MODULES_END);
 		if (pgd_none(*pgd))
 			continue;
 		if (!pgd_folded(*pgd)) {
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index c5c52ec2b46f..b827d758a17a 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -162,7 +162,7 @@ static int walk_pmd_level(pud_t *pudp, unsigned long addr, unsigned long end,
 	do {
 		if (pmd_none(*pmdp))
 			return -EINVAL;
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmdp, addr, end);
 		if (pmd_large(*pmdp)) {
 			if (addr & ~PMD_MASK || addr + PMD_SIZE > next) {
 				rc = split_pmd_page(pmdp, addr);
@@ -239,7 +239,7 @@ static int walk_pud_level(p4d_t *p4d, unsigned long addr, unsigned long end,
 	do {
 		if (pud_none(*pudp))
 			return -EINVAL;
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pudp, addr, end);
 		if (pud_large(*pudp)) {
 			if (addr & ~PUD_MASK || addr + PUD_SIZE > next) {
 				rc = split_pud_page(pudp, addr);
@@ -269,7 +269,7 @@ static int walk_p4d_level(pgd_t *pgd, unsigned long addr, unsigned long end,
 	do {
 		if (p4d_none(*p4dp))
 			return -EINVAL;
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4dp, addr, end);
 		rc = walk_pud_level(p4dp, addr, next, flags);
 		p4dp++;
 		addr = next;
@@ -296,7 +296,7 @@ static int change_page_attr(unsigned long addr, unsigned long end,
 	do {
 		if (pgd_none(*pgdp))
 			break;
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgdp, addr, end);
 		rc = walk_p4d_level(pgdp, addr, next, flags);
 		if (rc)
 			break;
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index b239f2ba93b0..672bc89f13e7 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -219,7 +219,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!add) {
 			if (pmd_none(*pmd))
 				continue;
@@ -320,7 +320,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 		prot &= ~_REGION_ENTRY_NOEXEC;
 	pud = pud_offset(p4d, addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!add) {
 			if (pud_none(*pud))
 				continue;
@@ -394,7 +394,7 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!add) {
 			if (p4d_none(*p4d))
 				continue;
@@ -449,8 +449,8 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 	if (WARN_ON_ONCE(!PAGE_ALIGNED(start | end)))
 		return -EINVAL;
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (!add) {
 			if (pgd_none(*pgd))
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ec423b5f17dd..341c2ff8d31a 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -428,7 +428,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd))
 			continue;
 		if (is_hugetlb_pmd(*pmd))
@@ -465,7 +465,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		if (is_hugetlb_pud(*pud))
@@ -519,7 +519,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	pgd = pgd_offset(tlb->mm, addr);
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		hugetlb_free_pud_range(tlb, p4d, addr, next, floor, ceiling);
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index 61776790cd67..7b4fe31c8df2 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -264,7 +264,7 @@ static inline int update_pmd_range(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_present(*pmd)) {
 			if (hvc->force || pmd_newpage(*pmd)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -286,7 +286,7 @@ static inline int update_pud_range(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_present(*pud)) {
 			if (hvc->force || pud_newpage(*pud)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -308,7 +308,7 @@ static inline int update_p4d_range(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (!p4d_present(*p4d)) {
 			if (hvc->force || p4d_newpage(*p4d)) {
 				ret = add_munmap(addr, next - addr, hvc);
@@ -331,7 +331,7 @@ void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
 	hvc = INIT_HVC(mm, force, userspace);
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end_addr);
+		next = pgd_addr_end(*pgd, addr, end_addr);
 		if (!pgd_present(*pgd)) {
 			if (force || pgd_newpage(*pgd)) {
 				ret = add_munmap(addr, next - addr, &hvc);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a4ac13cc3fdc..e2cb9316a104 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1043,7 +1043,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -1099,7 +1099,7 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -1153,7 +1153,7 @@ remove_p4d_table(p4d_t *p4d_start, unsigned long addr, unsigned long end,
 
 	p4d = p4d_start + p4d_index(addr);
 	for (; addr < end; addr = next, p4d++) {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -1186,9 +1186,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct,
 	p4d_t *p4d;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
@@ -1500,8 +1499,6 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 	pmd_t *pmd;
 
 	for (addr = start; addr < end; addr = next) {
-		next = pmd_addr_end(addr, end);
-
 		pgd = vmemmap_pgd_populate(addr, node);
 		if (!pgd)
 			return -ENOMEM;
@@ -1515,6 +1512,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 			return -ENOMEM;
 
 		pmd = pmd_offset(pud, addr);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd)) {
 			void *p;
 
@@ -1623,9 +1621,8 @@ void register_page_bootmem_memmap(unsigned long section_nr,
 			get_page_bootmem(section_nr, pte_page(*pte),
 					 SECTION_INFO);
 		} else {
-			next = pmd_addr_end(addr, end);
-
 			pmd = pmd_offset(pud, addr);
+			next = pmd_addr_end(*pmd, addr, end);
 			if (pmd_none(*pmd))
 				continue;
 
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 1a50434c8a4d..2c105b5154ba 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -96,7 +96,7 @@ static void __init kasan_populate_pud(pud_t *pud, unsigned long addr,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (!pmd_large(*pmd))
 			kasan_populate_pmd(pmd, addr, next, nid);
 	} while (pmd++, addr = next, addr != end);
@@ -116,7 +116,7 @@ static void __init kasan_populate_p4d(p4d_t *p4d, unsigned long addr,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (!pud_large(*pud))
 			kasan_populate_pud(pud, addr, next, nid);
 	} while (pud++, addr = next, addr != end);
@@ -136,7 +136,7 @@ static void __init kasan_populate_pgd(pgd_t *pgd, unsigned long addr,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		kasan_populate_p4d(p4d, addr, next, nid);
 	} while (p4d++, addr = next, addr != end);
 }
@@ -151,7 +151,7 @@ static void __init kasan_populate_shadow(unsigned long addr, unsigned long end,
 	end = round_up(end, PAGE_SIZE);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_populate_pgd(pgd, addr, next, nid);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -219,7 +219,7 @@ static void __init kasan_early_p4d_populate(pgd_t *pgd,
 
 	p4d = early_p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_none(*p4d))
 			continue;
@@ -239,7 +239,7 @@ static void __init kasan_map_early_shadow(pgd_t *pgd)
 
 	pgd += pgd_index(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		kasan_early_p4d_populate(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
 }
@@ -254,7 +254,7 @@ static void __init kasan_shallow_populate_p4ds(pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (p4d_none(*p4d)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
@@ -272,7 +272,7 @@ static void __init kasan_shallow_populate_pgds(void *start, void *end)
 	addr = (unsigned long)start;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, (unsigned long)end);
+		next = pgd_addr_end(*pgd, addr, (unsigned long)end);
 
 		if (pgd_none(*pgd)) {
 			p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
index ce2cbb3c380f..156b42e51424 100644
--- a/include/asm-generic/pgtable-nop4d.h
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -53,7 +53,7 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
 #define p4d_free_tlb(tlb, x, a)			do { } while (0)
 
 #undef  p4d_addr_end
-#define p4d_addr_end(addr, end)			(end)
+#define p4d_addr_end(p4d, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOP4D_H */
diff --git a/include/asm-generic/pgtable-nopmd.h b/include/asm-generic/pgtable-nopmd.h
index 3e13acd019ae..e988384de1c7 100644
--- a/include/asm-generic/pgtable-nopmd.h
+++ b/include/asm-generic/pgtable-nopmd.h
@@ -64,7 +64,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 #define pmd_free_tlb(tlb, x, a)		do { } while (0)
 
 #undef  pmd_addr_end
-#define pmd_addr_end(addr, end)			(end)
+#define pmd_addr_end(pmd, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index a9d751fbda9e..57a28bade9f9 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -60,7 +60,7 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
 #define pud_free_tlb(tlb, x, a)		        do { } while (0)
 
 #undef  pud_addr_end
-#define pud_addr_end(addr, end)			(end)
+#define pud_addr_end(pud, addr, end)		(end)
 
 #endif /* __ASSEMBLY__ */
 #endif /* _PGTABLE_NOPUD_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 981c4c2a31fe..67ebc22cf83d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -655,48 +655,34 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
  */
 
-#define pgd_addr_end(addr, end)						\
+#ifndef pgd_addr_end
+#define pgd_addr_end(pgd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
+#endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(addr, end)						\
+#define p4d_addr_end(p4d, addr, end)					\
 ({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(addr, end)						\
+#define pud_addr_end(pud, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(addr, end)						\
+#define pmd_addr_end(pmd, addr, end)					\
 ({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 #endif
 
-#ifndef pgd_addr_end_folded
-#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
-#endif
-
-#ifndef p4d_addr_end_folded
-#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
-#endif
-
-#ifndef pud_addr_end_folded
-#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
-#endif
-
-#ifndef pmd_addr_end_folded
-#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
-#endif
-
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/gup.c b/mm/gup.c
index ba4aace5d0f4..7826876ae7e0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
-		next = pmd_addr_end_folded(pmd, addr, end);
+		next = pmd_addr_end(pmd, addr, end);
 		if (!pmd_present(pmd))
 			return 0;
 
@@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
-		next = pud_addr_end_folded(pud, addr, end);
+		next = pud_addr_end(pud, addr, end);
 		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
@@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
-		next = p4d_addr_end_folded(p4d, addr, end);
+		next = p4d_addr_end(p4d, addr, end);
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
@@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		pgd_t pgd = READ_ONCE(*pgdp);
 
-		next = pgd_addr_end_folded(pgd, addr, end);
+		next = pgd_addr_end(pgd, addr, end);
 		if (pgd_none(pgd))
 			return;
 		if (unlikely(pgd_huge(pgd))) {
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..400fa119c09d 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -114,7 +114,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PMD_MODIFIED;
@@ -160,7 +160,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PUD_MODIFIED;
@@ -206,7 +206,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_P4D_MODIFIED;
@@ -234,7 +234,7 @@ int ioremap_page_range(unsigned long addr,
 	start = addr;
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
 					&mask);
 		if (err)
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index fe6be0be1f76..829627a92763 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -117,7 +117,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (IS_ALIGNED(addr, PMD_SIZE) && end - addr >= PMD_SIZE) {
 			pmd_populate_kernel(&init_mm, pmd,
@@ -150,7 +150,7 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (IS_ALIGNED(addr, PUD_SIZE) && end - addr >= PUD_SIZE) {
 			pmd_t *pmd;
 
@@ -187,7 +187,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
 	unsigned long next;
 
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (IS_ALIGNED(addr, P4D_SIZE) && end - addr >= P4D_SIZE) {
 			pud_t *pud;
 			pmd_t *pmd;
@@ -236,7 +236,7 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
 	unsigned long next;
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 
 		if (IS_ALIGNED(addr, PGDIR_SIZE) && end - addr >= PGDIR_SIZE) {
 			p4d_t *p4d;
@@ -370,7 +370,7 @@ static void kasan_remove_pmd_table(pmd_t *pmd, unsigned long addr,
 	for (; addr < end; addr = next, pmd++) {
 		pte_t *pte;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		if (!pmd_present(*pmd))
 			continue;
@@ -395,7 +395,7 @@ static void kasan_remove_pud_table(pud_t *pud, unsigned long addr,
 	for (; addr < end; addr = next, pud++) {
 		pmd_t *pmd, *pmd_base;
 
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		if (!pud_present(*pud))
 			continue;
@@ -421,7 +421,7 @@ static void kasan_remove_p4d_table(p4d_t *p4d, unsigned long addr,
 	for (; addr < end; addr = next, p4d++) {
 		pud_t *pud;
 
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		if (!p4d_present(*p4d))
 			continue;
@@ -454,9 +454,8 @@ void kasan_remove_zero_shadow(void *start, unsigned long size)
 	for (; addr < end; addr = next) {
 		p4d_t *p4d;
 
-		next = pgd_addr_end(addr, end);
-
 		pgd = pgd_offset_k(addr);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!pgd_present(*pgd))
 			continue;
 
diff --git a/mm/madvise.c b/mm/madvise.c
index e32e7efbba0f..acfb3441d97e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -326,7 +326,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (pmd_trans_huge(*pmd)) {
 		pmd_t orig_pmd;
-		unsigned long next = pmd_addr_end(addr, end);
+		unsigned long next = pmd_addr_end(*pmd, addr, end);
 
 		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 		ptl = pmd_trans_huge_lock(pmd, vma);
@@ -587,7 +587,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	int nr_swap = 0;
 	unsigned long next;
 
-	next = pmd_addr_end(addr, end);
+	next = pmd_addr_end(*pmd, addr, end);
 	if (pmd_trans_huge(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
diff --git a/mm/memory.c b/mm/memory.c
index fb5463153351..f95424946b0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -233,7 +233,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	start = addr;
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		free_pte_range(tlb, pmd, addr);
@@ -267,7 +267,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	start = addr;
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		free_pmd_range(tlb, pud, addr, next, floor, ceiling);
@@ -301,7 +301,7 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 	start = addr;
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		free_pud_range(tlb, p4d, addr, next, floor, ceiling);
@@ -381,7 +381,7 @@ void free_pgd_range(struct mmu_gather *tlb,
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	pgd = pgd_offset(tlb->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
@@ -887,7 +887,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*src_pmd, addr, end);
 		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
 			|| pmd_devmap(*src_pmd)) {
 			int err;
@@ -921,7 +921,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_pud = pud_offset(src_p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*src_pud, addr, end);
 		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
 			int err;
 
@@ -955,7 +955,7 @@ static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src
 		return -ENOMEM;
 	src_p4d = p4d_offset(src_pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*src_p4d, addr, end);
 		if (p4d_none_or_clear_bad(src_p4d))
 			continue;
 		if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
@@ -1017,7 +1017,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*src_pgd, addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
@@ -1177,7 +1177,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
@@ -1212,7 +1212,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				mmap_assert_locked(tlb->mm);
@@ -1241,7 +1241,7 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		next = zap_pud_range(tlb, vma, p4d, addr, next, details);
@@ -1262,7 +1262,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
@@ -2030,7 +2030,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 		return -ENOMEM;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		err = remap_pte_range(mm, pmd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2052,7 +2052,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		err = remap_pmd_range(mm, pud, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2074,7 +2074,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		err = remap_pud_range(mm, p4d, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2143,7 +2143,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		err = remap_p4d_range(mm, pgd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
@@ -2266,7 +2266,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 		pmd = pmd_offset(pud, addr);
 	}
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (create || !pmd_none_or_clear_bad(pmd)) {
 			err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
 						 create, mask);
@@ -2294,7 +2294,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 		pud = pud_offset(p4d, addr);
 	}
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (create || !pud_none_or_clear_bad(pud)) {
 			err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
 						 create, mask);
@@ -2322,7 +2322,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 		p4d = p4d_offset(pgd, addr);
 	}
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (create || !p4d_none_or_clear_bad(p4d)) {
 			err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
 						 create, mask);
@@ -2348,7 +2348,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (!create && pgd_none_or_clear_bad(pgd))
 			continue;
 		err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, &mask);
diff --git a/mm/mlock.c b/mm/mlock.c
index 93ca2bf30b4f..5898e8fe2288 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -374,8 +374,12 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 			struct vm_area_struct *vma, struct zone *zone,
 			unsigned long start, unsigned long end)
 {
-	pte_t *pte;
 	spinlock_t *ptl;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
 
 	/*
 	 * Initialize pte walk starting at the already pinned page where we
@@ -384,10 +388,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 	 */
 	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
 	/* Make sure we do not cross the page table boundary */
-	end = pgd_addr_end(start, end);
-	end = p4d_addr_end(start, end);
-	end = pud_addr_end(start, end);
-	end = pmd_addr_end(start, end);
+	pgd = pgd_offset(vma->vm_mm, start);
+	end = pgd_addr_end(*pgd, start, end);
+	p4d = p4d_offset(pgd, start);
+	end = p4d_addr_end(*p4d, start, end);
+	pud = pud_offset(p4d, start);
+	end = pud_addr_end(*pud, start, end);
+	pmd = pmd_offset(pud, start);
+	end = pmd_addr_end(*pmd, start, end);
 
 	/* The page next to the pinned page is the first we will try to get */
 	start += PAGE_SIZE;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..278f2dbd1f20 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -225,7 +225,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	do {
 		unsigned long this_pages;
 
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
@@ -291,7 +291,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
@@ -311,7 +311,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(vma, p4d, addr, next, newprot,
@@ -336,7 +336,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	flush_cache_range(vma, addr, end);
 	inc_tlb_flush_pending(mm);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(vma, pgd, addr, next, newprot,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..a5b9f61b5d45 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -70,7 +70,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	pmd = pmd_offset(pud, addr);
 	do {
 again:
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -128,7 +128,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 	pud = pud_offset(p4d, addr);
 	do {
  again:
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -176,7 +176,7 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -211,7 +211,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
 	else
 		pgd = pgd_offset(walk->mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, 0, walk);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 20012c0c0252..b1dd815aee6b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2018,7 +2018,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	pmd = pmd_offset(pud, addr);
 	do {
 		cond_resched();
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type,
@@ -2040,7 +2040,7 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		ret = unuse_pmd_range(vma, pud, addr, next, type,
@@ -2062,7 +2062,7 @@ static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		ret = unuse_pud_range(vma, p4d, addr, next, type,
@@ -2085,7 +2085,7 @@ static int unuse_vma(struct vm_area_struct *vma, unsigned int type,
 
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		ret = unuse_p4d_range(vma, pgd, addr, next, type,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index be4724b916b3..09ff0d5ecbc1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -93,7 +93,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 
 		cleared = pmd_clear_huge(pmd);
 		if (cleared || pmd_bad(*pmd))
@@ -118,7 +118,7 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 
 	pud = pud_offset(p4d, addr);
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 
 		cleared = pud_clear_huge(pud);
 		if (cleared || pud_bad(*pud))
@@ -141,7 +141,7 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4d = p4d_offset(pgd, addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 
 		cleared = p4d_clear_huge(p4d);
 		if (cleared || p4d_bad(*p4d))
@@ -179,7 +179,7 @@ void unmap_kernel_range_noflush(unsigned long start, unsigned long size)
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		if (pgd_none_or_clear_bad(pgd))
@@ -230,7 +230,7 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
 	if (!pmd)
 		return -ENOMEM;
 	do {
-		next = pmd_addr_end(addr, end);
+		next = pmd_addr_end(*pmd, addr, end);
 		if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
@@ -248,7 +248,7 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
 	if (!pud)
 		return -ENOMEM;
 	do {
-		next = pud_addr_end(addr, end);
+		next = pud_addr_end(*pud, addr, end);
 		if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
@@ -266,7 +266,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
 	if (!p4d)
 		return -ENOMEM;
 	do {
-		next = p4d_addr_end(addr, end);
+		next = p4d_addr_end(*p4d, addr, end);
 		if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
@@ -305,7 +305,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(*pgd, addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
 		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
-- 
2.17.1


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
  2020-09-07 18:00 ` Gerald Schaefer
                     ` (2 preceding siblings ...)
  (?)
@ 2020-09-07 18:00   ` Gerald Schaefer
  -1 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

From: Alexander Gordeev <agordeev@linux.ibm.com>

Since pXd_addr_end() macros take pXd page-table entry as a
parameter it makes sense to check the entry type on compile.
Even though most archs do not make use of page-table entries
in pXd_addr_end() calls, checking the type in traversal code
paths could help to avoid subtle bugs.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 67ebc22cf83d..d9e7d16c2263 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  */
 
 #ifndef pgd_addr_end
-#define pgd_addr_end(pgd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(p4d, addr, end)					\
-({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(pud, addr, end)					\
-({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pud_addr_end pud_addr_end
+static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(pmd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pmd_addr_end pmd_addr_end
+static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

From: Alexander Gordeev <agordeev@linux.ibm.com>

Since pXd_addr_end() macros take pXd page-table entry as a
parameter it makes sense to check the entry type on compile.
Even though most archs do not make use of page-table entries
in pXd_addr_end() calls, checking the type in traversal code
paths could help to avoid subtle bugs.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 67ebc22cf83d..d9e7d16c2263 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  */
 
 #ifndef pgd_addr_end
-#define pgd_addr_end(pgd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(p4d, addr, end)					\
-({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(pud, addr, end)					\
-({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pud_addr_end pud_addr_end
+static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(pmd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pmd_addr_end pmd_addr_end
+static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 /*
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Since pXd_addr_end() macros take pXd page-table entry as a
parameter it makes sense to check the entry type on compile.
Even though most archs do not make use of page-table entries
in pXd_addr_end() calls, checking the type in traversal code
paths could help to avoid subtle bugs.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 67ebc22cf83d..d9e7d16c2263 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  */
 
 #ifndef pgd_addr_end
-#define pgd_addr_end(pgd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(p4d, addr, end)					\
-({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(pud, addr, end)					\
-({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pud_addr_end pud_addr_end
+static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(pmd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pmd_addr_end pmd_addr_end
+static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Since pXd_addr_end() macros take pXd page-table entry as a
parameter it makes sense to check the entry type on compile.
Even though most archs do not make use of page-table entries
in pXd_addr_end() calls, checking the type in traversal code
paths could help to avoid subtle bugs.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 67ebc22cf83d..d9e7d16c2263 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  */
 
 #ifndef pgd_addr_end
-#define pgd_addr_end(pgd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(p4d, addr, end)					\
-({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(pud, addr, end)					\
-({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pud_addr_end pud_addr_end
+static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(pmd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pmd_addr_end pmd_addr_end
+static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 /*
-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-07 18:00   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-07 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

From: Alexander Gordeev <agordeev@linux.ibm.com>

Since pXd_addr_end() macros take pXd page-table entry as a
parameter it makes sense to check the entry type on compile.
Even though most archs do not make use of page-table entries
in pXd_addr_end() calls, checking the type in traversal code
paths could help to avoid subtle bugs.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 67ebc22cf83d..d9e7d16c2263 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  */
 
 #ifndef pgd_addr_end
-#define pgd_addr_end(pgd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(p4d, addr, end)					\
-({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(pud, addr, end)					\
-({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pud_addr_end pud_addr_end
+static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(pmd, addr, end)					\
-({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
-})
+#define pmd_addr_end pmd_addr_end
+static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
+{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 /*
-- 
2.17.1


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-07 18:00 ` Gerald Schaefer
                     ` (2 preceding siblings ...)
  (?)
@ 2020-09-07 20:12   ` Mike Rapoport
  -1 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:12 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86,
	linux-arm, linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

I also think that adding pXd parameter to pXd_addr_end() is a cleaner
way and with this patch 1 is not really required. I would even merge
patches 2 and 3 into a single patch and use only it as the fix.

[ /me apologises to stable@ team :-) ]

> Changes in v2:
> - Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
> - Add patch 2 + 3 for more generic approach
> 
> Alexander Gordeev (3):
>   mm/gup: fix gup_fast with dynamic page table folding
>   mm: make pXd_addr_end() functions page-table entry aware
>   mm: make generic pXd_addr_end() macros inline functions
> 
>  arch/arm/include/asm/pgtable-2level.h    |  2 +-
>  arch/arm/mm/idmap.c                      |  6 ++--
>  arch/arm/mm/mmu.c                        |  8 ++---
>  arch/arm64/kernel/hibernate.c            | 16 +++++----
>  arch/arm64/kvm/mmu.c                     | 16 ++++-----
>  arch/arm64/mm/kasan_init.c               |  8 ++---
>  arch/arm64/mm/mmu.c                      | 25 +++++++-------
>  arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
>  arch/powerpc/mm/hugetlbpage.c            |  6 ++--
>  arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
>  arch/s390/mm/page-states.c               |  8 ++---
>  arch/s390/mm/pageattr.c                  |  8 ++---
>  arch/s390/mm/vmem.c                      |  8 ++---
>  arch/sparc/mm/hugetlbpage.c              |  6 ++--
>  arch/um/kernel/tlb.c                     |  8 ++---
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 ++++-----
>  include/asm-generic/pgtable-nop4d.h      |  2 +-
>  include/asm-generic/pgtable-nopmd.h      |  2 +-
>  include/asm-generic/pgtable-nopud.h      |  2 +-
>  include/linux/pgtable.h                  | 38 ++++++++++++---------
>  mm/gup.c                                 |  8 ++---
>  mm/ioremap.c                             |  8 ++---
>  mm/kasan/init.c                          | 17 +++++-----
>  mm/madvise.c                             |  4 +--
>  mm/memory.c                              | 40 +++++++++++-----------
>  mm/mlock.c                               | 18 +++++++---
>  mm/mprotect.c                            |  8 ++---
>  mm/pagewalk.c                            |  8 ++---
>  mm/swapfile.c                            |  8 ++---
>  mm/vmalloc.c                             | 16 ++++-----
>  31 files changed, 219 insertions(+), 165 deletions(-)
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 20:12   ` Mike Rapoport
  0 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:12 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86,
	linux-arm, linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

I also think that adding pXd parameter to pXd_addr_end() is a cleaner
way and with this patch 1 is not really required. I would even merge
patches 2 and 3 into a single patch and use only it as the fix.

[ /me apologises to stable@ team :-) ]

> Changes in v2:
> - Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
> - Add patch 2 + 3 for more generic approach
> 
> Alexander Gordeev (3):
>   mm/gup: fix gup_fast with dynamic page table folding
>   mm: make pXd_addr_end() functions page-table entry aware
>   mm: make generic pXd_addr_end() macros inline functions
> 
>  arch/arm/include/asm/pgtable-2level.h    |  2 +-
>  arch/arm/mm/idmap.c                      |  6 ++--
>  arch/arm/mm/mmu.c                        |  8 ++---
>  arch/arm64/kernel/hibernate.c            | 16 +++++----
>  arch/arm64/kvm/mmu.c                     | 16 ++++-----
>  arch/arm64/mm/kasan_init.c               |  8 ++---
>  arch/arm64/mm/mmu.c                      | 25 +++++++-------
>  arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
>  arch/powerpc/mm/hugetlbpage.c            |  6 ++--
>  arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
>  arch/s390/mm/page-states.c               |  8 ++---
>  arch/s390/mm/pageattr.c                  |  8 ++---
>  arch/s390/mm/vmem.c                      |  8 ++---
>  arch/sparc/mm/hugetlbpage.c              |  6 ++--
>  arch/um/kernel/tlb.c                     |  8 ++---
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 ++++-----
>  include/asm-generic/pgtable-nop4d.h      |  2 +-
>  include/asm-generic/pgtable-nopmd.h      |  2 +-
>  include/asm-generic/pgtable-nopud.h      |  2 +-
>  include/linux/pgtable.h                  | 38 ++++++++++++---------
>  mm/gup.c                                 |  8 ++---
>  mm/ioremap.c                             |  8 ++---
>  mm/kasan/init.c                          | 17 +++++-----
>  mm/madvise.c                             |  4 +--
>  mm/memory.c                              | 40 +++++++++++-----------
>  mm/mlock.c                               | 18 +++++++---
>  mm/mprotect.c                            |  8 ++---
>  mm/pagewalk.c                            |  8 ++---
>  mm/swapfile.c                            |  8 ++---
>  mm/vmalloc.c                             | 16 ++++-----
>  31 files changed, 219 insertions(+), 165 deletions(-)
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 20:12   ` Mike Rapoport
  0 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:12 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds

On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

I also think that adding pXd parameter to pXd_addr_end() is a cleaner
way and with this patch 1 is not really required. I would even merge
patches 2 and 3 into a single patch and use only it as the fix.

[ /me apologises to stable@ team :-) ]

> Changes in v2:
> - Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
> - Add patch 2 + 3 for more generic approach
> 
> Alexander Gordeev (3):
>   mm/gup: fix gup_fast with dynamic page table folding
>   mm: make pXd_addr_end() functions page-table entry aware
>   mm: make generic pXd_addr_end() macros inline functions
> 
>  arch/arm/include/asm/pgtable-2level.h    |  2 +-
>  arch/arm/mm/idmap.c                      |  6 ++--
>  arch/arm/mm/mmu.c                        |  8 ++---
>  arch/arm64/kernel/hibernate.c            | 16 +++++----
>  arch/arm64/kvm/mmu.c                     | 16 ++++-----
>  arch/arm64/mm/kasan_init.c               |  8 ++---
>  arch/arm64/mm/mmu.c                      | 25 +++++++-------
>  arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
>  arch/powerpc/mm/hugetlbpage.c            |  6 ++--
>  arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
>  arch/s390/mm/page-states.c               |  8 ++---
>  arch/s390/mm/pageattr.c                  |  8 ++---
>  arch/s390/mm/vmem.c                      |  8 ++---
>  arch/sparc/mm/hugetlbpage.c              |  6 ++--
>  arch/um/kernel/tlb.c                     |  8 ++---
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 ++++-----
>  include/asm-generic/pgtable-nop4d.h      |  2 +-
>  include/asm-generic/pgtable-nopmd.h      |  2 +-
>  include/asm-generic/pgtable-nopud.h      |  2 +-
>  include/linux/pgtable.h                  | 38 ++++++++++++---------
>  mm/gup.c                                 |  8 ++---
>  mm/ioremap.c                             |  8 ++---
>  mm/kasan/init.c                          | 17 +++++-----
>  mm/madvise.c                             |  4 +--
>  mm/memory.c                              | 40 +++++++++++-----------
>  mm/mlock.c                               | 18 +++++++---
>  mm/mprotect.c                            |  8 ++---
>  mm/pagewalk.c                            |  8 ++---
>  mm/swapfile.c                            |  8 ++---
>  mm/vmalloc.c                             | 16 ++++-----
>  31 files changed, 219 insertions(+), 165 deletions(-)
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 20:12   ` Mike Rapoport
  0 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:12 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds

On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

I also think that adding pXd parameter to pXd_addr_end() is a cleaner
way and with this patch 1 is not really required. I would even merge
patches 2 and 3 into a single patch and use only it as the fix.

[ /me apologises to stable@ team :-) ]

> Changes in v2:
> - Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
> - Add patch 2 + 3 for more generic approach
> 
> Alexander Gordeev (3):
>   mm/gup: fix gup_fast with dynamic page table folding
>   mm: make pXd_addr_end() functions page-table entry aware
>   mm: make generic pXd_addr_end() macros inline functions
> 
>  arch/arm/include/asm/pgtable-2level.h    |  2 +-
>  arch/arm/mm/idmap.c                      |  6 ++--
>  arch/arm/mm/mmu.c                        |  8 ++---
>  arch/arm64/kernel/hibernate.c            | 16 +++++----
>  arch/arm64/kvm/mmu.c                     | 16 ++++-----
>  arch/arm64/mm/kasan_init.c               |  8 ++---
>  arch/arm64/mm/mmu.c                      | 25 +++++++-------
>  arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
>  arch/powerpc/mm/hugetlbpage.c            |  6 ++--
>  arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
>  arch/s390/mm/page-states.c               |  8 ++---
>  arch/s390/mm/pageattr.c                  |  8 ++---
>  arch/s390/mm/vmem.c                      |  8 ++---
>  arch/sparc/mm/hugetlbpage.c              |  6 ++--
>  arch/um/kernel/tlb.c                     |  8 ++---
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 ++++-----
>  include/asm-generic/pgtable-nop4d.h      |  2 +-
>  include/asm-generic/pgtable-nopmd.h      |  2 +-
>  include/asm-generic/pgtable-nopud.h      |  2 +-
>  include/linux/pgtable.h                  | 38 ++++++++++++---------
>  mm/gup.c                                 |  8 ++---
>  mm/ioremap.c                             |  8 ++---
>  mm/kasan/init.c                          | 17 +++++-----
>  mm/madvise.c                             |  4 +--
>  mm/memory.c                              | 40 +++++++++++-----------
>  mm/mlock.c                               | 18 +++++++---
>  mm/mprotect.c                            |  8 ++---
>  mm/pagewalk.c                            |  8 ++---
>  mm/swapfile.c                            |  8 ++---
>  mm/vmalloc.c                             | 16 ++++-----
>  31 files changed, 219 insertions(+), 165 deletions(-)
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-07 20:12   ` Mike Rapoport
  0 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:12 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds

On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

I also think that adding pXd parameter to pXd_addr_end() is a cleaner
way and with this patch 1 is not really required. I would even merge
patches 2 and 3 into a single patch and use only it as the fix.

[ /me apologises to stable@ team :-) ]

> Changes in v2:
> - Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
> - Add patch 2 + 3 for more generic approach
> 
> Alexander Gordeev (3):
>   mm/gup: fix gup_fast with dynamic page table folding
>   mm: make pXd_addr_end() functions page-table entry aware
>   mm: make generic pXd_addr_end() macros inline functions
> 
>  arch/arm/include/asm/pgtable-2level.h    |  2 +-
>  arch/arm/mm/idmap.c                      |  6 ++--
>  arch/arm/mm/mmu.c                        |  8 ++---
>  arch/arm64/kernel/hibernate.c            | 16 +++++----
>  arch/arm64/kvm/mmu.c                     | 16 ++++-----
>  arch/arm64/mm/kasan_init.c               |  8 ++---
>  arch/arm64/mm/mmu.c                      | 25 +++++++-------
>  arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
>  arch/powerpc/mm/hugetlbpage.c            |  6 ++--
>  arch/s390/include/asm/pgtable.h          | 42 ++++++++++++++++++++++++
>  arch/s390/mm/page-states.c               |  8 ++---
>  arch/s390/mm/pageattr.c                  |  8 ++---
>  arch/s390/mm/vmem.c                      |  8 ++---
>  arch/sparc/mm/hugetlbpage.c              |  6 ++--
>  arch/um/kernel/tlb.c                     |  8 ++---
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 ++++-----
>  include/asm-generic/pgtable-nop4d.h      |  2 +-
>  include/asm-generic/pgtable-nopmd.h      |  2 +-
>  include/asm-generic/pgtable-nopud.h      |  2 +-
>  include/linux/pgtable.h                  | 38 ++++++++++++---------
>  mm/gup.c                                 |  8 ++---
>  mm/ioremap.c                             |  8 ++---
>  mm/kasan/init.c                          | 17 +++++-----
>  mm/madvise.c                             |  4 +--
>  mm/memory.c                              | 40 +++++++++++-----------
>  mm/mlock.c                               | 18 +++++++---
>  mm/mprotect.c                            |  8 ++---
>  mm/pagewalk.c                            |  8 ++---
>  mm/swapfile.c                            |  8 ++---
>  mm/vmalloc.c                             | 16 ++++-----
>  31 files changed, 219 insertions(+), 165 deletions(-)
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
  2020-09-07 18:00   ` Gerald Schaefer
                       ` (2 preceding siblings ...)
  (?)
@ 2020-09-07 20:15     ` Mike Rapoport
  -1 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:15 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86,
	linux-arm, linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

Hi,

Some style comments below.

On Mon, Sep 07, 2020 at 08:00:58PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>  include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>  1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   */
>  
>  #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end
> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;

The code should be on a separate line from the curly brace.
Besides, since this is not a macro anymore, I think it would be nicer to
use 'boundary' without underscores.
This applies to the changes below as well.

> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  /*
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-07 20:15     ` Mike Rapoport
  0 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:15 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86,
	linux-arm, linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

Hi,

Some style comments below.

On Mon, Sep 07, 2020 at 08:00:58PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>  include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>  1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   */
>  
>  #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end
> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;

The code should be on a separate line from the curly brace.
Besides, since this is not a macro anymore, I think it would be nicer to
use 'boundary' without underscores.
This applies to the changes below as well.

> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  /*
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-07 20:15     ` Mike Rapoport
  0 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:15 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds

Hi,

Some style comments below.

On Mon, Sep 07, 2020 at 08:00:58PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>  include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>  1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   */
>  
>  #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end
> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;

The code should be on a separate line from the curly brace.
Besides, since this is not a macro anymore, I think it would be nicer to
use 'boundary' without underscores.
This applies to the changes below as well.

> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  /*
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-07 20:15     ` Mike Rapoport
  0 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:15 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds

Hi,

Some style comments below.

On Mon, Sep 07, 2020 at 08:00:58PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>  include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>  1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   */
>  
>  #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end
> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;

The code should be on a separate line from the curly brace.
Besides, since this is not a macro anymore, I think it would be nicer to
use 'boundary' without underscores.
This applies to the changes below as well.

> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  /*
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-07 20:15     ` Mike Rapoport
  0 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-07 20:15 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds

Hi,

Some style comments below.

On Mon, Sep 07, 2020 at 08:00:58PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>  include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>  1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   */
>  
>  #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end
> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;

The code should be on a separate line from the curly brace.
Besides, since this is not a macro anymore, I think it would be nicer to
use 'boundary' without underscores.
This applies to the changes below as well.

> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>  #endif
>  
>  /*
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-07 18:00 ` Gerald Schaefer
                     ` (2 preceding siblings ...)
  (?)
@ 2020-09-08  4:42   ` Christophe Leroy
  -1 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  4:42 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

If one day you have to backport a fix that requires patch 2 and/or 3, 
just mark it "depends-on:" and the patches will go in stable at the 
relevant time.

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  4:42   ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  4:42 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

If one day you have to backport a fix that requires patch 2 and/or 3, 
just mark it "depends-on:" and the patches will go in stable at the 
relevant time.

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  4:42   ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  4:42 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

If one day you have to backport a fix that requires patch 2 and/or 3, 
just mark it "depends-on:" and the patches will go in stable at the 
relevant time.

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  4:42   ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  4:42 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

If one day you have to backport a fix that requires patch 2 and/or 3, 
just mark it "depends-on:" and the patches will go in stable at the 
relevant time.

Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  4:42   ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  4:42 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> This is v2 of an RFC previously discussed here:
> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> 
> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> to common gup_fast code. It will introduce special helper functions
> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> 
> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> themselves by adding an extra pXd value parameter. That was suggested by
> Jason during v1 discussion, because he is already thinking of some other
> places where he might want to switch to the READ_ONCE logic for pagetable
> walks. In general, that would be the cleanest / safest solution, but there
> is some impact on other architectures and common code, hence the new and
> greatly enlarged recipient list.
> 
> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> functions instead of #defines, so that we get some type checking for the
> new pXd value parameter.
> 
> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> still be nice to have in stable, to ease future backports, but I guess
> "nice to have" does not really qualify for stable backports.

If one day you have to backport a fix that requires patch 2 and/or 3, 
just mark it "depends-on:" and the patches will go in stable at the 
relevant time.

Christophe

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-07 18:00   ` Gerald Schaefer
                       ` (2 preceding siblings ...)
  (?)
@ 2020-09-08  5:06     ` Christophe Leroy
  -1 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:06 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.
> 
> The question "What would it require for the generic code to work for s390"
> has already been discussed here
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> and ended with a promising approach here
> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> which in the end unfortunately didn't quite work completely.
> 
> We tried to mimic static level folding by changing pgd_offset to always
> calculate top level page table offset, and do nothing in folded pXd_offset.
> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> not reflect this dynamic behaviour, and still act like static 5-level
> page tables.
> 

[...]

> 
> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> additional pXd entry value parameter, that can be used on s390
> to determine the correct page table level and return corresponding
> end / boundary. With that, the pointer iteration will always
> happen in gup_pgd_range for s390. No change for other architectures
> introduced.

Not sure pXd_addr_end_folded() is the best understandable name, 
allthough I don't have any alternative suggestion at the moment.
Maybe could be something like pXd_addr_end_fixup() as it will disappear 
in the next patch, or pXd_addr_end_gup() ?

Also, if it happens to be acceptable to get patch 2 in stable, I think 
you should switch patch 1 and patch 2 to avoid the step through 
pXd_addr_end_folded()


> 
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Cc: <stable@vger.kernel.org> # 5.2+
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>   include/linux/pgtable.h         | 16 +++++++++++++
>   mm/gup.c                        |  8 +++----
>   3 files changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..027206e4959d 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>   }
>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>   
> +/*
> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
> + * will not return corresponding dynamic boundaries. This is no problem as long
> + * as only pXd pointers are passed down during page table walk, because
> + * pXd_offset() will simply return the given pointer for folded levels, and the
> + * pointer iteration over a range simply happens at the correct page table
> + * level.
> + * It is however a problem with gup_fast, or other places walking the page
> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
> + * a stack variable, which cannot be used for pointer iteration at the correct
> + * level. Instead, the iteration then has to happen by going up to pgd level
> + * again. To allow this, provide pXd_addr_end_folded() functions with an
> + * additional pXd value parameter, which can be used on s390 to determine the
> + * folding level and return the corresponding boundary.
> + */
> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)

What does 'rste' stands for ?

Isn't this line a bit long ?

> +{
> +	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
> +	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
> +	unsigned long boundary = (addr + size) & ~(size - 1);
> +
> +	/*
> +	 * FIXME The below check is for internal testing only, to be removed
> +	 */
> +	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
> +
> +	return (boundary - 1) < (end - 1) ? boundary : end;
> +}
> +
> +#define pgd_addr_end_folded pgd_addr_end_folded
> +static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> +}
> +
> +#define p4d_addr_end_folded p4d_addr_end_folded
> +static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(p4d_val(p4d), addr, end);
> +}
> +
>   static inline int mm_has_pgste(struct mm_struct *mm)
>   {
>   #ifdef CONFIG_PGSTE
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..981c4c2a31fe 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   })
>   #endif
>   
> +#ifndef pgd_addr_end_folded
> +#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
> +#endif
> +
> +#ifndef p4d_addr_end_folded
> +#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
> +#endif
> +
> +#ifndef pud_addr_end_folded
> +#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
> +#endif
> +
> +#ifndef pmd_addr_end_folded
> +#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
> +#endif
> +
>   /*
>    * When walking page tables, we usually want to skip any p?d_none entries;
>    * and any p?d_bad entries - reporting the error before resetting to none.
> diff --git a/mm/gup.c b/mm/gup.c
> index bd883a112724..ba4aace5d0f4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>   		if (!pmd_present(pmd))
>   			return 0;
>   
> @@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> -		next = pud_addr_end(addr, end);
> +		next = pud_addr_end_folded(pud, addr, end);
>   		if (unlikely(!pud_present(pud)))
>   			return 0;
>   		if (unlikely(pud_huge(pud))) {
> @@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> -		next = p4d_addr_end(addr, end);
> +		next = p4d_addr_end_folded(p4d, addr, end);
>   		if (p4d_none(p4d))
>   			return 0;
>   		BUILD_BUG_ON(p4d_huge(p4d));
> @@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   	do {
>   		pgd_t pgd = READ_ONCE(*pgdp);
>   
> -		next = pgd_addr_end(addr, end);
> +		next = pgd_addr_end_folded(pgd, addr, end);
>   		if (pgd_none(pgd))
>   			return;
>   		if (unlikely(pgd_huge(pgd))) {
> 

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  5:06     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:06 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.
> 
> The question "What would it require for the generic code to work for s390"
> has already been discussed here
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> and ended with a promising approach here
> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> which in the end unfortunately didn't quite work completely.
> 
> We tried to mimic static level folding by changing pgd_offset to always
> calculate top level page table offset, and do nothing in folded pXd_offset.
> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> not reflect this dynamic behaviour, and still act like static 5-level
> page tables.
> 

[...]

> 
> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> additional pXd entry value parameter, that can be used on s390
> to determine the correct page table level and return corresponding
> end / boundary. With that, the pointer iteration will always
> happen in gup_pgd_range for s390. No change for other architectures
> introduced.

Not sure pXd_addr_end_folded() is the best understandable name, 
allthough I don't have any alternative suggestion at the moment.
Maybe could be something like pXd_addr_end_fixup() as it will disappear 
in the next patch, or pXd_addr_end_gup() ?

Also, if it happens to be acceptable to get patch 2 in stable, I think 
you should switch patch 1 and patch 2 to avoid the step through 
pXd_addr_end_folded()


> 
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Cc: <stable@vger.kernel.org> # 5.2+
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>   include/linux/pgtable.h         | 16 +++++++++++++
>   mm/gup.c                        |  8 +++----
>   3 files changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..027206e4959d 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>   }
>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>   
> +/*
> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
> + * will not return corresponding dynamic boundaries. This is no problem as long
> + * as only pXd pointers are passed down during page table walk, because
> + * pXd_offset() will simply return the given pointer for folded levels, and the
> + * pointer iteration over a range simply happens at the correct page table
> + * level.
> + * It is however a problem with gup_fast, or other places walking the page
> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
> + * a stack variable, which cannot be used for pointer iteration at the correct
> + * level. Instead, the iteration then has to happen by going up to pgd level
> + * again. To allow this, provide pXd_addr_end_folded() functions with an
> + * additional pXd value parameter, which can be used on s390 to determine the
> + * folding level and return the corresponding boundary.
> + */
> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)

What does 'rste' stands for ?

Isn't this line a bit long ?

> +{
> +	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
> +	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
> +	unsigned long boundary = (addr + size) & ~(size - 1);
> +
> +	/*
> +	 * FIXME The below check is for internal testing only, to be removed
> +	 */
> +	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
> +
> +	return (boundary - 1) < (end - 1) ? boundary : end;
> +}
> +
> +#define pgd_addr_end_folded pgd_addr_end_folded
> +static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> +}
> +
> +#define p4d_addr_end_folded p4d_addr_end_folded
> +static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(p4d_val(p4d), addr, end);
> +}
> +
>   static inline int mm_has_pgste(struct mm_struct *mm)
>   {
>   #ifdef CONFIG_PGSTE
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..981c4c2a31fe 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   })
>   #endif
>   
> +#ifndef pgd_addr_end_folded
> +#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
> +#endif
> +
> +#ifndef p4d_addr_end_folded
> +#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
> +#endif
> +
> +#ifndef pud_addr_end_folded
> +#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
> +#endif
> +
> +#ifndef pmd_addr_end_folded
> +#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
> +#endif
> +
>   /*
>    * When walking page tables, we usually want to skip any p?d_none entries;
>    * and any p?d_bad entries - reporting the error before resetting to none.
> diff --git a/mm/gup.c b/mm/gup.c
> index bd883a112724..ba4aace5d0f4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>   		if (!pmd_present(pmd))
>   			return 0;
>   
> @@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> -		next = pud_addr_end(addr, end);
> +		next = pud_addr_end_folded(pud, addr, end);
>   		if (unlikely(!pud_present(pud)))
>   			return 0;
>   		if (unlikely(pud_huge(pud))) {
> @@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> -		next = p4d_addr_end(addr, end);
> +		next = p4d_addr_end_folded(p4d, addr, end);
>   		if (p4d_none(p4d))
>   			return 0;
>   		BUILD_BUG_ON(p4d_huge(p4d));
> @@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   	do {
>   		pgd_t pgd = READ_ONCE(*pgdp);
>   
> -		next = pgd_addr_end(addr, end);
> +		next = pgd_addr_end_folded(pgd, addr, end);
>   		if (pgd_none(pgd))
>   			return;
>   		if (unlikely(pgd_huge(pgd))) {
> 

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  5:06     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:06 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.
> 
> The question "What would it require for the generic code to work for s390"
> has already been discussed here
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> and ended with a promising approach here
> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> which in the end unfortunately didn't quite work completely.
> 
> We tried to mimic static level folding by changing pgd_offset to always
> calculate top level page table offset, and do nothing in folded pXd_offset.
> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> not reflect this dynamic behaviour, and still act like static 5-level
> page tables.
> 

[...]

> 
> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> additional pXd entry value parameter, that can be used on s390
> to determine the correct page table level and return corresponding
> end / boundary. With that, the pointer iteration will always
> happen in gup_pgd_range for s390. No change for other architectures
> introduced.

Not sure pXd_addr_end_folded() is the best understandable name, 
allthough I don't have any alternative suggestion at the moment.
Maybe could be something like pXd_addr_end_fixup() as it will disappear 
in the next patch, or pXd_addr_end_gup() ?

Also, if it happens to be acceptable to get patch 2 in stable, I think 
you should switch patch 1 and patch 2 to avoid the step through 
pXd_addr_end_folded()


> 
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Cc: <stable@vger.kernel.org> # 5.2+
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>   include/linux/pgtable.h         | 16 +++++++++++++
>   mm/gup.c                        |  8 +++----
>   3 files changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..027206e4959d 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>   }
>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>   
> +/*
> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
> + * will not return corresponding dynamic boundaries. This is no problem as long
> + * as only pXd pointers are passed down during page table walk, because
> + * pXd_offset() will simply return the given pointer for folded levels, and the
> + * pointer iteration over a range simply happens at the correct page table
> + * level.
> + * It is however a problem with gup_fast, or other places walking the page
> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
> + * a stack variable, which cannot be used for pointer iteration at the correct
> + * level. Instead, the iteration then has to happen by going up to pgd level
> + * again. To allow this, provide pXd_addr_end_folded() functions with an
> + * additional pXd value parameter, which can be used on s390 to determine the
> + * folding level and return the corresponding boundary.
> + */
> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)

What does 'rste' stands for ?

Isn't this line a bit long ?

> +{
> +	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
> +	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
> +	unsigned long boundary = (addr + size) & ~(size - 1);
> +
> +	/*
> +	 * FIXME The below check is for internal testing only, to be removed
> +	 */
> +	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
> +
> +	return (boundary - 1) < (end - 1) ? boundary : end;
> +}
> +
> +#define pgd_addr_end_folded pgd_addr_end_folded
> +static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> +}
> +
> +#define p4d_addr_end_folded p4d_addr_end_folded
> +static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(p4d_val(p4d), addr, end);
> +}
> +
>   static inline int mm_has_pgste(struct mm_struct *mm)
>   {
>   #ifdef CONFIG_PGSTE
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..981c4c2a31fe 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   })
>   #endif
>   
> +#ifndef pgd_addr_end_folded
> +#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
> +#endif
> +
> +#ifndef p4d_addr_end_folded
> +#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
> +#endif
> +
> +#ifndef pud_addr_end_folded
> +#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
> +#endif
> +
> +#ifndef pmd_addr_end_folded
> +#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
> +#endif
> +
>   /*
>    * When walking page tables, we usually want to skip any p?d_none entries;
>    * and any p?d_bad entries - reporting the error before resetting to none.
> diff --git a/mm/gup.c b/mm/gup.c
> index bd883a112724..ba4aace5d0f4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>   		if (!pmd_present(pmd))
>   			return 0;
>   
> @@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> -		next = pud_addr_end(addr, end);
> +		next = pud_addr_end_folded(pud, addr, end);
>   		if (unlikely(!pud_present(pud)))
>   			return 0;
>   		if (unlikely(pud_huge(pud))) {
> @@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> -		next = p4d_addr_end(addr, end);
> +		next = p4d_addr_end_folded(p4d, addr, end);
>   		if (p4d_none(p4d))
>   			return 0;
>   		BUILD_BUG_ON(p4d_huge(p4d));
> @@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   	do {
>   		pgd_t pgd = READ_ONCE(*pgdp);
>   
> -		next = pgd_addr_end(addr, end);
> +		next = pgd_addr_end_folded(pgd, addr, end);
>   		if (pgd_none(pgd))
>   			return;
>   		if (unlikely(pgd_huge(pgd))) {
> 

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  5:06     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:06 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.
> 
> The question "What would it require for the generic code to work for s390"
> has already been discussed here
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> and ended with a promising approach here
> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> which in the end unfortunately didn't quite work completely.
> 
> We tried to mimic static level folding by changing pgd_offset to always
> calculate top level page table offset, and do nothing in folded pXd_offset.
> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> not reflect this dynamic behaviour, and still act like static 5-level
> page tables.
> 

[...]

> 
> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> additional pXd entry value parameter, that can be used on s390
> to determine the correct page table level and return corresponding
> end / boundary. With that, the pointer iteration will always
> happen in gup_pgd_range for s390. No change for other architectures
> introduced.

Not sure pXd_addr_end_folded() is the best understandable name, 
allthough I don't have any alternative suggestion at the moment.
Maybe could be something like pXd_addr_end_fixup() as it will disappear 
in the next patch, or pXd_addr_end_gup() ?

Also, if it happens to be acceptable to get patch 2 in stable, I think 
you should switch patch 1 and patch 2 to avoid the step through 
pXd_addr_end_folded()


> 
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Cc: <stable@vger.kernel.org> # 5.2+
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>   include/linux/pgtable.h         | 16 +++++++++++++
>   mm/gup.c                        |  8 +++----
>   3 files changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..027206e4959d 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>   }
>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>   
> +/*
> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
> + * will not return corresponding dynamic boundaries. This is no problem as long
> + * as only pXd pointers are passed down during page table walk, because
> + * pXd_offset() will simply return the given pointer for folded levels, and the
> + * pointer iteration over a range simply happens at the correct page table
> + * level.
> + * It is however a problem with gup_fast, or other places walking the page
> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
> + * a stack variable, which cannot be used for pointer iteration at the correct
> + * level. Instead, the iteration then has to happen by going up to pgd level
> + * again. To allow this, provide pXd_addr_end_folded() functions with an
> + * additional pXd value parameter, which can be used on s390 to determine the
> + * folding level and return the corresponding boundary.
> + */
> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)

What does 'rste' stands for ?

Isn't this line a bit long ?

> +{
> +	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
> +	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
> +	unsigned long boundary = (addr + size) & ~(size - 1);
> +
> +	/*
> +	 * FIXME The below check is for internal testing only, to be removed
> +	 */
> +	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
> +
> +	return (boundary - 1) < (end - 1) ? boundary : end;
> +}
> +
> +#define pgd_addr_end_folded pgd_addr_end_folded
> +static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> +}
> +
> +#define p4d_addr_end_folded p4d_addr_end_folded
> +static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(p4d_val(p4d), addr, end);
> +}
> +
>   static inline int mm_has_pgste(struct mm_struct *mm)
>   {
>   #ifdef CONFIG_PGSTE
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..981c4c2a31fe 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   })
>   #endif
>   
> +#ifndef pgd_addr_end_folded
> +#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
> +#endif
> +
> +#ifndef p4d_addr_end_folded
> +#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
> +#endif
> +
> +#ifndef pud_addr_end_folded
> +#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
> +#endif
> +
> +#ifndef pmd_addr_end_folded
> +#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
> +#endif
> +
>   /*
>    * When walking page tables, we usually want to skip any p?d_none entries;
>    * and any p?d_bad entries - reporting the error before resetting to none.
> diff --git a/mm/gup.c b/mm/gup.c
> index bd883a112724..ba4aace5d0f4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>   		if (!pmd_present(pmd))
>   			return 0;
>   
> @@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> -		next = pud_addr_end(addr, end);
> +		next = pud_addr_end_folded(pud, addr, end);
>   		if (unlikely(!pud_present(pud)))
>   			return 0;
>   		if (unlikely(pud_huge(pud))) {
> @@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> -		next = p4d_addr_end(addr, end);
> +		next = p4d_addr_end_folded(p4d, addr, end);
>   		if (p4d_none(p4d))
>   			return 0;
>   		BUILD_BUG_ON(p4d_huge(p4d));
> @@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   	do {
>   		pgd_t pgd = READ_ONCE(*pgdp);
>   
> -		next = pgd_addr_end(addr, end);
> +		next = pgd_addr_end_folded(pgd, addr, end);
>   		if (pgd_none(pgd))
>   			return;
>   		if (unlikely(pgd_huge(pgd))) {
> 

Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  5:06     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:06 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.
> 
> The question "What would it require for the generic code to work for s390"
> has already been discussed here
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> and ended with a promising approach here
> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> which in the end unfortunately didn't quite work completely.
> 
> We tried to mimic static level folding by changing pgd_offset to always
> calculate top level page table offset, and do nothing in folded pXd_offset.
> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> not reflect this dynamic behaviour, and still act like static 5-level
> page tables.
> 

[...]

> 
> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> additional pXd entry value parameter, that can be used on s390
> to determine the correct page table level and return corresponding
> end / boundary. With that, the pointer iteration will always
> happen in gup_pgd_range for s390. No change for other architectures
> introduced.

Not sure pXd_addr_end_folded() is the best understandable name, 
allthough I don't have any alternative suggestion at the moment.
Maybe could be something like pXd_addr_end_fixup() as it will disappear 
in the next patch, or pXd_addr_end_gup() ?

Also, if it happens to be acceptable to get patch 2 in stable, I think 
you should switch patch 1 and patch 2 to avoid the step through 
pXd_addr_end_folded()


> 
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Cc: <stable@vger.kernel.org> # 5.2+
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>   include/linux/pgtable.h         | 16 +++++++++++++
>   mm/gup.c                        |  8 +++----
>   3 files changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..027206e4959d 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>   }
>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>   
> +/*
> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
> + * will not return corresponding dynamic boundaries. This is no problem as long
> + * as only pXd pointers are passed down during page table walk, because
> + * pXd_offset() will simply return the given pointer for folded levels, and the
> + * pointer iteration over a range simply happens at the correct page table
> + * level.
> + * It is however a problem with gup_fast, or other places walking the page
> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
> + * a stack variable, which cannot be used for pointer iteration at the correct
> + * level. Instead, the iteration then has to happen by going up to pgd level
> + * again. To allow this, provide pXd_addr_end_folded() functions with an
> + * additional pXd value parameter, which can be used on s390 to determine the
> + * folding level and return the corresponding boundary.
> + */
> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)

What does 'rste' stands for ?

Isn't this line a bit long ?

> +{
> +	unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2;
> +	unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11);
> +	unsigned long boundary = (addr + size) & ~(size - 1);
> +
> +	/*
> +	 * FIXME The below check is for internal testing only, to be removed
> +	 */
> +	VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2));
> +
> +	return (boundary - 1) < (end - 1) ? boundary : end;
> +}
> +
> +#define pgd_addr_end_folded pgd_addr_end_folded
> +static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> +}
> +
> +#define p4d_addr_end_folded p4d_addr_end_folded
> +static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end)
> +{
> +	return rste_addr_end_folded(p4d_val(p4d), addr, end);
> +}
> +
>   static inline int mm_has_pgste(struct mm_struct *mm)
>   {
>   #ifdef CONFIG_PGSTE
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..981c4c2a31fe 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   })
>   #endif
>   
> +#ifndef pgd_addr_end_folded
> +#define pgd_addr_end_folded(pgd, addr, end)	pgd_addr_end(addr, end)
> +#endif
> +
> +#ifndef p4d_addr_end_folded
> +#define p4d_addr_end_folded(p4d, addr, end)	p4d_addr_end(addr, end)
> +#endif
> +
> +#ifndef pud_addr_end_folded
> +#define pud_addr_end_folded(pud, addr, end)	pud_addr_end(addr, end)
> +#endif
> +
> +#ifndef pmd_addr_end_folded
> +#define pmd_addr_end_folded(pmd, addr, end)	pmd_addr_end(addr, end)
> +#endif
> +
>   /*
>    * When walking page tables, we usually want to skip any p?d_none entries;
>    * and any p?d_bad entries - reporting the error before resetting to none.
> diff --git a/mm/gup.c b/mm/gup.c
> index bd883a112724..ba4aace5d0f4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>   		if (!pmd_present(pmd))
>   			return 0;
>   
> @@ -2564,7 +2564,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> -		next = pud_addr_end(addr, end);
> +		next = pud_addr_end_folded(pud, addr, end);
>   		if (unlikely(!pud_present(pud)))
>   			return 0;
>   		if (unlikely(pud_huge(pud))) {
> @@ -2592,7 +2592,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> -		next = p4d_addr_end(addr, end);
> +		next = p4d_addr_end_folded(p4d, addr, end);
>   		if (p4d_none(p4d))
>   			return 0;
>   		BUILD_BUG_ON(p4d_huge(p4d));
> @@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   	do {
>   		pgd_t pgd = READ_ONCE(*pgdp);
>   
> -		next = pgd_addr_end(addr, end);
> +		next = pgd_addr_end_folded(pgd, addr, end);
>   		if (pgd_none(pgd))
>   			return;
>   		if (unlikely(pgd_huge(pgd))) {
> 

Christophe

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-07 18:00   ` Gerald Schaefer
                       ` (2 preceding siblings ...)
  (?)
@ 2020-09-08  5:14     ` Christophe Leroy
  -1 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:14 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 

[...]

> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/arm/include/asm/pgtable-2level.h    |  2 +-
>   arch/arm/mm/idmap.c                      |  6 ++--
>   arch/arm/mm/mmu.c                        |  8 ++---
>   arch/arm64/kernel/hibernate.c            | 16 ++++++----
>   arch/arm64/kvm/mmu.c                     | 16 +++++-----
>   arch/arm64/mm/kasan_init.c               |  8 ++---
>   arch/arm64/mm/mmu.c                      | 25 +++++++--------
>   arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
>   arch/powerpc/mm/hugetlbpage.c            |  6 ++--

You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

>   arch/s390/include/asm/pgtable.h          |  8 ++---
>   arch/s390/mm/page-states.c               |  8 ++---
>   arch/s390/mm/pageattr.c                  |  8 ++---
>   arch/s390/mm/vmem.c                      |  8 ++---
>   arch/sparc/mm/hugetlbpage.c              |  6 ++--
>   arch/um/kernel/tlb.c                     |  8 ++---
>   arch/x86/mm/init_64.c                    | 15 ++++-----
>   arch/x86/mm/kasan_init_64.c              | 16 +++++-----
>   include/asm-generic/pgtable-nop4d.h      |  2 +-
>   include/asm-generic/pgtable-nopmd.h      |  2 +-
>   include/asm-generic/pgtable-nopud.h      |  2 +-
>   include/linux/pgtable.h                  | 26 ++++-----------
>   mm/gup.c                                 |  8 ++---
>   mm/ioremap.c                             |  8 ++---
>   mm/kasan/init.c                          | 17 +++++-----
>   mm/madvise.c                             |  4 +--
>   mm/memory.c                              | 40 ++++++++++++------------
>   mm/mlock.c                               | 18 ++++++++---
>   mm/mprotect.c                            |  8 ++---
>   mm/pagewalk.c                            |  8 ++---
>   mm/swapfile.c                            |  8 ++---
>   mm/vmalloc.c                             | 16 +++++-----
>   31 files changed, 165 insertions(+), 173 deletions(-)

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  5:14     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:14 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 

[...]

> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/arm/include/asm/pgtable-2level.h    |  2 +-
>   arch/arm/mm/idmap.c                      |  6 ++--
>   arch/arm/mm/mmu.c                        |  8 ++---
>   arch/arm64/kernel/hibernate.c            | 16 ++++++----
>   arch/arm64/kvm/mmu.c                     | 16 +++++-----
>   arch/arm64/mm/kasan_init.c               |  8 ++---
>   arch/arm64/mm/mmu.c                      | 25 +++++++--------
>   arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
>   arch/powerpc/mm/hugetlbpage.c            |  6 ++--

You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

>   arch/s390/include/asm/pgtable.h          |  8 ++---
>   arch/s390/mm/page-states.c               |  8 ++---
>   arch/s390/mm/pageattr.c                  |  8 ++---
>   arch/s390/mm/vmem.c                      |  8 ++---
>   arch/sparc/mm/hugetlbpage.c              |  6 ++--
>   arch/um/kernel/tlb.c                     |  8 ++---
>   arch/x86/mm/init_64.c                    | 15 ++++-----
>   arch/x86/mm/kasan_init_64.c              | 16 +++++-----
>   include/asm-generic/pgtable-nop4d.h      |  2 +-
>   include/asm-generic/pgtable-nopmd.h      |  2 +-
>   include/asm-generic/pgtable-nopud.h      |  2 +-
>   include/linux/pgtable.h                  | 26 ++++-----------
>   mm/gup.c                                 |  8 ++---
>   mm/ioremap.c                             |  8 ++---
>   mm/kasan/init.c                          | 17 +++++-----
>   mm/madvise.c                             |  4 +--
>   mm/memory.c                              | 40 ++++++++++++------------
>   mm/mlock.c                               | 18 ++++++++---
>   mm/mprotect.c                            |  8 ++---
>   mm/pagewalk.c                            |  8 ++---
>   mm/swapfile.c                            |  8 ++---
>   mm/vmalloc.c                             | 16 +++++-----
>   31 files changed, 165 insertions(+), 173 deletions(-)

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  5:14     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:14 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 

[...]

> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/arm/include/asm/pgtable-2level.h    |  2 +-
>   arch/arm/mm/idmap.c                      |  6 ++--
>   arch/arm/mm/mmu.c                        |  8 ++---
>   arch/arm64/kernel/hibernate.c            | 16 ++++++----
>   arch/arm64/kvm/mmu.c                     | 16 +++++-----
>   arch/arm64/mm/kasan_init.c               |  8 ++---
>   arch/arm64/mm/mmu.c                      | 25 +++++++--------
>   arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
>   arch/powerpc/mm/hugetlbpage.c            |  6 ++--

You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

>   arch/s390/include/asm/pgtable.h          |  8 ++---
>   arch/s390/mm/page-states.c               |  8 ++---
>   arch/s390/mm/pageattr.c                  |  8 ++---
>   arch/s390/mm/vmem.c                      |  8 ++---
>   arch/sparc/mm/hugetlbpage.c              |  6 ++--
>   arch/um/kernel/tlb.c                     |  8 ++---
>   arch/x86/mm/init_64.c                    | 15 ++++-----
>   arch/x86/mm/kasan_init_64.c              | 16 +++++-----
>   include/asm-generic/pgtable-nop4d.h      |  2 +-
>   include/asm-generic/pgtable-nopmd.h      |  2 +-
>   include/asm-generic/pgtable-nopud.h      |  2 +-
>   include/linux/pgtable.h                  | 26 ++++-----------
>   mm/gup.c                                 |  8 ++---
>   mm/ioremap.c                             |  8 ++---
>   mm/kasan/init.c                          | 17 +++++-----
>   mm/madvise.c                             |  4 +--
>   mm/memory.c                              | 40 ++++++++++++------------
>   mm/mlock.c                               | 18 ++++++++---
>   mm/mprotect.c                            |  8 ++---
>   mm/pagewalk.c                            |  8 ++---
>   mm/swapfile.c                            |  8 ++---
>   mm/vmalloc.c                             | 16 +++++-----
>   31 files changed, 165 insertions(+), 173 deletions(-)

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  5:14     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:14 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 

[...]

> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/arm/include/asm/pgtable-2level.h    |  2 +-
>   arch/arm/mm/idmap.c                      |  6 ++--
>   arch/arm/mm/mmu.c                        |  8 ++---
>   arch/arm64/kernel/hibernate.c            | 16 ++++++----
>   arch/arm64/kvm/mmu.c                     | 16 +++++-----
>   arch/arm64/mm/kasan_init.c               |  8 ++---
>   arch/arm64/mm/mmu.c                      | 25 +++++++--------
>   arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
>   arch/powerpc/mm/hugetlbpage.c            |  6 ++--

You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

>   arch/s390/include/asm/pgtable.h          |  8 ++---
>   arch/s390/mm/page-states.c               |  8 ++---
>   arch/s390/mm/pageattr.c                  |  8 ++---
>   arch/s390/mm/vmem.c                      |  8 ++---
>   arch/sparc/mm/hugetlbpage.c              |  6 ++--
>   arch/um/kernel/tlb.c                     |  8 ++---
>   arch/x86/mm/init_64.c                    | 15 ++++-----
>   arch/x86/mm/kasan_init_64.c              | 16 +++++-----
>   include/asm-generic/pgtable-nop4d.h      |  2 +-
>   include/asm-generic/pgtable-nopmd.h      |  2 +-
>   include/asm-generic/pgtable-nopud.h      |  2 +-
>   include/linux/pgtable.h                  | 26 ++++-----------
>   mm/gup.c                                 |  8 ++---
>   mm/ioremap.c                             |  8 ++---
>   mm/kasan/init.c                          | 17 +++++-----
>   mm/madvise.c                             |  4 +--
>   mm/memory.c                              | 40 ++++++++++++------------
>   mm/mlock.c                               | 18 ++++++++---
>   mm/mprotect.c                            |  8 ++---
>   mm/pagewalk.c                            |  8 ++---
>   mm/swapfile.c                            |  8 ++---
>   mm/vmalloc.c                             | 16 +++++-----
>   31 files changed, 165 insertions(+), 173 deletions(-)

Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  5:14     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:14 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 

[...]

> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   arch/arm/include/asm/pgtable-2level.h    |  2 +-
>   arch/arm/mm/idmap.c                      |  6 ++--
>   arch/arm/mm/mmu.c                        |  8 ++---
>   arch/arm64/kernel/hibernate.c            | 16 ++++++----
>   arch/arm64/kvm/mmu.c                     | 16 +++++-----
>   arch/arm64/mm/kasan_init.c               |  8 ++---
>   arch/arm64/mm/mmu.c                      | 25 +++++++--------
>   arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
>   arch/powerpc/mm/hugetlbpage.c            |  6 ++--

You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

>   arch/s390/include/asm/pgtable.h          |  8 ++---
>   arch/s390/mm/page-states.c               |  8 ++---
>   arch/s390/mm/pageattr.c                  |  8 ++---
>   arch/s390/mm/vmem.c                      |  8 ++---
>   arch/sparc/mm/hugetlbpage.c              |  6 ++--
>   arch/um/kernel/tlb.c                     |  8 ++---
>   arch/x86/mm/init_64.c                    | 15 ++++-----
>   arch/x86/mm/kasan_init_64.c              | 16 +++++-----
>   include/asm-generic/pgtable-nop4d.h      |  2 +-
>   include/asm-generic/pgtable-nopmd.h      |  2 +-
>   include/asm-generic/pgtable-nopud.h      |  2 +-
>   include/linux/pgtable.h                  | 26 ++++-----------
>   mm/gup.c                                 |  8 ++---
>   mm/ioremap.c                             |  8 ++---
>   mm/kasan/init.c                          | 17 +++++-----
>   mm/madvise.c                             |  4 +--
>   mm/memory.c                              | 40 ++++++++++++------------
>   mm/mlock.c                               | 18 ++++++++---
>   mm/mprotect.c                            |  8 ++---
>   mm/pagewalk.c                            |  8 ++---
>   mm/swapfile.c                            |  8 ++---
>   mm/vmalloc.c                             | 16 +++++-----
>   31 files changed, 165 insertions(+), 173 deletions(-)

Christophe

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
  2020-09-07 18:00   ` Gerald Schaefer
                       ` (2 preceding siblings ...)
  (?)
@ 2020-09-08  5:19     ` Christophe Leroy
  -1 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:19 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>   1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>    */
>   
>   #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end

I think that #define is pointless, usually there is no such #define for 
the default case.

> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}

Please use the standard layout, ie entry { and exit } alone on their 
line, and space between local vars declaration and the rest.

Also remove the leading __ in front of var names as it's not needed once 
it is not macros anymore.

f_name()
{
	some_local_var;

	do_something();
}

>   #endif
>   
>   #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   /*
> 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08  5:19     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:19 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>   1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>    */
>   
>   #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end

I think that #define is pointless, usually there is no such #define for 
the default case.

> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}

Please use the standard layout, ie entry { and exit } alone on their 
line, and space between local vars declaration and the rest.

Also remove the leading __ in front of var names as it's not needed once 
it is not macros anymore.

f_name()
{
	some_local_var;

	do_something();
}

>   #endif
>   
>   #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   /*
> 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08  5:19     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:19 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>   1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>    */
>   
>   #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end

I think that #define is pointless, usually there is no such #define for 
the default case.

> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}

Please use the standard layout, ie entry { and exit } alone on their 
line, and space between local vars declaration and the rest.

Also remove the leading __ in front of var names as it's not needed once 
it is not macros anymore.

f_name()
{
	some_local_var;

	do_something();
}

>   #endif
>   
>   #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   /*
> 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08  5:19     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:19 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>   1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>    */
>   
>   #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end

I think that #define is pointless, usually there is no such #define for 
the default case.

> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}

Please use the standard layout, ie entry { and exit } alone on their 
line, and space between local vars declaration and the rest.

Also remove the leading __ in front of var names as it's not needed once 
it is not macros anymore.

f_name()
{
	some_local_var;

	do_something();
}

>   #endif
>   
>   #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   /*
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08  5:19     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:19 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, Heiko Carstens, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Linus Torvalds,
	LKML, Andrew Morton, linux-power, Mike Rapoport



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since pXd_addr_end() macros take pXd page-table entry as a
> parameter it makes sense to check the entry type on compile.
> Even though most archs do not make use of page-table entries
> in pXd_addr_end() calls, checking the type in traversal code
> paths could help to avoid subtle bugs.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>   include/linux/pgtable.h | 36 ++++++++++++++++++++----------------
>   1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 67ebc22cf83d..d9e7d16c2263 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>    */
>   
>   #ifndef pgd_addr_end
> -#define pgd_addr_end(pgd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pgd_addr_end pgd_addr_end

I think that #define is pointless, usually there is no such #define for 
the default case.

> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}

Please use the standard layout, ie entry { and exit } alone on their 
line, and space between local vars declaration and the rest.

Also remove the leading __ in front of var names as it's not needed once 
it is not macros anymore.

f_name()
{
	some_local_var;

	do_something();
}

>   #endif
>   
>   #ifndef p4d_addr_end
> -#define p4d_addr_end(p4d, addr, end)					\
> -({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define p4d_addr_end p4d_addr_end
> +static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pud_addr_end
> -#define pud_addr_end(pud, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pud_addr_end pud_addr_end
> +static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   #ifndef pmd_addr_end
> -#define pmd_addr_end(pmd, addr, end)					\
> -({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> -})
> +#define pmd_addr_end pmd_addr_end
> +static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, unsigned long end)
> +{	unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
> +	return (__boundary - 1 < end - 1) ? __boundary : end;
> +}
>   #endif
>   
>   /*
> 

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-07 20:12   ` Mike Rapoport
                       ` (2 preceding siblings ...)
  (?)
@ 2020-09-08  5:22     ` Christophe Leroy
  -1 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:22 UTC (permalink / raw)
  To: Mike Rapoport, Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds



Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
>> This is v2 of an RFC previously discussed here:
>> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
>>
>> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
>> to common gup_fast code. It will introduce special helper functions
>> pXd_addr_end_folded(), which have to be used in places where pagetable walk
>> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
>>
>> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
>> themselves by adding an extra pXd value parameter. That was suggested by
>> Jason during v1 discussion, because he is already thinking of some other
>> places where he might want to switch to the READ_ONCE logic for pagetable
>> walks. In general, that would be the cleanest / safest solution, but there
>> is some impact on other architectures and common code, hence the new and
>> greatly enlarged recipient list.
>>
>> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
>> functions instead of #defines, so that we get some type checking for the
>> new pXd value parameter.
>>
>> Not sure about Fixes/stable tags for the generic solution. Only patch 1
>> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
>> still be nice to have in stable, to ease future backports, but I guess
>> "nice to have" does not really qualify for stable backports.
> 
> I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> way and with this patch 1 is not really required. I would even merge
> patches 2 and 3 into a single patch and use only it as the fix.

Why not merging patches 2 and 3, but I would keep patch 1 separate but 
after the generic changes, so that we first do the generic changes, then 
we do the specific S390 use of it.

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  5:22     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:22 UTC (permalink / raw)
  To: Mike Rapoport, Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds



Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
>> This is v2 of an RFC previously discussed here:
>> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
>>
>> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
>> to common gup_fast code. It will introduce special helper functions
>> pXd_addr_end_folded(), which have to be used in places where pagetable walk
>> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
>>
>> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
>> themselves by adding an extra pXd value parameter. That was suggested by
>> Jason during v1 discussion, because he is already thinking of some other
>> places where he might want to switch to the READ_ONCE logic for pagetable
>> walks. In general, that would be the cleanest / safest solution, but there
>> is some impact on other architectures and common code, hence the new and
>> greatly enlarged recipient list.
>>
>> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
>> functions instead of #defines, so that we get some type checking for the
>> new pXd value parameter.
>>
>> Not sure about Fixes/stable tags for the generic solution. Only patch 1
>> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
>> still be nice to have in stable, to ease future backports, but I guess
>> "nice to have" does not really qualify for stable backports.
> 
> I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> way and with this patch 1 is not really required. I would even merge
> patches 2 and 3 into a single patch and use only it as the fix.

Why not merging patches 2 and 3, but I would keep patch 1 separate but 
after the generic changes, so that we first do the generic changes, then 
we do the specific S390 use of it.

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  5:22     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:22 UTC (permalink / raw)
  To: Mike Rapoport, Gerald Schaefer
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power



Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
>> This is v2 of an RFC previously discussed here:
>> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
>>
>> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
>> to common gup_fast code. It will introduce special helper functions
>> pXd_addr_end_folded(), which have to be used in places where pagetable walk
>> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
>>
>> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
>> themselves by adding an extra pXd value parameter. That was suggested by
>> Jason during v1 discussion, because he is already thinking of some other
>> places where he might want to switch to the READ_ONCE logic for pagetable
>> walks. In general, that would be the cleanest / safest solution, but there
>> is some impact on other architectures and common code, hence the new and
>> greatly enlarged recipient list.
>>
>> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
>> functions instead of #defines, so that we get some type checking for the
>> new pXd value parameter.
>>
>> Not sure about Fixes/stable tags for the generic solution. Only patch 1
>> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
>> still be nice to have in stable, to ease future backports, but I guess
>> "nice to have" does not really qualify for stable backports.
> 
> I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> way and with this patch 1 is not really required. I would even merge
> patches 2 and 3 into a single patch and use only it as the fix.

Why not merging patches 2 and 3, but I would keep patch 1 separate but 
after the generic changes, so that we first do the generic changes, then 
we do the specific S390 use of it.

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  5:22     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:22 UTC (permalink / raw)
  To: Mike Rapoport, Gerald Schaefer
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power



Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
>> This is v2 of an RFC previously discussed here:
>> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
>>
>> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
>> to common gup_fast code. It will introduce special helper functions
>> pXd_addr_end_folded(), which have to be used in places where pagetable walk
>> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
>>
>> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
>> themselves by adding an extra pXd value parameter. That was suggested by
>> Jason during v1 discussion, because he is already thinking of some other
>> places where he might want to switch to the READ_ONCE logic for pagetable
>> walks. In general, that would be the cleanest / safest solution, but there
>> is some impact on other architectures and common code, hence the new and
>> greatly enlarged recipient list.
>>
>> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
>> functions instead of #defines, so that we get some type checking for the
>> new pXd value parameter.
>>
>> Not sure about Fixes/stable tags for the generic solution. Only patch 1
>> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
>> still be nice to have in stable, to ease future backports, but I guess
>> "nice to have" does not really qualify for stable backports.
> 
> I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> way and with this patch 1 is not really required. I would even merge
> patches 2 and 3 into a single patch and use only it as the fix.

Why not merging patches 2 and 3, but I would keep patch 1 separate but 
after the generic changes, so that we first do the generic changes, then 
we do the specific S390 use of it.

Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08  5:22     ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  5:22 UTC (permalink / raw)
  To: Mike Rapoport, Gerald Schaefer
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power



Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
>> This is v2 of an RFC previously discussed here:
>> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
>>
>> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
>> to common gup_fast code. It will introduce special helper functions
>> pXd_addr_end_folded(), which have to be used in places where pagetable walk
>> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
>>
>> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
>> themselves by adding an extra pXd value parameter. That was suggested by
>> Jason during v1 discussion, because he is already thinking of some other
>> places where he might want to switch to the READ_ONCE logic for pagetable
>> walks. In general, that would be the cleanest / safest solution, but there
>> is some impact on other architectures and common code, hence the new and
>> greatly enlarged recipient list.
>>
>> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
>> functions instead of #defines, so that we get some type checking for the
>> new pXd value parameter.
>>
>> Not sure about Fixes/stable tags for the generic solution. Only patch 1
>> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
>> still be nice to have in stable, to ease future backports, but I guess
>> "nice to have" does not really qualify for stable backports.
> 
> I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> way and with this patch 1 is not really required. I would even merge
> patches 2 and 3 into a single patch and use only it as the fix.

Why not merging patches 2 and 3, but I would keep patch 1 separate but 
after the generic changes, so that we first do the generic changes, then 
we do the specific S390 use of it.

Christophe

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-08  5:14     ` Christophe Leroy
                         ` (2 preceding siblings ...)
  (?)
@ 2020-09-08  7:46       ` Alexander Gordeev
  -1 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08  7:46 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

Yes, and also two more sources :/
	arch/powerpc/mm/kasan/8xx.c
	arch/powerpc/mm/kasan/kasan_init_32.c

But these two are not quite obvious wrt pgd_addr_end() used
while traversing pmds. Could you please clarify a bit?


diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..89c5053 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..3f7d6dc6 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pmd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

> Christophe

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  7:46       ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08  7:46 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

Yes, and also two more sources :/
	arch/powerpc/mm/kasan/8xx.c
	arch/powerpc/mm/kasan/kasan_init_32.c

But these two are not quite obvious wrt pgd_addr_end() used
while traversing pmds. Could you please clarify a bit?


diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..89c5053 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..3f7d6dc6 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pmd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

> Christophe

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  7:46       ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08  7:46 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

Yes, and also two more sources :/
	arch/powerpc/mm/kasan/8xx.c
	arch/powerpc/mm/kasan/kasan_init_32.c

But these two are not quite obvious wrt pgd_addr_end() used
while traversing pmds. Could you please clarify a bit?


diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..89c5053 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..3f7d6dc6 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pmd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

> Christophe

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  7:46       ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08  7:46 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

Yes, and also two more sources :/
	arch/powerpc/mm/kasan/8xx.c
	arch/powerpc/mm/kasan/kasan_init_32.c

But these two are not quite obvious wrt pgd_addr_end() used
while traversing pmds. Could you please clarify a bit?


diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..89c5053 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..3f7d6dc6 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pmd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

> Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  7:46       ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08  7:46 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

Yes, and also two more sources :/
	arch/powerpc/mm/kasan/8xx.c
	arch/powerpc/mm/kasan/kasan_init_32.c

But these two are not quite obvious wrt pgd_addr_end() used
while traversing pmds. Could you please clarify a bit?


diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..89c5053 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..3f7d6dc6 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pmd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

> Christophe

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-08  7:46       ` Alexander Gordeev
                           ` (2 preceding siblings ...)
  (?)
@ 2020-09-08  8:16         ` Christophe Leroy
  -1 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  8:16 UTC (permalink / raw)
  To: Alexander Gordeev, Michael Ellerman
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



Le 08/09/2020 à 09:46, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
>> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.
> 
> Yes, and also two more sources :/
> 	arch/powerpc/mm/kasan/8xx.c
> 	arch/powerpc/mm/kasan/kasan_init_32.c
> 
> But these two are not quite obvious wrt pgd_addr_end() used
> while traversing pmds. Could you please clarify a bit?
> 
> 
> diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> index 2784224..89c5053 100644
> --- a/arch/powerpc/mm/kasan/8xx.c
> +++ b/arch/powerpc/mm/kasan/8xx.c
> @@ -15,8 +15,8 @@
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
>   		pte_basic_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> -		k_next = pgd_addr_end(k_next, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_next, k_end);

No, I don't think so.
On powerpc32 we have only two levels, so pgd and pmd are more or less 
the same.
But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h is 
a no-op, so I don't think it will work.

It is likely that this function should iterate on pgd, then you get pmd 
= pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
> index fb29404..3f7d6dc6 100644
> --- a/arch/powerpc/mm/kasan/kasan_init_32.c
> +++ b/arch/powerpc/mm/kasan/kasan_init_32.c
> @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
>   		pte_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);

Same here I get, iterate on pgd then get pmd = 
pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> @@ -196,7 +196,7 @@ void __init kasan_early_init(void)
>   	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
>   
>   	do {
> -		next = pgd_addr_end(addr, end);
> +		next = pmd_addr_end(addr, end);
>   		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
>   	} while (pmd++, addr = next, addr != end);
>   
> 

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  8:16         ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  8:16 UTC (permalink / raw)
  To: Alexander Gordeev, Michael Ellerman
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



Le 08/09/2020 à 09:46, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
>> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.
> 
> Yes, and also two more sources :/
> 	arch/powerpc/mm/kasan/8xx.c
> 	arch/powerpc/mm/kasan/kasan_init_32.c
> 
> But these two are not quite obvious wrt pgd_addr_end() used
> while traversing pmds. Could you please clarify a bit?
> 
> 
> diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> index 2784224..89c5053 100644
> --- a/arch/powerpc/mm/kasan/8xx.c
> +++ b/arch/powerpc/mm/kasan/8xx.c
> @@ -15,8 +15,8 @@
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
>   		pte_basic_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> -		k_next = pgd_addr_end(k_next, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_next, k_end);

No, I don't think so.
On powerpc32 we have only two levels, so pgd and pmd are more or less 
the same.
But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h is 
a no-op, so I don't think it will work.

It is likely that this function should iterate on pgd, then you get pmd 
= pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
> index fb29404..3f7d6dc6 100644
> --- a/arch/powerpc/mm/kasan/kasan_init_32.c
> +++ b/arch/powerpc/mm/kasan/kasan_init_32.c
> @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
>   		pte_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);

Same here I get, iterate on pgd then get pmd = 
pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> @@ -196,7 +196,7 @@ void __init kasan_early_init(void)
>   	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
>   
>   	do {
> -		next = pgd_addr_end(addr, end);
> +		next = pmd_addr_end(addr, end);
>   		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
>   	} while (pmd++, addr = next, addr != end);
>   
> 

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  8:16         ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  8:16 UTC (permalink / raw)
  To: Alexander Gordeev, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport



Le 08/09/2020 à 09:46, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
>> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.
> 
> Yes, and also two more sources :/
> 	arch/powerpc/mm/kasan/8xx.c
> 	arch/powerpc/mm/kasan/kasan_init_32.c
> 
> But these two are not quite obvious wrt pgd_addr_end() used
> while traversing pmds. Could you please clarify a bit?
> 
> 
> diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> index 2784224..89c5053 100644
> --- a/arch/powerpc/mm/kasan/8xx.c
> +++ b/arch/powerpc/mm/kasan/8xx.c
> @@ -15,8 +15,8 @@
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
>   		pte_basic_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> -		k_next = pgd_addr_end(k_next, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_next, k_end);

No, I don't think so.
On powerpc32 we have only two levels, so pgd and pmd are more or less 
the same.
But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h is 
a no-op, so I don't think it will work.

It is likely that this function should iterate on pgd, then you get pmd 
= pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
> index fb29404..3f7d6dc6 100644
> --- a/arch/powerpc/mm/kasan/kasan_init_32.c
> +++ b/arch/powerpc/mm/kasan/kasan_init_32.c
> @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
>   		pte_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);

Same here I get, iterate on pgd then get pmd = 
pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> @@ -196,7 +196,7 @@ void __init kasan_early_init(void)
>   	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
>   
>   	do {
> -		next = pgd_addr_end(addr, end);
> +		next = pmd_addr_end(addr, end);
>   		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
>   	} while (pmd++, addr = next, addr != end);
>   
> 

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  8:16         ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  8:16 UTC (permalink / raw)
  To: Alexander Gordeev, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport



Le 08/09/2020 à 09:46, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
>> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.
> 
> Yes, and also two more sources :/
> 	arch/powerpc/mm/kasan/8xx.c
> 	arch/powerpc/mm/kasan/kasan_init_32.c
> 
> But these two are not quite obvious wrt pgd_addr_end() used
> while traversing pmds. Could you please clarify a bit?
> 
> 
> diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> index 2784224..89c5053 100644
> --- a/arch/powerpc/mm/kasan/8xx.c
> +++ b/arch/powerpc/mm/kasan/8xx.c
> @@ -15,8 +15,8 @@
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
>   		pte_basic_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> -		k_next = pgd_addr_end(k_next, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_next, k_end);

No, I don't think so.
On powerpc32 we have only two levels, so pgd and pmd are more or less 
the same.
But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h is 
a no-op, so I don't think it will work.

It is likely that this function should iterate on pgd, then you get pmd 
= pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
> index fb29404..3f7d6dc6 100644
> --- a/arch/powerpc/mm/kasan/kasan_init_32.c
> +++ b/arch/powerpc/mm/kasan/kasan_init_32.c
> @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
>   		pte_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);

Same here I get, iterate on pgd then get pmd = 
pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> @@ -196,7 +196,7 @@ void __init kasan_early_init(void)
>   	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
>   
>   	do {
> -		next = pgd_addr_end(addr, end);
> +		next = pmd_addr_end(addr, end);
>   		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
>   	} while (pmd++, addr = next, addr != end);
>   
> 

Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08  8:16         ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08  8:16 UTC (permalink / raw)
  To: Alexander Gordeev, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport



Le 08/09/2020 à 09:46, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
>> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.
> 
> Yes, and also two more sources :/
> 	arch/powerpc/mm/kasan/8xx.c
> 	arch/powerpc/mm/kasan/kasan_init_32.c
> 
> But these two are not quite obvious wrt pgd_addr_end() used
> while traversing pmds. Could you please clarify a bit?
> 
> 
> diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> index 2784224..89c5053 100644
> --- a/arch/powerpc/mm/kasan/8xx.c
> +++ b/arch/powerpc/mm/kasan/8xx.c
> @@ -15,8 +15,8 @@
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
>   		pte_basic_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> -		k_next = pgd_addr_end(k_next, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_next, k_end);

No, I don't think so.
On powerpc32 we have only two levels, so pgd and pmd are more or less 
the same.
But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h is 
a no-op, so I don't think it will work.

It is likely that this function should iterate on pgd, then you get pmd 
= pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
> index fb29404..3f7d6dc6 100644
> --- a/arch/powerpc/mm/kasan/kasan_init_32.c
> +++ b/arch/powerpc/mm/kasan/kasan_init_32.c
> @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
>   		pte_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);

Same here I get, iterate on pgd then get pmd = 
pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> @@ -196,7 +196,7 @@ void __init kasan_early_init(void)
>   	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
>   
>   	do {
> -		next = pgd_addr_end(addr, end);
> +		next = pmd_addr_end(addr, end);
>   		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
>   	} while (pmd++, addr = next, addr != end);
>   
> 

Christophe

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-08  5:06     ` Christophe Leroy
                         ` (2 preceding siblings ...)
  (?)
@ 2020-09-08 12:09       ` Christian Borntraeger
  -1 siblings, 0 replies; 313+ messages in thread
From: Christian Borntraeger @ 2020-09-08 12:09 UTC (permalink / raw)
  To: Christophe Leroy, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



On 08.09.20 07:06, Christophe Leroy wrote:
> 
> 
> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>
>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>> dynamic page table folding.
>>
>> The question "What would it require for the generic code to work for s390"
>> has already been discussed here
>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>> and ended with a promising approach here
>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>> which in the end unfortunately didn't quite work completely.
>>
>> We tried to mimic static level folding by changing pgd_offset to always
>> calculate top level page table offset, and do nothing in folded pXd_offset.
>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>> not reflect this dynamic behaviour, and still act like static 5-level
>> page tables.
>>
> 
> [...]
> 
>>
>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>> additional pXd entry value parameter, that can be used on s390
>> to determine the correct page table level and return corresponding
>> end / boundary. With that, the pointer iteration will always
>> happen in gup_pgd_range for s390. No change for other architectures
>> introduced.
> 
> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> 
> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()

given that this fixes a data corruption issue, wouldnt it be the best to go forward
with this patch ASAP and then handle the other patches on top with all the time that
we need?
> 
> 
>>
>> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
>> Cc: <stable@vger.kernel.org> # 5.2+
>> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
>> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>>   include/linux/pgtable.h         | 16 +++++++++++++
>>   mm/gup.c                        |  8 +++----
>>   3 files changed, 62 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>> index 7eb01a5459cd..027206e4959d 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>>   }
>>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>>   +/*
>> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
>> + * will not return corresponding dynamic boundaries. This is no problem as long
>> + * as only pXd pointers are passed down during page table walk, because
>> + * pXd_offset() will simply return the given pointer for folded levels, and the
>> + * pointer iteration over a range simply happens at the correct page table
>> + * level.
>> + * It is however a problem with gup_fast, or other places walking the page
>> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
>> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
>> + * a stack variable, which cannot be used for pointer iteration at the correct
>> + * level. Instead, the iteration then has to happen by going up to pgd level
>> + * again. To allow this, provide pXd_addr_end_folded() functions with an
>> + * additional pXd value parameter, which can be used on s390 to determine the
>> + * folding level and return the corresponding boundary.
>> + */
>> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
> 
> What does 'rste' stands for ?
> 
> Isn't this line a bit long ?

this is region/segment table entry according to the architecture. 
On our platform we do have the pagetables with a different format that
next levels (segment table -> 1MB granularity, region 3rd table -> 2 GB
granularity, region 2nd table -> 4TB granularity, region 1st table -> 8 PB
granularity. ST,R3,R2,R1 have the same format and are thus often called
crste (combined region and segment table entry).

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 12:09       ` Christian Borntraeger
  0 siblings, 0 replies; 313+ messages in thread
From: Christian Borntraeger @ 2020-09-08 12:09 UTC (permalink / raw)
  To: Christophe Leroy, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



On 08.09.20 07:06, Christophe Leroy wrote:
> 
> 
> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>
>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>> dynamic page table folding.
>>
>> The question "What would it require for the generic code to work for s390"
>> has already been discussed here
>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>> and ended with a promising approach here
>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>> which in the end unfortunately didn't quite work completely.
>>
>> We tried to mimic static level folding by changing pgd_offset to always
>> calculate top level page table offset, and do nothing in folded pXd_offset.
>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>> not reflect this dynamic behaviour, and still act like static 5-level
>> page tables.
>>
> 
> [...]
> 
>>
>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>> additional pXd entry value parameter, that can be used on s390
>> to determine the correct page table level and return corresponding
>> end / boundary. With that, the pointer iteration will always
>> happen in gup_pgd_range for s390. No change for other architectures
>> introduced.
> 
> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> 
> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()

given that this fixes a data corruption issue, wouldnt it be the best to go forward
with this patch ASAP and then handle the other patches on top with all the time that
we need?
> 
> 
>>
>> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
>> Cc: <stable@vger.kernel.org> # 5.2+
>> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
>> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>>   include/linux/pgtable.h         | 16 +++++++++++++
>>   mm/gup.c                        |  8 +++----
>>   3 files changed, 62 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>> index 7eb01a5459cd..027206e4959d 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>>   }
>>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>>   +/*
>> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
>> + * will not return corresponding dynamic boundaries. This is no problem as long
>> + * as only pXd pointers are passed down during page table walk, because
>> + * pXd_offset() will simply return the given pointer for folded levels, and the
>> + * pointer iteration over a range simply happens at the correct page table
>> + * level.
>> + * It is however a problem with gup_fast, or other places walking the page
>> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
>> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
>> + * a stack variable, which cannot be used for pointer iteration at the correct
>> + * level. Instead, the iteration then has to happen by going up to pgd level
>> + * again. To allow this, provide pXd_addr_end_folded() functions with an
>> + * additional pXd value parameter, which can be used on s390 to determine the
>> + * folding level and return the corresponding boundary.
>> + */
>> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
> 
> What does 'rste' stands for ?
> 
> Isn't this line a bit long ?

this is region/segment table entry according to the architecture. 
On our platform we do have the pagetables with a different format that
next levels (segment table -> 1MB granularity, region 3rd table -> 2 GB
granularity, region 2nd table -> 4TB granularity, region 1st table -> 8 PB
granularity. ST,R3,R2,R1 have the same format and are thus often called
crste (combined region and segment table entry).

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 12:09       ` Christian Borntraeger
  0 siblings, 0 replies; 313+ messages in thread
From: Christian Borntraeger @ 2020-09-08 12:09 UTC (permalink / raw)
  To: Christophe Leroy, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King, Ingo Molnar,
	Andrey Ryabinin, Jeff Dike, Arnd Bergmann, Heiko Carstens,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, Linus Torvalds, LKML, Andrew Morton, linux-power,
	Mike Rapoport



On 08.09.20 07:06, Christophe Leroy wrote:
> 
> 
> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>
>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>> dynamic page table folding.
>>
>> The question "What would it require for the generic code to work for s390"
>> has already been discussed here
>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>> and ended with a promising approach here
>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>> which in the end unfortunately didn't quite work completely.
>>
>> We tried to mimic static level folding by changing pgd_offset to always
>> calculate top level page table offset, and do nothing in folded pXd_offset.
>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>> not reflect this dynamic behaviour, and still act like static 5-level
>> page tables.
>>
> 
> [...]
> 
>>
>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>> additional pXd entry value parameter, that can be used on s390
>> to determine the correct page table level and return corresponding
>> end / boundary. With that, the pointer iteration will always
>> happen in gup_pgd_range for s390. No change for other architectures
>> introduced.
> 
> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> 
> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()

given that this fixes a data corruption issue, wouldnt it be the best to go forward
with this patch ASAP and then handle the other patches on top with all the time that
we need?
> 
> 
>>
>> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
>> Cc: <stable@vger.kernel.org> # 5.2+
>> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
>> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>>   include/linux/pgtable.h         | 16 +++++++++++++
>>   mm/gup.c                        |  8 +++----
>>   3 files changed, 62 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>> index 7eb01a5459cd..027206e4959d 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>>   }
>>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>>   +/*
>> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
>> + * will not return corresponding dynamic boundaries. This is no problem as long
>> + * as only pXd pointers are passed down during page table walk, because
>> + * pXd_offset() will simply return the given pointer for folded levels, and the
>> + * pointer iteration over a range simply happens at the correct page table
>> + * level.
>> + * It is however a problem with gup_fast, or other places walking the page
>> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
>> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
>> + * a stack variable, which cannot be used for pointer iteration at the correct
>> + * level. Instead, the iteration then has to happen by going up to pgd level
>> + * again. To allow this, provide pXd_addr_end_folded() functions with an
>> + * additional pXd value parameter, which can be used on s390 to determine the
>> + * folding level and return the corresponding boundary.
>> + */
>> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
> 
> What does 'rste' stands for ?
> 
> Isn't this line a bit long ?

this is region/segment table entry according to the architecture. 
On our platform we do have the pagetables with a different format that
next levels (segment table -> 1MB granularity, region 3rd table -> 2 GB
granularity, region 2nd table -> 4TB granularity, region 1st table -> 8 PB
granularity. ST,R3,R2,R1 have the same format and are thus often called
crste (combined region and segment table entry).

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 12:09       ` Christian Borntraeger
  0 siblings, 0 replies; 313+ messages in thread
From: Christian Borntraeger @ 2020-09-08 12:09 UTC (permalink / raw)
  To: Christophe Leroy, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King, Ingo Molnar,
	Andrey Ryabinin, Jeff Dike, Arnd Bergmann, Heiko Carstens,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, Linus Torvalds, LKML, Andrew Morton, linux-power,
	Mike Rapoport



On 08.09.20 07:06, Christophe Leroy wrote:
> 
> 
> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>
>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>> dynamic page table folding.
>>
>> The question "What would it require for the generic code to work for s390"
>> has already been discussed here
>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>> and ended with a promising approach here
>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>> which in the end unfortunately didn't quite work completely.
>>
>> We tried to mimic static level folding by changing pgd_offset to always
>> calculate top level page table offset, and do nothing in folded pXd_offset.
>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>> not reflect this dynamic behaviour, and still act like static 5-level
>> page tables.
>>
> 
> [...]
> 
>>
>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>> additional pXd entry value parameter, that can be used on s390
>> to determine the correct page table level and return corresponding
>> end / boundary. With that, the pointer iteration will always
>> happen in gup_pgd_range for s390. No change for other architectures
>> introduced.
> 
> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> 
> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()

given that this fixes a data corruption issue, wouldnt it be the best to go forward
with this patch ASAP and then handle the other patches on top with all the time that
we need?
> 
> 
>>
>> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
>> Cc: <stable@vger.kernel.org> # 5.2+
>> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
>> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>>   include/linux/pgtable.h         | 16 +++++++++++++
>>   mm/gup.c                        |  8 +++----
>>   3 files changed, 62 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>> index 7eb01a5459cd..027206e4959d 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>>   }
>>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>>   +/*
>> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
>> + * will not return corresponding dynamic boundaries. This is no problem as long
>> + * as only pXd pointers are passed down during page table walk, because
>> + * pXd_offset() will simply return the given pointer for folded levels, and the
>> + * pointer iteration over a range simply happens at the correct page table
>> + * level.
>> + * It is however a problem with gup_fast, or other places walking the page
>> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
>> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
>> + * a stack variable, which cannot be used for pointer iteration at the correct
>> + * level. Instead, the iteration then has to happen by going up to pgd level
>> + * again. To allow this, provide pXd_addr_end_folded() functions with an
>> + * additional pXd value parameter, which can be used on s390 to determine the
>> + * folding level and return the corresponding boundary.
>> + */
>> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
> 
> What does 'rste' stands for ?
> 
> Isn't this line a bit long ?

this is region/segment table entry according to the architecture. 
On our platform we do have the pagetables with a different format that
next levels (segment table -> 1MB granularity, region 3rd table -> 2 GB
granularity, region 2nd table -> 4TB granularity, region 1st table -> 8 PB
granularity. ST,R3,R2,R1 have the same format and are thus often called
crste (combined region and segment table entry).

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 12:09       ` Christian Borntraeger
  0 siblings, 0 replies; 313+ messages in thread
From: Christian Borntraeger @ 2020-09-08 12:09 UTC (permalink / raw)
  To: Christophe Leroy, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King, Ingo Molnar,
	Andrey Ryabinin, Jeff Dike, Arnd Bergmann, Heiko Carstens,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, Linus Torvalds, LKML, Andrew Morton, linux-power,
	Mike Rapoport



On 08.09.20 07:06, Christophe Leroy wrote:
> 
> 
> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>
>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>> dynamic page table folding.
>>
>> The question "What would it require for the generic code to work for s390"
>> has already been discussed here
>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>> and ended with a promising approach here
>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>> which in the end unfortunately didn't quite work completely.
>>
>> We tried to mimic static level folding by changing pgd_offset to always
>> calculate top level page table offset, and do nothing in folded pXd_offset.
>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>> not reflect this dynamic behaviour, and still act like static 5-level
>> page tables.
>>
> 
> [...]
> 
>>
>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>> additional pXd entry value parameter, that can be used on s390
>> to determine the correct page table level and return corresponding
>> end / boundary. With that, the pointer iteration will always
>> happen in gup_pgd_range for s390. No change for other architectures
>> introduced.
> 
> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> 
> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()

given that this fixes a data corruption issue, wouldnt it be the best to go forward
with this patch ASAP and then handle the other patches on top with all the time that
we need?
> 
> 
>>
>> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
>> Cc: <stable@vger.kernel.org> # 5.2+
>> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
>> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>>   include/linux/pgtable.h         | 16 +++++++++++++
>>   mm/gup.c                        |  8 +++----
>>   3 files changed, 62 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>> index 7eb01a5459cd..027206e4959d 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>>   }
>>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>>   +/*
>> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
>> + * will not return corresponding dynamic boundaries. This is no problem as long
>> + * as only pXd pointers are passed down during page table walk, because
>> + * pXd_offset() will simply return the given pointer for folded levels, and the
>> + * pointer iteration over a range simply happens at the correct page table
>> + * level.
>> + * It is however a problem with gup_fast, or other places walking the page
>> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
>> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
>> + * a stack variable, which cannot be used for pointer iteration at the correct
>> + * level. Instead, the iteration then has to happen by going up to pgd level
>> + * again. To allow this, provide pXd_addr_end_folded() functions with an
>> + * additional pXd value parameter, which can be used on s390 to determine the
>> + * folding level and return the corresponding boundary.
>> + */
>> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
> 
> What does 'rste' stands for ?
> 
> Isn't this line a bit long ?

this is region/segment table entry according to the architecture. 
On our platform we do have the pagetables with a different format that
next levels (segment table -> 1MB granularity, region 3rd table -> 2 GB
granularity, region 2nd table -> 4TB granularity, region 1st table -> 8 PB
granularity. ST,R3,R2,R1 have the same format and are thus often called
crste (combined region and segment table entry).

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-08 12:09       ` Christian Borntraeger
                           ` (2 preceding siblings ...)
  (?)
@ 2020-09-08 12:40         ` Christophe Leroy
  -1 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08 12:40 UTC (permalink / raw)
  To: Christian Borntraeger, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> 
> 
> On 08.09.20 07:06, Christophe Leroy wrote:
>>
>>
>> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>>
>>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>>> dynamic page table folding.
>>>
>>> The question "What would it require for the generic code to work for s390"
>>> has already been discussed here
>>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>>> and ended with a promising approach here
>>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>>> which in the end unfortunately didn't quite work completely.
>>>
>>> We tried to mimic static level folding by changing pgd_offset to always
>>> calculate top level page table offset, and do nothing in folded pXd_offset.
>>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>>> not reflect this dynamic behaviour, and still act like static 5-level
>>> page tables.
>>>
>>
>> [...]
>>
>>>
>>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>>> additional pXd entry value parameter, that can be used on s390
>>> to determine the correct page table level and return corresponding
>>> end / boundary. With that, the pointer iteration will always
>>> happen in gup_pgd_range for s390. No change for other architectures
>>> introduced.
>>
>> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
>> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
>>
>> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> 
> given that this fixes a data corruption issue, wouldnt it be the best to go forward
> with this patch ASAP and then handle the other patches on top with all the time that
> we need?

I have no strong opinion on this, but I feel rather tricky to have to 
change generic part of GUP to use a new fonction then revert that change 
in the following patch, just because you want the first patch in stable 
and not the second one.

Regardless, I was wondering, why do we need a reference to the pXd at 
all when calling pXd_addr_end() ?

Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
passed addr ?

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 12:40         ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08 12:40 UTC (permalink / raw)
  To: Christian Borntraeger, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> 
> 
> On 08.09.20 07:06, Christophe Leroy wrote:
>>
>>
>> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>>
>>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>>> dynamic page table folding.
>>>
>>> The question "What would it require for the generic code to work for s390"
>>> has already been discussed here
>>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>>> and ended with a promising approach here
>>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>>> which in the end unfortunately didn't quite work completely.
>>>
>>> We tried to mimic static level folding by changing pgd_offset to always
>>> calculate top level page table offset, and do nothing in folded pXd_offset.
>>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>>> not reflect this dynamic behaviour, and still act like static 5-level
>>> page tables.
>>>
>>
>> [...]
>>
>>>
>>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>>> additional pXd entry value parameter, that can be used on s390
>>> to determine the correct page table level and return corresponding
>>> end / boundary. With that, the pointer iteration will always
>>> happen in gup_pgd_range for s390. No change for other architectures
>>> introduced.
>>
>> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
>> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
>>
>> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> 
> given that this fixes a data corruption issue, wouldnt it be the best to go forward
> with this patch ASAP and then handle the other patches on top with all the time that
> we need?

I have no strong opinion on this, but I feel rather tricky to have to 
change generic part of GUP to use a new fonction then revert that change 
in the following patch, just because you want the first patch in stable 
and not the second one.

Regardless, I was wondering, why do we need a reference to the pXd at 
all when calling pXd_addr_end() ?

Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
passed addr ?

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 12:40         ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08 12:40 UTC (permalink / raw)
  To: Christian Borntraeger, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King, Ingo Molnar,
	Andrey Ryabinin, Jeff Dike, Arnd Bergmann, Heiko Carstens,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, Linus Torvalds, LKML, Andrew Morton, linux-power,
	Mike Rapoport



Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> 
> 
> On 08.09.20 07:06, Christophe Leroy wrote:
>>
>>
>> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>>
>>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>>> dynamic page table folding.
>>>
>>> The question "What would it require for the generic code to work for s390"
>>> has already been discussed here
>>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>>> and ended with a promising approach here
>>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>>> which in the end unfortunately didn't quite work completely.
>>>
>>> We tried to mimic static level folding by changing pgd_offset to always
>>> calculate top level page table offset, and do nothing in folded pXd_offset.
>>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>>> not reflect this dynamic behaviour, and still act like static 5-level
>>> page tables.
>>>
>>
>> [...]
>>
>>>
>>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>>> additional pXd entry value parameter, that can be used on s390
>>> to determine the correct page table level and return corresponding
>>> end / boundary. With that, the pointer iteration will always
>>> happen in gup_pgd_range for s390. No change for other architectures
>>> introduced.
>>
>> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
>> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
>>
>> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> 
> given that this fixes a data corruption issue, wouldnt it be the best to go forward
> with this patch ASAP and then handle the other patches on top with all the time that
> we need?

I have no strong opinion on this, but I feel rather tricky to have to 
change generic part of GUP to use a new fonction then revert that change 
in the following patch, just because you want the first patch in stable 
and not the second one.

Regardless, I was wondering, why do we need a reference to the pXd at 
all when calling pXd_addr_end() ?

Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
passed addr ?

Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 12:40         ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08 12:40 UTC (permalink / raw)
  To: Christian Borntraeger, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King, Ingo Molnar,
	Andrey Ryabinin, Jeff Dike, Arnd Bergmann, Heiko Carstens,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, Linus Torvalds, LKML, Andrew Morton, linux-power,
	Mike Rapoport



Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> 
> 
> On 08.09.20 07:06, Christophe Leroy wrote:
>>
>>
>> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>>
>>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>>> dynamic page table folding.
>>>
>>> The question "What would it require for the generic code to work for s390"
>>> has already been discussed here
>>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>>> and ended with a promising approach here
>>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>>> which in the end unfortunately didn't quite work completely.
>>>
>>> We tried to mimic static level folding by changing pgd_offset to always
>>> calculate top level page table offset, and do nothing in folded pXd_offset.
>>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>>> not reflect this dynamic behaviour, and still act like static 5-level
>>> page tables.
>>>
>>
>> [...]
>>
>>>
>>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>>> additional pXd entry value parameter, that can be used on s390
>>> to determine the correct page table level and return corresponding
>>> end / boundary. With that, the pointer iteration will always
>>> happen in gup_pgd_range for s390. No change for other architectures
>>> introduced.
>>
>> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
>> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
>>
>> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> 
> given that this fixes a data corruption issue, wouldnt it be the best to go forward
> with this patch ASAP and then handle the other patches on top with all the time that
> we need?

I have no strong opinion on this, but I feel rather tricky to have to 
change generic part of GUP to use a new fonction then revert that change 
in the following patch, just because you want the first patch in stable 
and not the second one.

Regardless, I was wondering, why do we need a reference to the pXd at 
all when calling pXd_addr_end() ?

Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
passed addr ?

Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 12:40         ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08 12:40 UTC (permalink / raw)
  To: Christian Borntraeger, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King, Ingo Molnar,
	Andrey Ryabinin, Jeff Dike, Arnd Bergmann, Heiko Carstens,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, Linus Torvalds, LKML, Andrew Morton, linux-power,
	Mike Rapoport



Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> 
> 
> On 08.09.20 07:06, Christophe Leroy wrote:
>>
>>
>> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>>
>>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>>> dynamic page table folding.
>>>
>>> The question "What would it require for the generic code to work for s390"
>>> has already been discussed here
>>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>>> and ended with a promising approach here
>>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>>> which in the end unfortunately didn't quite work completely.
>>>
>>> We tried to mimic static level folding by changing pgd_offset to always
>>> calculate top level page table offset, and do nothing in folded pXd_offset.
>>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>>> not reflect this dynamic behaviour, and still act like static 5-level
>>> page tables.
>>>
>>
>> [...]
>>
>>>
>>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>>> additional pXd entry value parameter, that can be used on s390
>>> to determine the correct page table level and return corresponding
>>> end / boundary. With that, the pointer iteration will always
>>> happen in gup_pgd_range for s390. No change for other architectures
>>> introduced.
>>
>> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
>> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
>>
>> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> 
> given that this fixes a data corruption issue, wouldnt it be the best to go forward
> with this patch ASAP and then handle the other patches on top with all the time that
> we need?

I have no strong opinion on this, but I feel rather tricky to have to 
change generic part of GUP to use a new fonction then revert that change 
in the following patch, just because you want the first patch in stable 
and not the second one.

Regardless, I was wondering, why do we need a reference to the pXd at 
all when calling pXd_addr_end() ?

Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
passed addr ?

Christophe

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-07 18:00   ` Gerald Schaefer
                       ` (2 preceding siblings ...)
  (?)
@ 2020-09-08 13:26     ` Jason Gunthorpe
  -1 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-08 13:26 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton,
	Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86,
	linux-arm, linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Mon, Sep 07, 2020 at 08:00:57PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 
> So, even though the dereferenced page-table entries are not used on
> archs other than s390, and are optimized out by the compiler, there
> is a small change in kernel size and this is what bloat-o-meter reports:

This looks pretty clean and straightfoward, only
__munlock_pagevec_fill() had any real increased complexity.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 13:26     ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-08 13:26 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton,
	Linus Torvalds, Russell King, Mike Rapoport, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Andrey Ryabinin, linux-x86,
	linux-arm, linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Mon, Sep 07, 2020 at 08:00:57PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 
> So, even though the dereferenced page-table entries are not used on
> archs other than s390, and are optimized out by the compiler, there
> is a small change in kernel size and this is what bloat-o-meter reports:

This looks pretty clean and straightfoward, only
__munlock_pagevec_fill() had any real increased complexity.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 13:26     ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-08 13:26 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On Mon, Sep 07, 2020 at 08:00:57PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 
> So, even though the dereferenced page-table entries are not used on
> archs other than s390, and are optimized out by the compiler, there
> is a small change in kernel size and this is what bloat-o-meter reports:

This looks pretty clean and straightfoward, only
__munlock_pagevec_fill() had any real increased complexity.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 13:26     ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-08 13:26 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On Mon, Sep 07, 2020 at 08:00:57PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 
> So, even though the dereferenced page-table entries are not used on
> archs other than s390, and are optimized out by the compiler, there
> is a small change in kernel size and this is what bloat-o-meter reports:

This looks pretty clean and straightfoward, only
__munlock_pagevec_fill() had any real increased complexity.

Thanks,
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 13:26     ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-08 13:26 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On Mon, Sep 07, 2020 at 08:00:57PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 
> So, even though the dereferenced page-table entries are not used on
> archs other than s390, and are optimized out by the compiler, there
> is a small change in kernel size and this is what bloat-o-meter reports:

This looks pretty clean and straightfoward, only
__munlock_pagevec_fill() had any real increased complexity.

Thanks,
Jason

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-08 12:40         ` Christophe Leroy
                             ` (2 preceding siblings ...)
  (?)
@ 2020-09-08 13:38           ` Gerald Schaefer
  -1 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 13:38 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Christian Borntraeger, Jason Gunthorpe, John Hubbard,
	Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, 8 Sep 2020 14:40:10 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> > 
> > 
> > On 08.09.20 07:06, Christophe Leroy wrote:
> >>
> >>
> >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> >>> From: Alexander Gordeev <agordeev@linux.ibm.com>
> >>>
> >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> >>> dynamic page table folding.
> >>>
> >>> The question "What would it require for the generic code to work for s390"
> >>> has already been discussed here
> >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> >>> and ended with a promising approach here
> >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> >>> which in the end unfortunately didn't quite work completely.
> >>>
> >>> We tried to mimic static level folding by changing pgd_offset to always
> >>> calculate top level page table offset, and do nothing in folded pXd_offset.
> >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> >>> not reflect this dynamic behaviour, and still act like static 5-level
> >>> page tables.
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> >>> additional pXd entry value parameter, that can be used on s390
> >>> to determine the correct page table level and return corresponding
> >>> end / boundary. With that, the pointer iteration will always
> >>> happen in gup_pgd_range for s390. No change for other architectures
> >>> introduced.
> >>
> >> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> >>
> >> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> > 
> > given that this fixes a data corruption issue, wouldnt it be the best to go forward
> > with this patch ASAP and then handle the other patches on top with all the time that
> > we need?
> 
> I have no strong opinion on this, but I feel rather tricky to have to 
> change generic part of GUP to use a new fonction then revert that change 
> in the following patch, just because you want the first patch in stable 
> and not the second one.
> 
> Regardless, I was wondering, why do we need a reference to the pXd at 
> all when calling pXd_addr_end() ?
> 
> Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
> passed addr ?

Apart from performance impact when re-doing that what has already been
done by the caller, I think we would also break the READ_ONCE semantics.
After all, the pXd_offset() would also require some pXd pointer input,
which we don't have. So we would need to start over again from mm->pgd.

Also, it seems to be more in line with other primitives that take
a pXd value or pointer.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 13:38           ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 13:38 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Christian Borntraeger, Jason Gunthorpe, John Hubbard,
	Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, 8 Sep 2020 14:40:10 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> > 
> > 
> > On 08.09.20 07:06, Christophe Leroy wrote:
> >>
> >>
> >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> >>> From: Alexander Gordeev <agordeev@linux.ibm.com>
> >>>
> >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> >>> dynamic page table folding.
> >>>
> >>> The question "What would it require for the generic code to work for s390"
> >>> has already been discussed here
> >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> >>> and ended with a promising approach here
> >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> >>> which in the end unfortunately didn't quite work completely.
> >>>
> >>> We tried to mimic static level folding by changing pgd_offset to always
> >>> calculate top level page table offset, and do nothing in folded pXd_offset.
> >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> >>> not reflect this dynamic behaviour, and still act like static 5-level
> >>> page tables.
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> >>> additional pXd entry value parameter, that can be used on s390
> >>> to determine the correct page table level and return corresponding
> >>> end / boundary. With that, the pointer iteration will always
> >>> happen in gup_pgd_range for s390. No change for other architectures
> >>> introduced.
> >>
> >> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> >>
> >> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> > 
> > given that this fixes a data corruption issue, wouldnt it be the best to go forward
> > with this patch ASAP and then handle the other patches on top with all the time that
> > we need?
> 
> I have no strong opinion on this, but I feel rather tricky to have to 
> change generic part of GUP to use a new fonction then revert that change 
> in the following patch, just because you want the first patch in stable 
> and not the second one.
> 
> Regardless, I was wondering, why do we need a reference to the pXd at 
> all when calling pXd_addr_end() ?
> 
> Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
> passed addr ?

Apart from performance impact when re-doing that what has already been
done by the caller, I think we would also break the READ_ONCE semantics.
After all, the pXd_offset() would also require some pXd pointer input,
which we don't have. So we would need to start over again from mm->pgd.

Also, it seems to be more in line with other primitives that take
a pXd value or pointer.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 13:38           ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 13:38 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Andrey Ryabinin,
	Jeff Dike, Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 14:40:10 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> > 
> > 
> > On 08.09.20 07:06, Christophe Leroy wrote:
> >>
> >>
> >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> >>> From: Alexander Gordeev <agordeev@linux.ibm.com>
> >>>
> >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> >>> dynamic page table folding.
> >>>
> >>> The question "What would it require for the generic code to work for s390"
> >>> has already been discussed here
> >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> >>> and ended with a promising approach here
> >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> >>> which in the end unfortunately didn't quite work completely.
> >>>
> >>> We tried to mimic static level folding by changing pgd_offset to always
> >>> calculate top level page table offset, and do nothing in folded pXd_offset.
> >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> >>> not reflect this dynamic behaviour, and still act like static 5-level
> >>> page tables.
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> >>> additional pXd entry value parameter, that can be used on s390
> >>> to determine the correct page table level and return corresponding
> >>> end / boundary. With that, the pointer iteration will always
> >>> happen in gup_pgd_range for s390. No change for other architectures
> >>> introduced.
> >>
> >> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> >>
> >> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> > 
> > given that this fixes a data corruption issue, wouldnt it be the best to go forward
> > with this patch ASAP and then handle the other patches on top with all the time that
> > we need?
> 
> I have no strong opinion on this, but I feel rather tricky to have to 
> change generic part of GUP to use a new fonction then revert that change 
> in the following patch, just because you want the first patch in stable 
> and not the second one.
> 
> Regardless, I was wondering, why do we need a reference to the pXd at 
> all when calling pXd_addr_end() ?
> 
> Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
> passed addr ?

Apart from performance impact when re-doing that what has already been
done by the caller, I think we would also break the READ_ONCE semantics.
After all, the pXd_offset() would also require some pXd pointer input,
which we don't have. So we would need to start over again from mm->pgd.

Also, it seems to be more in line with other primitives that take
a pXd value or pointer.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 13:38           ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 13:38 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Andrey Ryabinin,
	Jeff Dike, Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 14:40:10 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> > 
> > 
> > On 08.09.20 07:06, Christophe Leroy wrote:
> >>
> >>
> >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> >>> From: Alexander Gordeev <agordeev@linux.ibm.com>
> >>>
> >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> >>> dynamic page table folding.
> >>>
> >>> The question "What would it require for the generic code to work for s390"
> >>> has already been discussed here
> >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> >>> and ended with a promising approach here
> >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> >>> which in the end unfortunately didn't quite work completely.
> >>>
> >>> We tried to mimic static level folding by changing pgd_offset to always
> >>> calculate top level page table offset, and do nothing in folded pXd_offset.
> >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> >>> not reflect this dynamic behaviour, and still act like static 5-level
> >>> page tables.
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> >>> additional pXd entry value parameter, that can be used on s390
> >>> to determine the correct page table level and return corresponding
> >>> end / boundary. With that, the pointer iteration will always
> >>> happen in gup_pgd_range for s390. No change for other architectures
> >>> introduced.
> >>
> >> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> >>
> >> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> > 
> > given that this fixes a data corruption issue, wouldnt it be the best to go forward
> > with this patch ASAP and then handle the other patches on top with all the time that
> > we need?
> 
> I have no strong opinion on this, but I feel rather tricky to have to 
> change generic part of GUP to use a new fonction then revert that change 
> in the following patch, just because you want the first patch in stable 
> and not the second one.
> 
> Regardless, I was wondering, why do we need a reference to the pXd at 
> all when calling pXd_addr_end() ?
> 
> Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
> passed addr ?

Apart from performance impact when re-doing that what has already been
done by the caller, I think we would also break the READ_ONCE semantics.
After all, the pXd_offset() would also require some pXd pointer input,
which we don't have. So we would need to start over again from mm->pgd.

Also, it seems to be more in line with other primitives that take
a pXd value or pointer.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 13:38           ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 13:38 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Andrey Ryabinin,
	Jeff Dike, Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 14:40:10 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> > 
> > 
> > On 08.09.20 07:06, Christophe Leroy wrote:
> >>
> >>
> >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> >>> From: Alexander Gordeev <agordeev@linux.ibm.com>
> >>>
> >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> >>> dynamic page table folding.
> >>>
> >>> The question "What would it require for the generic code to work for s390"
> >>> has already been discussed here
> >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> >>> and ended with a promising approach here
> >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> >>> which in the end unfortunately didn't quite work completely.
> >>>
> >>> We tried to mimic static level folding by changing pgd_offset to always
> >>> calculate top level page table offset, and do nothing in folded pXd_offset.
> >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> >>> not reflect this dynamic behaviour, and still act like static 5-level
> >>> page tables.
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> >>> additional pXd entry value parameter, that can be used on s390
> >>> to determine the correct page table level and return corresponding
> >>> end / boundary. With that, the pointer iteration will always
> >>> happen in gup_pgd_range for s390. No change for other architectures
> >>> introduced.
> >>
> >> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> >>
> >> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> > 
> > given that this fixes a data corruption issue, wouldnt it be the best to go forward
> > with this patch ASAP and then handle the other patches on top with all the time that
> > we need?
> 
> I have no strong opinion on this, but I feel rather tricky to have to 
> change generic part of GUP to use a new fonction then revert that change 
> in the following patch, just because you want the first patch in stable 
> and not the second one.
> 
> Regardless, I was wondering, why do we need a reference to the pXd at 
> all when calling pXd_addr_end() ?
> 
> Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
> passed addr ?

Apart from performance impact when re-doing that what has already been
done by the caller, I think we would also break the READ_ONCE semantics.
After all, the pXd_offset() would also require some pXd pointer input,
which we don't have. So we would need to start over again from mm->pgd.

Also, it seems to be more in line with other primitives that take
a pXd value or pointer.

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-08  8:16         ` Christophe Leroy
                             ` (2 preceding siblings ...)
  (?)
@ 2020-09-08 14:15           ` Alexander Gordeev
  -1 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:15 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Michael Ellerman, Gerald Schaefer, Jason Gunthorpe, John Hubbard,
	Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> >Yes, and also two more sources :/
> >	arch/powerpc/mm/kasan/8xx.c
> >	arch/powerpc/mm/kasan/kasan_init_32.c
> >
> >But these two are not quite obvious wrt pgd_addr_end() used
> >while traversing pmds. Could you please clarify a bit?
> >
> >
> >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> >index 2784224..89c5053 100644
> >--- a/arch/powerpc/mm/kasan/8xx.c
> >+++ b/arch/powerpc/mm/kasan/8xx.c
> >@@ -15,8 +15,8 @@
> >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> >  		pte_basic_t *new;
> >-		k_next = pgd_addr_end(k_cur, k_end);
> >-		k_next = pgd_addr_end(k_next, k_end);
> >+		k_next = pmd_addr_end(k_cur, k_end);
> >+		k_next = pmd_addr_end(k_next, k_end);
> 
> No, I don't think so.
> On powerpc32 we have only two levels, so pgd and pmd are more or
> less the same.
> But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> is a no-op, so I don't think it will work.
> 
> It is likely that this function should iterate on pgd, then you get
> pmd = pmd_offset(pud_offset(p4d_offset(pgd)));

It looks like the code iterates over single pmd table while using
pgd_addr_end() only to skip all the middle levels and bail out
from the loop.

I would be wary for switching from pmds to pgds, since we are
trying to minimize impact (especially functional) and the
rework does not seem that obvious.

Assuming pmd and pgd are the same would actually such approach
work for now?

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..94466cc 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*(pmd + 1))), k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..c0bcd64 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(__pgd(pmd_val(*pmd)), addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

Alternatively we could pass invalid pgd to keep the code structure
intact, but that of course is less nice.

Thanks!

> Christophe

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:15           ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:15 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Michael Ellerman, Gerald Schaefer, Jason Gunthorpe, John Hubbard,
	Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> >Yes, and also two more sources :/
> >	arch/powerpc/mm/kasan/8xx.c
> >	arch/powerpc/mm/kasan/kasan_init_32.c
> >
> >But these two are not quite obvious wrt pgd_addr_end() used
> >while traversing pmds. Could you please clarify a bit?
> >
> >
> >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> >index 2784224..89c5053 100644
> >--- a/arch/powerpc/mm/kasan/8xx.c
> >+++ b/arch/powerpc/mm/kasan/8xx.c
> >@@ -15,8 +15,8 @@
> >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> >  		pte_basic_t *new;
> >-		k_next = pgd_addr_end(k_cur, k_end);
> >-		k_next = pgd_addr_end(k_next, k_end);
> >+		k_next = pmd_addr_end(k_cur, k_end);
> >+		k_next = pmd_addr_end(k_next, k_end);
> 
> No, I don't think so.
> On powerpc32 we have only two levels, so pgd and pmd are more or
> less the same.
> But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> is a no-op, so I don't think it will work.
> 
> It is likely that this function should iterate on pgd, then you get
> pmd = pmd_offset(pud_offset(p4d_offset(pgd)));

It looks like the code iterates over single pmd table while using
pgd_addr_end() only to skip all the middle levels and bail out
from the loop.

I would be wary for switching from pmds to pgds, since we are
trying to minimize impact (especially functional) and the
rework does not seem that obvious.

Assuming pmd and pgd are the same would actually such approach
work for now?

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..94466cc 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*(pmd + 1))), k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..c0bcd64 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(__pgd(pmd_val(*pmd)), addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

Alternatively we could pass invalid pgd to keep the code structure
intact, but that of course is less nice.

Thanks!

> Christophe

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:15           ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:15 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Arnd Bergmann, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Vasily Gorbik, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> >Yes, and also two more sources :/
> >	arch/powerpc/mm/kasan/8xx.c
> >	arch/powerpc/mm/kasan/kasan_init_32.c
> >
> >But these two are not quite obvious wrt pgd_addr_end() used
> >while traversing pmds. Could you please clarify a bit?
> >
> >
> >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> >index 2784224..89c5053 100644
> >--- a/arch/powerpc/mm/kasan/8xx.c
> >+++ b/arch/powerpc/mm/kasan/8xx.c
> >@@ -15,8 +15,8 @@
> >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> >  		pte_basic_t *new;
> >-		k_next = pgd_addr_end(k_cur, k_end);
> >-		k_next = pgd_addr_end(k_next, k_end);
> >+		k_next = pmd_addr_end(k_cur, k_end);
> >+		k_next = pmd_addr_end(k_next, k_end);
> 
> No, I don't think so.
> On powerpc32 we have only two levels, so pgd and pmd are more or
> less the same.
> But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> is a no-op, so I don't think it will work.
> 
> It is likely that this function should iterate on pgd, then you get
> pmd = pmd_offset(pud_offset(p4d_offset(pgd)));

It looks like the code iterates over single pmd table while using
pgd_addr_end() only to skip all the middle levels and bail out
from the loop.

I would be wary for switching from pmds to pgds, since we are
trying to minimize impact (especially functional) and the
rework does not seem that obvious.

Assuming pmd and pgd are the same would actually such approach
work for now?

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..94466cc 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*(pmd + 1))), k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..c0bcd64 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(__pgd(pmd_val(*pmd)), addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

Alternatively we could pass invalid pgd to keep the code structure
intact, but that of course is less nice.

Thanks!

> Christophe

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:15           ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:15 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Arnd Bergmann, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Vasily Gorbik, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Michael Ellerman, Andrew Morton,
	linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> >Yes, and also two more sources :/
> >	arch/powerpc/mm/kasan/8xx.c
> >	arch/powerpc/mm/kasan/kasan_init_32.c
> >
> >But these two are not quite obvious wrt pgd_addr_end() used
> >while traversing pmds. Could you please clarify a bit?
> >
> >
> >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> >index 2784224..89c5053 100644
> >--- a/arch/powerpc/mm/kasan/8xx.c
> >+++ b/arch/powerpc/mm/kasan/8xx.c
> >@@ -15,8 +15,8 @@
> >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> >  		pte_basic_t *new;
> >-		k_next = pgd_addr_end(k_cur, k_end);
> >-		k_next = pgd_addr_end(k_next, k_end);
> >+		k_next = pmd_addr_end(k_cur, k_end);
> >+		k_next = pmd_addr_end(k_next, k_end);
> 
> No, I don't think so.
> On powerpc32 we have only two levels, so pgd and pmd are more or
> less the same.
> But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> is a no-op, so I don't think it will work.
> 
> It is likely that this function should iterate on pgd, then you get
> pmd = pmd_offset(pud_offset(p4d_offset(pgd)));

It looks like the code iterates over single pmd table while using
pgd_addr_end() only to skip all the middle levels and bail out
from the loop.

I would be wary for switching from pmds to pgds, since we are
trying to minimize impact (especially functional) and the
rework does not seem that obvious.

Assuming pmd and pgd are the same would actually such approach
work for now?

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..94466cc 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*(pmd + 1))), k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..c0bcd64 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(__pgd(pmd_val(*pmd)), addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

Alternatively we could pass invalid pgd to keep the code structure
intact, but that of course is less nice.

Thanks!

> Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:15           ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:15 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Arnd Bergmann, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Vasily Gorbik, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Michael Ellerman, Andrew Morton,
	linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> >Yes, and also two more sources :/
> >	arch/powerpc/mm/kasan/8xx.c
> >	arch/powerpc/mm/kasan/kasan_init_32.c
> >
> >But these two are not quite obvious wrt pgd_addr_end() used
> >while traversing pmds. Could you please clarify a bit?
> >
> >
> >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> >index 2784224..89c5053 100644
> >--- a/arch/powerpc/mm/kasan/8xx.c
> >+++ b/arch/powerpc/mm/kasan/8xx.c
> >@@ -15,8 +15,8 @@
> >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> >  		pte_basic_t *new;
> >-		k_next = pgd_addr_end(k_cur, k_end);
> >-		k_next = pgd_addr_end(k_next, k_end);
> >+		k_next = pmd_addr_end(k_cur, k_end);
> >+		k_next = pmd_addr_end(k_next, k_end);
> 
> No, I don't think so.
> On powerpc32 we have only two levels, so pgd and pmd are more or
> less the same.
> But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> is a no-op, so I don't think it will work.
> 
> It is likely that this function should iterate on pgd, then you get
> pmd = pmd_offset(pud_offset(p4d_offset(pgd)));

It looks like the code iterates over single pmd table while using
pgd_addr_end() only to skip all the middle levels and bail out
from the loop.

I would be wary for switching from pmds to pgds, since we are
trying to minimize impact (especially functional) and the
rework does not seem that obvious.

Assuming pmd and pgd are the same would actually such approach
work for now?

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..94466cc 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*(pmd + 1))), k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..c0bcd64 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(__pgd(pmd_val(*pmd)), addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

Alternatively we could pass invalid pgd to keep the code structure
intact, but that of course is less nice.

Thanks!

> Christophe

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-08  5:14     ` Christophe Leroy
                         ` (2 preceding siblings ...)
  (?)
@ 2020-09-08 14:25       ` Alexander Gordeev
  -1 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:25 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
[...]
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

If this one would be okay?

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index 60c6ea16..3690d22 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -88,6 +88,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
 static void subpage_prot_clear(unsigned long addr, unsigned long len)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -103,8 +104,8 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len)
 	limit = addr + len;
 	if (limit > spt->maxaddr)
 		limit = spt->maxaddr;
-	for (; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;
 		} else {
@@ -191,6 +192,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 		unsigned long, len, u32 __user *, map)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -236,8 +238,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 	}
 
 	subpage_mark_vma_nohuge(mm, addr, len);
-	for (limit = addr + len; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (limit = addr + len; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		err = -ENOMEM;
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;

Thanks!

> Christophe

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:25       ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:25 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
[...]
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

If this one would be okay?

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index 60c6ea16..3690d22 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -88,6 +88,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
 static void subpage_prot_clear(unsigned long addr, unsigned long len)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -103,8 +104,8 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len)
 	limit = addr + len;
 	if (limit > spt->maxaddr)
 		limit = spt->maxaddr;
-	for (; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;
 		} else {
@@ -191,6 +192,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 		unsigned long, len, u32 __user *, map)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -236,8 +238,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 	}
 
 	subpage_mark_vma_nohuge(mm, addr, len);
-	for (limit = addr + len; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (limit = addr + len; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		err = -ENOMEM;
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;

Thanks!

> Christophe

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:25       ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:25 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
[...]
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

If this one would be okay?

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index 60c6ea16..3690d22 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -88,6 +88,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
 static void subpage_prot_clear(unsigned long addr, unsigned long len)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -103,8 +104,8 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len)
 	limit = addr + len;
 	if (limit > spt->maxaddr)
 		limit = spt->maxaddr;
-	for (; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;
 		} else {
@@ -191,6 +192,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 		unsigned long, len, u32 __user *, map)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -236,8 +238,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 	}
 
 	subpage_mark_vma_nohuge(mm, addr, len);
-	for (limit = addr + len; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (limit = addr + len; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		err = -ENOMEM;
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;

Thanks!

> Christophe

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:25       ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:25 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
[...]
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

If this one would be okay?

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index 60c6ea16..3690d22 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -88,6 +88,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
 static void subpage_prot_clear(unsigned long addr, unsigned long len)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -103,8 +104,8 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len)
 	limit = addr + len;
 	if (limit > spt->maxaddr)
 		limit = spt->maxaddr;
-	for (; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;
 		} else {
@@ -191,6 +192,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 		unsigned long, len, u32 __user *, map)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -236,8 +238,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 	}
 
 	subpage_mark_vma_nohuge(mm, addr, len);
-	for (limit = addr + len; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (limit = addr + len; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		err = -ENOMEM;
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;

Thanks!

> Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:25       ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 14:25 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
[...]
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

If this one would be okay?

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index 60c6ea16..3690d22 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -88,6 +88,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
 static void subpage_prot_clear(unsigned long addr, unsigned long len)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -103,8 +104,8 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len)
 	limit = addr + len;
 	if (limit > spt->maxaddr)
 		limit = spt->maxaddr;
-	for (; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;
 		} else {
@@ -191,6 +192,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 		unsigned long, len, u32 __user *, map)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -236,8 +238,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 	}
 
 	subpage_mark_vma_nohuge(mm, addr, len);
-	for (limit = addr + len; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (limit = addr + len; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		err = -ENOMEM;
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;

Thanks!

> Christophe

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-07 18:00   ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-08 14:30     ` Dave Hansen
  -1 siblings, 0 replies; 313+ messages in thread
From: Dave Hansen @ 2020-09-08 14:30 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.

Would it be fair to say that the "fake" page table entries s390
allocates on the stack are what's causing the trouble here?  That might
be a nice thing to open up with here.  "Dynamic page table folding"
really means nothing to me.

> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  	do {
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>  		if (!pmd_present(pmd))
>  			return 0;

It looks like you fix this up later, but this would be a problem if left
this way.  There's no documentation for whether I use
pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 14:30     ` Dave Hansen
  0 siblings, 0 replies; 313+ messages in thread
From: Dave Hansen @ 2020-09-08 14:30 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.

Would it be fair to say that the "fake" page table entries s390
allocates on the stack are what's causing the trouble here?  That might
be a nice thing to open up with here.  "Dynamic page table folding"
really means nothing to me.

> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  	do {
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>  		if (!pmd_present(pmd))
>  			return 0;

It looks like you fix this up later, but this would be a problem if left
this way.  There's no documentation for whether I use
pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 14:30     ` Dave Hansen
  0 siblings, 0 replies; 313+ messages in thread
From: Dave Hansen @ 2020-09-08 14:30 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.

Would it be fair to say that the "fake" page table entries s390
allocates on the stack are what's causing the trouble here?  That might
be a nice thing to open up with here.  "Dynamic page table folding"
really means nothing to me.

> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  	do {
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>  		if (!pmd_present(pmd))
>  			return 0;

It looks like you fix this up later, but this would be a problem if left
this way.  There's no documentation for whether I use
pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 14:30     ` Dave Hansen
  0 siblings, 0 replies; 313+ messages in thread
From: Dave Hansen @ 2020-09-08 14:30 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.

Would it be fair to say that the "fake" page table entries s390
allocates on the stack are what's causing the trouble here?  That might
be a nice thing to open up with here.  "Dynamic page table folding"
really means nothing to me.

> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  	do {
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>  		if (!pmd_present(pmd))
>  			return 0;

It looks like you fix this up later, but this would be a problem if left
this way.  There's no documentation for whether I use
pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-07 18:00   ` Gerald Schaefer
  (?)
  (?)
@ 2020-09-08 14:33     ` Dave Hansen
  -1 siblings, 0 replies; 313+ messages in thread
From: Dave Hansen @ 2020-09-08 14:33 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390,
	Alexander Gordeev, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> x86:
> add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
> Function                                     old     new   delta
> vmemmap_populate                             587     592      +5
> munlock_vma_pages_range                      556     561      +5
> Total: Before=15534694, After=15534704, chg +0.00%
...
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 +++++-----

I didn't do a super thorough review on this, but it generally looks OK
and the benefits of sharing more code between arches certainly outweigh
a few bytes of binary growth.  For the x86 bits at least, feel free to
add my ack.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:33     ` Dave Hansen
  0 siblings, 0 replies; 313+ messages in thread
From: Dave Hansen @ 2020-09-08 14:33 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> x86:
> add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
> Function                                     old     new   delta
> vmemmap_populate                             587     592      +5
> munlock_vma_pages_range                      556     561      +5
> Total: Before=15534694, After=15534704, chg +0.00%
...
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 +++++-----

I didn't do a super thorough review on this, but it generally looks OK
and the benefits of sharing more code between arches certainly outweigh
a few bytes of binary growth.  For the x86 bits at least, feel free to
add my ack.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:33     ` Dave Hansen
  0 siblings, 0 replies; 313+ messages in thread
From: Dave Hansen @ 2020-09-08 14:33 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> x86:
> add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
> Function                                     old     new   delta
> vmemmap_populate                             587     592      +5
> munlock_vma_pages_range                      556     561      +5
> Total: Before=15534694, After=15534704, chg +0.00%
...
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 +++++-----

I didn't do a super thorough review on this, but it generally looks OK
and the benefits of sharing more code between arches certainly outweigh
a few bytes of binary growth.  For the x86 bits at least, feel free to
add my ack.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-08 14:33     ` Dave Hansen
  0 siblings, 0 replies; 313+ messages in thread
From: Dave Hansen @ 2020-09-08 14:33 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> x86:
> add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
> Function                                     old     new   delta
> vmemmap_populate                             587     592      +5
> munlock_vma_pages_range                      556     561      +5
> Total: Before=15534694, After=15534704, chg +0.00%
...
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 +++++-----

I didn't do a super thorough review on this, but it generally looks OK
and the benefits of sharing more code between arches certainly outweigh
a few bytes of binary growth.  For the x86 bits at least, feel free to
add my ack.

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
  2020-09-08  5:19     ` Christophe Leroy
                         ` (2 preceding siblings ...)
  (?)
@ 2020-09-08 15:48       ` Alexander Gordeev
  -1 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 15:48 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:

[...]

> >diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >index 67ebc22cf83d..d9e7d16c2263 100644
> >--- a/include/linux/pgtable.h
> >+++ b/include/linux/pgtable.h
> >@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   */
> >  #ifndef pgd_addr_end
> >-#define pgd_addr_end(pgd, addr, end)					\
> >-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> >-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> >-})
> >+#define pgd_addr_end pgd_addr_end
> 
> I think that #define is pointless, usually there is no such #define
> for the default case.

Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):

#define pgd_addr_end pgd_addr_end
static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
{
	return rste_addr_end_folded(pgd_val(pgd), addr, end);
}

> >+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> >+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> >+	return (__boundary - 1 < end - 1) ? __boundary : end;
> >+}

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08 15:48       ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 15:48 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:

[...]

> >diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >index 67ebc22cf83d..d9e7d16c2263 100644
> >--- a/include/linux/pgtable.h
> >+++ b/include/linux/pgtable.h
> >@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   */
> >  #ifndef pgd_addr_end
> >-#define pgd_addr_end(pgd, addr, end)					\
> >-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> >-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> >-})
> >+#define pgd_addr_end pgd_addr_end
> 
> I think that #define is pointless, usually there is no such #define
> for the default case.

Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):

#define pgd_addr_end pgd_addr_end
static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
{
	return rste_addr_end_folded(pgd_val(pgd), addr, end);
}

> >+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> >+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> >+	return (__boundary - 1 < end - 1) ? __boundary : end;
> >+}

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08 15:48       ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 15:48 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:

[...]

> >diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >index 67ebc22cf83d..d9e7d16c2263 100644
> >--- a/include/linux/pgtable.h
> >+++ b/include/linux/pgtable.h
> >@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   */
> >  #ifndef pgd_addr_end
> >-#define pgd_addr_end(pgd, addr, end)					\
> >-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> >-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> >-})
> >+#define pgd_addr_end pgd_addr_end
> 
> I think that #define is pointless, usually there is no such #define
> for the default case.

Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):

#define pgd_addr_end pgd_addr_end
static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
{
	return rste_addr_end_folded(pgd_val(pgd), addr, end);
}

> >+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> >+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> >+	return (__boundary - 1 < end - 1) ? __boundary : end;
> >+}

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08 15:48       ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 15:48 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:

[...]

> >diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >index 67ebc22cf83d..d9e7d16c2263 100644
> >--- a/include/linux/pgtable.h
> >+++ b/include/linux/pgtable.h
> >@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   */
> >  #ifndef pgd_addr_end
> >-#define pgd_addr_end(pgd, addr, end)					\
> >-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> >-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> >-})
> >+#define pgd_addr_end pgd_addr_end
> 
> I think that #define is pointless, usually there is no such #define
> for the default case.

Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):

#define pgd_addr_end pgd_addr_end
static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
{
	return rste_addr_end_folded(pgd_val(pgd), addr, end);
}

> >+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> >+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> >+	return (__boundary - 1 < end - 1) ? __boundary : end;
> >+}

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08 15:48       ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-08 15:48 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:

[...]

> >diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >index 67ebc22cf83d..d9e7d16c2263 100644
> >--- a/include/linux/pgtable.h
> >+++ b/include/linux/pgtable.h
> >@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   */
> >  #ifndef pgd_addr_end
> >-#define pgd_addr_end(pgd, addr, end)					\
> >-({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
> >-	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> >-})
> >+#define pgd_addr_end pgd_addr_end
> 
> I think that #define is pointless, usually there is no such #define
> for the default case.

Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):

#define pgd_addr_end pgd_addr_end
static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
{
	return rste_addr_end_folded(pgd_val(pgd), addr, end);
}

> >+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> >+{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> >+	return (__boundary - 1 < end - 1) ? __boundary : end;
> >+}

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
  2020-09-07 18:00   ` Gerald Schaefer
                     ` (5 preceding siblings ...)
  (?)
@ 2020-09-08 16:05   ` kernel test robot
  -1 siblings, 0 replies; 313+ messages in thread
From: kernel test robot @ 2020-09-08 16:05 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 6603 bytes --]

Hi Gerald,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on linus/master]
[also build test ERROR on v5.9-rc4]
[cannot apply to mmotm/master hnaz-linux-mm/master s390/features arm64/for-next/core asm-generic/master next-20200908]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Gerald-Schaefer/mm-gup-fix-gup_fast-with-dynamic-page-table-folding/20200908-020629
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git f4d51dffc6c01a9e94650d95ce0104964f8ae822
config: powerpc-randconfig-r016-20200908 (attached as .config)
compiler: powerpc-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arch/powerpc/mm/kasan/kasan_init_32.c: In function 'kasan_init_shadow_page_tables':
>> arch/powerpc/mm/kasan/kasan_init_32.c:41:25: error: incompatible type for argument 1 of 'pgd_addr_end'
      41 |   k_next = pgd_addr_end(k_cur, k_end);
         |                         ^~~~~
         |                         |
         |                         long unsigned int
   In file included from include/linux/kasan.h:14,
                    from arch/powerpc/mm/kasan/kasan_init_32.c:5:
   include/linux/pgtable.h:660:48: note: expected 'pgd_t' {aka 'struct <anonymous>'} but argument is of type 'long unsigned int'
     660 | static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
         |                                          ~~~~~~^~~
>> include/linux/pgtable.h:659:22: error: too few arguments to function 'pgd_addr_end'
     659 | #define pgd_addr_end pgd_addr_end
         |                      ^~~~~~~~~~~~
   arch/powerpc/mm/kasan/kasan_init_32.c:41:12: note: in expansion of macro 'pgd_addr_end'
      41 |   k_next = pgd_addr_end(k_cur, k_end);
         |            ^~~~~~~~~~~~
   include/linux/pgtable.h:659:22: note: declared here
     659 | #define pgd_addr_end pgd_addr_end
         |                      ^~~~~~~~~~~~
   include/linux/pgtable.h:660:29: note: in expansion of macro 'pgd_addr_end'
     660 | static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
         |                             ^~~~~~~~~~~~
   arch/powerpc/mm/kasan/kasan_init_32.c: In function 'kasan_early_init':
   arch/powerpc/mm/kasan/kasan_init_32.c:199:23: error: incompatible type for argument 1 of 'pgd_addr_end'
     199 |   next = pgd_addr_end(addr, end);
         |                       ^~~~
         |                       |
         |                       long unsigned int
   In file included from include/linux/kasan.h:14,
                    from arch/powerpc/mm/kasan/kasan_init_32.c:5:
   include/linux/pgtable.h:660:48: note: expected 'pgd_t' {aka 'struct <anonymous>'} but argument is of type 'long unsigned int'
     660 | static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
         |                                          ~~~~~~^~~
>> include/linux/pgtable.h:659:22: error: too few arguments to function 'pgd_addr_end'
     659 | #define pgd_addr_end pgd_addr_end
         |                      ^~~~~~~~~~~~
   arch/powerpc/mm/kasan/kasan_init_32.c:199:10: note: in expansion of macro 'pgd_addr_end'
     199 |   next = pgd_addr_end(addr, end);
         |          ^~~~~~~~~~~~
   include/linux/pgtable.h:659:22: note: declared here
     659 | #define pgd_addr_end pgd_addr_end
         |                      ^~~~~~~~~~~~
   include/linux/pgtable.h:660:29: note: in expansion of macro 'pgd_addr_end'
     660 | static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
         |                             ^~~~~~~~~~~~

# https://github.com/0day-ci/linux/commit/b01038fe015c54a504215f859204205199b8fdc7
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Gerald-Schaefer/mm-gup-fix-gup_fast-with-dynamic-page-table-folding/20200908-020629
git checkout b01038fe015c54a504215f859204205199b8fdc7
vim +/pgd_addr_end +41 arch/powerpc/mm/kasan/kasan_init_32.c

2edb16efc899f9 Christophe Leroy 2019-04-26  30  
ec97d022f621c6 Christophe Leroy 2020-05-19  31  int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_end)
2edb16efc899f9 Christophe Leroy 2019-04-26  32  {
2edb16efc899f9 Christophe Leroy 2019-04-26  33  	pmd_t *pmd;
2edb16efc899f9 Christophe Leroy 2019-04-26  34  	unsigned long k_cur, k_next;
2edb16efc899f9 Christophe Leroy 2019-04-26  35  
e05c7b1f2bc4b7 Mike Rapoport    2020-06-08  36  	pmd = pmd_off_k(k_start);
2edb16efc899f9 Christophe Leroy 2019-04-26  37  
2edb16efc899f9 Christophe Leroy 2019-04-26  38  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
7c31c05e00fc5f Christophe Leroy 2020-05-19  39  		pte_t *new;
7c31c05e00fc5f Christophe Leroy 2020-05-19  40  
2edb16efc899f9 Christophe Leroy 2019-04-26 @41  		k_next = pgd_addr_end(k_cur, k_end);
2edb16efc899f9 Christophe Leroy 2019-04-26  42  		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
2edb16efc899f9 Christophe Leroy 2019-04-26  43  			continue;
2edb16efc899f9 Christophe Leroy 2019-04-26  44  
d7e23b887f6717 Christophe Leroy 2019-07-31  45  		new = memblock_alloc(PTE_FRAG_SIZE, PTE_FRAG_SIZE);
2edb16efc899f9 Christophe Leroy 2019-04-26  46  
2edb16efc899f9 Christophe Leroy 2019-04-26  47  		if (!new)
2edb16efc899f9 Christophe Leroy 2019-04-26  48  			return -ENOMEM;
509cd3f2b47333 Christophe Leroy 2020-01-14  49  		kasan_populate_pte(new, PAGE_KERNEL);
2edb16efc899f9 Christophe Leroy 2019-04-26  50  		pmd_populate_kernel(&init_mm, pmd, new);
2edb16efc899f9 Christophe Leroy 2019-04-26  51  	}
2edb16efc899f9 Christophe Leroy 2019-04-26  52  	return 0;
2edb16efc899f9 Christophe Leroy 2019-04-26  53  }
2edb16efc899f9 Christophe Leroy 2019-04-26  54  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 28174 bytes --]

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
  2020-09-08 15:48       ` Alexander Gordeev
                           ` (2 preceding siblings ...)
  (?)
@ 2020-09-08 17:20         ` Christophe Leroy
  -1 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08 17:20 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



Le 08/09/2020 à 17:48, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:
> 
> [...]
> 
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 67ebc22cf83d..d9e7d16c2263 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>>>    */
>>>   #ifndef pgd_addr_end
>>> -#define pgd_addr_end(pgd, addr, end)					\
>>> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
>>> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
>>> -})
>>> +#define pgd_addr_end pgd_addr_end
>>
>> I think that #define is pointless, usually there is no such #define
>> for the default case.
> 
> Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):
> 
> #define pgd_addr_end pgd_addr_end
> static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> {
> 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> }

Yes, there in s390 the #define is needed to hit the #ifndef pgd_addr_end 
that's in include/linux/pgtable.h

But in include/linux/pgtable.h, there is no need of an #define 
pgd_addr_end pgd_addr_end I think

> 
>>> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
>>> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
>>> +	return (__boundary - 1 < end - 1) ? __boundary : end;
>>> +}


Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08 17:20         ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08 17:20 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Gerald Schaefer, Jason Gunthorpe, John Hubbard, Peter Zijlstra,
	Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Claudio Imbrenda, Will Deacon, linux-arch, linux-s390,
	Vasily Gorbik, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, Jeff Dike,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport



Le 08/09/2020 à 17:48, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:
> 
> [...]
> 
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 67ebc22cf83d..d9e7d16c2263 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>>>    */
>>>   #ifndef pgd_addr_end
>>> -#define pgd_addr_end(pgd, addr, end)					\
>>> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
>>> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
>>> -})
>>> +#define pgd_addr_end pgd_addr_end
>>
>> I think that #define is pointless, usually there is no such #define
>> for the default case.
> 
> Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):
> 
> #define pgd_addr_end pgd_addr_end
> static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> {
> 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> }

Yes, there in s390 the #define is needed to hit the #ifndef pgd_addr_end 
that's in include/linux/pgtable.h

But in include/linux/pgtable.h, there is no need of an #define 
pgd_addr_end pgd_addr_end I think

> 
>>> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
>>> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
>>> +	return (__boundary - 1 < end - 1) ? __boundary : end;
>>> +}


Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08 17:20         ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08 17:20 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport



Le 08/09/2020 à 17:48, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:
> 
> [...]
> 
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 67ebc22cf83d..d9e7d16c2263 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>>>    */
>>>   #ifndef pgd_addr_end
>>> -#define pgd_addr_end(pgd, addr, end)					\
>>> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
>>> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
>>> -})
>>> +#define pgd_addr_end pgd_addr_end
>>
>> I think that #define is pointless, usually there is no such #define
>> for the default case.
> 
> Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):
> 
> #define pgd_addr_end pgd_addr_end
> static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> {
> 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> }

Yes, there in s390 the #define is needed to hit the #ifndef pgd_addr_end 
that's in include/linux/pgtable.h

But in include/linux/pgtable.h, there is no need of an #define 
pgd_addr_end pgd_addr_end I think

> 
>>> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
>>> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
>>> +	return (__boundary - 1 < end - 1) ? __boundary : end;
>>> +}


Christophe

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08 17:20         ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08 17:20 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport



Le 08/09/2020 à 17:48, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:
> 
> [...]
> 
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 67ebc22cf83d..d9e7d16c2263 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>>>    */
>>>   #ifndef pgd_addr_end
>>> -#define pgd_addr_end(pgd, addr, end)					\
>>> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
>>> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
>>> -})
>>> +#define pgd_addr_end pgd_addr_end
>>
>> I think that #define is pointless, usually there is no such #define
>> for the default case.
> 
> Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):
> 
> #define pgd_addr_end pgd_addr_end
> static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> {
> 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> }

Yes, there in s390 the #define is needed to hit the #ifndef pgd_addr_end 
that's in include/linux/pgtable.h

But in include/linux/pgtable.h, there is no need of an #define 
pgd_addr_end pgd_addr_end I think

> 
>>> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
>>> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
>>> +	return (__boundary - 1 < end - 1) ? __boundary : end;
>>> +}


Christophe

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
@ 2020-09-08 17:20         ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-08 17:20 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport



Le 08/09/2020 à 17:48, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:
> 
> [...]
> 
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 67ebc22cf83d..d9e7d16c2263 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>>>    */
>>>   #ifndef pgd_addr_end
>>> -#define pgd_addr_end(pgd, addr, end)					\
>>> -({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
>>> -	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
>>> -})
>>> +#define pgd_addr_end pgd_addr_end
>>
>> I think that #define is pointless, usually there is no such #define
>> for the default case.
> 
> Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):
> 
> #define pgd_addr_end pgd_addr_end
> static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
> {
> 	return rste_addr_end_folded(pgd_val(pgd), addr, end);
> }

Yes, there in s390 the #define is needed to hit the #ifndef pgd_addr_end 
that's in include/linux/pgtable.h

But in include/linux/pgtable.h, there is no need of an #define 
pgd_addr_end pgd_addr_end I think

> 
>>> +static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end)
>>> +{	unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
>>> +	return (__boundary - 1 < end - 1) ? __boundary : end;
>>> +}


Christophe

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-08  5:22     ` Christophe Leroy
                         ` (2 preceding siblings ...)
  (?)
@ 2020-09-08 17:36       ` Gerald Schaefer
  -1 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:36 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Mike Rapoport, Peter Zijlstra, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds

On Tue, 8 Sep 2020 07:22:39 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> > On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> >> This is v2 of an RFC previously discussed here:
> >> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> >>
> >> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> >> to common gup_fast code. It will introduce special helper functions
> >> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> >> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> >>
> >> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> >> themselves by adding an extra pXd value parameter. That was suggested by
> >> Jason during v1 discussion, because he is already thinking of some other
> >> places where he might want to switch to the READ_ONCE logic for pagetable
> >> walks. In general, that would be the cleanest / safest solution, but there
> >> is some impact on other architectures and common code, hence the new and
> >> greatly enlarged recipient list.
> >>
> >> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> >> functions instead of #defines, so that we get some type checking for the
> >> new pXd value parameter.
> >>
> >> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> >> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> >> still be nice to have in stable, to ease future backports, but I guess
> >> "nice to have" does not really qualify for stable backports.
> > 
> > I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> > way and with this patch 1 is not really required. I would even merge
> > patches 2 and 3 into a single patch and use only it as the fix.
> 
> Why not merging patches 2 and 3, but I would keep patch 1 separate but 
> after the generic changes, so that we first do the generic changes, then 
> we do the specific S390 use of it.

Yes, we thought about that approach too. It would at least allow to
get all into stable, more or less nicely, as prerequisite for the s390
fix.

Two concerns kept us from going that way. For once, it might not be
the nicest way to get it all in stable, and we would not want to risk
further objections due to the imminent and rather scary data corruption
issue that we want to fix asap.

For the same reason, we thought that the generalization part might
need more time and agreement from various people, so that we could at
least get the first patch as short-term solution.

It seems now that the generalization is very well accepted so far,
apart from some apparent issues on arm. Also, merging 2 + 3 and
putting them first seems to be acceptable, so we could do that for
v3, if there are no objections.

Of course, we first need to address the few remaining issues for
arm(32?), which do look quite confusing to me so far. BTW, sorry for
the compile error with patch 3, I guess we did the cross-compile only
for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
patch 3 already proved its usefulness by that :-)

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 17:36       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:36 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Mike Rapoport, Peter Zijlstra, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds

On Tue, 8 Sep 2020 07:22:39 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> > On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> >> This is v2 of an RFC previously discussed here:
> >> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> >>
> >> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> >> to common gup_fast code. It will introduce special helper functions
> >> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> >> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> >>
> >> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> >> themselves by adding an extra pXd value parameter. That was suggested by
> >> Jason during v1 discussion, because he is already thinking of some other
> >> places where he might want to switch to the READ_ONCE logic for pagetable
> >> walks. In general, that would be the cleanest / safest solution, but there
> >> is some impact on other architectures and common code, hence the new and
> >> greatly enlarged recipient list.
> >>
> >> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> >> functions instead of #defines, so that we get some type checking for the
> >> new pXd value parameter.
> >>
> >> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> >> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> >> still be nice to have in stable, to ease future backports, but I guess
> >> "nice to have" does not really qualify for stable backports.
> > 
> > I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> > way and with this patch 1 is not really required. I would even merge
> > patches 2 and 3 into a single patch and use only it as the fix.
> 
> Why not merging patches 2 and 3, but I would keep patch 1 separate but 
> after the generic changes, so that we first do the generic changes, then 
> we do the specific S390 use of it.

Yes, we thought about that approach too. It would at least allow to
get all into stable, more or less nicely, as prerequisite for the s390
fix.

Two concerns kept us from going that way. For once, it might not be
the nicest way to get it all in stable, and we would not want to risk
further objections due to the imminent and rather scary data corruption
issue that we want to fix asap.

For the same reason, we thought that the generalization part might
need more time and agreement from various people, so that we could at
least get the first patch as short-term solution.

It seems now that the generalization is very well accepted so far,
apart from some apparent issues on arm. Also, merging 2 + 3 and
putting them first seems to be acceptable, so we could do that for
v3, if there are no objections.

Of course, we first need to address the few remaining issues for
arm(32?), which do look quite confusing to me so far. BTW, sorry for
the compile error with patch 3, I guess we did the cross-compile only
for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
patch 3 already proved its usefulness by that :-)

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 17:36       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:36 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 07:22:39 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> > On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> >> This is v2 of an RFC previously discussed here:
> >> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> >>
> >> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> >> to common gup_fast code. It will introduce special helper functions
> >> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> >> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> >>
> >> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> >> themselves by adding an extra pXd value parameter. That was suggested by
> >> Jason during v1 discussion, because he is already thinking of some other
> >> places where he might want to switch to the READ_ONCE logic for pagetable
> >> walks. In general, that would be the cleanest / safest solution, but there
> >> is some impact on other architectures and common code, hence the new and
> >> greatly enlarged recipient list.
> >>
> >> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> >> functions instead of #defines, so that we get some type checking for the
> >> new pXd value parameter.
> >>
> >> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> >> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> >> still be nice to have in stable, to ease future backports, but I guess
> >> "nice to have" does not really qualify for stable backports.
> > 
> > I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> > way and with this patch 1 is not really required. I would even merge
> > patches 2 and 3 into a single patch and use only it as the fix.
> 
> Why not merging patches 2 and 3, but I would keep patch 1 separate but 
> after the generic changes, so that we first do the generic changes, then 
> we do the specific S390 use of it.

Yes, we thought about that approach too. It would at least allow to
get all into stable, more or less nicely, as prerequisite for the s390
fix.

Two concerns kept us from going that way. For once, it might not be
the nicest way to get it all in stable, and we would not want to risk
further objections due to the imminent and rather scary data corruption
issue that we want to fix asap.

For the same reason, we thought that the generalization part might
need more time and agreement from various people, so that we could at
least get the first patch as short-term solution.

It seems now that the generalization is very well accepted so far,
apart from some apparent issues on arm. Also, merging 2 + 3 and
putting them first seems to be acceptable, so we could do that for
v3, if there are no objections.

Of course, we first need to address the few remaining issues for
arm(32?), which do look quite confusing to me so far. BTW, sorry for
the compile error with patch 3, I guess we did the cross-compile only
for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
patch 3 already proved its usefulness by that :-)

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 17:36       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:36 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 07:22:39 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> > On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> >> This is v2 of an RFC previously discussed here:
> >> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> >>
> >> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> >> to common gup_fast code. It will introduce special helper functions
> >> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> >> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> >>
> >> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> >> themselves by adding an extra pXd value parameter. That was suggested by
> >> Jason during v1 discussion, because he is already thinking of some other
> >> places where he might want to switch to the READ_ONCE logic for pagetable
> >> walks. In general, that would be the cleanest / safest solution, but there
> >> is some impact on other architectures and common code, hence the new and
> >> greatly enlarged recipient list.
> >>
> >> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> >> functions instead of #defines, so that we get some type checking for the
> >> new pXd value parameter.
> >>
> >> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> >> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> >> still be nice to have in stable, to ease future backports, but I guess
> >> "nice to have" does not really qualify for stable backports.
> > 
> > I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> > way and with this patch 1 is not really required. I would even merge
> > patches 2 and 3 into a single patch and use only it as the fix.
> 
> Why not merging patches 2 and 3, but I would keep patch 1 separate but 
> after the generic changes, so that we first do the generic changes, then 
> we do the specific S390 use of it.

Yes, we thought about that approach too. It would at least allow to
get all into stable, more or less nicely, as prerequisite for the s390
fix.

Two concerns kept us from going that way. For once, it might not be
the nicest way to get it all in stable, and we would not want to risk
further objections due to the imminent and rather scary data corruption
issue that we want to fix asap.

For the same reason, we thought that the generalization part might
need more time and agreement from various people, so that we could at
least get the first patch as short-term solution.

It seems now that the generalization is very well accepted so far,
apart from some apparent issues on arm. Also, merging 2 + 3 and
putting them first seems to be acceptable, so we could do that for
v3, if there are no objections.

Of course, we first need to address the few remaining issues for
arm(32?), which do look quite confusing to me so far. BTW, sorry for
the compile error with patch 3, I guess we did the cross-compile only
for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
patch 3 already proved its usefulness by that :-)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 17:36       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:36 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 07:22:39 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> > On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> >> This is v2 of an RFC previously discussed here:
> >> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/
> >>
> >> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> >> to common gup_fast code. It will introduce special helper functions
> >> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> >> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> >>
> >> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> >> themselves by adding an extra pXd value parameter. That was suggested by
> >> Jason during v1 discussion, because he is already thinking of some other
> >> places where he might want to switch to the READ_ONCE logic for pagetable
> >> walks. In general, that would be the cleanest / safest solution, but there
> >> is some impact on other architectures and common code, hence the new and
> >> greatly enlarged recipient list.
> >>
> >> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> >> functions instead of #defines, so that we get some type checking for the
> >> new pXd value parameter.
> >>
> >> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> >> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> >> still be nice to have in stable, to ease future backports, but I guess
> >> "nice to have" does not really qualify for stable backports.
> > 
> > I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> > way and with this patch 1 is not really required. I would even merge
> > patches 2 and 3 into a single patch and use only it as the fix.
> 
> Why not merging patches 2 and 3, but I would keep patch 1 separate but 
> after the generic changes, so that we first do the generic changes, then 
> we do the specific S390 use of it.

Yes, we thought about that approach too. It would at least allow to
get all into stable, more or less nicely, as prerequisite for the s390
fix.

Two concerns kept us from going that way. For once, it might not be
the nicest way to get it all in stable, and we would not want to risk
further objections due to the imminent and rather scary data corruption
issue that we want to fix asap.

For the same reason, we thought that the generalization part might
need more time and agreement from various people, so that we could at
least get the first patch as short-term solution.

It seems now that the generalization is very well accepted so far,
apart from some apparent issues on arm. Also, merging 2 + 3 and
putting them first seems to be acceptable, so we could do that for
v3, if there are no objections.

Of course, we first need to address the few remaining issues for
arm(32?), which do look quite confusing to me so far. BTW, sorry for
the compile error with patch 3, I guess we did the cross-compile only
for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
patch 3 already proved its usefulness by that :-)

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-08 14:30     ` Dave Hansen
                         ` (2 preceding siblings ...)
  (?)
@ 2020-09-08 17:59       ` Gerald Schaefer
  -1 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

We do not really allocate anything on the stack, it is the generic logic
from gup_fast that passes over pXd values (read once before), and using
pointers to such (stack) variables instead of real pXd pointers.
That, combined with the fact that we just return the passed in pointer in
pXd_offset() for folded levels.

That works similar on x86 IIUC, but with static folding, and thus also
proper pXd_addr_end() results because of statically (and correspondingly)
defined Pxd_INDEX/SHIFT. We always have static 5-level PxD_INDEX/SHIFT, and
that cannot really be made dynamic, so we just make pXd_addr_end()
dynamic instead, and that requires the pXd value to determine the correct
pagetable level.

Still makes my head spin when trying to explain, sorry. It is a very
special s390 oddity, or lets call it "feature", because I don't think any
other architecture has "dynamic pagetable folding" capability, depending
on process requirements, for whatever it is worth...

> 
> > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> >  	do {
> >  		pmd_t pmd = READ_ONCE(*pmdp);
> >  
> > -		next = pmd_addr_end(addr, end);
> > +		next = pmd_addr_end_folded(pmd, addr, end);
> >  		if (!pmd_present(pmd))
> >  			return 0;
> 
> It looks like you fix this up later, but this would be a problem if left
> this way.  There's no documentation for whether I use
> pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.

Yes, that is very unfortunate. We did have some lengthy comment in
include/linux/pgtable.h where the pXd_addr_end(_folded) were defined.
But that was moved to arch/s390/include/asm/pgtable.h in this version,
probably because we already had the generalization in mind, where we
would not need such explanation in common header any more.

So, it might help better understand the issue that we have with
dynamic page table folding and READ_ONCE-style pagetable walkers
when looking at that comment.

Thanks for pointing out, that comment should definitely go into
include/linux/pgtable.h again. At least if we would still go for
that "s390 fix first, generalization second" approach, but it
seems we have other / better options now.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 17:59       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

We do not really allocate anything on the stack, it is the generic logic
from gup_fast that passes over pXd values (read once before), and using
pointers to such (stack) variables instead of real pXd pointers.
That, combined with the fact that we just return the passed in pointer in
pXd_offset() for folded levels.

That works similar on x86 IIUC, but with static folding, and thus also
proper pXd_addr_end() results because of statically (and correspondingly)
defined Pxd_INDEX/SHIFT. We always have static 5-level PxD_INDEX/SHIFT, and
that cannot really be made dynamic, so we just make pXd_addr_end()
dynamic instead, and that requires the pXd value to determine the correct
pagetable level.

Still makes my head spin when trying to explain, sorry. It is a very
special s390 oddity, or lets call it "feature", because I don't think any
other architecture has "dynamic pagetable folding" capability, depending
on process requirements, for whatever it is worth...

> 
> > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> >  	do {
> >  		pmd_t pmd = READ_ONCE(*pmdp);
> >  
> > -		next = pmd_addr_end(addr, end);
> > +		next = pmd_addr_end_folded(pmd, addr, end);
> >  		if (!pmd_present(pmd))
> >  			return 0;
> 
> It looks like you fix this up later, but this would be a problem if left
> this way.  There's no documentation for whether I use
> pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.

Yes, that is very unfortunate. We did have some lengthy comment in
include/linux/pgtable.h where the pXd_addr_end(_folded) were defined.
But that was moved to arch/s390/include/asm/pgtable.h in this version,
probably because we already had the generalization in mind, where we
would not need such explanation in common header any more.

So, it might help better understand the issue that we have with
dynamic page table folding and READ_ONCE-style pagetable walkers
when looking at that comment.

Thanks for pointing out, that comment should definitely go into
include/linux/pgtable.h again. At least if we would still go for
that "s390 fix first, generalization second" approach, but it
seems we have other / better options now.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 17:59       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

We do not really allocate anything on the stack, it is the generic logic
from gup_fast that passes over pXd values (read once before), and using
pointers to such (stack) variables instead of real pXd pointers.
That, combined with the fact that we just return the passed in pointer in
pXd_offset() for folded levels.

That works similar on x86 IIUC, but with static folding, and thus also
proper pXd_addr_end() results because of statically (and correspondingly)
defined Pxd_INDEX/SHIFT. We always have static 5-level PxD_INDEX/SHIFT, and
that cannot really be made dynamic, so we just make pXd_addr_end()
dynamic instead, and that requires the pXd value to determine the correct
pagetable level.

Still makes my head spin when trying to explain, sorry. It is a very
special s390 oddity, or lets call it "feature", because I don't think any
other architecture has "dynamic pagetable folding" capability, depending
on process requirements, for whatever it is worth...

> 
> > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> >  	do {
> >  		pmd_t pmd = READ_ONCE(*pmdp);
> >  
> > -		next = pmd_addr_end(addr, end);
> > +		next = pmd_addr_end_folded(pmd, addr, end);
> >  		if (!pmd_present(pmd))
> >  			return 0;
> 
> It looks like you fix this up later, but this would be a problem if left
> this way.  There's no documentation for whether I use
> pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.

Yes, that is very unfortunate. We did have some lengthy comment in
include/linux/pgtable.h where the pXd_addr_end(_folded) were defined.
But that was moved to arch/s390/include/asm/pgtable.h in this version,
probably because we already had the generalization in mind, where we
would not need such explanation in common header any more.

So, it might help better understand the issue that we have with
dynamic page table folding and READ_ONCE-style pagetable walkers
when looking at that comment.

Thanks for pointing out, that comment should definitely go into
include/linux/pgtable.h again. At least if we would still go for
that "s390 fix first, generalization second" approach, but it
seems we have other / better options now.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 17:59       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

We do not really allocate anything on the stack, it is the generic logic
from gup_fast that passes over pXd values (read once before), and using
pointers to such (stack) variables instead of real pXd pointers.
That, combined with the fact that we just return the passed in pointer in
pXd_offset() for folded levels.

That works similar on x86 IIUC, but with static folding, and thus also
proper pXd_addr_end() results because of statically (and correspondingly)
defined Pxd_INDEX/SHIFT. We always have static 5-level PxD_INDEX/SHIFT, and
that cannot really be made dynamic, so we just make pXd_addr_end()
dynamic instead, and that requires the pXd value to determine the correct
pagetable level.

Still makes my head spin when trying to explain, sorry. It is a very
special s390 oddity, or lets call it "feature", because I don't think any
other architecture has "dynamic pagetable folding" capability, depending
on process requirements, for whatever it is worth...

> 
> > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> >  	do {
> >  		pmd_t pmd = READ_ONCE(*pmdp);
> >  
> > -		next = pmd_addr_end(addr, end);
> > +		next = pmd_addr_end_folded(pmd, addr, end);
> >  		if (!pmd_present(pmd))
> >  			return 0;
> 
> It looks like you fix this up later, but this would be a problem if left
> this way.  There's no documentation for whether I use
> pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.

Yes, that is very unfortunate. We did have some lengthy comment in
include/linux/pgtable.h where the pXd_addr_end(_folded) were defined.
But that was moved to arch/s390/include/asm/pgtable.h in this version,
probably because we already had the generalization in mind, where we
would not need such explanation in common header any more.

So, it might help better understand the issue that we have with
dynamic page table folding and READ_ONCE-style pagetable walkers
when looking at that comment.

Thanks for pointing out, that comment should definitely go into
include/linux/pgtable.h again. At least if we would still go for
that "s390 fix first, generalization second" approach, but it
seems we have other / better options now.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-08 17:59       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-08 17:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

We do not really allocate anything on the stack, it is the generic logic
from gup_fast that passes over pXd values (read once before), and using
pointers to such (stack) variables instead of real pXd pointers.
That, combined with the fact that we just return the passed in pointer in
pXd_offset() for folded levels.

That works similar on x86 IIUC, but with static folding, and thus also
proper pXd_addr_end() results because of statically (and correspondingly)
defined Pxd_INDEX/SHIFT. We always have static 5-level PxD_INDEX/SHIFT, and
that cannot really be made dynamic, so we just make pXd_addr_end()
dynamic instead, and that requires the pXd value to determine the correct
pagetable level.

Still makes my head spin when trying to explain, sorry. It is a very
special s390 oddity, or lets call it "feature", because I don't think any
other architecture has "dynamic pagetable folding" capability, depending
on process requirements, for whatever it is worth...

> 
> > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> >  	do {
> >  		pmd_t pmd = READ_ONCE(*pmdp);
> >  
> > -		next = pmd_addr_end(addr, end);
> > +		next = pmd_addr_end_folded(pmd, addr, end);
> >  		if (!pmd_present(pmd))
> >  			return 0;
> 
> It looks like you fix this up later, but this would be a problem if left
> this way.  There's no documentation for whether I use
> pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.

Yes, that is very unfortunate. We did have some lengthy comment in
include/linux/pgtable.h where the pXd_addr_end(_folded) were defined.
But that was moved to arch/s390/include/asm/pgtable.h in this version,
probably because we already had the generalization in mind, where we
would not need such explanation in common header any more.

So, it might help better understand the issue that we have with
dynamic page table folding and READ_ONCE-style pagetable walkers
when looking at that comment.

Thanks for pointing out, that comment should definitely go into
include/linux/pgtable.h again. At least if we would still go for
that "s390 fix first, generalization second" approach, but it
seems we have other / better options now.

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
  2020-09-08 14:15           ` Alexander Gordeev
                               ` (2 preceding siblings ...)
  (?)
@ 2020-09-09  8:38             ` Christophe Leroy
  -1 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-09  8:38 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Michael Ellerman, Gerald Schaefer, Jason Gunthorpe, John Hubbard,
	Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Tue, 2020-09-08 at 16:15 +0200, Alexander Gordeev wrote:
> On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> > >Yes, and also two more sources :/
> > >	arch/powerpc/mm/kasan/8xx.c
> > >	arch/powerpc/mm/kasan/kasan_init_32.c
> > >
> > >But these two are not quite obvious wrt pgd_addr_end() used
> > >while traversing pmds. Could you please clarify a bit?
> > >
> > >
> > >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> > >index 2784224..89c5053 100644
> > >--- a/arch/powerpc/mm/kasan/8xx.c
> > >+++ b/arch/powerpc/mm/kasan/8xx.c
> > >@@ -15,8 +15,8 @@
> > >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> > >  		pte_basic_t *new;
> > >-		k_next = pgd_addr_end(k_cur, k_end);
> > >-		k_next = pgd_addr_end(k_next, k_end);
> > >+		k_next = pmd_addr_end(k_cur, k_end);
> > >+		k_next = pmd_addr_end(k_next, k_end);
> > 
> > No, I don't think so.
> > On powerpc32 we have only two levels, so pgd and pmd are more or
> > less the same.
> > But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> > is a no-op, so I don't think it will work.
> > 
> > It is likely that this function should iterate on pgd, then you get
> > pmd = pmd_offset(pud_offset(p4d_offset(pgd)));
> 
> It looks like the code iterates over single pmd table while using
> pgd_addr_end() only to skip all the middle levels and bail out
> from the loop.
> 
> I would be wary for switching from pmds to pgds, since we are
> trying to minimize impact (especially functional) and the
> rework does not seem that obvious.
> 

I've just tested the following change, it works and should fix the
oddity:

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224054f8..8e53ddf57b84 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -9,11 +9,12 @@
 static int __init
 kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void
*block)
 {
-	pmd_t *pmd = pmd_off_k(k_start);
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block
+= SZ_8M) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd += 2, block
+= SZ_8M) {
 		pte_basic_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		k_next = pgd_addr_end(k_next, k_end);
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c
b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb294046e00e..e5f524fa71a7 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -30,13 +30,12 @@ static void __init kasan_populate_pte(pte_t *ptep,
pgprot_t prot)
 
 int __init kasan_init_shadow_page_tables(unsigned long k_start,
unsigned long k_end)
 {
-	pmd_t *pmd;
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	pmd = pmd_off_k(k_start);
-
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd++) {
 		pte_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
@@ -189,16 +188,18 @@ void __init kasan_early_init(void)
 	unsigned long addr = KASAN_SHADOW_START;
 	unsigned long end = KASAN_SHADOW_END;
 	unsigned long next;
-	pmd_t *pmd = pmd_off_k(addr);
+	pgd_t *pgd = pgd_offset_k(addr);
 
 	BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK);
 
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, addr), addr),
addr);
+
 		next = pgd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
-	} while (pmd++, addr = next, addr != end);
+	} while (pgd++, addr = next, addr != end);
 
 	if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
 		kasan_early_hash_table();
---
Christophe


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-09  8:38             ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-09  8:38 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Michael Ellerman, Gerald Schaefer, Jason Gunthorpe, John Hubbard,
	Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Tue, 2020-09-08 at 16:15 +0200, Alexander Gordeev wrote:
> On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> > >Yes, and also two more sources :/
> > >	arch/powerpc/mm/kasan/8xx.c
> > >	arch/powerpc/mm/kasan/kasan_init_32.c
> > >
> > >But these two are not quite obvious wrt pgd_addr_end() used
> > >while traversing pmds. Could you please clarify a bit?
> > >
> > >
> > >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> > >index 2784224..89c5053 100644
> > >--- a/arch/powerpc/mm/kasan/8xx.c
> > >+++ b/arch/powerpc/mm/kasan/8xx.c
> > >@@ -15,8 +15,8 @@
> > >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> > >  		pte_basic_t *new;
> > >-		k_next = pgd_addr_end(k_cur, k_end);
> > >-		k_next = pgd_addr_end(k_next, k_end);
> > >+		k_next = pmd_addr_end(k_cur, k_end);
> > >+		k_next = pmd_addr_end(k_next, k_end);
> > 
> > No, I don't think so.
> > On powerpc32 we have only two levels, so pgd and pmd are more or
> > less the same.
> > But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> > is a no-op, so I don't think it will work.
> > 
> > It is likely that this function should iterate on pgd, then you get
> > pmd = pmd_offset(pud_offset(p4d_offset(pgd)));
> 
> It looks like the code iterates over single pmd table while using
> pgd_addr_end() only to skip all the middle levels and bail out
> from the loop.
> 
> I would be wary for switching from pmds to pgds, since we are
> trying to minimize impact (especially functional) and the
> rework does not seem that obvious.
> 

I've just tested the following change, it works and should fix the
oddity:

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224054f8..8e53ddf57b84 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -9,11 +9,12 @@
 static int __init
 kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void
*block)
 {
-	pmd_t *pmd = pmd_off_k(k_start);
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block
+= SZ_8M) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd += 2, block
+= SZ_8M) {
 		pte_basic_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		k_next = pgd_addr_end(k_next, k_end);
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c
b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb294046e00e..e5f524fa71a7 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -30,13 +30,12 @@ static void __init kasan_populate_pte(pte_t *ptep,
pgprot_t prot)
 
 int __init kasan_init_shadow_page_tables(unsigned long k_start,
unsigned long k_end)
 {
-	pmd_t *pmd;
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	pmd = pmd_off_k(k_start);
-
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd++) {
 		pte_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
@@ -189,16 +188,18 @@ void __init kasan_early_init(void)
 	unsigned long addr = KASAN_SHADOW_START;
 	unsigned long end = KASAN_SHADOW_END;
 	unsigned long next;
-	pmd_t *pmd = pmd_off_k(addr);
+	pgd_t *pgd = pgd_offset_k(addr);
 
 	BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK);
 
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, addr), addr),
addr);
+
 		next = pgd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
-	} while (pmd++, addr = next, addr != end);
+	} while (pgd++, addr = next, addr != end);
 
 	if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
 		kasan_early_hash_table();
---
Christophe



^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-09  8:38             ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-09  8:38 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Arnd Bergmann, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Vasily Gorbik, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 2020-09-08 at 16:15 +0200, Alexander Gordeev wrote:
> On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> > >Yes, and also two more sources :/
> > >	arch/powerpc/mm/kasan/8xx.c
> > >	arch/powerpc/mm/kasan/kasan_init_32.c
> > >
> > >But these two are not quite obvious wrt pgd_addr_end() used
> > >while traversing pmds. Could you please clarify a bit?
> > >
> > >
> > >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> > >index 2784224..89c5053 100644
> > >--- a/arch/powerpc/mm/kasan/8xx.c
> > >+++ b/arch/powerpc/mm/kasan/8xx.c
> > >@@ -15,8 +15,8 @@
> > >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> > >  		pte_basic_t *new;
> > >-		k_next = pgd_addr_end(k_cur, k_end);
> > >-		k_next = pgd_addr_end(k_next, k_end);
> > >+		k_next = pmd_addr_end(k_cur, k_end);
> > >+		k_next = pmd_addr_end(k_next, k_end);
> > 
> > No, I don't think so.
> > On powerpc32 we have only two levels, so pgd and pmd are more or
> > less the same.
> > But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> > is a no-op, so I don't think it will work.
> > 
> > It is likely that this function should iterate on pgd, then you get
> > pmd = pmd_offset(pud_offset(p4d_offset(pgd)));
> 
> It looks like the code iterates over single pmd table while using
> pgd_addr_end() only to skip all the middle levels and bail out
> from the loop.
> 
> I would be wary for switching from pmds to pgds, since we are
> trying to minimize impact (especially functional) and the
> rework does not seem that obvious.
> 

I've just tested the following change, it works and should fix the
oddity:

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224054f8..8e53ddf57b84 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -9,11 +9,12 @@
 static int __init
 kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void
*block)
 {
-	pmd_t *pmd = pmd_off_k(k_start);
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block
+= SZ_8M) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd += 2, block
+= SZ_8M) {
 		pte_basic_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		k_next = pgd_addr_end(k_next, k_end);
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c
b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb294046e00e..e5f524fa71a7 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -30,13 +30,12 @@ static void __init kasan_populate_pte(pte_t *ptep,
pgprot_t prot)
 
 int __init kasan_init_shadow_page_tables(unsigned long k_start,
unsigned long k_end)
 {
-	pmd_t *pmd;
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	pmd = pmd_off_k(k_start);
-
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd++) {
 		pte_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
@@ -189,16 +188,18 @@ void __init kasan_early_init(void)
 	unsigned long addr = KASAN_SHADOW_START;
 	unsigned long end = KASAN_SHADOW_END;
 	unsigned long next;
-	pmd_t *pmd = pmd_off_k(addr);
+	pgd_t *pgd = pgd_offset_k(addr);
 
 	BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK);
 
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, addr), addr),
addr);
+
 		next = pgd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
-	} while (pmd++, addr = next, addr != end);
+	} while (pgd++, addr = next, addr != end);
 
 	if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
 		kasan_early_hash_table();
---
Christophe


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-09  8:38             ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-09  8:38 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Arnd Bergmann, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Vasily Gorbik, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Michael Ellerman, Andrew Morton,
	linux-power, Mike Rapoport

On Tue, 2020-09-08 at 16:15 +0200, Alexander Gordeev wrote:
> On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> > >Yes, and also two more sources :/
> > >	arch/powerpc/mm/kasan/8xx.c
> > >	arch/powerpc/mm/kasan/kasan_init_32.c
> > >
> > >But these two are not quite obvious wrt pgd_addr_end() used
> > >while traversing pmds. Could you please clarify a bit?
> > >
> > >
> > >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> > >index 2784224..89c5053 100644
> > >--- a/arch/powerpc/mm/kasan/8xx.c
> > >+++ b/arch/powerpc/mm/kasan/8xx.c
> > >@@ -15,8 +15,8 @@
> > >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> > >  		pte_basic_t *new;
> > >-		k_next = pgd_addr_end(k_cur, k_end);
> > >-		k_next = pgd_addr_end(k_next, k_end);
> > >+		k_next = pmd_addr_end(k_cur, k_end);
> > >+		k_next = pmd_addr_end(k_next, k_end);
> > 
> > No, I don't think so.
> > On powerpc32 we have only two levels, so pgd and pmd are more or
> > less the same.
> > But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> > is a no-op, so I don't think it will work.
> > 
> > It is likely that this function should iterate on pgd, then you get
> > pmd = pmd_offset(pud_offset(p4d_offset(pgd)));
> 
> It looks like the code iterates over single pmd table while using
> pgd_addr_end() only to skip all the middle levels and bail out
> from the loop.
> 
> I would be wary for switching from pmds to pgds, since we are
> trying to minimize impact (especially functional) and the
> rework does not seem that obvious.
> 

I've just tested the following change, it works and should fix the
oddity:

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224054f8..8e53ddf57b84 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -9,11 +9,12 @@
 static int __init
 kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void
*block)
 {
-	pmd_t *pmd = pmd_off_k(k_start);
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block
+= SZ_8M) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd += 2, block
+= SZ_8M) {
 		pte_basic_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		k_next = pgd_addr_end(k_next, k_end);
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c
b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb294046e00e..e5f524fa71a7 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -30,13 +30,12 @@ static void __init kasan_populate_pte(pte_t *ptep,
pgprot_t prot)
 
 int __init kasan_init_shadow_page_tables(unsigned long k_start,
unsigned long k_end)
 {
-	pmd_t *pmd;
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	pmd = pmd_off_k(k_start);
-
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd++) {
 		pte_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
@@ -189,16 +188,18 @@ void __init kasan_early_init(void)
 	unsigned long addr = KASAN_SHADOW_START;
 	unsigned long end = KASAN_SHADOW_END;
 	unsigned long next;
-	pmd_t *pmd = pmd_off_k(addr);
+	pgd_t *pgd = pgd_offset_k(addr);
 
 	BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK);
 
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, addr), addr),
addr);
+
 		next = pgd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
-	} while (pmd++, addr = next, addr != end);
+	} while (pgd++, addr = next, addr != end);
 
 	if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
 		kasan_early_hash_table();
---
Christophe


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
@ 2020-09-09  8:38             ` Christophe Leroy
  0 siblings, 0 replies; 313+ messages in thread
From: Christophe Leroy @ 2020-09-09  8:38 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Arnd Bergmann, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Vasily Gorbik, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Michael Ellerman, Andrew Morton,
	linux-power, Mike Rapoport

On Tue, 2020-09-08 at 16:15 +0200, Alexander Gordeev wrote:
> On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> > >Yes, and also two more sources :/
> > >	arch/powerpc/mm/kasan/8xx.c
> > >	arch/powerpc/mm/kasan/kasan_init_32.c
> > >
> > >But these two are not quite obvious wrt pgd_addr_end() used
> > >while traversing pmds. Could you please clarify a bit?
> > >
> > >
> > >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> > >index 2784224..89c5053 100644
> > >--- a/arch/powerpc/mm/kasan/8xx.c
> > >+++ b/arch/powerpc/mm/kasan/8xx.c
> > >@@ -15,8 +15,8 @@
> > >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> > >  		pte_basic_t *new;
> > >-		k_next = pgd_addr_end(k_cur, k_end);
> > >-		k_next = pgd_addr_end(k_next, k_end);
> > >+		k_next = pmd_addr_end(k_cur, k_end);
> > >+		k_next = pmd_addr_end(k_next, k_end);
> > 
> > No, I don't think so.
> > On powerpc32 we have only two levels, so pgd and pmd are more or
> > less the same.
> > But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> > is a no-op, so I don't think it will work.
> > 
> > It is likely that this function should iterate on pgd, then you get
> > pmd = pmd_offset(pud_offset(p4d_offset(pgd)));
> 
> It looks like the code iterates over single pmd table while using
> pgd_addr_end() only to skip all the middle levels and bail out
> from the loop.
> 
> I would be wary for switching from pmds to pgds, since we are
> trying to minimize impact (especially functional) and the
> rework does not seem that obvious.
> 

I've just tested the following change, it works and should fix the
oddity:

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224054f8..8e53ddf57b84 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -9,11 +9,12 @@
 static int __init
 kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void
*block)
 {
-	pmd_t *pmd = pmd_off_k(k_start);
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block
+= SZ_8M) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd += 2, block
+= SZ_8M) {
 		pte_basic_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		k_next = pgd_addr_end(k_next, k_end);
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c
b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb294046e00e..e5f524fa71a7 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -30,13 +30,12 @@ static void __init kasan_populate_pte(pte_t *ptep,
pgprot_t prot)
 
 int __init kasan_init_shadow_page_tables(unsigned long k_start,
unsigned long k_end)
 {
-	pmd_t *pmd;
+	pgd_t *pgd = pgd_offset_k(k_start);
 	unsigned long k_cur, k_next;
 
-	pmd = pmd_off_k(k_start);
-
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pgd++) {
 		pte_t *new;
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, k_cur), k_cur),
k_cur);
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
@@ -189,16 +188,18 @@ void __init kasan_early_init(void)
 	unsigned long addr = KASAN_SHADOW_START;
 	unsigned long end = KASAN_SHADOW_END;
 	unsigned long next;
-	pmd_t *pmd = pmd_off_k(addr);
+	pgd_t *pgd = pgd_offset_k(addr);
 
 	BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK);
 
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
+		pmd_t *pmd = pmd_offset(pud_offset(p4d_offset(pgd, addr), addr),
addr);
+
 		next = pgd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
-	} while (pmd++, addr = next, addr != end);
+	} while (pgd++, addr = next, addr != end);
 
 	if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
 		kasan_early_hash_table();
---
Christophe


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-08 14:30     ` Dave Hansen
                         ` (2 preceding siblings ...)
  (?)
@ 2020-09-09 12:29       ` Gerald Schaefer
  -1 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 12:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

Sorry, I guess my previous reply does not really explain "what the heck
is dynamic page table folding?".

On s390, we can have different number of page table levels for different
processes / mms. We always start with 3 levels, and update dynamically
on process demand to 4 or 5 levels, hence the dynamic folding. Still,
the PxD_SIZE/SHIFT is defined statically, so that e.g. pXd_addr_end() will
not reflect this dynamic behavior.

For the various pagetable walkers using pXd_addr_end() (w/o READ_ONCE
logic) this is no problem. With static folding, iteration over the folded
levels will always happen at pgd level (top-level folding). For s390,
we stay at the respective level and iterate there (dynamic middle-level
folding), only return to pgd level if there really were 5 levels.

This only works well as long there are real pagetable pointers involved,
that can also be used for iteration. For gup_fast, or any other future
pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
There are pointers involved to local pXd values on the stack, because of
the READ_ONCE logic, and our middle-level iteration will suddenly iterate
over such stack pointers instead of pagetable pointers.

This will be addressed by making the pXd_addr_end() dynamic, for which
we need to see the pXd value in order to determine its level / type.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 12:29       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 12:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

Sorry, I guess my previous reply does not really explain "what the heck
is dynamic page table folding?".

On s390, we can have different number of page table levels for different
processes / mms. We always start with 3 levels, and update dynamically
on process demand to 4 or 5 levels, hence the dynamic folding. Still,
the PxD_SIZE/SHIFT is defined statically, so that e.g. pXd_addr_end() will
not reflect this dynamic behavior.

For the various pagetable walkers using pXd_addr_end() (w/o READ_ONCE
logic) this is no problem. With static folding, iteration over the folded
levels will always happen at pgd level (top-level folding). For s390,
we stay at the respective level and iterate there (dynamic middle-level
folding), only return to pgd level if there really were 5 levels.

This only works well as long there are real pagetable pointers involved,
that can also be used for iteration. For gup_fast, or any other future
pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
There are pointers involved to local pXd values on the stack, because of
the READ_ONCE logic, and our middle-level iteration will suddenly iterate
over such stack pointers instead of pagetable pointers.

This will be addressed by making the pXd_addr_end() dynamic, for which
we need to see the pXd value in order to determine its level / type.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 12:29       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 12:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

Sorry, I guess my previous reply does not really explain "what the heck
is dynamic page table folding?".

On s390, we can have different number of page table levels for different
processes / mms. We always start with 3 levels, and update dynamically
on process demand to 4 or 5 levels, hence the dynamic folding. Still,
the PxD_SIZE/SHIFT is defined statically, so that e.g. pXd_addr_end() will
not reflect this dynamic behavior.

For the various pagetable walkers using pXd_addr_end() (w/o READ_ONCE
logic) this is no problem. With static folding, iteration over the folded
levels will always happen at pgd level (top-level folding). For s390,
we stay at the respective level and iterate there (dynamic middle-level
folding), only return to pgd level if there really were 5 levels.

This only works well as long there are real pagetable pointers involved,
that can also be used for iteration. For gup_fast, or any other future
pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
There are pointers involved to local pXd values on the stack, because of
the READ_ONCE logic, and our middle-level iteration will suddenly iterate
over such stack pointers instead of pagetable pointers.

This will be addressed by making the pXd_addr_end() dynamic, for which
we need to see the pXd value in order to determine its level / type.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 12:29       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 12:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

Sorry, I guess my previous reply does not really explain "what the heck
is dynamic page table folding?".

On s390, we can have different number of page table levels for different
processes / mms. We always start with 3 levels, and update dynamically
on process demand to 4 or 5 levels, hence the dynamic folding. Still,
the PxD_SIZE/SHIFT is defined statically, so that e.g. pXd_addr_end() will
not reflect this dynamic behavior.

For the various pagetable walkers using pXd_addr_end() (w/o READ_ONCE
logic) this is no problem. With static folding, iteration over the folded
levels will always happen at pgd level (top-level folding). For s390,
we stay at the respective level and iterate there (dynamic middle-level
folding), only return to pgd level if there really were 5 levels.

This only works well as long there are real pagetable pointers involved,
that can also be used for iteration. For gup_fast, or any other future
pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
There are pointers involved to local pXd values on the stack, because of
the READ_ONCE logic, and our middle-level iteration will suddenly iterate
over such stack pointers instead of pagetable pointers.

This will be addressed by making the pXd_addr_end() dynamic, for which
we need to see the pXd value in order to determine its level / type.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 12:29       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 12:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

Sorry, I guess my previous reply does not really explain "what the heck
is dynamic page table folding?".

On s390, we can have different number of page table levels for different
processes / mms. We always start with 3 levels, and update dynamically
on process demand to 4 or 5 levels, hence the dynamic folding. Still,
the PxD_SIZE/SHIFT is defined statically, so that e.g. pXd_addr_end() will
not reflect this dynamic behavior.

For the various pagetable walkers using pXd_addr_end() (w/o READ_ONCE
logic) this is no problem. With static folding, iteration over the folded
levels will always happen at pgd level (top-level folding). For s390,
we stay at the respective level and iterate there (dynamic middle-level
folding), only return to pgd level if there really were 5 levels.

This only works well as long there are real pagetable pointers involved,
that can also be used for iteration. For gup_fast, or any other future
pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
There are pointers involved to local pXd values on the stack, because of
the READ_ONCE logic, and our middle-level iteration will suddenly iterate
over such stack pointers instead of pagetable pointers.

This will be addressed by making the pXd_addr_end() dynamic, for which
we need to see the pXd value in order to determine its level / type.

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-08 17:36       ` Gerald Schaefer
                           ` (2 preceding siblings ...)
  (?)
@ 2020-09-09 16:12         ` Gerald Schaefer
  -1 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 16:12 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Mike Rapoport, Peter Zijlstra, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds

On Tue, 8 Sep 2020 19:36:50 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:

[..]
> 
> It seems now that the generalization is very well accepted so far,
> apart from some apparent issues on arm. Also, merging 2 + 3 and
> putting them first seems to be acceptable, so we could do that for
> v3, if there are no objections.
> 
> Of course, we first need to address the few remaining issues for
> arm(32?), which do look quite confusing to me so far. BTW, sorry for
> the compile error with patch 3, I guess we did the cross-compile only
> for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
> patch 3 already proved its usefulness by that :-)

Umm, replace "arm" with "power", sorry. No issues on arm so far, but
also no ack I think.

Thanks to Christophe for the power change, and to Mike for volunteering
for some cross compilation and cross-arch testing. Will send v3 with
merged and re-ordered patches after some more testing.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 16:12         ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 16:12 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Mike Rapoport, Peter Zijlstra, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds

On Tue, 8 Sep 2020 19:36:50 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:

[..]
> 
> It seems now that the generalization is very well accepted so far,
> apart from some apparent issues on arm. Also, merging 2 + 3 and
> putting them first seems to be acceptable, so we could do that for
> v3, if there are no objections.
> 
> Of course, we first need to address the few remaining issues for
> arm(32?), which do look quite confusing to me so far. BTW, sorry for
> the compile error with patch 3, I guess we did the cross-compile only
> for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
> patch 3 already proved its usefulness by that :-)

Umm, replace "arm" with "power", sorry. No issues on arm so far, but
also no ack I think.

Thanks to Christophe for the power change, and to Mike for volunteering
for some cross compilation and cross-arch testing. Will send v3 with
merged and re-ordered patches after some more testing.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 16:12         ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 16:12 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 19:36:50 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:

[..]
> 
> It seems now that the generalization is very well accepted so far,
> apart from some apparent issues on arm. Also, merging 2 + 3 and
> putting them first seems to be acceptable, so we could do that for
> v3, if there are no objections.
> 
> Of course, we first need to address the few remaining issues for
> arm(32?), which do look quite confusing to me so far. BTW, sorry for
> the compile error with patch 3, I guess we did the cross-compile only
> for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
> patch 3 already proved its usefulness by that :-)

Umm, replace "arm" with "power", sorry. No issues on arm so far, but
also no ack I think.

Thanks to Christophe for the power change, and to Mike for volunteering
for some cross compilation and cross-arch testing. Will send v3 with
merged and re-ordered patches after some more testing.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 16:12         ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 16:12 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 19:36:50 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:

[..]
> 
> It seems now that the generalization is very well accepted so far,
> apart from some apparent issues on arm. Also, merging 2 + 3 and
> putting them first seems to be acceptable, so we could do that for
> v3, if there are no objections.
> 
> Of course, we first need to address the few remaining issues for
> arm(32?), which do look quite confusing to me so far. BTW, sorry for
> the compile error with patch 3, I guess we did the cross-compile only
> for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
> patch 3 already proved its usefulness by that :-)

Umm, replace "arm" with "power", sorry. No issues on arm so far, but
also no ack I think.

Thanks to Christophe for the power change, and to Mike for volunteering
for some cross compilation and cross-arch testing. Will send v3 with
merged and re-ordered patches after some more testing.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 16:12         ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 16:12 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Jason Gunthorpe, Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Andrey Ryabinin, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport

On Tue, 8 Sep 2020 19:36:50 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:

[..]
> 
> It seems now that the generalization is very well accepted so far,
> apart from some apparent issues on arm. Also, merging 2 + 3 and
> putting them first seems to be acceptable, so we could do that for
> v3, if there are no objections.
> 
> Of course, we first need to address the few remaining issues for
> arm(32?), which do look quite confusing to me so far. BTW, sorry for
> the compile error with patch 3, I guess we did the cross-compile only
> for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
> patch 3 already proved its usefulness by that :-)

Umm, replace "arm" with "power", sorry. No issues on arm so far, but
also no ack I think.

Thanks to Christophe for the power change, and to Mike for volunteering
for some cross compilation and cross-arch testing. Will send v3 with
merged and re-ordered patches after some more testing.

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-09 12:29       ` Gerald Schaefer
                           ` (2 preceding siblings ...)
  (?)
@ 2020-09-09 16:18         ` Dave Hansen
  -1 siblings, 0 replies; 313+ messages in thread
From: Dave Hansen @ 2020-09-09 16:18 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> This only works well as long there are real pagetable pointers involved,
> that can also be used for iteration. For gup_fast, or any other future
> pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> There are pointers involved to local pXd values on the stack, because of
> the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> over such stack pointers instead of pagetable pointers.

By "There are pointers involved to local pXd values on the stack", did
you mean "locate" instead of "local"?  That sentence confused me.

Which code is it, exactly that allocates these troublesome on-stack pXd
values, btw?

> This will be addressed by making the pXd_addr_end() dynamic, for which
> we need to see the pXd value in order to determine its level / type.

Thanks for the explanation!

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 16:18         ` Dave Hansen
  0 siblings, 0 replies; 313+ messages in thread
From: Dave Hansen @ 2020-09-09 16:18 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> This only works well as long there are real pagetable pointers involved,
> that can also be used for iteration. For gup_fast, or any other future
> pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> There are pointers involved to local pXd values on the stack, because of
> the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> over such stack pointers instead of pagetable pointers.

By "There are pointers involved to local pXd values on the stack", did
you mean "locate" instead of "local"?  That sentence confused me.

Which code is it, exactly that allocates these troublesome on-stack pXd
values, btw?

> This will be addressed by making the pXd_addr_end() dynamic, for which
> we need to see the pXd value in order to determine its level / type.

Thanks for the explanation!

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 16:18         ` Dave Hansen
  0 siblings, 0 replies; 313+ messages in thread
From: Dave Hansen @ 2020-09-09 16:18 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> This only works well as long there are real pagetable pointers involved,
> that can also be used for iteration. For gup_fast, or any other future
> pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> There are pointers involved to local pXd values on the stack, because of
> the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> over such stack pointers instead of pagetable pointers.

By "There are pointers involved to local pXd values on the stack", did
you mean "locate" instead of "local"?  That sentence confused me.

Which code is it, exactly that allocates these troublesome on-stack pXd
values, btw?

> This will be addressed by making the pXd_addr_end() dynamic, for which
> we need to see the pXd value in order to determine its level / type.

Thanks for the explanation!

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 16:18         ` Dave Hansen
  0 siblings, 0 replies; 313+ messages in thread
From: Dave Hansen @ 2020-09-09 16:18 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> This only works well as long there are real pagetable pointers involved,
> that can also be used for iteration. For gup_fast, or any other future
> pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> There are pointers involved to local pXd values on the stack, because of
> the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> over such stack pointers instead of pagetable pointers.

By "There are pointers involved to local pXd values on the stack", did
you mean "locate" instead of "local"?  That sentence confused me.

Which code is it, exactly that allocates these troublesome on-stack pXd
values, btw?

> This will be addressed by making the pXd_addr_end() dynamic, for which
> we need to see the pXd value in order to determine its level / type.

Thanks for the explanation!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 16:18         ` Dave Hansen
  0 siblings, 0 replies; 313+ messages in thread
From: Dave Hansen @ 2020-09-09 16:18 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> This only works well as long there are real pagetable pointers involved,
> that can also be used for iteration. For gup_fast, or any other future
> pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> There are pointers involved to local pXd values on the stack, because of
> the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> over such stack pointers instead of pagetable pointers.

By "There are pointers involved to local pXd values on the stack", did
you mean "locate" instead of "local"?  That sentence confused me.

Which code is it, exactly that allocates these troublesome on-stack pXd
values, btw?

> This will be addressed by making the pXd_addr_end() dynamic, for which
> we need to see the pXd value in order to determine its level / type.

Thanks for the explanation!

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-09 16:18         ` Dave Hansen
                             ` (2 preceding siblings ...)
  (?)
@ 2020-09-09 17:25           ` Gerald Schaefer
  -1 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 17:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Wed, 9 Sep 2020 09:18:46 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> > This only works well as long there are real pagetable pointers involved,
> > that can also be used for iteration. For gup_fast, or any other future
> > pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> > There are pointers involved to local pXd values on the stack, because of
> > the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> > over such stack pointers instead of pagetable pointers.
> 
> By "There are pointers involved to local pXd values on the stack", did
> you mean "locate" instead of "local"?  That sentence confused me.
> 
> Which code is it, exactly that allocates these troublesome on-stack pXd
> values, btw?

It is the gup_pXd_range() call sequence in mm/gup.c. It starts in
gup_pgd_range() with "pgdp = pgd_offset(current->mm, addr)" and then
the "pgd_t pgd = READ_ONCE(*pgdp)" which creates the first local
stack variable "pgd".

The next-level call to gup_p4d_range() gets this "pgd" value as
input, but not the original pgdp pointer where it was read from.
This is already the essential difference to other pagetable walkers
like e.g. walk_pXd_range() in mm/pagewalk.c, where the original
pointer is passed through. With READ_ONCE, that pointer must not
be further de-referenced, so instead the value is passed over.

In gup_p4d_range() we then have "p4dp = p4d_offset(&pgd, addr)",
with &pgd being a pointer to the passed over pgd value, so that's
the first pXd pointer that does not point directly to the pXd in
the page table, but a local stack variable.

With folded p4d, p4d_offset(&pgd, addr) will simply return
the passed-in &pgd pointer, so we now also have p4dp point to that.
That continues with "p4d_t p4d = READ_ONCE(*p4dp)", and that second
stack variable passed to gup_huge_pud() and so on. Due to inlining,
all those variables will not really be passed anywhere, but simply
sit on the stack.

So far, IIUC, that would also happen on x86 (or everywhere else
actually) for folded levels, i.e. some pXd_offset() calls would
simply return the passed in (stack) value pointer. This works
as designed, and it will not lead to the "iteration over stack
pointer" for anybody but s390, because the pXd_addr_end()
boundaries usually take care that you always return to pgd
level for iteration, and that is the only level with a real
pagetable pointer. For s390, we stay at the first non-folded
level and do the iteration there, which is fine for other
pagetable walkers using the original pointers, but not for
the READ_ONCE-style gup_fast.

I actually had to draw myself a picture to get some hold of
this, or rather a walk-through with a certain pud-crossing
range in a folded 3-level scenario. Not sure if I would have
understood my explanation above w/o that, but I hope you can
make some sense out of it. Or draw yourself a picture :-)

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 17:25           ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 17:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jason Gunthorpe, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Wed, 9 Sep 2020 09:18:46 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> > This only works well as long there are real pagetable pointers involved,
> > that can also be used for iteration. For gup_fast, or any other future
> > pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> > There are pointers involved to local pXd values on the stack, because of
> > the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> > over such stack pointers instead of pagetable pointers.
> 
> By "There are pointers involved to local pXd values on the stack", did
> you mean "locate" instead of "local"?  That sentence confused me.
> 
> Which code is it, exactly that allocates these troublesome on-stack pXd
> values, btw?

It is the gup_pXd_range() call sequence in mm/gup.c. It starts in
gup_pgd_range() with "pgdp = pgd_offset(current->mm, addr)" and then
the "pgd_t pgd = READ_ONCE(*pgdp)" which creates the first local
stack variable "pgd".

The next-level call to gup_p4d_range() gets this "pgd" value as
input, but not the original pgdp pointer where it was read from.
This is already the essential difference to other pagetable walkers
like e.g. walk_pXd_range() in mm/pagewalk.c, where the original
pointer is passed through. With READ_ONCE, that pointer must not
be further de-referenced, so instead the value is passed over.

In gup_p4d_range() we then have "p4dp = p4d_offset(&pgd, addr)",
with &pgd being a pointer to the passed over pgd value, so that's
the first pXd pointer that does not point directly to the pXd in
the page table, but a local stack variable.

With folded p4d, p4d_offset(&pgd, addr) will simply return
the passed-in &pgd pointer, so we now also have p4dp point to that.
That continues with "p4d_t p4d = READ_ONCE(*p4dp)", and that second
stack variable passed to gup_huge_pud() and so on. Due to inlining,
all those variables will not really be passed anywhere, but simply
sit on the stack.

So far, IIUC, that would also happen on x86 (or everywhere else
actually) for folded levels, i.e. some pXd_offset() calls would
simply return the passed in (stack) value pointer. This works
as designed, and it will not lead to the "iteration over stack
pointer" for anybody but s390, because the pXd_addr_end()
boundaries usually take care that you always return to pgd
level for iteration, and that is the only level with a real
pagetable pointer. For s390, we stay at the first non-folded
level and do the iteration there, which is fine for other
pagetable walkers using the original pointers, but not for
the READ_ONCE-style gup_fast.

I actually had to draw myself a picture to get some hold of
this, or rather a walk-through with a certain pud-crossing
range in a folded 3-level scenario. Not sure if I would have
understood my explanation above w/o that, but I hope you can
make some sense out of it. Or draw yourself a picture :-)

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 17:25           ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 17:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, 9 Sep 2020 09:18:46 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> > This only works well as long there are real pagetable pointers involved,
> > that can also be used for iteration. For gup_fast, or any other future
> > pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> > There are pointers involved to local pXd values on the stack, because of
> > the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> > over such stack pointers instead of pagetable pointers.
> 
> By "There are pointers involved to local pXd values on the stack", did
> you mean "locate" instead of "local"?  That sentence confused me.
> 
> Which code is it, exactly that allocates these troublesome on-stack pXd
> values, btw?

It is the gup_pXd_range() call sequence in mm/gup.c. It starts in
gup_pgd_range() with "pgdp = pgd_offset(current->mm, addr)" and then
the "pgd_t pgd = READ_ONCE(*pgdp)" which creates the first local
stack variable "pgd".

The next-level call to gup_p4d_range() gets this "pgd" value as
input, but not the original pgdp pointer where it was read from.
This is already the essential difference to other pagetable walkers
like e.g. walk_pXd_range() in mm/pagewalk.c, where the original
pointer is passed through. With READ_ONCE, that pointer must not
be further de-referenced, so instead the value is passed over.

In gup_p4d_range() we then have "p4dp = p4d_offset(&pgd, addr)",
with &pgd being a pointer to the passed over pgd value, so that's
the first pXd pointer that does not point directly to the pXd in
the page table, but a local stack variable.

With folded p4d, p4d_offset(&pgd, addr) will simply return
the passed-in &pgd pointer, so we now also have p4dp point to that.
That continues with "p4d_t p4d = READ_ONCE(*p4dp)", and that second
stack variable passed to gup_huge_pud() and so on. Due to inlining,
all those variables will not really be passed anywhere, but simply
sit on the stack.

So far, IIUC, that would also happen on x86 (or everywhere else
actually) for folded levels, i.e. some pXd_offset() calls would
simply return the passed in (stack) value pointer. This works
as designed, and it will not lead to the "iteration over stack
pointer" for anybody but s390, because the pXd_addr_end()
boundaries usually take care that you always return to pgd
level for iteration, and that is the only level with a real
pagetable pointer. For s390, we stay at the first non-folded
level and do the iteration there, which is fine for other
pagetable walkers using the original pointers, but not for
the READ_ONCE-style gup_fast.

I actually had to draw myself a picture to get some hold of
this, or rather a walk-through with a certain pud-crossing
range in a folded 3-level scenario. Not sure if I would have
understood my explanation above w/o that, but I hope you can
make some sense out of it. Or draw yourself a picture :-)

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 17:25           ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 17:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, 9 Sep 2020 09:18:46 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> > This only works well as long there are real pagetable pointers involved,
> > that can also be used for iteration. For gup_fast, or any other future
> > pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> > There are pointers involved to local pXd values on the stack, because of
> > the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> > over such stack pointers instead of pagetable pointers.
> 
> By "There are pointers involved to local pXd values on the stack", did
> you mean "locate" instead of "local"?  That sentence confused me.
> 
> Which code is it, exactly that allocates these troublesome on-stack pXd
> values, btw?

It is the gup_pXd_range() call sequence in mm/gup.c. It starts in
gup_pgd_range() with "pgdp = pgd_offset(current->mm, addr)" and then
the "pgd_t pgd = READ_ONCE(*pgdp)" which creates the first local
stack variable "pgd".

The next-level call to gup_p4d_range() gets this "pgd" value as
input, but not the original pgdp pointer where it was read from.
This is already the essential difference to other pagetable walkers
like e.g. walk_pXd_range() in mm/pagewalk.c, where the original
pointer is passed through. With READ_ONCE, that pointer must not
be further de-referenced, so instead the value is passed over.

In gup_p4d_range() we then have "p4dp = p4d_offset(&pgd, addr)",
with &pgd being a pointer to the passed over pgd value, so that's
the first pXd pointer that does not point directly to the pXd in
the page table, but a local stack variable.

With folded p4d, p4d_offset(&pgd, addr) will simply return
the passed-in &pgd pointer, so we now also have p4dp point to that.
That continues with "p4d_t p4d = READ_ONCE(*p4dp)", and that second
stack variable passed to gup_huge_pud() and so on. Due to inlining,
all those variables will not really be passed anywhere, but simply
sit on the stack.

So far, IIUC, that would also happen on x86 (or everywhere else
actually) for folded levels, i.e. some pXd_offset() calls would
simply return the passed in (stack) value pointer. This works
as designed, and it will not lead to the "iteration over stack
pointer" for anybody but s390, because the pXd_addr_end()
boundaries usually take care that you always return to pgd
level for iteration, and that is the only level with a real
pagetable pointer. For s390, we stay at the first non-folded
level and do the iteration there, which is fine for other
pagetable walkers using the original pointers, but not for
the READ_ONCE-style gup_fast.

I actually had to draw myself a picture to get some hold of
this, or rather a walk-through with a certain pud-crossing
range in a folded 3-level scenario. Not sure if I would have
understood my explanation above w/o that, but I hope you can
make some sense out of it. Or draw yourself a picture :-)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 17:25           ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-09 17:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Michael Ellerman,
	Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, 9 Sep 2020 09:18:46 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> > This only works well as long there are real pagetable pointers involved,
> > that can also be used for iteration. For gup_fast, or any other future
> > pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> > There are pointers involved to local pXd values on the stack, because of
> > the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> > over such stack pointers instead of pagetable pointers.
> 
> By "There are pointers involved to local pXd values on the stack", did
> you mean "locate" instead of "local"?  That sentence confused me.
> 
> Which code is it, exactly that allocates these troublesome on-stack pXd
> values, btw?

It is the gup_pXd_range() call sequence in mm/gup.c. It starts in
gup_pgd_range() with "pgdp = pgd_offset(current->mm, addr)" and then
the "pgd_t pgd = READ_ONCE(*pgdp)" which creates the first local
stack variable "pgd".

The next-level call to gup_p4d_range() gets this "pgd" value as
input, but not the original pgdp pointer where it was read from.
This is already the essential difference to other pagetable walkers
like e.g. walk_pXd_range() in mm/pagewalk.c, where the original
pointer is passed through. With READ_ONCE, that pointer must not
be further de-referenced, so instead the value is passed over.

In gup_p4d_range() we then have "p4dp = p4d_offset(&pgd, addr)",
with &pgd being a pointer to the passed over pgd value, so that's
the first pXd pointer that does not point directly to the pXd in
the page table, but a local stack variable.

With folded p4d, p4d_offset(&pgd, addr) will simply return
the passed-in &pgd pointer, so we now also have p4dp point to that.
That continues with "p4d_t p4d = READ_ONCE(*p4dp)", and that second
stack variable passed to gup_huge_pud() and so on. Due to inlining,
all those variables will not really be passed anywhere, but simply
sit on the stack.

So far, IIUC, that would also happen on x86 (or everywhere else
actually) for folded levels, i.e. some pXd_offset() calls would
simply return the passed in (stack) value pointer. This works
as designed, and it will not lead to the "iteration over stack
pointer" for anybody but s390, because the pXd_addr_end()
boundaries usually take care that you always return to pgd
level for iteration, and that is the only level with a real
pagetable pointer. For s390, we stay at the first non-folded
level and do the iteration there, which is fine for other
pagetable walkers using the original pointers, but not for
the READ_ONCE-style gup_fast.

I actually had to draw myself a picture to get some hold of
this, or rather a walk-through with a certain pud-crossing
range in a folded 3-level scenario. Not sure if I would have
understood my explanation above w/o that, but I hope you can
make some sense out of it. Or draw yourself a picture :-)

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-09 17:25           ` Gerald Schaefer
                               ` (2 preceding siblings ...)
  (?)
@ 2020-09-09 18:03             ` Jason Gunthorpe
  -1 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-09 18:03 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> I actually had to draw myself a picture to get some hold of
> this, or rather a walk-through with a certain pud-crossing
> range in a folded 3-level scenario. Not sure if I would have
> understood my explanation above w/o that, but I hope you can
> make some sense out of it. Or draw yourself a picture :-)

What I don't understand is how does anything work with S390 today?

If the fix is only to change pxx_addr_end() then than generic code
like mm/pagewalk.c will iterate over a *different list* of page table
entries. 

It's choice of entries to look at is entirely driven by pxx_addr_end().

Which suggest to me that mm/pagewalk.c also doesn't work properly
today on S390 and this issue is not really about stack variables?

Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
together, if pXX_offset() is folded then pXX_addr_end() must cause a
single iteration of that level.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 18:03             ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-09 18:03 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> I actually had to draw myself a picture to get some hold of
> this, or rather a walk-through with a certain pud-crossing
> range in a folded 3-level scenario. Not sure if I would have
> understood my explanation above w/o that, but I hope you can
> make some sense out of it. Or draw yourself a picture :-)

What I don't understand is how does anything work with S390 today?

If the fix is only to change pxx_addr_end() then than generic code
like mm/pagewalk.c will iterate over a *different list* of page table
entries. 

It's choice of entries to look at is entirely driven by pxx_addr_end().

Which suggest to me that mm/pagewalk.c also doesn't work properly
today on S390 and this issue is not really about stack variables?

Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
together, if pXX_offset() is folded then pXX_addr_end() must cause a
single iteration of that level.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 18:03             ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-09 18:03 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> I actually had to draw myself a picture to get some hold of
> this, or rather a walk-through with a certain pud-crossing
> range in a folded 3-level scenario. Not sure if I would have
> understood my explanation above w/o that, but I hope you can
> make some sense out of it. Or draw yourself a picture :-)

What I don't understand is how does anything work with S390 today?

If the fix is only to change pxx_addr_end() then than generic code
like mm/pagewalk.c will iterate over a *different list* of page table
entries. 

It's choice of entries to look at is entirely driven by pxx_addr_end().

Which suggest to me that mm/pagewalk.c also doesn't work properly
today on S390 and this issue is not really about stack variables?

Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
together, if pXX_offset() is folded then pXX_addr_end() must cause a
single iteration of that level.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 18:03             ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-09 18:03 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> I actually had to draw myself a picture to get some hold of
> this, or rather a walk-through with a certain pud-crossing
> range in a folded 3-level scenario. Not sure if I would have
> understood my explanation above w/o that, but I hope you can
> make some sense out of it. Or draw yourself a picture :-)

What I don't understand is how does anything work with S390 today?

If the fix is only to change pxx_addr_end() then than generic code
like mm/pagewalk.c will iterate over a *different list* of page table
entries. 

It's choice of entries to look at is entirely driven by pxx_addr_end().

Which suggest to me that mm/pagewalk.c also doesn't work properly
today on S390 and this issue is not really about stack variables?

Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
together, if pXX_offset() is folded then pXX_addr_end() must cause a
single iteration of that level.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-09 18:03             ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-09 18:03 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> I actually had to draw myself a picture to get some hold of
> this, or rather a walk-through with a certain pud-crossing
> range in a folded 3-level scenario. Not sure if I would have
> understood my explanation above w/o that, but I hope you can
> make some sense out of it. Or draw yourself a picture :-)

What I don't understand is how does anything work with S390 today?

If the fix is only to change pxx_addr_end() then than generic code
like mm/pagewalk.c will iterate over a *different list* of page table
entries. 

It's choice of entries to look at is entirely driven by pxx_addr_end().

Which suggest to me that mm/pagewalk.c also doesn't work properly
today on S390 and this issue is not really about stack variables?

Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
together, if pXX_offset() is folded then pXX_addr_end() must cause a
single iteration of that level.

Jason

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-09 18:03             ` Jason Gunthorpe
                                 ` (2 preceding siblings ...)
  (?)
@ 2020-09-10  9:39               ` Alexander Gordeev
  -1 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-10  9:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Gerald Schaefer, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Wed, Sep 09, 2020 at 03:03:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?
> 
> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?
> 
> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

Your observation is correct.

Another way to describe the problem is existing pXd_addr_end helpers
could be applied to mismatching levels on s390 (e.g p4d_addr_end
applied to pud or pgd_addr_end applied to p4d). As you noticed,
all *_pXd_range iterators could be called with address ranges that
exceed single pXd table.

However, when it happens with pointers to real page tables (passed to
*_pXd_range iterators) we still operate on valid tables, which just
(lucky for us) happened to be folded. Thus we still reference correct
table entries.

It is only gup_fast case that exposes the issue. It hits because
pointers to stack copies are passed to gup_pXd_range iterators, not
pointers to real page tables itself.

As Gerald mentioned, it is very difficult to explain in a clear way.
Hopefully, one could make sense ot of it.

> Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10  9:39               ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-10  9:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Gerald Schaefer, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Wed, Sep 09, 2020 at 03:03:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?
> 
> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?
> 
> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

Your observation is correct.

Another way to describe the problem is existing pXd_addr_end helpers
could be applied to mismatching levels on s390 (e.g p4d_addr_end
applied to pud or pgd_addr_end applied to p4d). As you noticed,
all *_pXd_range iterators could be called with address ranges that
exceed single pXd table.

However, when it happens with pointers to real page tables (passed to
*_pXd_range iterators) we still operate on valid tables, which just
(lucky for us) happened to be folded. Thus we still reference correct
table entries.

It is only gup_fast case that exposes the issue. It hits because
pointers to stack copies are passed to gup_pXd_range iterators, not
pointers to real page tables itself.

As Gerald mentioned, it is very difficult to explain in a clear way.
Hopefully, one could make sense ot of it.

> Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10  9:39               ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-10  9:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Wed, Sep 09, 2020 at 03:03:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?
> 
> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?
> 
> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

Your observation is correct.

Another way to describe the problem is existing pXd_addr_end helpers
could be applied to mismatching levels on s390 (e.g p4d_addr_end
applied to pud or pgd_addr_end applied to p4d). As you noticed,
all *_pXd_range iterators could be called with address ranges that
exceed single pXd table.

However, when it happens with pointers to real page tables (passed to
*_pXd_range iterators) we still operate on valid tables, which just
(lucky for us) happened to be folded. Thus we still reference correct
table entries.

It is only gup_fast case that exposes the issue. It hits because
pointers to stack copies are passed to gup_pXd_range iterators, not
pointers to real page tables itself.

As Gerald mentioned, it is very difficult to explain in a clear way.
Hopefully, one could make sense ot of it.

> Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10  9:39               ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-10  9:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Wed, Sep 09, 2020 at 03:03:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?
> 
> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?
> 
> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

Your observation is correct.

Another way to describe the problem is existing pXd_addr_end helpers
could be applied to mismatching levels on s390 (e.g p4d_addr_end
applied to pud or pgd_addr_end applied to p4d). As you noticed,
all *_pXd_range iterators could be called with address ranges that
exceed single pXd table.

However, when it happens with pointers to real page tables (passed to
*_pXd_range iterators) we still operate on valid tables, which just
(lucky for us) happened to be folded. Thus we still reference correct
table entries.

It is only gup_fast case that exposes the issue. It hits because
pointers to stack copies are passed to gup_pXd_range iterators, not
pointers to real page tables itself.

As Gerald mentioned, it is very difficult to explain in a clear way.
Hopefully, one could make sense ot of it.

> Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10  9:39               ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-10  9:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Wed, Sep 09, 2020 at 03:03:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?
> 
> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?
> 
> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

Your observation is correct.

Another way to describe the problem is existing pXd_addr_end helpers
could be applied to mismatching levels on s390 (e.g p4d_addr_end
applied to pud or pgd_addr_end applied to p4d). As you noticed,
all *_pXd_range iterators could be called with address ranges that
exceed single pXd table.

However, when it happens with pointers to real page tables (passed to
*_pXd_range iterators) we still operate on valid tables, which just
(lucky for us) happened to be folded. Thus we still reference correct
table entries.

It is only gup_fast case that exposes the issue. It hits because
pointers to stack copies are passed to gup_pXd_range iterators, not
pointers to real page tables itself.

As Gerald mentioned, it is very difficult to explain in a clear way.
Hopefully, one could make sense ot of it.

> Jason

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10  9:39               ` Alexander Gordeev
                                   ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 13:02                 ` Jason Gunthorpe
  -1 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 13:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Gerald Schaefer, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:

> As Gerald mentioned, it is very difficult to explain in a clear way.
> Hopefully, one could make sense ot of it.

I would say the page table API requires this invariant:

        pud = pud_offset(p4d, addr);
        do {
		WARN_ON(pud != pud_offset(p4d, addr);
                next = pud_addr_end(addr, end);
        } while (pud++, addr = next, addr != end);

ie pud++ is supposed to be a shortcut for 
  pud_offset(p4d, next)

While S390 does not follow this. Fixing addr_end brings it into
alignment by preventing pud++ from happening.

The only currently known side effect is that gup_fast crashes, but it
sure is an unexpected thing.

This suggests another fix, which is to say that pud++ is undefined and
pud_offset() must always be called, but I think that would cause worse
codegen on all other archs.

Jason


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:02                 ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 13:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Gerald Schaefer, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:

> As Gerald mentioned, it is very difficult to explain in a clear way.
> Hopefully, one could make sense ot of it.

I would say the page table API requires this invariant:

        pud = pud_offset(p4d, addr);
        do {
		WARN_ON(pud != pud_offset(p4d, addr);
                next = pud_addr_end(addr, end);
        } while (pud++, addr = next, addr != end);

ie pud++ is supposed to be a shortcut for 
  pud_offset(p4d, next)

While S390 does not follow this. Fixing addr_end brings it into
alignment by preventing pud++ from happening.

The only currently known side effect is that gup_fast crashes, but it
sure is an unexpected thing.

This suggests another fix, which is to say that pud++ is undefined and
pud_offset() must always be called, but I think that would cause worse
codegen on all other archs.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:02                 ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 13:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:

> As Gerald mentioned, it is very difficult to explain in a clear way.
> Hopefully, one could make sense ot of it.

I would say the page table API requires this invariant:

        pud = pud_offset(p4d, addr);
        do {
		WARN_ON(pud != pud_offset(p4d, addr);
                next = pud_addr_end(addr, end);
        } while (pud++, addr = next, addr != end);

ie pud++ is supposed to be a shortcut for 
  pud_offset(p4d, next)

While S390 does not follow this. Fixing addr_end brings it into
alignment by preventing pud++ from happening.

The only currently known side effect is that gup_fast crashes, but it
sure is an unexpected thing.

This suggests another fix, which is to say that pud++ is undefined and
pud_offset() must always be called, but I think that would cause worse
codegen on all other archs.

Jason


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:02                 ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 13:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:

> As Gerald mentioned, it is very difficult to explain in a clear way.
> Hopefully, one could make sense ot of it.

I would say the page table API requires this invariant:

        pud = pud_offset(p4d, addr);
        do {
		WARN_ON(pud != pud_offset(p4d, addr);
                next = pud_addr_end(addr, end);
        } while (pud++, addr = next, addr != end);

ie pud++ is supposed to be a shortcut for 
  pud_offset(p4d, next)

While S390 does not follow this. Fixing addr_end brings it into
alignment by preventing pud++ from happening.

The only currently known side effect is that gup_fast crashes, but it
sure is an unexpected thing.

This suggests another fix, which is to say that pud++ is undefined and
pud_offset() must always be called, but I think that would cause worse
codegen on all other archs.

Jason


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:02                 ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 13:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:

> As Gerald mentioned, it is very difficult to explain in a clear way.
> Hopefully, one could make sense ot of it.

I would say the page table API requires this invariant:

        pud = pud_offset(p4d, addr);
        do {
		WARN_ON(pud != pud_offset(p4d, addr);
                next = pud_addr_end(addr, end);
        } while (pud++, addr = next, addr != end);

ie pud++ is supposed to be a shortcut for 
  pud_offset(p4d, next)

While S390 does not follow this. Fixing addr_end brings it into
alignment by preventing pud++ from happening.

The only currently known side effect is that gup_fast crashes, but it
sure is an unexpected thing.

This suggests another fix, which is to say that pud++ is undefined and
pud_offset() must always be called, but I think that would cause worse
codegen on all other archs.

Jason


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-09 18:03             ` Jason Gunthorpe
                                 ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 13:11               ` Gerald Schaefer
  -1 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Wed, 9 Sep 2020 15:03:24 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?

That is totally comprehensible :-)

> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?

I guess you are confused by the fact that the generic change will indeed
change the logic for _all_ pagetable walkers on s390, not just for
the gup_fast case. But that doesn't mean that they were doing it wrong
before, we simply can do it both ways. However, we probably should
make that (in theory useless) change more explicit.

Let's compare before and after for mm/pagewalk.c on s390, with 3-level
pagetables, range crossing 2 GB pud boundary.

* Before (with pXd_addr_end always using static 5-level PxD_SIZE):

walk_pgd_range()
-> pgd_addr_end() will use static 2^53 PGDIR_SIZE, range is not cropped,
                  no iterations needed, passed over to next level

walk_p4d_range()
-> p4d_addr_end() will use static 2^42 P4D_SIZE, range still not cropped

walk_pud_range()
-> pud_addr_end() now we're cropping, with 2^31 PUD_SIZE, need two
                  iterations for range crossing pud boundary, doing
                  that right here on a pudp which is actually the
                  previously passed-through pgdp/p4dp (pointing to
                  correct pagetable entry)

* After (with dynamic pXd_addr_end using "correct" PxD_SIZE boundaries,
         should be similar to other archs static "top-level folding"):

walk_pgd_range()
-> pgd_addr_end() will now determine "correct" boundary based on pgd
                  value, i.e. 2^31 PUD_SIZE, do cropping now, iteration
                  will now happen here

walk_p4d/pud_range()
->  operate on cropped range, will not iterate, instead return to pgd level,
    which will then use the same pointer for iteration as in the "Before"
    case, but not on the same level.

IMHO, our "Before" logic is more efficient, and also feels more natural.
After all, it is not really necessary to return to pgd level, and it will
surely cost some extra instructions. We are willing to take that cost
for the sake of doing it in a more generic way, hoping that will reduce
future issues. E.g. you already mentioned that you have plans for using
the READ_ONCE logic also in other places, and that would be such a
"future issue".

> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

well, that sounds correct in theory, but I guess it depends on "how
you fold it". E.g. what does "if pXX_offset() is folded" mean?
Take pgd_offset() for the 3-level case above. From our previous
"middle-level folding/iteration" perspective, I would say that
pgd/p4d are folded into pud, so if you say "if pgd_offset() is folded
then pgd_addr_end() must cause a single iteration of that level",
we were doing it all correctly, i.e only having single iteration
on pgd/p4d level. You could even say that all others are doing /
using it wrong :-)

Now take pgd_offset() from the "top-level folding/iteration".
Here you would say that p4d/pud are folded into pgd, which again
does not sound like the natural / most efficient way to me,
but IIUC this has to be how it works for all other archs with
(static) pagetable folding. Now you'd say "if pud/p4d_offset()
is folded then pud/p4d_addr_end() must cause a single iteration
of that level", and that would sound correct. At least until
you look more closely, because e.g. p4d_addr_end() in
include/asm-generic/pgtable-nop4d.h is simply this:
#define p4d_addr_end(addr, end) (end)

How can that cause a single iteration? It clearly won't, it only
works because the previous pgd_addr_end already cropped the range
so that there will be only single iterations for p4d/pud.

The more I think of it, the more it sounds like s390 "middle-level
folding/iteration" was doing it "the right way", and everybody else
was wrong, or at least not in an optimally efficient way :-) Might
also be that only we could do this because we can determine the
pagetable level from a pagetable entry value.

Anyway, if you are not yet confused enough, I recommend looking
at the other option we had in mind, for fixing the gup_fast issue.
See "Patch 1" from here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

That would actually have kept that "middle-level iteration" also
for gup_fast, by additionally passing through the pXd pointers.
However, it also needed a gup-specific version of pXd_offset(),
in order to keep the READ_ONCE semantics. For s390, that would
have actually been the best solution, but a generic version of
that might not have been so easy. And doing it like everybody
else can not be so bad, at least I really hope so.

Of course, at some point in time, we might come up with some fancy
fundamental change that would "do it the right middle-level way
for everybody". At least I think I overheard Vasily and Alexander
discussing some wild ideas, but that is certainly beyond this scope
here...

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:11               ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Hansen, John Hubbard, LKML, linux-mm, linux-arch,
	Andrew Morton, Linus Torvalds, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Alexander Gordeev, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Wed, 9 Sep 2020 15:03:24 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?

That is totally comprehensible :-)

> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?

I guess you are confused by the fact that the generic change will indeed
change the logic for _all_ pagetable walkers on s390, not just for
the gup_fast case. But that doesn't mean that they were doing it wrong
before, we simply can do it both ways. However, we probably should
make that (in theory useless) change more explicit.

Let's compare before and after for mm/pagewalk.c on s390, with 3-level
pagetables, range crossing 2 GB pud boundary.

* Before (with pXd_addr_end always using static 5-level PxD_SIZE):

walk_pgd_range()
-> pgd_addr_end() will use static 2^53 PGDIR_SIZE, range is not cropped,
                  no iterations needed, passed over to next level

walk_p4d_range()
-> p4d_addr_end() will use static 2^42 P4D_SIZE, range still not cropped

walk_pud_range()
-> pud_addr_end() now we're cropping, with 2^31 PUD_SIZE, need two
                  iterations for range crossing pud boundary, doing
                  that right here on a pudp which is actually the
                  previously passed-through pgdp/p4dp (pointing to
                  correct pagetable entry)

* After (with dynamic pXd_addr_end using "correct" PxD_SIZE boundaries,
         should be similar to other archs static "top-level folding"):

walk_pgd_range()
-> pgd_addr_end() will now determine "correct" boundary based on pgd
                  value, i.e. 2^31 PUD_SIZE, do cropping now, iteration
                  will now happen here

walk_p4d/pud_range()
->  operate on cropped range, will not iterate, instead return to pgd level,
    which will then use the same pointer for iteration as in the "Before"
    case, but not on the same level.

IMHO, our "Before" logic is more efficient, and also feels more natural.
After all, it is not really necessary to return to pgd level, and it will
surely cost some extra instructions. We are willing to take that cost
for the sake of doing it in a more generic way, hoping that will reduce
future issues. E.g. you already mentioned that you have plans for using
the READ_ONCE logic also in other places, and that would be such a
"future issue".

> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

well, that sounds correct in theory, but I guess it depends on "how
you fold it". E.g. what does "if pXX_offset() is folded" mean?
Take pgd_offset() for the 3-level case above. From our previous
"middle-level folding/iteration" perspective, I would say that
pgd/p4d are folded into pud, so if you say "if pgd_offset() is folded
then pgd_addr_end() must cause a single iteration of that level",
we were doing it all correctly, i.e only having single iteration
on pgd/p4d level. You could even say that all others are doing /
using it wrong :-)

Now take pgd_offset() from the "top-level folding/iteration".
Here you would say that p4d/pud are folded into pgd, which again
does not sound like the natural / most efficient way to me,
but IIUC this has to be how it works for all other archs with
(static) pagetable folding. Now you'd say "if pud/p4d_offset()
is folded then pud/p4d_addr_end() must cause a single iteration
of that level", and that would sound correct. At least until
you look more closely, because e.g. p4d_addr_end() in
include/asm-generic/pgtable-nop4d.h is simply this:
#define p4d_addr_end(addr, end) (end)

How can that cause a single iteration? It clearly won't, it only
works because the previous pgd_addr_end already cropped the range
so that there will be only single iterations for p4d/pud.

The more I think of it, the more it sounds like s390 "middle-level
folding/iteration" was doing it "the right way", and everybody else
was wrong, or at least not in an optimally efficient way :-) Might
also be that only we could do this because we can determine the
pagetable level from a pagetable entry value.

Anyway, if you are not yet confused enough, I recommend looking
at the other option we had in mind, for fixing the gup_fast issue.
See "Patch 1" from here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

That would actually have kept that "middle-level iteration" also
for gup_fast, by additionally passing through the pXd pointers.
However, it also needed a gup-specific version of pXd_offset(),
in order to keep the READ_ONCE semantics. For s390, that would
have actually been the best solution, but a generic version of
that might not have been so easy. And doing it like everybody
else can not be so bad, at least I really hope so.

Of course, at some point in time, we might come up with some fancy
fundamental change that would "do it the right middle-level way
for everybody". At least I think I overheard Vasily and Alexander
discussing some wild ideas, but that is certainly beyond this scope
here...

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:11               ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, 9 Sep 2020 15:03:24 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?

That is totally comprehensible :-)

> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?

I guess you are confused by the fact that the generic change will indeed
change the logic for _all_ pagetable walkers on s390, not just for
the gup_fast case. But that doesn't mean that they were doing it wrong
before, we simply can do it both ways. However, we probably should
make that (in theory useless) change more explicit.

Let's compare before and after for mm/pagewalk.c on s390, with 3-level
pagetables, range crossing 2 GB pud boundary.

* Before (with pXd_addr_end always using static 5-level PxD_SIZE):

walk_pgd_range()
-> pgd_addr_end() will use static 2^53 PGDIR_SIZE, range is not cropped,
                  no iterations needed, passed over to next level

walk_p4d_range()
-> p4d_addr_end() will use static 2^42 P4D_SIZE, range still not cropped

walk_pud_range()
-> pud_addr_end() now we're cropping, with 2^31 PUD_SIZE, need two
                  iterations for range crossing pud boundary, doing
                  that right here on a pudp which is actually the
                  previously passed-through pgdp/p4dp (pointing to
                  correct pagetable entry)

* After (with dynamic pXd_addr_end using "correct" PxD_SIZE boundaries,
         should be similar to other archs static "top-level folding"):

walk_pgd_range()
-> pgd_addr_end() will now determine "correct" boundary based on pgd
                  value, i.e. 2^31 PUD_SIZE, do cropping now, iteration
                  will now happen here

walk_p4d/pud_range()
->  operate on cropped range, will not iterate, instead return to pgd level,
    which will then use the same pointer for iteration as in the "Before"
    case, but not on the same level.

IMHO, our "Before" logic is more efficient, and also feels more natural.
After all, it is not really necessary to return to pgd level, and it will
surely cost some extra instructions. We are willing to take that cost
for the sake of doing it in a more generic way, hoping that will reduce
future issues. E.g. you already mentioned that you have plans for using
the READ_ONCE logic also in other places, and that would be such a
"future issue".

> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

well, that sounds correct in theory, but I guess it depends on "how
you fold it". E.g. what does "if pXX_offset() is folded" mean?
Take pgd_offset() for the 3-level case above. From our previous
"middle-level folding/iteration" perspective, I would say that
pgd/p4d are folded into pud, so if you say "if pgd_offset() is folded
then pgd_addr_end() must cause a single iteration of that level",
we were doing it all correctly, i.e only having single iteration
on pgd/p4d level. You could even say that all others are doing /
using it wrong :-)

Now take pgd_offset() from the "top-level folding/iteration".
Here you would say that p4d/pud are folded into pgd, which again
does not sound like the natural / most efficient way to me,
but IIUC this has to be how it works for all other archs with
(static) pagetable folding. Now you'd say "if pud/p4d_offset()
is folded then pud/p4d_addr_end() must cause a single iteration
of that level", and that would sound correct. At least until
you look more closely, because e.g. p4d_addr_end() in
include/asm-generic/pgtable-nop4d.h is simply this:
#define p4d_addr_end(addr, end) (end)

How can that cause a single iteration? It clearly won't, it only
works because the previous pgd_addr_end already cropped the range
so that there will be only single iterations for p4d/pud.

The more I think of it, the more it sounds like s390 "middle-level
folding/iteration" was doing it "the right way", and everybody else
was wrong, or at least not in an optimally efficient way :-) Might
also be that only we could do this because we can determine the
pagetable level from a pagetable entry value.

Anyway, if you are not yet confused enough, I recommend looking
at the other option we had in mind, for fixing the gup_fast issue.
See "Patch 1" from here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

That would actually have kept that "middle-level iteration" also
for gup_fast, by additionally passing through the pXd pointers.
However, it also needed a gup-specific version of pXd_offset(),
in order to keep the READ_ONCE semantics. For s390, that would
have actually been the best solution, but a generic version of
that might not have been so easy. And doing it like everybody
else can not be so bad, at least I really hope so.

Of course, at some point in time, we might come up with some fancy
fundamental change that would "do it the right middle-level way
for everybody". At least I think I overheard Vasily and Alexander
discussing some wild ideas, but that is certainly beyond this scope
here...

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:11               ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, 9 Sep 2020 15:03:24 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?

That is totally comprehensible :-)

> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?

I guess you are confused by the fact that the generic change will indeed
change the logic for _all_ pagetable walkers on s390, not just for
the gup_fast case. But that doesn't mean that they were doing it wrong
before, we simply can do it both ways. However, we probably should
make that (in theory useless) change more explicit.

Let's compare before and after for mm/pagewalk.c on s390, with 3-level
pagetables, range crossing 2 GB pud boundary.

* Before (with pXd_addr_end always using static 5-level PxD_SIZE):

walk_pgd_range()
-> pgd_addr_end() will use static 2^53 PGDIR_SIZE, range is not cropped,
                  no iterations needed, passed over to next level

walk_p4d_range()
-> p4d_addr_end() will use static 2^42 P4D_SIZE, range still not cropped

walk_pud_range()
-> pud_addr_end() now we're cropping, with 2^31 PUD_SIZE, need two
                  iterations for range crossing pud boundary, doing
                  that right here on a pudp which is actually the
                  previously passed-through pgdp/p4dp (pointing to
                  correct pagetable entry)

* After (with dynamic pXd_addr_end using "correct" PxD_SIZE boundaries,
         should be similar to other archs static "top-level folding"):

walk_pgd_range()
-> pgd_addr_end() will now determine "correct" boundary based on pgd
                  value, i.e. 2^31 PUD_SIZE, do cropping now, iteration
                  will now happen here

walk_p4d/pud_range()
->  operate on cropped range, will not iterate, instead return to pgd level,
    which will then use the same pointer for iteration as in the "Before"
    case, but not on the same level.

IMHO, our "Before" logic is more efficient, and also feels more natural.
After all, it is not really necessary to return to pgd level, and it will
surely cost some extra instructions. We are willing to take that cost
for the sake of doing it in a more generic way, hoping that will reduce
future issues. E.g. you already mentioned that you have plans for using
the READ_ONCE logic also in other places, and that would be such a
"future issue".

> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

well, that sounds correct in theory, but I guess it depends on "how
you fold it". E.g. what does "if pXX_offset() is folded" mean?
Take pgd_offset() for the 3-level case above. From our previous
"middle-level folding/iteration" perspective, I would say that
pgd/p4d are folded into pud, so if you say "if pgd_offset() is folded
then pgd_addr_end() must cause a single iteration of that level",
we were doing it all correctly, i.e only having single iteration
on pgd/p4d level. You could even say that all others are doing /
using it wrong :-)

Now take pgd_offset() from the "top-level folding/iteration".
Here you would say that p4d/pud are folded into pgd, which again
does not sound like the natural / most efficient way to me,
but IIUC this has to be how it works for all other archs with
(static) pagetable folding. Now you'd say "if pud/p4d_offset()
is folded then pud/p4d_addr_end() must cause a single iteration
of that level", and that would sound correct. At least until
you look more closely, because e.g. p4d_addr_end() in
include/asm-generic/pgtable-nop4d.h is simply this:
#define p4d_addr_end(addr, end) (end)

How can that cause a single iteration? It clearly won't, it only
works because the previous pgd_addr_end already cropped the range
so that there will be only single iterations for p4d/pud.

The more I think of it, the more it sounds like s390 "middle-level
folding/iteration" was doing it "the right way", and everybody else
was wrong, or at least not in an optimally efficient way :-) Might
also be that only we could do this because we can determine the
pagetable level from a pagetable entry value.

Anyway, if you are not yet confused enough, I recommend looking
at the other option we had in mind, for fixing the gup_fast issue.
See "Patch 1" from here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

That would actually have kept that "middle-level iteration" also
for gup_fast, by additionally passing through the pXd pointers.
However, it also needed a gup-specific version of pXd_offset(),
in order to keep the READ_ONCE semantics. For s390, that would
have actually been the best solution, but a generic version of
that might not have been so easy. And doing it like everybody
else can not be so bad, at least I really hope so.

Of course, at some point in time, we might come up with some fancy
fundamental change that would "do it the right middle-level way
for everybody". At least I think I overheard Vasily and Alexander
discussing some wild ideas, but that is certainly beyond this scope
here...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:11               ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Wed, 9 Sep 2020 15:03:24 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?

That is totally comprehensible :-)

> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?

I guess you are confused by the fact that the generic change will indeed
change the logic for _all_ pagetable walkers on s390, not just for
the gup_fast case. But that doesn't mean that they were doing it wrong
before, we simply can do it both ways. However, we probably should
make that (in theory useless) change more explicit.

Let's compare before and after for mm/pagewalk.c on s390, with 3-level
pagetables, range crossing 2 GB pud boundary.

* Before (with pXd_addr_end always using static 5-level PxD_SIZE):

walk_pgd_range()
-> pgd_addr_end() will use static 2^53 PGDIR_SIZE, range is not cropped,
                  no iterations needed, passed over to next level

walk_p4d_range()
-> p4d_addr_end() will use static 2^42 P4D_SIZE, range still not cropped

walk_pud_range()
-> pud_addr_end() now we're cropping, with 2^31 PUD_SIZE, need two
                  iterations for range crossing pud boundary, doing
                  that right here on a pudp which is actually the
                  previously passed-through pgdp/p4dp (pointing to
                  correct pagetable entry)

* After (with dynamic pXd_addr_end using "correct" PxD_SIZE boundaries,
         should be similar to other archs static "top-level folding"):

walk_pgd_range()
-> pgd_addr_end() will now determine "correct" boundary based on pgd
                  value, i.e. 2^31 PUD_SIZE, do cropping now, iteration
                  will now happen here

walk_p4d/pud_range()
->  operate on cropped range, will not iterate, instead return to pgd level,
    which will then use the same pointer for iteration as in the "Before"
    case, but not on the same level.

IMHO, our "Before" logic is more efficient, and also feels more natural.
After all, it is not really necessary to return to pgd level, and it will
surely cost some extra instructions. We are willing to take that cost
for the sake of doing it in a more generic way, hoping that will reduce
future issues. E.g. you already mentioned that you have plans for using
the READ_ONCE logic also in other places, and that would be such a
"future issue".

> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

well, that sounds correct in theory, but I guess it depends on "how
you fold it". E.g. what does "if pXX_offset() is folded" mean?
Take pgd_offset() for the 3-level case above. From our previous
"middle-level folding/iteration" perspective, I would say that
pgd/p4d are folded into pud, so if you say "if pgd_offset() is folded
then pgd_addr_end() must cause a single iteration of that level",
we were doing it all correctly, i.e only having single iteration
on pgd/p4d level. You could even say that all others are doing /
using it wrong :-)

Now take pgd_offset() from the "top-level folding/iteration".
Here you would say that p4d/pud are folded into pgd, which again
does not sound like the natural / most efficient way to me,
but IIUC this has to be how it works for all other archs with
(static) pagetable folding. Now you'd say "if pud/p4d_offset()
is folded then pud/p4d_addr_end() must cause a single iteration
of that level", and that would sound correct. At least until
you look more closely, because e.g. p4d_addr_end() in
include/asm-generic/pgtable-nop4d.h is simply this:
#define p4d_addr_end(addr, end) (end)

How can that cause a single iteration? It clearly won't, it only
works because the previous pgd_addr_end already cropped the range
so that there will be only single iterations for p4d/pud.

The more I think of it, the more it sounds like s390 "middle-level
folding/iteration" was doing it "the right way", and everybody else
was wrong, or at least not in an optimally efficient way :-) Might
also be that only we could do this because we can determine the
pagetable level from a pagetable entry value.

Anyway, if you are not yet confused enough, I recommend looking
at the other option we had in mind, for fixing the gup_fast issue.
See "Patch 1" from here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

That would actually have kept that "middle-level iteration" also
for gup_fast, by additionally passing through the pXd pointers.
However, it also needed a gup-specific version of pXd_offset(),
in order to keep the READ_ONCE semantics. For s390, that would
have actually been the best solution, but a generic version of
that might not have been so easy. And doing it like everybody
else can not be so bad, at least I really hope so.

Of course, at some point in time, we might come up with some fancy
fundamental change that would "do it the right middle-level way
for everybody". At least I think I overheard Vasily and Alexander
discussing some wild ideas, but that is certainly beyond this scope
here...

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 13:02                 ` Jason Gunthorpe
                                     ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 13:28                   ` Gerald Schaefer
  -1 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.  
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 
> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.
> 
> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

It only is unexpected in a "top-level folding" world, see my other reply.
Consider it an optimization, which was possible because of how our dynamic
folding works, and e.g. because we can determine the correct pagetable
level from a pXd value in pXd_offset.

> This suggests another fix, which is to say that pud++ is undefined and
> pud_offset() must always be called, but I think that would cause worse
> codegen on all other archs.

There really is nothing to fix for s390 outside of gup_fast, or other
potential future READ_ONCE pagetable walkers. We do take the side-effect
of the generic change on all other pagetable walkers for s390, but it
really is rather a slight degradation than a fix.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:28                   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.  
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 
> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.
> 
> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

It only is unexpected in a "top-level folding" world, see my other reply.
Consider it an optimization, which was possible because of how our dynamic
folding works, and e.g. because we can determine the correct pagetable
level from a pXd value in pXd_offset.

> This suggests another fix, which is to say that pud++ is undefined and
> pud_offset() must always be called, but I think that would cause worse
> codegen on all other archs.

There really is nothing to fix for s390 outside of gup_fast, or other
potential future READ_ONCE pagetable walkers. We do take the side-effect
of the generic change on all other pagetable walkers for s390, but it
really is rather a slight degradation than a fix.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:28                   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.  
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 
> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.
> 
> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

It only is unexpected in a "top-level folding" world, see my other reply.
Consider it an optimization, which was possible because of how our dynamic
folding works, and e.g. because we can determine the correct pagetable
level from a pXd value in pXd_offset.

> This suggests another fix, which is to say that pud++ is undefined and
> pud_offset() must always be called, but I think that would cause worse
> codegen on all other archs.

There really is nothing to fix for s390 outside of gup_fast, or other
potential future READ_ONCE pagetable walkers. We do take the side-effect
of the generic change on all other pagetable walkers for s390, but it
really is rather a slight degradation than a fix.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:28                   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.  
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 
> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.
> 
> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

It only is unexpected in a "top-level folding" world, see my other reply.
Consider it an optimization, which was possible because of how our dynamic
folding works, and e.g. because we can determine the correct pagetable
level from a pXd value in pXd_offset.

> This suggests another fix, which is to say that pud++ is undefined and
> pud_offset() must always be called, but I think that would cause worse
> codegen on all other archs.

There really is nothing to fix for s390 outside of gup_fast, or other
potential future READ_ONCE pagetable walkers. We do take the side-effect
of the generic change on all other pagetable walkers for s390, but it
really is rather a slight degradation than a fix.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 13:28                   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 13:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.  
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 
> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.
> 
> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

It only is unexpected in a "top-level folding" world, see my other reply.
Consider it an optimization, which was possible because of how our dynamic
folding works, and e.g. because we can determine the correct pagetable
level from a pXd value in pXd_offset.

> This suggests another fix, which is to say that pud++ is undefined and
> pud_offset() must always be called, but I think that would cause worse
> codegen on all other archs.

There really is nothing to fix for s390 outside of gup_fast, or other
potential future READ_ONCE pagetable walkers. We do take the side-effect
of the generic change on all other pagetable walkers for s390, but it
really is rather a slight degradation than a fix.

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 13:28                   ` Gerald Schaefer
                                       ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 15:10                     ` Jason Gunthorpe
  -1 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 15:10 UTC (permalink / raw)
  To: Gerald Schaefer, Anshuman Khandual
  Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.  
> > 
> > I would say the page table API requires this invariant:
> > 
> >         pud = pud_offset(p4d, addr);
> >         do {
> > 		WARN_ON(pud != pud_offset(p4d, addr);
> >                 next = pud_addr_end(addr, end);
> >         } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> > While S390 does not follow this. Fixing addr_end brings it into
> > alignment by preventing pud++ from happening.
> > 
> > The only currently known side effect is that gup_fast crashes, but it
> > sure is an unexpected thing.
> 
> It only is unexpected in a "top-level folding" world, see my other reply.
> Consider it an optimization, which was possible because of how our dynamic
> folding works, and e.g. because we can determine the correct pagetable
> level from a pXd value in pXd_offset.

No, I disagree. The page walker API the arch presents has to have well
defined semantics. For instance, there is an effort to define tests
and invarients for the page table accesses to bring this understanding
and uniformity:

 mm/debug_vm_pgtable.c

If we fix S390 using the pX_addr_end() change then the above should be
updated with an invariant to check it. I've added Anshuman for some
thoughts..

For better or worse, that invariant does exclude arches from using
other folding techniques.

The other solution would be to address the other side of != and adjust
the pud++

eg replcae pud++ with something like:
  pud = pud_next_entry(p4d, pud, next)

Such that:
  pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)

In which case the invarient changes to 'callers can never do pointer
arithmetic on the result of pXX_offset()' which is a bit harder to
enforce.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 15:10                     ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 15:10 UTC (permalink / raw)
  To: Gerald Schaefer, Anshuman Khandual
  Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.  
> > 
> > I would say the page table API requires this invariant:
> > 
> >         pud = pud_offset(p4d, addr);
> >         do {
> > 		WARN_ON(pud != pud_offset(p4d, addr);
> >                 next = pud_addr_end(addr, end);
> >         } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> > While S390 does not follow this. Fixing addr_end brings it into
> > alignment by preventing pud++ from happening.
> > 
> > The only currently known side effect is that gup_fast crashes, but it
> > sure is an unexpected thing.
> 
> It only is unexpected in a "top-level folding" world, see my other reply.
> Consider it an optimization, which was possible because of how our dynamic
> folding works, and e.g. because we can determine the correct pagetable
> level from a pXd value in pXd_offset.

No, I disagree. The page walker API the arch presents has to have well
defined semantics. For instance, there is an effort to define tests
and invarients for the page table accesses to bring this understanding
and uniformity:

 mm/debug_vm_pgtable.c

If we fix S390 using the pX_addr_end() change then the above should be
updated with an invariant to check it. I've added Anshuman for some
thoughts..

For better or worse, that invariant does exclude arches from using
other folding techniques.

The other solution would be to address the other side of != and adjust
the pud++

eg replcae pud++ with something like:
  pud = pud_next_entry(p4d, pud, next)

Such that:
  pud_next_entry(p4d, pud, next) == pud_offset(p4d, next)

In which case the invarient changes to 'callers can never do pointer
arithmetic on the result of pXX_offset()' which is a bit harder to
enforce.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 15:10                     ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 15:10 UTC (permalink / raw)
  To: Gerald Schaefer, Anshuman Khandual
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.  
> > 
> > I would say the page table API requires this invariant:
> > 
> >         pud = pud_offset(p4d, addr);
> >         do {
> > 		WARN_ON(pud != pud_offset(p4d, addr);
> >                 next = pud_addr_end(addr, end);
> >         } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> > While S390 does not follow this. Fixing addr_end brings it into
> > alignment by preventing pud++ from happening.
> > 
> > The only currently known side effect is that gup_fast crashes, but it
> > sure is an unexpected thing.
> 
> It only is unexpected in a "top-level folding" world, see my other reply.
> Consider it an optimization, which was possible because of how our dynamic
> folding works, and e.g. because we can determine the correct pagetable
> level from a pXd value in pXd_offset.

No, I disagree. The page walker API the arch presents has to have well
defined semantics. For instance, there is an effort to define tests
and invarients for the page table accesses to bring this understanding
and uniformity:

 mm/debug_vm_pgtable.c

If we fix S390 using the pX_addr_end() change then the above should be
updated with an invariant to check it. I've added Anshuman for some
thoughts..

For better or worse, that invariant does exclude arches from using
other folding techniques.

The other solution would be to address the other side of != and adjust
the pud++

eg replcae pud++ with something like:
  pud = pud_next_entry(p4d, pud, next)

Such that:
  pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)

In which case the invarient changes to 'callers can never do pointer
arithmetic on the result of pXX_offset()' which is a bit harder to
enforce.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 15:10                     ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 15:10 UTC (permalink / raw)
  To: Gerald Schaefer, Anshuman Khandual
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.  
> > 
> > I would say the page table API requires this invariant:
> > 
> >         pud = pud_offset(p4d, addr);
> >         do {
> > 		WARN_ON(pud != pud_offset(p4d, addr);
> >                 next = pud_addr_end(addr, end);
> >         } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> > While S390 does not follow this. Fixing addr_end brings it into
> > alignment by preventing pud++ from happening.
> > 
> > The only currently known side effect is that gup_fast crashes, but it
> > sure is an unexpected thing.
> 
> It only is unexpected in a "top-level folding" world, see my other reply.
> Consider it an optimization, which was possible because of how our dynamic
> folding works, and e.g. because we can determine the correct pagetable
> level from a pXd value in pXd_offset.

No, I disagree. The page walker API the arch presents has to have well
defined semantics. For instance, there is an effort to define tests
and invarients for the page table accesses to bring this understanding
and uniformity:

 mm/debug_vm_pgtable.c

If we fix S390 using the pX_addr_end() change then the above should be
updated with an invariant to check it. I've added Anshuman for some
thoughts..

For better or worse, that invariant does exclude arches from using
other folding techniques.

The other solution would be to address the other side of != and adjust
the pud++

eg replcae pud++ with something like:
  pud = pud_next_entry(p4d, pud, next)

Such that:
  pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)

In which case the invarient changes to 'callers can never do pointer
arithmetic on the result of pXX_offset()' which is a bit harder to
enforce.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 15:10                     ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 15:10 UTC (permalink / raw)
  To: Gerald Schaefer, Anshuman Khandual
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.  
> > 
> > I would say the page table API requires this invariant:
> > 
> >         pud = pud_offset(p4d, addr);
> >         do {
> > 		WARN_ON(pud != pud_offset(p4d, addr);
> >                 next = pud_addr_end(addr, end);
> >         } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> > While S390 does not follow this. Fixing addr_end brings it into
> > alignment by preventing pud++ from happening.
> > 
> > The only currently known side effect is that gup_fast crashes, but it
> > sure is an unexpected thing.
> 
> It only is unexpected in a "top-level folding" world, see my other reply.
> Consider it an optimization, which was possible because of how our dynamic
> folding works, and e.g. because we can determine the correct pagetable
> level from a pXd value in pXd_offset.

No, I disagree. The page walker API the arch presents has to have well
defined semantics. For instance, there is an effort to define tests
and invarients for the page table accesses to bring this understanding
and uniformity:

 mm/debug_vm_pgtable.c

If we fix S390 using the pX_addr_end() change then the above should be
updated with an invariant to check it. I've added Anshuman for some
thoughts..

For better or worse, that invariant does exclude arches from using
other folding techniques.

The other solution would be to address the other side of != and adjust
the pud++

eg replcae pud++ with something like:
  pud = pud_next_entry(p4d, pud, next)

Such that:
  pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)

In which case the invarient changes to 'callers can never do pointer
arithmetic on the result of pXX_offset()' which is a bit harder to
enforce.

Jason

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 15:10                     ` Jason Gunthorpe
                                         ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 17:07                       ` Gerald Schaefer
  -1 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Anshuman Khandual, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Thu, 10 Sep 2020 12:10:26 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> > On Thu, 10 Sep 2020 10:02:33 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >   
> > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > >   
> > > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > > Hopefully, one could make sense ot of it.    
> > > 
> > > I would say the page table API requires this invariant:
> > > 
> > >         pud = pud_offset(p4d, addr);
> > >         do {
> > > 		WARN_ON(pud != pud_offset(p4d, addr);
> > >                 next = pud_addr_end(addr, end);
> > >         } while (pud++, addr = next, addr != end);
> > > 
> > > ie pud++ is supposed to be a shortcut for 
> > >   pud_offset(p4d, next)
> > > 
> > > While S390 does not follow this. Fixing addr_end brings it into
> > > alignment by preventing pud++ from happening.
> > > 
> > > The only currently known side effect is that gup_fast crashes, but it
> > > sure is an unexpected thing.  
> > 
> > It only is unexpected in a "top-level folding" world, see my other reply.
> > Consider it an optimization, which was possible because of how our dynamic
> > folding works, and e.g. because we can determine the correct pagetable
> > level from a pXd value in pXd_offset.  
> 
> No, I disagree. The page walker API the arch presents has to have well
> defined semantics. For instance, there is an effort to define tests
> and invarients for the page table accesses to bring this understanding
> and uniformity:
> 
>  mm/debug_vm_pgtable.c
> 
> If we fix S390 using the pX_addr_end() change then the above should be
> updated with an invariant to check it. I've added Anshuman for some
> thoughts..

We are very aware of those tests, and actually a big supporter of the
idea. Also part of the supported architectures already, and it has
already helped us find / fix some s390 oddities.

However, we did not see any issues wrt to our pagetable walking,
neither with the current version, nor with the new generic approach.
We do currently see other issues, Anshuman will know what I mean :-)

> For better or worse, that invariant does exclude arches from using
> other folding techniques.
> 
> The other solution would be to address the other side of != and adjust
> the pud++
> 
> eg replcae pud++ with something like:
>   pud = pud_next_entry(p4d, pud, next)
> 
> Such that:
>   pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)
> 
> In which case the invarient changes to 'callers can never do pointer
> arithmetic on the result of pXX_offset()' which is a bit harder to
> enforce.

I might have lost track a bit. Are we still talking about possible
functional impacts of either our current pagetable walking with s390
(apart from gup_fast), or the proposed generic change (for s390, or
others?)?

Or is this rather some (other) generic issue / idea that you have,
in order to put "some more structure / enforcement" to generic
pagetable walkers?

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:07                       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Anshuman Khandual, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Thu, 10 Sep 2020 12:10:26 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> > On Thu, 10 Sep 2020 10:02:33 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >   
> > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > >   
> > > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > > Hopefully, one could make sense ot of it.    
> > > 
> > > I would say the page table API requires this invariant:
> > > 
> > >         pud = pud_offset(p4d, addr);
> > >         do {
> > > 		WARN_ON(pud != pud_offset(p4d, addr);
> > >                 next = pud_addr_end(addr, end);
> > >         } while (pud++, addr = next, addr != end);
> > > 
> > > ie pud++ is supposed to be a shortcut for 
> > >   pud_offset(p4d, next)
> > > 
> > > While S390 does not follow this. Fixing addr_end brings it into
> > > alignment by preventing pud++ from happening.
> > > 
> > > The only currently known side effect is that gup_fast crashes, but it
> > > sure is an unexpected thing.  
> > 
> > It only is unexpected in a "top-level folding" world, see my other reply.
> > Consider it an optimization, which was possible because of how our dynamic
> > folding works, and e.g. because we can determine the correct pagetable
> > level from a pXd value in pXd_offset.  
> 
> No, I disagree. The page walker API the arch presents has to have well
> defined semantics. For instance, there is an effort to define tests
> and invarients for the page table accesses to bring this understanding
> and uniformity:
> 
>  mm/debug_vm_pgtable.c
> 
> If we fix S390 using the pX_addr_end() change then the above should be
> updated with an invariant to check it. I've added Anshuman for some
> thoughts..

We are very aware of those tests, and actually a big supporter of the
idea. Also part of the supported architectures already, and it has
already helped us find / fix some s390 oddities.

However, we did not see any issues wrt to our pagetable walking,
neither with the current version, nor with the new generic approach.
We do currently see other issues, Anshuman will know what I mean :-)

> For better or worse, that invariant does exclude arches from using
> other folding techniques.
> 
> The other solution would be to address the other side of != and adjust
> the pud++
> 
> eg replcae pud++ with something like:
>   pud = pud_next_entry(p4d, pud, next)
> 
> Such that:
>   pud_next_entry(p4d, pud, next) == pud_offset(p4d, next)
> 
> In which case the invarient changes to 'callers can never do pointer
> arithmetic on the result of pXX_offset()' which is a bit harder to
> enforce.

I might have lost track a bit. Are we still talking about possible
functional impacts of either our current pagetable walking with s390
(apart from gup_fast), or the proposed generic change (for s390, or
others?)?

Or is this rather some (other) generic issue / idea that you have,
in order to put "some more structure / enforcement" to generic
pagetable walkers?

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:07                       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Anshuman Khandual, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Thu, 10 Sep 2020 12:10:26 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> > On Thu, 10 Sep 2020 10:02:33 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >   
> > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > >   
> > > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > > Hopefully, one could make sense ot of it.    
> > > 
> > > I would say the page table API requires this invariant:
> > > 
> > >         pud = pud_offset(p4d, addr);
> > >         do {
> > > 		WARN_ON(pud != pud_offset(p4d, addr);
> > >                 next = pud_addr_end(addr, end);
> > >         } while (pud++, addr = next, addr != end);
> > > 
> > > ie pud++ is supposed to be a shortcut for 
> > >   pud_offset(p4d, next)
> > > 
> > > While S390 does not follow this. Fixing addr_end brings it into
> > > alignment by preventing pud++ from happening.
> > > 
> > > The only currently known side effect is that gup_fast crashes, but it
> > > sure is an unexpected thing.  
> > 
> > It only is unexpected in a "top-level folding" world, see my other reply.
> > Consider it an optimization, which was possible because of how our dynamic
> > folding works, and e.g. because we can determine the correct pagetable
> > level from a pXd value in pXd_offset.  
> 
> No, I disagree. The page walker API the arch presents has to have well
> defined semantics. For instance, there is an effort to define tests
> and invarients for the page table accesses to bring this understanding
> and uniformity:
> 
>  mm/debug_vm_pgtable.c
> 
> If we fix S390 using the pX_addr_end() change then the above should be
> updated with an invariant to check it. I've added Anshuman for some
> thoughts..

We are very aware of those tests, and actually a big supporter of the
idea. Also part of the supported architectures already, and it has
already helped us find / fix some s390 oddities.

However, we did not see any issues wrt to our pagetable walking,
neither with the current version, nor with the new generic approach.
We do currently see other issues, Anshuman will know what I mean :-)

> For better or worse, that invariant does exclude arches from using
> other folding techniques.
> 
> The other solution would be to address the other side of != and adjust
> the pud++
> 
> eg replcae pud++ with something like:
>   pud = pud_next_entry(p4d, pud, next)
> 
> Such that:
>   pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)
> 
> In which case the invarient changes to 'callers can never do pointer
> arithmetic on the result of pXX_offset()' which is a bit harder to
> enforce.

I might have lost track a bit. Are we still talking about possible
functional impacts of either our current pagetable walking with s390
(apart from gup_fast), or the proposed generic change (for s390, or
others?)?

Or is this rather some (other) generic issue / idea that you have,
in order to put "some more structure / enforcement" to generic
pagetable walkers?

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:07                       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Anshuman Khandual, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Thu, 10 Sep 2020 12:10:26 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> > On Thu, 10 Sep 2020 10:02:33 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >   
> > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > >   
> > > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > > Hopefully, one could make sense ot of it.    
> > > 
> > > I would say the page table API requires this invariant:
> > > 
> > >         pud = pud_offset(p4d, addr);
> > >         do {
> > > 		WARN_ON(pud != pud_offset(p4d, addr);
> > >                 next = pud_addr_end(addr, end);
> > >         } while (pud++, addr = next, addr != end);
> > > 
> > > ie pud++ is supposed to be a shortcut for 
> > >   pud_offset(p4d, next)
> > > 
> > > While S390 does not follow this. Fixing addr_end brings it into
> > > alignment by preventing pud++ from happening.
> > > 
> > > The only currently known side effect is that gup_fast crashes, but it
> > > sure is an unexpected thing.  
> > 
> > It only is unexpected in a "top-level folding" world, see my other reply.
> > Consider it an optimization, which was possible because of how our dynamic
> > folding works, and e.g. because we can determine the correct pagetable
> > level from a pXd value in pXd_offset.  
> 
> No, I disagree. The page walker API the arch presents has to have well
> defined semantics. For instance, there is an effort to define tests
> and invarients for the page table accesses to bring this understanding
> and uniformity:
> 
>  mm/debug_vm_pgtable.c
> 
> If we fix S390 using the pX_addr_end() change then the above should be
> updated with an invariant to check it. I've added Anshuman for some
> thoughts..

We are very aware of those tests, and actually a big supporter of the
idea. Also part of the supported architectures already, and it has
already helped us find / fix some s390 oddities.

However, we did not see any issues wrt to our pagetable walking,
neither with the current version, nor with the new generic approach.
We do currently see other issues, Anshuman will know what I mean :-)

> For better or worse, that invariant does exclude arches from using
> other folding techniques.
> 
> The other solution would be to address the other side of != and adjust
> the pud++
> 
> eg replcae pud++ with something like:
>   pud = pud_next_entry(p4d, pud, next)
> 
> Such that:
>   pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)
> 
> In which case the invarient changes to 'callers can never do pointer
> arithmetic on the result of pXX_offset()' which is a bit harder to
> enforce.

I might have lost track a bit. Are we still talking about possible
functional impacts of either our current pagetable walking with s390
(apart from gup_fast), or the proposed generic change (for s390, or
others?)?

Or is this rather some (other) generic issue / idea that you have,
in order to put "some more structure / enforcement" to generic
pagetable walkers?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:07                       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Anshuman Khandual, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Thu, 10 Sep 2020 12:10:26 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> > On Thu, 10 Sep 2020 10:02:33 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >   
> > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > >   
> > > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > > Hopefully, one could make sense ot of it.    
> > > 
> > > I would say the page table API requires this invariant:
> > > 
> > >         pud = pud_offset(p4d, addr);
> > >         do {
> > > 		WARN_ON(pud != pud_offset(p4d, addr);
> > >                 next = pud_addr_end(addr, end);
> > >         } while (pud++, addr = next, addr != end);
> > > 
> > > ie pud++ is supposed to be a shortcut for 
> > >   pud_offset(p4d, next)
> > > 
> > > While S390 does not follow this. Fixing addr_end brings it into
> > > alignment by preventing pud++ from happening.
> > > 
> > > The only currently known side effect is that gup_fast crashes, but it
> > > sure is an unexpected thing.  
> > 
> > It only is unexpected in a "top-level folding" world, see my other reply.
> > Consider it an optimization, which was possible because of how our dynamic
> > folding works, and e.g. because we can determine the correct pagetable
> > level from a pXd value in pXd_offset.  
> 
> No, I disagree. The page walker API the arch presents has to have well
> defined semantics. For instance, there is an effort to define tests
> and invarients for the page table accesses to bring this understanding
> and uniformity:
> 
>  mm/debug_vm_pgtable.c
> 
> If we fix S390 using the pX_addr_end() change then the above should be
> updated with an invariant to check it. I've added Anshuman for some
> thoughts..

We are very aware of those tests, and actually a big supporter of the
idea. Also part of the supported architectures already, and it has
already helped us find / fix some s390 oddities.

However, we did not see any issues wrt to our pagetable walking,
neither with the current version, nor with the new generic approach.
We do currently see other issues, Anshuman will know what I mean :-)

> For better or worse, that invariant does exclude arches from using
> other folding techniques.
> 
> The other solution would be to address the other side of != and adjust
> the pud++
> 
> eg replcae pud++ with something like:
>   pud = pud_next_entry(p4d, pud, next)
> 
> Such that:
>   pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)
> 
> In which case the invarient changes to 'callers can never do pointer
> arithmetic on the result of pXX_offset()' which is a bit harder to
> enforce.

I might have lost track a bit. Are we still talking about possible
functional impacts of either our current pagetable walking with s390
(apart from gup_fast), or the proposed generic change (for s390, or
others?)?

Or is this rather some (other) generic issue / idea that you have,
in order to put "some more structure / enforcement" to generic
pagetable walkers?

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 17:07                       ` Gerald Schaefer
                                           ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 17:19                         ` Jason Gunthorpe
  -1 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 17:19 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Anshuman Khandual, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 07:07:57PM +0200, Gerald Schaefer wrote:

> I might have lost track a bit. Are we still talking about possible
> functional impacts of either our current pagetable walking with s390
> (apart from gup_fast), or the proposed generic change (for s390, or
> others?)?

I'm looking for an more understandable explanation what is wrong with
the S390 implementation.

If the page operations require the invariant I described then it is
quite easy to explain the problem and understand the solution.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:19                         ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 17:19 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Anshuman Khandual, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Linus Torvalds,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Vasily Gorbik,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 07:07:57PM +0200, Gerald Schaefer wrote:

> I might have lost track a bit. Are we still talking about possible
> functional impacts of either our current pagetable walking with s390
> (apart from gup_fast), or the proposed generic change (for s390, or
> others?)?

I'm looking for an more understandable explanation what is wrong with
the S390 implementation.

If the page operations require the invariant I described then it is
quite easy to explain the problem and understand the solution.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:19                         ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 17:19 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Anshuman Khandual, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Thu, Sep 10, 2020 at 07:07:57PM +0200, Gerald Schaefer wrote:

> I might have lost track a bit. Are we still talking about possible
> functional impacts of either our current pagetable walking with s390
> (apart from gup_fast), or the proposed generic change (for s390, or
> others?)?

I'm looking for an more understandable explanation what is wrong with
the S390 implementation.

If the page operations require the invariant I described then it is
quite easy to explain the problem and understand the solution.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:19                         ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 17:19 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Anshuman Khandual, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 07:07:57PM +0200, Gerald Schaefer wrote:

> I might have lost track a bit. Are we still talking about possible
> functional impacts of either our current pagetable walking with s390
> (apart from gup_fast), or the proposed generic change (for s390, or
> others?)?

I'm looking for an more understandable explanation what is wrong with
the S390 implementation.

If the page operations require the invariant I described then it is
quite easy to explain the problem and understand the solution.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:19                         ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 17:19 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Anshuman Khandual, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 07:07:57PM +0200, Gerald Schaefer wrote:

> I might have lost track a bit. Are we still talking about possible
> functional impacts of either our current pagetable walking with s390
> (apart from gup_fast), or the proposed generic change (for s390, or
> others?)?

I'm looking for an more understandable explanation what is wrong with
the S390 implementation.

If the page operations require the invariant I described then it is
quite easy to explain the problem and understand the solution.

Jason

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10  9:39               ` Alexander Gordeev
                                   ` (3 preceding siblings ...)
  (?)
@ 2020-09-10 17:35                 ` Linus Torvalds
  -1 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 17:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Jason Gunthorpe, Gerald Schaefer, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> It is only gup_fast case that exposes the issue. It hits because
> pointers to stack copies are passed to gup_pXd_range iterators, not
> pointers to real page tables itself.

Can we possibly change fast-gup to not do the stack copies?

I'd actually rather do something like that, than the "addr_end" thing.

As you say, none of the other page table walking code does what the
GUP code does, and I don't think it's required.

The GUP code is kind of strange, I'm not quite sure why. Some of it
unusually came from the powerpc code that handled their special odd
hugepage model, and that may be why it's so different.

How painful would it be to just pass the pmd (etc) _pointers_ around,
rather than do the odd "take the address of local copies"?

                  Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:35                 ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 17:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Jason Gunthorpe, Gerald Schaefer, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> It is only gup_fast case that exposes the issue. It hits because
> pointers to stack copies are passed to gup_pXd_range iterators, not
> pointers to real page tables itself.

Can we possibly change fast-gup to not do the stack copies?

I'd actually rather do something like that, than the "addr_end" thing.

As you say, none of the other page table walking code does what the
GUP code does, and I don't think it's required.

The GUP code is kind of strange, I'm not quite sure why. Some of it
unusually came from the powerpc code that handled their special odd
hugepage model, and that may be why it's so different.

How painful would it be to just pass the pmd (etc) _pointers_ around,
rather than do the odd "take the address of local copies"?

                  Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:35                 ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 17:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Jason Gunthorpe, Gerald Schaefer, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> It is only gup_fast case that exposes the issue. It hits because
> pointers to stack copies are passed to gup_pXd_range iterators, not
> pointers to real page tables itself.

Can we possibly change fast-gup to not do the stack copies?

I'd actually rather do something like that, than the "addr_end" thing.

As you say, none of the other page table walking code does what the
GUP code does, and I don't think it's required.

The GUP code is kind of strange, I'm not quite sure why. Some of it
unusually came from the powerpc code that handled their special odd
hugepage model, and that may be why it's so different.

How painful would it be to just pass the pmd (etc) _pointers_ around,
rather than do the odd "take the address of local copies"?

                  Linus


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:35                 ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 17:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Andrew Morton, linux-power, Mike Rapoport

On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> It is only gup_fast case that exposes the issue. It hits because
> pointers to stack copies are passed to gup_pXd_range iterators, not
> pointers to real page tables itself.

Can we possibly change fast-gup to not do the stack copies?

I'd actually rather do something like that, than the "addr_end" thing.

As you say, none of the other page table walking code does what the
GUP code does, and I don't think it's required.

The GUP code is kind of strange, I'm not quite sure why. Some of it
unusually came from the powerpc code that handled their special odd
hugepage model, and that may be why it's so different.

How painful would it be to just pass the pmd (etc) _pointers_ around,
rather than do the odd "take the address of local copies"?

                  Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:35                 ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 17:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Michael Ellerman, Andrew Morton, linux-power,
	Mike Rapoport

On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> It is only gup_fast case that exposes the issue. It hits because
> pointers to stack copies are passed to gup_pXd_range iterators, not
> pointers to real page tables itself.

Can we possibly change fast-gup to not do the stack copies?

I'd actually rather do something like that, than the "addr_end" thing.

As you say, none of the other page table walking code does what the
GUP code does, and I don't think it's required.

The GUP code is kind of strange, I'm not quite sure why. Some of it
unusually came from the powerpc code that handled their special odd
hugepage model, and that may be why it's so different.

How painful would it be to just pass the pmd (etc) _pointers_ around,
rather than do the odd "take the address of local copies"?

                  Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:35                 ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 17:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Michael Ellerman, Andrew Morton, linux-power,
	Mike Rapoport

On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> It is only gup_fast case that exposes the issue. It hits because
> pointers to stack copies are passed to gup_pXd_range iterators, not
> pointers to real page tables itself.

Can we possibly change fast-gup to not do the stack copies?

I'd actually rather do something like that, than the "addr_end" thing.

As you say, none of the other page table walking code does what the
GUP code does, and I don't think it's required.

The GUP code is kind of strange, I'm not quite sure why. Some of it
unusually came from the powerpc code that handled their special odd
hugepage model, and that may be why it's so different.

How painful would it be to just pass the pmd (etc) _pointers_ around,
rather than do the odd "take the address of local copies"?

                  Linus

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 13:02                 ` Jason Gunthorpe
                                     ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 17:57                   ` Gerald Schaefer
  -1 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 

Hmm, IIUC, all architectures with static folding will simply return
the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
pagetables. There is no difference for s390. For gup_fast, that p4d
pointer is not really a pointer to a value in a pagetable, but
to some local copy of such a value, and not just for s390.

So, pud = p4d = pointer to copy, and increasing that pud pointer
cannot be the same as pud_offset(p4d, next). I do see your point
however, at last I think :-) My problem is that I do not see where
we would have an s390-specific issue here. Maybe my understanding
of how it works for others with static folding is wrong. That
would explain my difficulties in getting your point...

> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.

Exactly, only that nobody seems to follow it, IIUC. Fixing it up
with pXd_addr_end was my impression of what we need to do, in order to
have it work the same way as for others.

> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

Well, from my understanding it feels more unexpected that something
that is supposed to be a pointer to an entry in a page table, really is
just a pointer to some copy somewhere.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:57                   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 

Hmm, IIUC, all architectures with static folding will simply return
the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
pagetables. There is no difference for s390. For gup_fast, that p4d
pointer is not really a pointer to a value in a pagetable, but
to some local copy of such a value, and not just for s390.

So, pud = p4d = pointer to copy, and increasing that pud pointer
cannot be the same as pud_offset(p4d, next). I do see your point
however, at last I think :-) My problem is that I do not see where
we would have an s390-specific issue here. Maybe my understanding
of how it works for others with static folding is wrong. That
would explain my difficulties in getting your point...

> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.

Exactly, only that nobody seems to follow it, IIUC. Fixing it up
with pXd_addr_end was my impression of what we need to do, in order to
have it work the same way as for others.

> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

Well, from my understanding it feels more unexpected that something
that is supposed to be a pointer to an entry in a page table, really is
just a pointer to some copy somewhere.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:57                   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 

Hmm, IIUC, all architectures with static folding will simply return
the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
pagetables. There is no difference for s390. For gup_fast, that p4d
pointer is not really a pointer to a value in a pagetable, but
to some local copy of such a value, and not just for s390.

So, pud = p4d = pointer to copy, and increasing that pud pointer
cannot be the same as pud_offset(p4d, next). I do see your point
however, at last I think :-) My problem is that I do not see where
we would have an s390-specific issue here. Maybe my understanding
of how it works for others with static folding is wrong. That
would explain my difficulties in getting your point...

> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.

Exactly, only that nobody seems to follow it, IIUC. Fixing it up
with pXd_addr_end was my impression of what we need to do, in order to
have it work the same way as for others.

> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

Well, from my understanding it feels more unexpected that something
that is supposed to be a pointer to an entry in a page table, really is
just a pointer to some copy somewhere.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:57                   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 

Hmm, IIUC, all architectures with static folding will simply return
the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
pagetables. There is no difference for s390. For gup_fast, that p4d
pointer is not really a pointer to a value in a pagetable, but
to some local copy of such a value, and not just for s390.

So, pud = p4d = pointer to copy, and increasing that pud pointer
cannot be the same as pud_offset(p4d, next). I do see your point
however, at last I think :-) My problem is that I do not see where
we would have an s390-specific issue here. Maybe my understanding
of how it works for others with static folding is wrong. That
would explain my difficulties in getting your point...

> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.

Exactly, only that nobody seems to follow it, IIUC. Fixing it up
with pXd_addr_end was my impression of what we need to do, in order to
have it work the same way as for others.

> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

Well, from my understanding it feels more unexpected that something
that is supposed to be a pointer to an entry in a page table, really is
just a pointer to some copy somewhere.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 17:57                   ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 17:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.
> 
> I would say the page table API requires this invariant:
> 
>         pud = pud_offset(p4d, addr);
>         do {
> 		WARN_ON(pud != pud_offset(p4d, addr);
>                 next = pud_addr_end(addr, end);
>         } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 

Hmm, IIUC, all architectures with static folding will simply return
the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
pagetables. There is no difference for s390. For gup_fast, that p4d
pointer is not really a pointer to a value in a pagetable, but
to some local copy of such a value, and not just for s390.

So, pud = p4d = pointer to copy, and increasing that pud pointer
cannot be the same as pud_offset(p4d, next). I do see your point
however, at last I think :-) My problem is that I do not see where
we would have an s390-specific issue here. Maybe my understanding
of how it works for others with static folding is wrong. That
would explain my difficulties in getting your point...

> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.

Exactly, only that nobody seems to follow it, IIUC. Fixing it up
with pXd_addr_end was my impression of what we need to do, in order to
have it work the same way as for others.

> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

Well, from my understanding it feels more unexpected that something
that is supposed to be a pointer to an entry in a page table, really is
just a pointer to some copy somewhere.

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 17:35                 ` Linus Torvalds
                                     ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 18:13                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 18:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Gordeev, Gerald Schaefer, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote:
> On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
> <agordeev@linux.ibm.com> wrote:
> >
> > It is only gup_fast case that exposes the issue. It hits because
> > pointers to stack copies are passed to gup_pXd_range iterators, not
> > pointers to real page tables itself.
> 
> Can we possibly change fast-gup to not do the stack copies?
>
> I'd actually rather do something like that, than the "addr_end" thing.

> As you say, none of the other page table walking code does what the
> GUP code does, and I don't think it's required.

As I understand it, the requirement is because fast-gup walks without
the page table spinlock, or mmap_sem held so it must READ_ONCE the
*pXX.

It then checks that it is a valid page table pointer, then calls
pXX_offset().

The arch implementation of pXX_offset() derefs again the passed pXX
pointer. So it defeats the READ_ONCE and the 2nd load could observe
something that is no longer a page table pointer and crash.

Passing it the address of the stack value is a way to force
pXX_offset() to use the READ_ONCE result which has already been tested
to be a page table pointer.

Other page walking code that holds the mmap_sem tends to use
pmd_trans_unstable() which solves this problem by injecting a
barrier. The load hidden in pte_offset() after a pmd_trans_unstable()
can't be re-ordered and will only see a page table entry under the
mmap_sem.

However, I think that logic would have been much clearer following the
GUP model of READ_ONCE vs extra reads and a hidden barrier. At least
it took me a long time to work it out :(

I also think there are real bugs here where places are reading *pXX
multiple times without locking the page table. One was found recently
in the wild in the huge tlb code IIRC.

The mm/pagewalk.c has these missing READ_ONCE bugs too.

So.. To change away from the stack option I think we'd have to pass
the READ_ONCE value to pXX_offset() as an extra argument instead of it
derefing the pointer internally.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 18:13                   ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 18:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Gordeev, Gerald Schaefer, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote:
> On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
> <agordeev@linux.ibm.com> wrote:
> >
> > It is only gup_fast case that exposes the issue. It hits because
> > pointers to stack copies are passed to gup_pXd_range iterators, not
> > pointers to real page tables itself.
> 
> Can we possibly change fast-gup to not do the stack copies?
>
> I'd actually rather do something like that, than the "addr_end" thing.

> As you say, none of the other page table walking code does what the
> GUP code does, and I don't think it's required.

As I understand it, the requirement is because fast-gup walks without
the page table spinlock, or mmap_sem held so it must READ_ONCE the
*pXX.

It then checks that it is a valid page table pointer, then calls
pXX_offset().

The arch implementation of pXX_offset() derefs again the passed pXX
pointer. So it defeats the READ_ONCE and the 2nd load could observe
something that is no longer a page table pointer and crash.

Passing it the address of the stack value is a way to force
pXX_offset() to use the READ_ONCE result which has already been tested
to be a page table pointer.

Other page walking code that holds the mmap_sem tends to use
pmd_trans_unstable() which solves this problem by injecting a
barrier. The load hidden in pte_offset() after a pmd_trans_unstable()
can't be re-ordered and will only see a page table entry under the
mmap_sem.

However, I think that logic would have been much clearer following the
GUP model of READ_ONCE vs extra reads and a hidden barrier. At least
it took me a long time to work it out :(

I also think there are real bugs here where places are reading *pXX
multiple times without locking the page table. One was found recently
in the wild in the huge tlb code IIRC.

The mm/pagewalk.c has these missing READ_ONCE bugs too.

So.. To change away from the stack option I think we'd have to pass
the READ_ONCE value to pXX_offset() as an extra argument instead of it
derefing the pointer internally.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 18:13                   ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 18:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Andrew Morton, linux-power, Mike Rapoport

On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote:
> On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
> <agordeev@linux.ibm.com> wrote:
> >
> > It is only gup_fast case that exposes the issue. It hits because
> > pointers to stack copies are passed to gup_pXd_range iterators, not
> > pointers to real page tables itself.
> 
> Can we possibly change fast-gup to not do the stack copies?
>
> I'd actually rather do something like that, than the "addr_end" thing.

> As you say, none of the other page table walking code does what the
> GUP code does, and I don't think it's required.

As I understand it, the requirement is because fast-gup walks without
the page table spinlock, or mmap_sem held so it must READ_ONCE the
*pXX.

It then checks that it is a valid page table pointer, then calls
pXX_offset().

The arch implementation of pXX_offset() derefs again the passed pXX
pointer. So it defeats the READ_ONCE and the 2nd load could observe
something that is no longer a page table pointer and crash.

Passing it the address of the stack value is a way to force
pXX_offset() to use the READ_ONCE result which has already been tested
to be a page table pointer.

Other page walking code that holds the mmap_sem tends to use
pmd_trans_unstable() which solves this problem by injecting a
barrier. The load hidden in pte_offset() after a pmd_trans_unstable()
can't be re-ordered and will only see a page table entry under the
mmap_sem.

However, I think that logic would have been much clearer following the
GUP model of READ_ONCE vs extra reads and a hidden barrier. At least
it took me a long time to work it out :(

I also think there are real bugs here where places are reading *pXX
multiple times without locking the page table. One was found recently
in the wild in the huge tlb code IIRC.

The mm/pagewalk.c has these missing READ_ONCE bugs too.

So.. To change away from the stack option I think we'd have to pass
the READ_ONCE value to pXX_offset() as an extra argument instead of it
derefing the pointer internally.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 18:13                   ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 18:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Gerald Schaefer, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm, LKML,
	Michael Ellerman, Andrew Morton, linux-power, Mike Rapoport

On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote:
> On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
> <agordeev@linux.ibm.com> wrote:
> >
> > It is only gup_fast case that exposes the issue. It hits because
> > pointers to stack copies are passed to gup_pXd_range iterators, not
> > pointers to real page tables itself.
> 
> Can we possibly change fast-gup to not do the stack copies?
>
> I'd actually rather do something like that, than the "addr_end" thing.

> As you say, none of the other page table walking code does what the
> GUP code does, and I don't think it's required.

As I understand it, the requirement is because fast-gup walks without
the page table spinlock, or mmap_sem held so it must READ_ONCE the
*pXX.

It then checks that it is a valid page table pointer, then calls
pXX_offset().

The arch implementation of pXX_offset() derefs again the passed pXX
pointer. So it defeats the READ_ONCE and the 2nd load could observe
something that is no longer a page table pointer and crash.

Passing it the address of the stack value is a way to force
pXX_offset() to use the READ_ONCE result which has already been tested
to be a page table pointer.

Other page walking code that holds the mmap_sem tends to use
pmd_trans_unstable() which solves this problem by injecting a
barrier. The load hidden in pte_offset() after a pmd_trans_unstable()
can't be re-ordered and will only see a page table entry under the
mmap_sem.

However, I think that logic would have been much clearer following the
GUP model of READ_ONCE vs extra reads and a hidden barrier. At least
it took me a long time to work it out :(

I also think there are real bugs here where places are reading *pXX
multiple times without locking the page table. One was found recently
in the wild in the huge tlb code IIRC.

The mm/pagewalk.c has these missing READ_ONCE bugs too.

So.. To change away from the stack option I think we'd have to pass
the READ_ONCE value to pXX_offset() as an extra argument instead of it
derefing the pointer internally.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 18:13                   ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 18:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Gerald Schaefer, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm, LKML,
	Michael Ellerman, Andrew Morton, linux-power, Mike Rapoport

On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote:
> On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
> <agordeev@linux.ibm.com> wrote:
> >
> > It is only gup_fast case that exposes the issue. It hits because
> > pointers to stack copies are passed to gup_pXd_range iterators, not
> > pointers to real page tables itself.
> 
> Can we possibly change fast-gup to not do the stack copies?
>
> I'd actually rather do something like that, than the "addr_end" thing.

> As you say, none of the other page table walking code does what the
> GUP code does, and I don't think it's required.

As I understand it, the requirement is because fast-gup walks without
the page table spinlock, or mmap_sem held so it must READ_ONCE the
*pXX.

It then checks that it is a valid page table pointer, then calls
pXX_offset().

The arch implementation of pXX_offset() derefs again the passed pXX
pointer. So it defeats the READ_ONCE and the 2nd load could observe
something that is no longer a page table pointer and crash.

Passing it the address of the stack value is a way to force
pXX_offset() to use the READ_ONCE result which has already been tested
to be a page table pointer.

Other page walking code that holds the mmap_sem tends to use
pmd_trans_unstable() which solves this problem by injecting a
barrier. The load hidden in pte_offset() after a pmd_trans_unstable()
can't be re-ordered and will only see a page table entry under the
mmap_sem.

However, I think that logic would have been much clearer following the
GUP model of READ_ONCE vs extra reads and a hidden barrier. At least
it took me a long time to work it out :(

I also think there are real bugs here where places are reading *pXX
multiple times without locking the page table. One was found recently
in the wild in the huge tlb code IIRC.

The mm/pagewalk.c has these missing READ_ONCE bugs too.

So.. To change away from the stack option I think we'd have to pass
the READ_ONCE value to pXX_offset() as an extra argument instead of it
derefing the pointer internally.

Jason

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 18:13                   ` Jason Gunthorpe
                                       ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 18:33                     ` Linus Torvalds
  -1 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 18:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Gordeev, Gerald Schaefer, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> So.. To change away from the stack option I think we'd have to pass
> the READ_ONCE value to pXX_offset() as an extra argument instead of it
> derefing the pointer internally.

Yeah, but I think that would actually be the better model than passing
an address to a random stack location.

It's also effectively what we do in some other places, eg the whole
logic with "orig" in the regular pte fault handling is basically doing
unlocked loads of the pte, various decisions on that, and then doing a
final "is this still the same pte" after it has gotten the page table
lock.

(And yes, those other pte fault handling cases are different, since
they _do_ hold the mmap lock, so they know the page *tables* are
stable, and it's only the last level that then gets re-checked against
the pte once the pte itself has also been stabilized with the page
table lock).

So I think it would actually be a better conceptual match to make the
page table walking interface be "here, this is the value I read once
carefully, and this is the address, now give me the next address".

The folded case would then just return the address it was given, and
the non-folded case would return the inner page table based on the
value.

I dunno. I don't actually feel all that strongly about this, so
whatever works, I guess.

                Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 18:33                     ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 18:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Gordeev, Gerald Schaefer, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> So.. To change away from the stack option I think we'd have to pass
> the READ_ONCE value to pXX_offset() as an extra argument instead of it
> derefing the pointer internally.

Yeah, but I think that would actually be the better model than passing
an address to a random stack location.

It's also effectively what we do in some other places, eg the whole
logic with "orig" in the regular pte fault handling is basically doing
unlocked loads of the pte, various decisions on that, and then doing a
final "is this still the same pte" after it has gotten the page table
lock.

(And yes, those other pte fault handling cases are different, since
they _do_ hold the mmap lock, so they know the page *tables* are
stable, and it's only the last level that then gets re-checked against
the pte once the pte itself has also been stabilized with the page
table lock).

So I think it would actually be a better conceptual match to make the
page table walking interface be "here, this is the value I read once
carefully, and this is the address, now give me the next address".

The folded case would then just return the address it was given, and
the non-folded case would return the inner page table based on the
value.

I dunno. I don't actually feel all that strongly about this, so
whatever works, I guess.

                Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 18:33                     ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 18:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Gordeev, Gerald Schaefer, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> So.. To change away from the stack option I think we'd have to pass
> the READ_ONCE value to pXX_offset() as an extra argument instead of it
> derefing the pointer internally.

Yeah, but I think that would actually be the better model than passing
an address to a random stack location.

It's also effectively what we do in some other places, eg the whole
logic with "orig" in the regular pte fault handling is basically doing
unlocked loads of the pte, various decisions on that, and then doing a
final "is this still the same pte" after it has gotten the page table
lock.

(And yes, those other pte fault handling cases are different, since
they _do_ hold the mmap lock, so they know the page *tables* are
stable, and it's only the last level that then gets re-checked against
the pte once the pte itself has also been stabilized with the page
table lock).

So I think it would actually be a better conceptual match to make the
page table walking interface be "here, this is the value I read once
carefully, and this is the address, now give me the next address".

The folded case would then just return the address it was given, and
the non-folded case would return the inner page table based on the
value.

I dunno. I don't actually feel all that strongly about this, so
whatever works, I guess.

                Linus


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 18:33                     ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 18:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Andrew Morton, linux-power, Mike Rapoport

On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> So.. To change away from the stack option I think we'd have to pass
> the READ_ONCE value to pXX_offset() as an extra argument instead of it
> derefing the pointer internally.

Yeah, but I think that would actually be the better model than passing
an address to a random stack location.

It's also effectively what we do in some other places, eg the whole
logic with "orig" in the regular pte fault handling is basically doing
unlocked loads of the pte, various decisions on that, and then doing a
final "is this still the same pte" after it has gotten the page table
lock.

(And yes, those other pte fault handling cases are different, since
they _do_ hold the mmap lock, so they know the page *tables* are
stable, and it's only the last level that then gets re-checked against
the pte once the pte itself has also been stabilized with the page
table lock).

So I think it would actually be a better conceptual match to make the
page table walking interface be "here, this is the value I read once
carefully, and this is the address, now give me the next address".

The folded case would then just return the address it was given, and
the non-folded case would return the inner page table based on the
value.

I dunno. I don't actually feel all that strongly about this, so
whatever works, I guess.

                Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 18:33                     ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 18:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Gerald Schaefer, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm, LKML,
	Michael Ellerman, Andrew Morton, linux-power, Mike Rapoport

On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> So.. To change away from the stack option I think we'd have to pass
> the READ_ONCE value to pXX_offset() as an extra argument instead of it
> derefing the pointer internally.

Yeah, but I think that would actually be the better model than passing
an address to a random stack location.

It's also effectively what we do in some other places, eg the whole
logic with "orig" in the regular pte fault handling is basically doing
unlocked loads of the pte, various decisions on that, and then doing a
final "is this still the same pte" after it has gotten the page table
lock.

(And yes, those other pte fault handling cases are different, since
they _do_ hold the mmap lock, so they know the page *tables* are
stable, and it's only the last level that then gets re-checked against
the pte once the pte itself has also been stabilized with the page
table lock).

So I think it would actually be a better conceptual match to make the
page table walking interface be "here, this is the value I read once
carefully, and this is the address, now give me the next address".

The folded case would then just return the address it was given, and
the non-folded case would return the inner page table based on the
value.

I dunno. I don't actually feel all that strongly about this, so
whatever works, I guess.

                Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 18:33                     ` Linus Torvalds
                                         ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 19:10                       ` Gerald Schaefer
  -1 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 19:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason Gunthorpe, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, 10 Sep 2020 11:33:17 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > So.. To change away from the stack option I think we'd have to pass
> > the READ_ONCE value to pXX_offset() as an extra argument instead of it
> > derefing the pointer internally.
> 
> Yeah, but I think that would actually be the better model than passing
> an address to a random stack location.
> 
> It's also effectively what we do in some other places, eg the whole
> logic with "orig" in the regular pte fault handling is basically doing
> unlocked loads of the pte, various decisions on that, and then doing a
> final "is this still the same pte" after it has gotten the page table
> lock.

That sounds a lot like the pXd_offset_orig() from Martins first approach
in this thread:
https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/

It is also the "Patch 1" option from the start of this thread:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

I guess I chose wrongly there, should have had more trust in Martins
approach, and not try so hard to do it like others...

So, maybe we can start over again, from that patch option. It would of
course also initially introduce some gup-specific helpers, like with
the other approach. It seemed harder to generalize when I thought
about it back then, but I guess it should not be a lot harder than
the _addr_end stuff.

Or, maybe this time, just not to risk Christian getting a heart attack,
we could go for the gup-specific helper first, so that we would at
least have a fix for the possible s390 data corruption. Jason, would
you agree that we send a new RFC, this time with pXd_offset_orig()
approach, and have that accepted as short-term fix?

Or would you rather also wait for some proper generic change? Have
lost that option from my radar, so cannot really judge how much more
effort it would be. I'm on vacation next week anyway, but Alexander
or Vasily (who did the option 1 patch) could look into this further.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 19:10                       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 19:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason Gunthorpe, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, 10 Sep 2020 11:33:17 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > So.. To change away from the stack option I think we'd have to pass
> > the READ_ONCE value to pXX_offset() as an extra argument instead of it
> > derefing the pointer internally.
> 
> Yeah, but I think that would actually be the better model than passing
> an address to a random stack location.
> 
> It's also effectively what we do in some other places, eg the whole
> logic with "orig" in the regular pte fault handling is basically doing
> unlocked loads of the pte, various decisions on that, and then doing a
> final "is this still the same pte" after it has gotten the page table
> lock.

That sounds a lot like the pXd_offset_orig() from Martins first approach
in this thread:
https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/

It is also the "Patch 1" option from the start of this thread:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

I guess I chose wrongly there, should have had more trust in Martins
approach, and not try so hard to do it like others...

So, maybe we can start over again, from that patch option. It would of
course also initially introduce some gup-specific helpers, like with
the other approach. It seemed harder to generalize when I thought
about it back then, but I guess it should not be a lot harder than
the _addr_end stuff.

Or, maybe this time, just not to risk Christian getting a heart attack,
we could go for the gup-specific helper first, so that we would at
least have a fix for the possible s390 data corruption. Jason, would
you agree that we send a new RFC, this time with pXd_offset_orig()
approach, and have that accepted as short-term fix?

Or would you rather also wait for some proper generic change? Have
lost that option from my radar, so cannot really judge how much more
effort it would be. I'm on vacation next week anyway, but Alexander
or Vasily (who did the option 1 patch) could look into this further.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 19:10                       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 19:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Andrew Morton, linux-power, Mike Rapoport

On Thu, 10 Sep 2020 11:33:17 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > So.. To change away from the stack option I think we'd have to pass
> > the READ_ONCE value to pXX_offset() as an extra argument instead of it
> > derefing the pointer internally.
> 
> Yeah, but I think that would actually be the better model than passing
> an address to a random stack location.
> 
> It's also effectively what we do in some other places, eg the whole
> logic with "orig" in the regular pte fault handling is basically doing
> unlocked loads of the pte, various decisions on that, and then doing a
> final "is this still the same pte" after it has gotten the page table
> lock.

That sounds a lot like the pXd_offset_orig() from Martins first approach
in this thread:
https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/

It is also the "Patch 1" option from the start of this thread:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

I guess I chose wrongly there, should have had more trust in Martins
approach, and not try so hard to do it like others...

So, maybe we can start over again, from that patch option. It would of
course also initially introduce some gup-specific helpers, like with
the other approach. It seemed harder to generalize when I thought
about it back then, but I guess it should not be a lot harder than
the _addr_end stuff.

Or, maybe this time, just not to risk Christian getting a heart attack,
we could go for the gup-specific helper first, so that we would at
least have a fix for the possible s390 data corruption. Jason, would
you agree that we send a new RFC, this time with pXd_offset_orig()
approach, and have that accepted as short-term fix?

Or would you rather also wait for some proper generic change? Have
lost that option from my radar, so cannot really judge how much more
effort it would be. I'm on vacation next week anyway, but Alexander
or Vasily (who did the option 1 patch) could look into this further.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 19:10                       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 19:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, LKML, Michael Ellerman,
	Andrew Morton, linux-power, Mike Rapoport

On Thu, 10 Sep 2020 11:33:17 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > So.. To change away from the stack option I think we'd have to pass
> > the READ_ONCE value to pXX_offset() as an extra argument instead of it
> > derefing the pointer internally.
> 
> Yeah, but I think that would actually be the better model than passing
> an address to a random stack location.
> 
> It's also effectively what we do in some other places, eg the whole
> logic with "orig" in the regular pte fault handling is basically doing
> unlocked loads of the pte, various decisions on that, and then doing a
> final "is this still the same pte" after it has gotten the page table
> lock.

That sounds a lot like the pXd_offset_orig() from Martins first approach
in this thread:
https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/

It is also the "Patch 1" option from the start of this thread:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

I guess I chose wrongly there, should have had more trust in Martins
approach, and not try so hard to do it like others...

So, maybe we can start over again, from that patch option. It would of
course also initially introduce some gup-specific helpers, like with
the other approach. It seemed harder to generalize when I thought
about it back then, but I guess it should not be a lot harder than
the _addr_end stuff.

Or, maybe this time, just not to risk Christian getting a heart attack,
we could go for the gup-specific helper first, so that we would at
least have a fix for the possible s390 data corruption. Jason, would
you agree that we send a new RFC, this time with pXd_offset_orig()
approach, and have that accepted as short-term fix?

Or would you rather also wait for some proper generic change? Have
lost that option from my radar, so cannot really judge how much more
effort it would be. I'm on vacation next week anyway, but Alexander
or Vasily (who did the option 1 patch) could look into this further.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 19:10                       ` Gerald Schaefer
  0 siblings, 0 replies; 313+ messages in thread
From: Gerald Schaefer @ 2020-09-10 19:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, LKML, Michael Ellerman,
	Andrew Morton, linux-power, Mike Rapoport

On Thu, 10 Sep 2020 11:33:17 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > So.. To change away from the stack option I think we'd have to pass
> > the READ_ONCE value to pXX_offset() as an extra argument instead of it
> > derefing the pointer internally.
> 
> Yeah, but I think that would actually be the better model than passing
> an address to a random stack location.
> 
> It's also effectively what we do in some other places, eg the whole
> logic with "orig" in the regular pte fault handling is basically doing
> unlocked loads of the pte, various decisions on that, and then doing a
> final "is this still the same pte" after it has gotten the page table
> lock.

That sounds a lot like the pXd_offset_orig() from Martins first approach
in this thread:
https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/

It is also the "Patch 1" option from the start of this thread:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schaefer@linux.ibm.com/

I guess I chose wrongly there, should have had more trust in Martins
approach, and not try so hard to do it like others...

So, maybe we can start over again, from that patch option. It would of
course also initially introduce some gup-specific helpers, like with
the other approach. It seemed harder to generalize when I thought
about it back then, but I guess it should not be a lot harder than
the _addr_end stuff.

Or, maybe this time, just not to risk Christian getting a heart attack,
we could go for the gup-specific helper first, so that we would at
least have a fix for the possible s390 data corruption. Jason, would
you agree that we send a new RFC, this time with pXd_offset_orig()
approach, and have that accepted as short-term fix?

Or would you rather also wait for some proper generic change? Have
lost that option from my radar, so cannot really judge how much more
effort it would be. I'm on vacation next week anyway, but Alexander
or Vasily (who did the option 1 patch) could look into this further.

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 19:10                       ` Gerald Schaefer
                                           ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 19:32                         ` Linus Torvalds
  -1 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 19:32 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 12:11 PM Gerald Schaefer
<gerald.schaefer@linux.ibm.com> wrote:
>
> That sounds a lot like the pXd_offset_orig() from Martins first approach
> in this thread:
> https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/

I have to admit to finding that name horrible, but aside from that, yes.

I don't think "pXd_offset_orig()" makes any sense as a name. Yes,
"orig" may make sense as the variable name (as in "this was the
original value we read"), but a function name should describe what it
*does*, not what the arguments are.

Plus "original" doesn't make sense to me anyway, since we're not
modifying it. To me, "original" means that there's a final version
too, which this interface in no way implies. It's just "this is the
value we already read".

("orig" does make some sense in that fault path - because by
definition we *are* going to modify the page table entry, that's the
whole point of the fault - we need to do something to not keep
faulting. But here, we're not at all necessarily modifying the page
table contents, we're just following them and readign the values once)

Of course, I don't know what a better name would be to describe what
is actually going on, I'm just explaining why I hate that naming.

*Maybe* something like just "pXd_offset_value()" together with a
comment explaining that it's given the upper pXd pointer _and_ the
value behind it, and it needs to return the next level offset? I
dunno. "value" doesn't really seem horribly descriptive either, but at
least it doesn't feel actively misleading to me.

Yeah, I get hung up on naming sometimes. I don't tend to care much
about private local variables ("i" is a perfectly fine variable name),
but these kinds of somewhat subtle cross-architecture definitions I
feel matter.

               Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 19:32                         ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 19:32 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 12:11 PM Gerald Schaefer
<gerald.schaefer@linux.ibm.com> wrote:
>
> That sounds a lot like the pXd_offset_orig() from Martins first approach
> in this thread:
> https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/

I have to admit to finding that name horrible, but aside from that, yes.

I don't think "pXd_offset_orig()" makes any sense as a name. Yes,
"orig" may make sense as the variable name (as in "this was the
original value we read"), but a function name should describe what it
*does*, not what the arguments are.

Plus "original" doesn't make sense to me anyway, since we're not
modifying it. To me, "original" means that there's a final version
too, which this interface in no way implies. It's just "this is the
value we already read".

("orig" does make some sense in that fault path - because by
definition we *are* going to modify the page table entry, that's the
whole point of the fault - we need to do something to not keep
faulting. But here, we're not at all necessarily modifying the page
table contents, we're just following them and readign the values once)

Of course, I don't know what a better name would be to describe what
is actually going on, I'm just explaining why I hate that naming.

*Maybe* something like just "pXd_offset_value()" together with a
comment explaining that it's given the upper pXd pointer _and_ the
value behind it, and it needs to return the next level offset? I
dunno. "value" doesn't really seem horribly descriptive either, but at
least it doesn't feel actively misleading to me.

Yeah, I get hung up on naming sometimes. I don't tend to care much
about private local variables ("i" is a perfectly fine variable name),
but these kinds of somewhat subtle cross-architecture definitions I
feel matter.

               Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 19:32                         ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 19:32 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Jason Gunthorpe, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 12:11 PM Gerald Schaefer
<gerald.schaefer@linux.ibm.com> wrote:
>
> That sounds a lot like the pXd_offset_orig() from Martins first approach
> in this thread:
> https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/

I have to admit to finding that name horrible, but aside from that, yes.

I don't think "pXd_offset_orig()" makes any sense as a name. Yes,
"orig" may make sense as the variable name (as in "this was the
original value we read"), but a function name should describe what it
*does*, not what the arguments are.

Plus "original" doesn't make sense to me anyway, since we're not
modifying it. To me, "original" means that there's a final version
too, which this interface in no way implies. It's just "this is the
value we already read".

("orig" does make some sense in that fault path - because by
definition we *are* going to modify the page table entry, that's the
whole point of the fault - we need to do something to not keep
faulting. But here, we're not at all necessarily modifying the page
table contents, we're just following them and readign the values once)

Of course, I don't know what a better name would be to describe what
is actually going on, I'm just explaining why I hate that naming.

*Maybe* something like just "pXd_offset_value()" together with a
comment explaining that it's given the upper pXd pointer _and_ the
value behind it, and it needs to return the next level offset? I
dunno. "value" doesn't really seem horribly descriptive either, but at
least it doesn't feel actively misleading to me.

Yeah, I get hung up on naming sometimes. I don't tend to care much
about private local variables ("i" is a perfectly fine variable name),
but these kinds of somewhat subtle cross-architecture definitions I
feel matter.

               Linus


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 19:32                         ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 19:32 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Andrew Morton, linux-power, Mike Rapoport

On Thu, Sep 10, 2020 at 12:11 PM Gerald Schaefer
<gerald.schaefer@linux.ibm.com> wrote:
>
> That sounds a lot like the pXd_offset_orig() from Martins first approach
> in this thread:
> https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/

I have to admit to finding that name horrible, but aside from that, yes.

I don't think "pXd_offset_orig()" makes any sense as a name. Yes,
"orig" may make sense as the variable name (as in "this was the
original value we read"), but a function name should describe what it
*does*, not what the arguments are.

Plus "original" doesn't make sense to me anyway, since we're not
modifying it. To me, "original" means that there's a final version
too, which this interface in no way implies. It's just "this is the
value we already read".

("orig" does make some sense in that fault path - because by
definition we *are* going to modify the page table entry, that's the
whole point of the fault - we need to do something to not keep
faulting. But here, we're not at all necessarily modifying the page
table contents, we're just following them and readign the values once)

Of course, I don't know what a better name would be to describe what
is actually going on, I'm just explaining why I hate that naming.

*Maybe* something like just "pXd_offset_value()" together with a
comment explaining that it's given the upper pXd pointer _and_ the
value behind it, and it needs to return the next level offset? I
dunno. "value" doesn't really seem horribly descriptive either, but at
least it doesn't feel actively misleading to me.

Yeah, I get hung up on naming sometimes. I don't tend to care much
about private local variables ("i" is a perfectly fine variable name),
but these kinds of somewhat subtle cross-architecture definitions I
feel matter.

               Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 19:32                         ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-10 19:32 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, LKML, Michael Ellerman,
	Andrew Morton, linux-power, Mike Rapoport

On Thu, Sep 10, 2020 at 12:11 PM Gerald Schaefer
<gerald.schaefer@linux.ibm.com> wrote:
>
> That sounds a lot like the pXd_offset_orig() from Martins first approach
> in this thread:
> https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/

I have to admit to finding that name horrible, but aside from that, yes.

I don't think "pXd_offset_orig()" makes any sense as a name. Yes,
"orig" may make sense as the variable name (as in "this was the
original value we read"), but a function name should describe what it
*does*, not what the arguments are.

Plus "original" doesn't make sense to me anyway, since we're not
modifying it. To me, "original" means that there's a final version
too, which this interface in no way implies. It's just "this is the
value we already read".

("orig" does make some sense in that fault path - because by
definition we *are* going to modify the page table entry, that's the
whole point of the fault - we need to do something to not keep
faulting. But here, we're not at all necessarily modifying the page
table contents, we're just following them and readign the values once)

Of course, I don't know what a better name would be to describe what
is actually going on, I'm just explaining why I hate that naming.

*Maybe* something like just "pXd_offset_value()" together with a
comment explaining that it's given the upper pXd pointer _and_ the
value behind it, and it needs to return the next level offset? I
dunno. "value" doesn't really seem horribly descriptive either, but at
least it doesn't feel actively misleading to me.

Yeah, I get hung up on naming sometimes. I don't tend to care much
about private local variables ("i" is a perfectly fine variable name),
but these kinds of somewhat subtle cross-architecture definitions I
feel matter.

               Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 18:13                   ` Jason Gunthorpe
                                       ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 21:22                     ` John Hubbard
  -1 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-10 21:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Linus Torvalds
  Cc: Alexander Gordeev, Gerald Schaefer, Dave Hansen, LKML, linux-mm,
	linux-arch, Andrew Morton, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On 9/10/20 11:13 AM, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote:
>> On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
>> <agordeev@linux.ibm.com> wrote:
>>>
>>> It is only gup_fast case that exposes the issue. It hits because
>>> pointers to stack copies are passed to gup_pXd_range iterators, not
>>> pointers to real page tables itself.
>>
>> Can we possibly change fast-gup to not do the stack copies?
>>
>> I'd actually rather do something like that, than the "addr_end" thing.
> 
>> As you say, none of the other page table walking code does what the
>> GUP code does, and I don't think it's required.
> 
> As I understand it, the requirement is because fast-gup walks without
> the page table spinlock, or mmap_sem held so it must READ_ONCE the
> *pXX.
> 
> It then checks that it is a valid page table pointer, then calls
> pXX_offset().
> 
> The arch implementation of pXX_offset() derefs again the passed pXX
> pointer. So it defeats the READ_ONCE and the 2nd load could observe
> something that is no longer a page table pointer and crash.

Just to be clear, though, that makes it sound a little wilder and
reckless than it really is, right?

Because actually, the page tables cannot be freed while gup_fast is
walking them, due to either IPI blocking during the walk, or the moral
equivalent (MMU_GATHER_RCU_TABLE_FREE) for non-IPI architectures. So the
pages tables can *change* underneath gup_fast, and for example pages can
be unmapped. But they remain valid page tables, it's just that their
contents are unstable. Even if pXd_none()==true.

Or am I way off here, and it really is possible (aside from the current
s390 situation) to observe something that "is no longer a page table"?


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 21:22                     ` John Hubbard
  0 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-10 21:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Linus Torvalds
  Cc: Alexander Gordeev, Gerald Schaefer, Dave Hansen, LKML, linux-mm,
	linux-arch, Andrew Morton, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On 9/10/20 11:13 AM, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote:
>> On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
>> <agordeev@linux.ibm.com> wrote:
>>>
>>> It is only gup_fast case that exposes the issue. It hits because
>>> pointers to stack copies are passed to gup_pXd_range iterators, not
>>> pointers to real page tables itself.
>>
>> Can we possibly change fast-gup to not do the stack copies?
>>
>> I'd actually rather do something like that, than the "addr_end" thing.
> 
>> As you say, none of the other page table walking code does what the
>> GUP code does, and I don't think it's required.
> 
> As I understand it, the requirement is because fast-gup walks without
> the page table spinlock, or mmap_sem held so it must READ_ONCE the
> *pXX.
> 
> It then checks that it is a valid page table pointer, then calls
> pXX_offset().
> 
> The arch implementation of pXX_offset() derefs again the passed pXX
> pointer. So it defeats the READ_ONCE and the 2nd load could observe
> something that is no longer a page table pointer and crash.

Just to be clear, though, that makes it sound a little wilder and
reckless than it really is, right?

Because actually, the page tables cannot be freed while gup_fast is
walking them, due to either IPI blocking during the walk, or the moral
equivalent (MMU_GATHER_RCU_TABLE_FREE) for non-IPI architectures. So the
pages tables can *change* underneath gup_fast, and for example pages can
be unmapped. But they remain valid page tables, it's just that their
contents are unstable. Even if pXd_none()=true.

Or am I way off here, and it really is possible (aside from the current
s390 situation) to observe something that "is no longer a page table"?


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 21:22                     ` John Hubbard
  0 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-10 21:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Linus Torvalds
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Andrew Morton, linux-power, Mike Rapoport

On 9/10/20 11:13 AM, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote:
>> On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
>> <agordeev@linux.ibm.com> wrote:
>>>
>>> It is only gup_fast case that exposes the issue. It hits because
>>> pointers to stack copies are passed to gup_pXd_range iterators, not
>>> pointers to real page tables itself.
>>
>> Can we possibly change fast-gup to not do the stack copies?
>>
>> I'd actually rather do something like that, than the "addr_end" thing.
> 
>> As you say, none of the other page table walking code does what the
>> GUP code does, and I don't think it's required.
> 
> As I understand it, the requirement is because fast-gup walks without
> the page table spinlock, or mmap_sem held so it must READ_ONCE the
> *pXX.
> 
> It then checks that it is a valid page table pointer, then calls
> pXX_offset().
> 
> The arch implementation of pXX_offset() derefs again the passed pXX
> pointer. So it defeats the READ_ONCE and the 2nd load could observe
> something that is no longer a page table pointer and crash.

Just to be clear, though, that makes it sound a little wilder and
reckless than it really is, right?

Because actually, the page tables cannot be freed while gup_fast is
walking them, due to either IPI blocking during the walk, or the moral
equivalent (MMU_GATHER_RCU_TABLE_FREE) for non-IPI architectures. So the
pages tables can *change* underneath gup_fast, and for example pages can
be unmapped. But they remain valid page tables, it's just that their
contents are unstable. Even if pXd_none()==true.

Or am I way off here, and it really is possible (aside from the current
s390 situation) to observe something that "is no longer a page table"?


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 21:22                     ` John Hubbard
  0 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-10 21:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Linus Torvalds
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Gerald Schaefer, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, LKML, Michael Ellerman,
	Andrew Morton, linux-power, Mike Rapoport

On 9/10/20 11:13 AM, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote:
>> On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
>> <agordeev@linux.ibm.com> wrote:
>>>
>>> It is only gup_fast case that exposes the issue. It hits because
>>> pointers to stack copies are passed to gup_pXd_range iterators, not
>>> pointers to real page tables itself.
>>
>> Can we possibly change fast-gup to not do the stack copies?
>>
>> I'd actually rather do something like that, than the "addr_end" thing.
> 
>> As you say, none of the other page table walking code does what the
>> GUP code does, and I don't think it's required.
> 
> As I understand it, the requirement is because fast-gup walks without
> the page table spinlock, or mmap_sem held so it must READ_ONCE the
> *pXX.
> 
> It then checks that it is a valid page table pointer, then calls
> pXX_offset().
> 
> The arch implementation of pXX_offset() derefs again the passed pXX
> pointer. So it defeats the READ_ONCE and the 2nd load could observe
> something that is no longer a page table pointer and crash.

Just to be clear, though, that makes it sound a little wilder and
reckless than it really is, right?

Because actually, the page tables cannot be freed while gup_fast is
walking them, due to either IPI blocking during the walk, or the moral
equivalent (MMU_GATHER_RCU_TABLE_FREE) for non-IPI architectures. So the
pages tables can *change* underneath gup_fast, and for example pages can
be unmapped. But they remain valid page tables, it's just that their
contents are unstable. Even if pXd_none()==true.

Or am I way off here, and it really is possible (aside from the current
s390 situation) to observe something that "is no longer a page table"?


thanks,
-- 
John Hubbard
NVIDIA

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 21:22                     ` John Hubbard
  0 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-10 21:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Linus Torvalds
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Gerald Schaefer, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, LKML, Michael Ellerman,
	Andrew Morton, linux-power, Mike Rapoport

On 9/10/20 11:13 AM, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote:
>> On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
>> <agordeev@linux.ibm.com> wrote:
>>>
>>> It is only gup_fast case that exposes the issue. It hits because
>>> pointers to stack copies are passed to gup_pXd_range iterators, not
>>> pointers to real page tables itself.
>>
>> Can we possibly change fast-gup to not do the stack copies?
>>
>> I'd actually rather do something like that, than the "addr_end" thing.
> 
>> As you say, none of the other page table walking code does what the
>> GUP code does, and I don't think it's required.
> 
> As I understand it, the requirement is because fast-gup walks without
> the page table spinlock, or mmap_sem held so it must READ_ONCE the
> *pXX.
> 
> It then checks that it is a valid page table pointer, then calls
> pXX_offset().
> 
> The arch implementation of pXX_offset() derefs again the passed pXX
> pointer. So it defeats the READ_ONCE and the 2nd load could observe
> something that is no longer a page table pointer and crash.

Just to be clear, though, that makes it sound a little wilder and
reckless than it really is, right?

Because actually, the page tables cannot be freed while gup_fast is
walking them, due to either IPI blocking during the walk, or the moral
equivalent (MMU_GATHER_RCU_TABLE_FREE) for non-IPI architectures. So the
pages tables can *change* underneath gup_fast, and for example pages can
be unmapped. But they remain valid page tables, it's just that their
contents are unstable. Even if pXd_none()==true.

Or am I way off here, and it really is possible (aside from the current
s390 situation) to observe something that "is no longer a page table"?


thanks,
-- 
John Hubbard
NVIDIA

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 19:32                         ` Linus Torvalds
                                             ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 21:59                           ` Jason Gunthorpe
  -1 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 21:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gerald Schaefer, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 12:32:05PM -0700, Linus Torvalds wrote:
> Yeah, I get hung up on naming sometimes. I don't tend to care much
> about private local variables ("i" is a perfectly fine variable name),
> but these kinds of somewhat subtle cross-architecture definitions I
> feel matter.

One of the first replys to this patch was to ask "when would I use
_orig vs normal", so you are not alone. The name should convey it..

So, I suggest pXX_offset_unlocked()

Since it is safe to call without the page table lock, while pXX_offset()
requires the page table lock to be held as the internal *pXX is a data
race otherwise.

Patch 1 might be OK for a stable backport, but to get to a clear
pXX_offset_unlocked() all the arches would want to be changed to
implement that API and the generic code would provide the wrapper:

#define pXX_offset(pXXp, address) pXX_offset_unlocked(pXXp, *(pXXp), address)

Arches would not have a *pXX inside their code.

Then we can talk about auditing call sites of pXX_offset and think
about using the _unlocked version in places where the page table lock
is not held.

For instance mm/pagewalk.c should be changed. So should
huge_pte_offset() and probably other places. These places might
already be exsting data-race bugs.

It is code-as-documentation indicating an unlocked page table walk.

Now it is not just a S390 story but a change that makes the data
concurrency much clearer, so I think I prefer this version to the
addr_end one too.

Regards,
Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 21:59                           ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 21:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gerald Schaefer, Alexander Gordeev, Dave Hansen, John Hubbard,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 12:32:05PM -0700, Linus Torvalds wrote:
> Yeah, I get hung up on naming sometimes. I don't tend to care much
> about private local variables ("i" is a perfectly fine variable name),
> but these kinds of somewhat subtle cross-architecture definitions I
> feel matter.

One of the first replys to this patch was to ask "when would I use
_orig vs normal", so you are not alone. The name should convey it..

So, I suggest pXX_offset_unlocked()

Since it is safe to call without the page table lock, while pXX_offset()
requires the page table lock to be held as the internal *pXX is a data
race otherwise.

Patch 1 might be OK for a stable backport, but to get to a clear
pXX_offset_unlocked() all the arches would want to be changed to
implement that API and the generic code would provide the wrapper:

#define pXX_offset(pXXp, address) pXX_offset_unlocked(pXXp, *(pXXp), address)

Arches would not have a *pXX inside their code.

Then we can talk about auditing call sites of pXX_offset and think
about using the _unlocked version in places where the page table lock
is not held.

For instance mm/pagewalk.c should be changed. So should
huge_pte_offset() and probably other places. These places might
already be exsting data-race bugs.

It is code-as-documentation indicating an unlocked page table walk.

Now it is not just a S390 story but a change that makes the data
concurrency much clearer, so I think I prefer this version to the
addr_end one too.

Regards,
Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 21:59                           ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 21:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Andrew Morton, linux-power, Mike Rapoport

On Thu, Sep 10, 2020 at 12:32:05PM -0700, Linus Torvalds wrote:
> Yeah, I get hung up on naming sometimes. I don't tend to care much
> about private local variables ("i" is a perfectly fine variable name),
> but these kinds of somewhat subtle cross-architecture definitions I
> feel matter.

One of the first replys to this patch was to ask "when would I use
_orig vs normal", so you are not alone. The name should convey it..

So, I suggest pXX_offset_unlocked()

Since it is safe to call without the page table lock, while pXX_offset()
requires the page table lock to be held as the internal *pXX is a data
race otherwise.

Patch 1 might be OK for a stable backport, but to get to a clear
pXX_offset_unlocked() all the arches would want to be changed to
implement that API and the generic code would provide the wrapper:

#define pXX_offset(pXXp, address) pXX_offset_unlocked(pXXp, *(pXXp), address)

Arches would not have a *pXX inside their code.

Then we can talk about auditing call sites of pXX_offset and think
about using the _unlocked version in places where the page table lock
is not held.

For instance mm/pagewalk.c should be changed. So should
huge_pte_offset() and probably other places. These places might
already be exsting data-race bugs.

It is code-as-documentation indicating an unlocked page table walk.

Now it is not just a S390 story but a change that makes the data
concurrency much clearer, so I think I prefer this version to the
addr_end one too.

Regards,
Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 21:59                           ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 21:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Gerald Schaefer, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm, LKML,
	Michael Ellerman, Andrew Morton, linux-power, Mike Rapoport

On Thu, Sep 10, 2020 at 12:32:05PM -0700, Linus Torvalds wrote:
> Yeah, I get hung up on naming sometimes. I don't tend to care much
> about private local variables ("i" is a perfectly fine variable name),
> but these kinds of somewhat subtle cross-architecture definitions I
> feel matter.

One of the first replys to this patch was to ask "when would I use
_orig vs normal", so you are not alone. The name should convey it..

So, I suggest pXX_offset_unlocked()

Since it is safe to call without the page table lock, while pXX_offset()
requires the page table lock to be held as the internal *pXX is a data
race otherwise.

Patch 1 might be OK for a stable backport, but to get to a clear
pXX_offset_unlocked() all the arches would want to be changed to
implement that API and the generic code would provide the wrapper:

#define pXX_offset(pXXp, address) pXX_offset_unlocked(pXXp, *(pXXp), address)

Arches would not have a *pXX inside their code.

Then we can talk about auditing call sites of pXX_offset and think
about using the _unlocked version in places where the page table lock
is not held.

For instance mm/pagewalk.c should be changed. So should
huge_pte_offset() and probably other places. These places might
already be exsting data-race bugs.

It is code-as-documentation indicating an unlocked page table walk.

Now it is not just a S390 story but a change that makes the data
concurrency much clearer, so I think I prefer this version to the
addr_end one too.

Regards,
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 21:59                           ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 21:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Gerald Schaefer, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm, LKML,
	Michael Ellerman, Andrew Morton, linux-power, Mike Rapoport

On Thu, Sep 10, 2020 at 12:32:05PM -0700, Linus Torvalds wrote:
> Yeah, I get hung up on naming sometimes. I don't tend to care much
> about private local variables ("i" is a perfectly fine variable name),
> but these kinds of somewhat subtle cross-architecture definitions I
> feel matter.

One of the first replys to this patch was to ask "when would I use
_orig vs normal", so you are not alone. The name should convey it..

So, I suggest pXX_offset_unlocked()

Since it is safe to call without the page table lock, while pXX_offset()
requires the page table lock to be held as the internal *pXX is a data
race otherwise.

Patch 1 might be OK for a stable backport, but to get to a clear
pXX_offset_unlocked() all the arches would want to be changed to
implement that API and the generic code would provide the wrapper:

#define pXX_offset(pXXp, address) pXX_offset_unlocked(pXXp, *(pXXp), address)

Arches would not have a *pXX inside their code.

Then we can talk about auditing call sites of pXX_offset and think
about using the _unlocked version in places where the page table lock
is not held.

For instance mm/pagewalk.c should be changed. So should
huge_pte_offset() and probably other places. These places might
already be exsting data-race bugs.

It is code-as-documentation indicating an unlocked page table walk.

Now it is not just a S390 story but a change that makes the data
concurrency much clearer, so I think I prefer this version to the
addr_end one too.

Regards,
Jason

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 21:22                     ` John Hubbard
                                         ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 22:11                       ` Jason Gunthorpe
  -1 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 22:11 UTC (permalink / raw)
  To: John Hubbard
  Cc: Linus Torvalds, Alexander Gordeev, Gerald Schaefer, Dave Hansen,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:

> Or am I way off here, and it really is possible (aside from the current
> s390 situation) to observe something that "is no longer a page table"?

Yes, that is the issue. Remember there is no locking for GUP
fast. While a page table cannot be freed there is nothing preventing
the page table entry from being concurrently modified.

Without the stack variable it looks like this:

       pud_t pud = READ_ONCE(*pudp);
       if (!pud_present(pud))
            return
       pmd_offset(pudp, address);

And pmd_offset() expands to

    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);

Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
of *pud can change, eg to !pud_present.

Then pud_page_vaddr(*pud) will crash. It is not use after free, it
is using data that has not been validated.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 22:11                       ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 22:11 UTC (permalink / raw)
  To: John Hubbard
  Cc: Linus Torvalds, Alexander Gordeev, Gerald Schaefer, Dave Hansen,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:

> Or am I way off here, and it really is possible (aside from the current
> s390 situation) to observe something that "is no longer a page table"?

Yes, that is the issue. Remember there is no locking for GUP
fast. While a page table cannot be freed there is nothing preventing
the page table entry from being concurrently modified.

Without the stack variable it looks like this:

       pud_t pud = READ_ONCE(*pudp);
       if (!pud_present(pud))
            return
       pmd_offset(pudp, address);

And pmd_offset() expands to

    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);

Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
of *pud can change, eg to !pud_present.

Then pud_page_vaddr(*pud) will crash. It is not use after free, it
is using data that has not been validated.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 22:11                       ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 22:11 UTC (permalink / raw)
  To: John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:

> Or am I way off here, and it really is possible (aside from the current
> s390 situation) to observe something that "is no longer a page table"?

Yes, that is the issue. Remember there is no locking for GUP
fast. While a page table cannot be freed there is nothing preventing
the page table entry from being concurrently modified.

Without the stack variable it looks like this:

       pud_t pud = READ_ONCE(*pudp);
       if (!pud_present(pud))
            return
       pmd_offset(pudp, address);

And pmd_offset() expands to

    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);

Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
of *pud can change, eg to !pud_present.

Then pud_page_vaddr(*pud) will crash. It is not use after free, it
is using data that has not been validated.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 22:11                       ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 22:11 UTC (permalink / raw)
  To: John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Gerald Schaefer, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:

> Or am I way off here, and it really is possible (aside from the current
> s390 situation) to observe something that "is no longer a page table"?

Yes, that is the issue. Remember there is no locking for GUP
fast. While a page table cannot be freed there is nothing preventing
the page table entry from being concurrently modified.

Without the stack variable it looks like this:

       pud_t pud = READ_ONCE(*pudp);
       if (!pud_present(pud))
            return
       pmd_offset(pudp, address);

And pmd_offset() expands to

    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);

Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
of *pud can change, eg to !pud_present.

Then pud_page_vaddr(*pud) will crash. It is not use after free, it
is using data that has not been validated.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 22:11                       ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 22:11 UTC (permalink / raw)
  To: John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Gerald Schaefer, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:

> Or am I way off here, and it really is possible (aside from the current
> s390 situation) to observe something that "is no longer a page table"?

Yes, that is the issue. Remember there is no locking for GUP
fast. While a page table cannot be freed there is nothing preventing
the page table entry from being concurrently modified.

Without the stack variable it looks like this:

       pud_t pud = READ_ONCE(*pudp);
       if (!pud_present(pud))
            return
       pmd_offset(pudp, address);

And pmd_offset() expands to

    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);

Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
of *pud can change, eg to !pud_present.

Then pud_page_vaddr(*pud) will crash. It is not use after free, it
is using data that has not been validated.

Jason

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 22:11                       ` Jason Gunthorpe
                                           ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 22:17                         ` John Hubbard
  -1 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-10 22:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Linus Torvalds, Alexander Gordeev, Gerald Schaefer, Dave Hansen,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On 9/10/20 3:11 PM, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:
> 
>> Or am I way off here, and it really is possible (aside from the current
>> s390 situation) to observe something that "is no longer a page table"?
> 
> Yes, that is the issue. Remember there is no locking for GUP
> fast. While a page table cannot be freed there is nothing preventing
> the page table entry from being concurrently modified.
> 

OK, then we are saying the same thing after all, good.

> Without the stack variable it looks like this:
> 
>         pud_t pud = READ_ONCE(*pudp);
>         if (!pud_present(pud))
>              return
>         pmd_offset(pudp, address);
> 
> And pmd_offset() expands to
> 
>      return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> 
> Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
> of *pud can change, eg to !pud_present.
> 
> Then pud_page_vaddr(*pud) will crash. It is not use after free, it
> is using data that has not been validated.
> 

Right, that matches what I had in mind, too: you can still have a problem
even though you're in the same page table. I just wanted to confirm that
there's not some odd way to launch out into completely non-page-table
memory.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 22:17                         ` John Hubbard
  0 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-10 22:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Linus Torvalds, Alexander Gordeev, Gerald Schaefer, Dave Hansen,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On 9/10/20 3:11 PM, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:
> 
>> Or am I way off here, and it really is possible (aside from the current
>> s390 situation) to observe something that "is no longer a page table"?
> 
> Yes, that is the issue. Remember there is no locking for GUP
> fast. While a page table cannot be freed there is nothing preventing
> the page table entry from being concurrently modified.
> 

OK, then we are saying the same thing after all, good.

> Without the stack variable it looks like this:
> 
>         pud_t pud = READ_ONCE(*pudp);
>         if (!pud_present(pud))
>              return
>         pmd_offset(pudp, address);
> 
> And pmd_offset() expands to
> 
>      return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> 
> Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
> of *pud can change, eg to !pud_present.
> 
> Then pud_page_vaddr(*pud) will crash. It is not use after free, it
> is using data that has not been validated.
> 

Right, that matches what I had in mind, too: you can still have a problem
even though you're in the same page table. I just wanted to confirm that
there's not some odd way to launch out into completely non-page-table
memory.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 22:17                         ` John Hubbard
  0 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-10 22:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On 9/10/20 3:11 PM, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:
> 
>> Or am I way off here, and it really is possible (aside from the current
>> s390 situation) to observe something that "is no longer a page table"?
> 
> Yes, that is the issue. Remember there is no locking for GUP
> fast. While a page table cannot be freed there is nothing preventing
> the page table entry from being concurrently modified.
> 

OK, then we are saying the same thing after all, good.

> Without the stack variable it looks like this:
> 
>         pud_t pud = READ_ONCE(*pudp);
>         if (!pud_present(pud))
>              return
>         pmd_offset(pudp, address);
> 
> And pmd_offset() expands to
> 
>      return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> 
> Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
> of *pud can change, eg to !pud_present.
> 
> Then pud_page_vaddr(*pud) will crash. It is not use after free, it
> is using data that has not been validated.
> 

Right, that matches what I had in mind, too: you can still have a problem
even though you're in the same page table. I just wanted to confirm that
there's not some odd way to launch out into completely non-page-table
memory.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 22:17                         ` John Hubbard
  0 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-10 22:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Gerald Schaefer, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On 9/10/20 3:11 PM, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:
> 
>> Or am I way off here, and it really is possible (aside from the current
>> s390 situation) to observe something that "is no longer a page table"?
> 
> Yes, that is the issue. Remember there is no locking for GUP
> fast. While a page table cannot be freed there is nothing preventing
> the page table entry from being concurrently modified.
> 

OK, then we are saying the same thing after all, good.

> Without the stack variable it looks like this:
> 
>         pud_t pud = READ_ONCE(*pudp);
>         if (!pud_present(pud))
>              return
>         pmd_offset(pudp, address);
> 
> And pmd_offset() expands to
> 
>      return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> 
> Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
> of *pud can change, eg to !pud_present.
> 
> Then pud_page_vaddr(*pud) will crash. It is not use after free, it
> is using data that has not been validated.
> 

Right, that matches what I had in mind, too: you can still have a problem
even though you're in the same page table. I just wanted to confirm that
there's not some odd way to launch out into completely non-page-table
memory.


thanks,
-- 
John Hubbard
NVIDIA

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 22:17                         ` John Hubbard
  0 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-10 22:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Gerald Schaefer, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On 9/10/20 3:11 PM, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:
> 
>> Or am I way off here, and it really is possible (aside from the current
>> s390 situation) to observe something that "is no longer a page table"?
> 
> Yes, that is the issue. Remember there is no locking for GUP
> fast. While a page table cannot be freed there is nothing preventing
> the page table entry from being concurrently modified.
> 

OK, then we are saying the same thing after all, good.

> Without the stack variable it looks like this:
> 
>         pud_t pud = READ_ONCE(*pudp);
>         if (!pud_present(pud))
>              return
>         pmd_offset(pudp, address);
> 
> And pmd_offset() expands to
> 
>      return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> 
> Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
> of *pud can change, eg to !pud_present.
> 
> Then pud_page_vaddr(*pud) will crash. It is not use after free, it
> is using data that has not been validated.
> 

Right, that matches what I had in mind, too: you can still have a problem
even though you're in the same page table. I just wanted to confirm that
there's not some odd way to launch out into completely non-page-table
memory.


thanks,
-- 
John Hubbard
NVIDIA

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 17:57                   ` Gerald Schaefer
                                       ` (2 preceding siblings ...)
  (?)
@ 2020-09-10 23:21                     ` Jason Gunthorpe
  -1 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 23:21 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 07:57:49PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.
> > 
> > I would say the page table API requires this invariant:
> > 
> >         pud = pud_offset(p4d, addr);
> >         do {
> > 		WARN_ON(pud != pud_offset(p4d, addr);
> >                 next = pud_addr_end(addr, end);
> >         } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> 
> Hmm, IIUC, all architectures with static folding will simply return
> the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
> pagetables.

It is probably moot now, but since other arch's don't crash they also
return pud_addr_end() == end so the loop only does one iteration.

ie pud == pud_offset(p4d, addr) for all iterations as the pud++ never
happens.

Which is what this addr_end patch does for s390..

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 23:21                     ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 23:21 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Alexander Gordeev, Dave Hansen, John Hubbard, LKML, linux-mm,
	linux-arch, Andrew Morton, Linus Torvalds, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 07:57:49PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.
> > 
> > I would say the page table API requires this invariant:
> > 
> >         pud = pud_offset(p4d, addr);
> >         do {
> > 		WARN_ON(pud != pud_offset(p4d, addr);
> >                 next = pud_addr_end(addr, end);
> >         } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> 
> Hmm, IIUC, all architectures with static folding will simply return
> the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
> pagetables.

It is probably moot now, but since other arch's don't crash they also
return pud_addr_end() = end so the loop only does one iteration.

ie pud = pud_offset(p4d, addr) for all iterations as the pud++ never
happens.

Which is what this addr_end patch does for s390..

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 23:21                     ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 23:21 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 07:57:49PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.
> > 
> > I would say the page table API requires this invariant:
> > 
> >         pud = pud_offset(p4d, addr);
> >         do {
> > 		WARN_ON(pud != pud_offset(p4d, addr);
> >                 next = pud_addr_end(addr, end);
> >         } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> 
> Hmm, IIUC, all architectures with static folding will simply return
> the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
> pagetables.

It is probably moot now, but since other arch's don't crash they also
return pud_addr_end() == end so the loop only does one iteration.

ie pud == pud_offset(p4d, addr) for all iterations as the pud++ never
happens.

Which is what this addr_end patch does for s390..

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 23:21                     ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 23:21 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 07:57:49PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.
> > 
> > I would say the page table API requires this invariant:
> > 
> >         pud = pud_offset(p4d, addr);
> >         do {
> > 		WARN_ON(pud != pud_offset(p4d, addr);
> >                 next = pud_addr_end(addr, end);
> >         } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> 
> Hmm, IIUC, all architectures with static folding will simply return
> the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
> pagetables.

It is probably moot now, but since other arch's don't crash they also
return pud_addr_end() == end so the loop only does one iteration.

ie pud == pud_offset(p4d, addr) for all iterations as the pud++ never
happens.

Which is what this addr_end patch does for s390..

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-10 23:21                     ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 23:21 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Heiko Carstens, Arnd Bergmann, John Hubbard,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-mm, linux-power, LKML,
	Michael Ellerman, Andrew Morton, Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 07:57:49PM +0200, Gerald Schaefer wrote:
> On Thu, 10 Sep 2020 10:02:33 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > 
> > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > Hopefully, one could make sense ot of it.
> > 
> > I would say the page table API requires this invariant:
> > 
> >         pud = pud_offset(p4d, addr);
> >         do {
> > 		WARN_ON(pud != pud_offset(p4d, addr);
> >                 next = pud_addr_end(addr, end);
> >         } while (pud++, addr = next, addr != end);
> > 
> > ie pud++ is supposed to be a shortcut for 
> >   pud_offset(p4d, next)
> > 
> 
> Hmm, IIUC, all architectures with static folding will simply return
> the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
> pagetables.

It is probably moot now, but since other arch's don't crash they also
return pud_addr_end() == end so the loop only does one iteration.

ie pud == pud_offset(p4d, addr) for all iterations as the pud++ never
happens.

Which is what this addr_end patch does for s390..

Jason

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 21:59                           ` Jason Gunthorpe
                                               ` (2 preceding siblings ...)
  (?)
@ 2020-09-11  7:09                             ` peterz
  -1 siblings, 0 replies; 313+ messages in thread
From: peterz @ 2020-09-11  7:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Linus Torvalds, Gerald Schaefer, Alexander Gordeev, Dave Hansen,
	John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote:
> So, I suggest pXX_offset_unlocked()

Urgh, no. Elsewhere in gup _unlocked() means it will take the lock
itself (get_user_pages_unlocked()) -- although often it seems to mean
the lock is already held (git grep _unlocked and marvel).

What we want is _lockless().

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11  7:09                             ` peterz
  0 siblings, 0 replies; 313+ messages in thread
From: peterz @ 2020-09-11  7:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Linus Torvalds, Gerald Schaefer, Alexander Gordeev, Dave Hansen,
	John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote:
> So, I suggest pXX_offset_unlocked()

Urgh, no. Elsewhere in gup _unlocked() means it will take the lock
itself (get_user_pages_unlocked()) -- although often it seems to mean
the lock is already held (git grep _unlocked and marvel).

What we want is _lockless().

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11  7:09                             ` peterz
  0 siblings, 0 replies; 313+ messages in thread
From: peterz @ 2020-09-11  7:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Alexander Gordeev, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote:
> So, I suggest pXX_offset_unlocked()

Urgh, no. Elsewhere in gup _unlocked() means it will take the lock
itself (get_user_pages_unlocked()) -- although often it seems to mean
the lock is already held (git grep _unlocked and marvel).

What we want is _lockless().

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11  7:09                             ` peterz
  0 siblings, 0 replies; 313+ messages in thread
From: peterz @ 2020-09-11  7:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote:
> So, I suggest pXX_offset_unlocked()

Urgh, no. Elsewhere in gup _unlocked() means it will take the lock
itself (get_user_pages_unlocked()) -- although often it seems to mean
the lock is already held (git grep _unlocked and marvel).

What we want is _lockless().

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11  7:09                             ` peterz
  0 siblings, 0 replies; 313+ messages in thread
From: peterz @ 2020-09-11  7:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote:
> So, I suggest pXX_offset_unlocked()

Urgh, no. Elsewhere in gup _unlocked() means it will take the lock
itself (get_user_pages_unlocked()) -- although often it seems to mean
the lock is already held (git grep _unlocked and marvel).

What we want is _lockless().

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-11  7:09                             ` peterz
                                                 ` (2 preceding siblings ...)
  (?)
@ 2020-09-11 11:19                               ` Jason Gunthorpe
  -1 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 11:19 UTC (permalink / raw)
  To: peterz
  Cc: Linus Torvalds, Gerald Schaefer, Alexander Gordeev, Dave Hansen,
	John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Fri, Sep 11, 2020 at 09:09:39AM +0200, peterz@infradead.org wrote:
> On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote:
> > So, I suggest pXX_offset_unlocked()
> 
> Urgh, no. Elsewhere in gup _unlocked() means it will take the lock
> itself (get_user_pages_unlocked()) -- although often it seems to mean
> the lock is already held (git grep _unlocked and marvel).
>
> What we want is _lockless().

This is clear to me!

Thanks,
Jason 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 11:19                               ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 11:19 UTC (permalink / raw)
  To: peterz
  Cc: Linus Torvalds, Gerald Schaefer, Alexander Gordeev, Dave Hansen,
	John Hubbard, LKML, linux-mm, linux-arch, Andrew Morton,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Fri, Sep 11, 2020 at 09:09:39AM +0200, peterz@infradead.org wrote:
> On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote:
> > So, I suggest pXX_offset_unlocked()
> 
> Urgh, no. Elsewhere in gup _unlocked() means it will take the lock
> itself (get_user_pages_unlocked()) -- although often it seems to mean
> the lock is already held (git grep _unlocked and marvel).
>
> What we want is _lockless().

This is clear to me!

Thanks,
Jason 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 11:19                               ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 11:19 UTC (permalink / raw)
  To: peterz
  Cc: Dave Hansen, linux-mm, Paul Mackerras, linux-sparc,
	Alexander Gordeev, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Fri, Sep 11, 2020 at 09:09:39AM +0200, peterz@infradead.org wrote:
> On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote:
> > So, I suggest pXX_offset_unlocked()
> 
> Urgh, no. Elsewhere in gup _unlocked() means it will take the lock
> itself (get_user_pages_unlocked()) -- although often it seems to mean
> the lock is already held (git grep _unlocked and marvel).
>
> What we want is _lockless().

This is clear to me!

Thanks,
Jason 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 11:19                               ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 11:19 UTC (permalink / raw)
  To: peterz
  Cc: Benjamin Herrenschmidt, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Fri, Sep 11, 2020 at 09:09:39AM +0200, peterz@infradead.org wrote:
> On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote:
> > So, I suggest pXX_offset_unlocked()
> 
> Urgh, no. Elsewhere in gup _unlocked() means it will take the lock
> itself (get_user_pages_unlocked()) -- although often it seems to mean
> the lock is already held (git grep _unlocked and marvel).
>
> What we want is _lockless().

This is clear to me!

Thanks,
Jason 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 11:19                               ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 11:19 UTC (permalink / raw)
  To: peterz
  Cc: Benjamin Herrenschmidt, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Fri, Sep 11, 2020 at 09:09:39AM +0200, peterz@infradead.org wrote:
> On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote:
> > So, I suggest pXX_offset_unlocked()
> 
> Urgh, no. Elsewhere in gup _unlocked() means it will take the lock
> itself (get_user_pages_unlocked()) -- although often it seems to mean
> the lock is already held (git grep _unlocked and marvel).
>
> What we want is _lockless().

This is clear to me!

Thanks,
Jason 

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-10 22:11                       ` Jason Gunthorpe
                                           ` (2 preceding siblings ...)
  (?)
@ 2020-09-11 12:19                         ` Alexander Gordeev
  -1 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-11 12:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, Linus Torvalds, Gerald Schaefer, Dave Hansen, LKML,
	linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 07:11:16PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:
> 
> > Or am I way off here, and it really is possible (aside from the current
> > s390 situation) to observe something that "is no longer a page table"?
> 
> Yes, that is the issue. Remember there is no locking for GUP
> fast. While a page table cannot be freed there is nothing preventing
> the page table entry from being concurrently modified.
> 
> Without the stack variable it looks like this:
> 
>        pud_t pud = READ_ONCE(*pudp);
>        if (!pud_present(pud))
>             return
>        pmd_offset(pudp, address);
> 
> And pmd_offset() expands to
> 
>     return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> 
> Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
> of *pud can change, eg to !pud_present.
> 
> Then pud_page_vaddr(*pud) will crash. It is not use after free, it
> is using data that has not been validated.

One thing I ask myself and it is probably a good moment to wonder.

What if the entry is still pud_present, but got remapped after
READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere?

> Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 12:19                         ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-11 12:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, Linus Torvalds, Gerald Schaefer, Dave Hansen, LKML,
	linux-mm, linux-arch, Andrew Morton, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Thu, Sep 10, 2020 at 07:11:16PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:
> 
> > Or am I way off here, and it really is possible (aside from the current
> > s390 situation) to observe something that "is no longer a page table"?
> 
> Yes, that is the issue. Remember there is no locking for GUP
> fast. While a page table cannot be freed there is nothing preventing
> the page table entry from being concurrently modified.
> 
> Without the stack variable it looks like this:
> 
>        pud_t pud = READ_ONCE(*pudp);
>        if (!pud_present(pud))
>             return
>        pmd_offset(pudp, address);
> 
> And pmd_offset() expands to
> 
>     return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> 
> Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
> of *pud can change, eg to !pud_present.
> 
> Then pud_page_vaddr(*pud) will crash. It is not use after free, it
> is using data that has not been validated.

One thing I ask myself and it is probably a good moment to wonder.

What if the entry is still pud_present, but got remapped after
READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere?

> Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 12:19                         ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-11 12:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Thu, Sep 10, 2020 at 07:11:16PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:
> 
> > Or am I way off here, and it really is possible (aside from the current
> > s390 situation) to observe something that "is no longer a page table"?
> 
> Yes, that is the issue. Remember there is no locking for GUP
> fast. While a page table cannot be freed there is nothing preventing
> the page table entry from being concurrently modified.
> 
> Without the stack variable it looks like this:
> 
>        pud_t pud = READ_ONCE(*pudp);
>        if (!pud_present(pud))
>             return
>        pmd_offset(pudp, address);
> 
> And pmd_offset() expands to
> 
>     return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> 
> Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
> of *pud can change, eg to !pud_present.
> 
> Then pud_page_vaddr(*pud) will crash. It is not use after free, it
> is using data that has not been validated.

One thing I ask myself and it is probably a good moment to wonder.

What if the entry is still pud_present, but got remapped after
READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere?

> Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 12:19                         ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-11 12:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 07:11:16PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:
> 
> > Or am I way off here, and it really is possible (aside from the current
> > s390 situation) to observe something that "is no longer a page table"?
> 
> Yes, that is the issue. Remember there is no locking for GUP
> fast. While a page table cannot be freed there is nothing preventing
> the page table entry from being concurrently modified.
> 
> Without the stack variable it looks like this:
> 
>        pud_t pud = READ_ONCE(*pudp);
>        if (!pud_present(pud))
>             return
>        pmd_offset(pudp, address);
> 
> And pmd_offset() expands to
> 
>     return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> 
> Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
> of *pud can change, eg to !pud_present.
> 
> Then pud_page_vaddr(*pud) will crash. It is not use after free, it
> is using data that has not been validated.

One thing I ask myself and it is probably a good moment to wonder.

What if the entry is still pud_present, but got remapped after
READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere?

> Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 12:19                         ` Alexander Gordeev
  0 siblings, 0 replies; 313+ messages in thread
From: Alexander Gordeev @ 2020-09-11 12:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Thu, Sep 10, 2020 at 07:11:16PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:
> 
> > Or am I way off here, and it really is possible (aside from the current
> > s390 situation) to observe something that "is no longer a page table"?
> 
> Yes, that is the issue. Remember there is no locking for GUP
> fast. While a page table cannot be freed there is nothing preventing
> the page table entry from being concurrently modified.
> 
> Without the stack variable it looks like this:
> 
>        pud_t pud = READ_ONCE(*pudp);
>        if (!pud_present(pud))
>             return
>        pmd_offset(pudp, address);
> 
> And pmd_offset() expands to
> 
>     return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> 
> Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
> of *pud can change, eg to !pud_present.
> 
> Then pud_page_vaddr(*pud) will crash. It is not use after free, it
> is using data that has not been validated.

One thing I ask myself and it is probably a good moment to wonder.

What if the entry is still pud_present, but got remapped after
READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere?

> Jason

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-11 12:19                         ` Alexander Gordeev
                                             ` (2 preceding siblings ...)
  (?)
@ 2020-09-11 16:45                           ` Linus Torvalds
  -1 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-11 16:45 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Jason Gunthorpe, John Hubbard, Gerald Schaefer, Dave Hansen,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Fri, Sep 11, 2020 at 5:20 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> What if the entry is still pud_present, but got remapped after
> READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere?

That can't happen.

The GUP walk doesn't hold any locks, but it *is* done with interrupts
disabled, and anybody who is modifying the page tables needs to do the
TLB flush, and/or RCU-free them.

The interrupt disable means that on architectures where the TLB flush
involves an IPI, it will be delayed until afterwards, but it also acts
as a big RCU read lock hammer.

So the page tables can get modified under us, but the old pages won't
be released and re-used.

                Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 16:45                           ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-11 16:45 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Jason Gunthorpe, John Hubbard, Gerald Schaefer, Dave Hansen,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Fri, Sep 11, 2020 at 5:20 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> What if the entry is still pud_present, but got remapped after
> READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere?

That can't happen.

The GUP walk doesn't hold any locks, but it *is* done with interrupts
disabled, and anybody who is modifying the page tables needs to do the
TLB flush, and/or RCU-free them.

The interrupt disable means that on architectures where the TLB flush
involves an IPI, it will be delayed until afterwards, but it also acts
as a big RCU read lock hammer.

So the page tables can get modified under us, but the old pages won't
be released and re-used.

                Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 16:45                           ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-11 16:45 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Jason Gunthorpe, John Hubbard, Gerald Schaefer, Dave Hansen,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Fri, Sep 11, 2020 at 5:20 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> What if the entry is still pud_present, but got remapped after
> READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere?

That can't happen.

The GUP walk doesn't hold any locks, but it *is* done with interrupts
disabled, and anybody who is modifying the page tables needs to do the
TLB flush, and/or RCU-free them.

The interrupt disable means that on architectures where the TLB flush
involves an IPI, it will be delayed until afterwards, but it also acts
as a big RCU read lock hammer.

So the page tables can get modified under us, but the old pages won't
be released and re-used.

                Linus


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 16:45                           ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-11 16:45 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Claudio Imbrenda, Will Deacon, linux-arch,
	linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Andrew Morton, linux-power, Mike Rapoport

On Fri, Sep 11, 2020 at 5:20 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> What if the entry is still pud_present, but got remapped after
> READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere?

That can't happen.

The GUP walk doesn't hold any locks, but it *is* done with interrupts
disabled, and anybody who is modifying the page tables needs to do the
TLB flush, and/or RCU-free them.

The interrupt disable means that on architectures where the TLB flush
involves an IPI, it will be delayed until afterwards, but it also acts
as a big RCU read lock hammer.

So the page tables can get modified under us, but the old pages won't
be released and re-used.

                Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 16:45                           ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-11 16:45 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Michael Ellerman, Andrew Morton, linux-power,
	Mike Rapoport

On Fri, Sep 11, 2020 at 5:20 AM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> What if the entry is still pud_present, but got remapped after
> READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere?

That can't happen.

The GUP walk doesn't hold any locks, but it *is* done with interrupts
disabled, and anybody who is modifying the page tables needs to do the
TLB flush, and/or RCU-free them.

The interrupt disable means that on architectures where the TLB flush
involves an IPI, it will be delayed until afterwards, but it also acts
as a big RCU read lock hammer.

So the page tables can get modified under us, but the old pages won't
be released and re-used.

                Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-11  7:09                             ` peterz
                                                 ` (2 preceding siblings ...)
  (?)
@ 2020-09-11 19:03                               ` Vasily Gorbik
  -1 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-11 19:03 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard, Linus Torvalds
  Cc: Gerald Schaefer, Alexander Gordeev, Peter Zijlstra, Dave Hansen,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
...
        pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

This happens since s390 moved to common gup code with
commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
and commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code"). s390 tried to mimic static level folding by
changing pXd_offset primitives to always calculate top level page table
offset in pgd_offset and just return the value passed when pXd_offset
has to act as folded.

What is crucial for gup_fast and what has been overlooked is
that PxD_SIZE/MASK and thus pXd_addr_end should also change
correspondingly. And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original
pXdp pointers down to gup_pXd_range functions. And introduce
pXd_offset_lockless helpers, which take an additional pXd
entry value parameter. This has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Cc: <stable@vger.kernel.org> # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
 include/linux/pgtable.h         | 10 ++++++++
 mm/gup.c                        | 18 +++++++-------
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..b55561cc8786 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
 
 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
 
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
 {
-	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-	return (p4d_t *) pgd;
+	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
+	return (p4d_t *) pgdp;
 }
+#define p4d_offset_lockless p4d_offset_lockless
 
-static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
 {
-	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
-		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
-	return (pud_t *) p4d;
+	return p4d_offset_lockless(pgdp, *pgdp, address);
+}
+
+static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
+{
+	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
+		return (pud_t *) p4d_deref(p4d) + pud_index(address);
+	return (pud_t *) p4dp;
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
+{
+	return pud_offset_lockless(p4dp, *p4dp, address);
 }
 #define pud_offset pud_offset
 
-static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
+static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
+{
+	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
+		return (pmd_t *) pud_deref(pud) + pmd_index(address);
+	return (pmd_t *) pudp;
+}
+#define pmd_offset_lockless pmd_offset_lockless
+
+static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
 {
-	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
-		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
-	return (pmd_t *) pud;
+	return pmd_offset_lockless(pudp, *pudp, address);
 }
 #define pmd_offset pmd_offset
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..e899d3506671 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+#ifndef p4d_offset_lockless
+#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&pgd, address)
+#endif
+#ifndef pud_offset_lockless
+#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&p4d, address)
+#endif
+#ifndef pmd_offset_lockless
+#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&pud, address)
+#endif
+
 /*
  * p?d_leaf() - true if this entry is a final mapping to a physical address.
  * This differs from p?d_huge() by the fact that they are always available (if
diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..578bf5bd8bf8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	return 1;
 }
 
-static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
 		unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pmd_t *pmdp;
 
-	pmdp = pmd_offset(&pud, addr);
+	pmdp = pmd_offset_lockless(pudp, pud, addr);
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
@@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	return 1;
 }
 
-static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pud_t *pudp;
 
-	pudp = pud_offset(&p4d, addr);
+	pudp = pud_offset_lockless(p4dp, p4d, addr);
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
@@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
 					 PUD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
+		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
 	return 1;
 }
 
-static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	p4d_t *p4dp;
 
-	p4dp = p4d_offset(&pgd, addr);
+	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
@@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
 					 P4D_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
+		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
 			return 0;
 	} while (p4dp++, addr = next, addr != end);
 
@@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
 					 PGDIR_SHIFT, next, flags, pages, nr))
 				return;
-		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
+		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
 			return;
 	} while (pgdp++, addr = next, addr != end);
 }
-- 
⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 19:03                               ` Vasily Gorbik
  0 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-11 19:03 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard, Linus Torvalds
  Cc: Gerald Schaefer, Alexander Gordeev, Peter Zijlstra, Dave Hansen,
	LKML, linux-mm, linux-arch, Andrew Morton, Russell King,
	Mike Rapoport, Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
...
        pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

This happens since s390 moved to common gup code with
commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
and commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code"). s390 tried to mimic static level folding by
changing pXd_offset primitives to always calculate top level page table
offset in pgd_offset and just return the value passed when pXd_offset
has to act as folded.

What is crucial for gup_fast and what has been overlooked is
that PxD_SIZE/MASK and thus pXd_addr_end should also change
correspondingly. And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original
pXdp pointers down to gup_pXd_range functions. And introduce
pXd_offset_lockless helpers, which take an additional pXd
entry value parameter. This has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Cc: <stable@vger.kernel.org> # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
 include/linux/pgtable.h         | 10 ++++++++
 mm/gup.c                        | 18 +++++++-------
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..b55561cc8786 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
 
 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
 
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
 {
-	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-	return (p4d_t *) pgd;
+	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
+	return (p4d_t *) pgdp;
 }
+#define p4d_offset_lockless p4d_offset_lockless
 
-static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
 {
-	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
-		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
-	return (pud_t *) p4d;
+	return p4d_offset_lockless(pgdp, *pgdp, address);
+}
+
+static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
+{
+	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
+		return (pud_t *) p4d_deref(p4d) + pud_index(address);
+	return (pud_t *) p4dp;
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
+{
+	return pud_offset_lockless(p4dp, *p4dp, address);
 }
 #define pud_offset pud_offset
 
-static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
+static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
+{
+	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
+		return (pmd_t *) pud_deref(pud) + pmd_index(address);
+	return (pmd_t *) pudp;
+}
+#define pmd_offset_lockless pmd_offset_lockless
+
+static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
 {
-	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
-		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
-	return (pmd_t *) pud;
+	return pmd_offset_lockless(pudp, *pudp, address);
 }
 #define pmd_offset pmd_offset
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..e899d3506671 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+#ifndef p4d_offset_lockless
+#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&pgd, address)
+#endif
+#ifndef pud_offset_lockless
+#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&p4d, address)
+#endif
+#ifndef pmd_offset_lockless
+#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&pud, address)
+#endif
+
 /*
  * p?d_leaf() - true if this entry is a final mapping to a physical address.
  * This differs from p?d_huge() by the fact that they are always available (if
diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..578bf5bd8bf8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	return 1;
 }
 
-static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
 		unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pmd_t *pmdp;
 
-	pmdp = pmd_offset(&pud, addr);
+	pmdp = pmd_offset_lockless(pudp, pud, addr);
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
@@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	return 1;
 }
 
-static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pud_t *pudp;
 
-	pudp = pud_offset(&p4d, addr);
+	pudp = pud_offset_lockless(p4dp, p4d, addr);
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
@@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
 					 PUD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
+		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
 	return 1;
 }
 
-static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	p4d_t *p4dp;
 
-	p4dp = p4d_offset(&pgd, addr);
+	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
@@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
 					 P4D_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
+		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
 			return 0;
 	} while (p4dp++, addr = next, addr != end);
 
@@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
 					 PGDIR_SHIFT, next, flags, pages, nr))
 				return;
-		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
+		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
 			return;
 	} while (pgdp++, addr = next, addr != end);
 }
-- 
⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 19:03                               ` Vasily Gorbik
  0 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-11 19:03 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard, Linus Torvalds
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, LKML, Andrew Morton, linux-power, Mike Rapoport

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
...
        pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

This happens since s390 moved to common gup code with
commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
and commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code"). s390 tried to mimic static level folding by
changing pXd_offset primitives to always calculate top level page table
offset in pgd_offset and just return the value passed when pXd_offset
has to act as folded.

What is crucial for gup_fast and what has been overlooked is
that PxD_SIZE/MASK and thus pXd_addr_end should also change
correspondingly. And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original
pXdp pointers down to gup_pXd_range functions. And introduce
pXd_offset_lockless helpers, which take an additional pXd
entry value parameter. This has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Cc: <stable@vger.kernel.org> # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
 include/linux/pgtable.h         | 10 ++++++++
 mm/gup.c                        | 18 +++++++-------
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..b55561cc8786 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
 
 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
 
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
 {
-	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-	return (p4d_t *) pgd;
+	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
+	return (p4d_t *) pgdp;
 }
+#define p4d_offset_lockless p4d_offset_lockless
 
-static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
 {
-	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
-		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
-	return (pud_t *) p4d;
+	return p4d_offset_lockless(pgdp, *pgdp, address);
+}
+
+static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
+{
+	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
+		return (pud_t *) p4d_deref(p4d) + pud_index(address);
+	return (pud_t *) p4dp;
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
+{
+	return pud_offset_lockless(p4dp, *p4dp, address);
 }
 #define pud_offset pud_offset
 
-static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
+static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
+{
+	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
+		return (pmd_t *) pud_deref(pud) + pmd_index(address);
+	return (pmd_t *) pudp;
+}
+#define pmd_offset_lockless pmd_offset_lockless
+
+static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
 {
-	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
-		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
-	return (pmd_t *) pud;
+	return pmd_offset_lockless(pudp, *pudp, address);
 }
 #define pmd_offset pmd_offset
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..e899d3506671 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+#ifndef p4d_offset_lockless
+#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&pgd, address)
+#endif
+#ifndef pud_offset_lockless
+#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&p4d, address)
+#endif
+#ifndef pmd_offset_lockless
+#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&pud, address)
+#endif
+
 /*
  * p?d_leaf() - true if this entry is a final mapping to a physical address.
  * This differs from p?d_huge() by the fact that they are always available (if
diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..578bf5bd8bf8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	return 1;
 }
 
-static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
 		unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pmd_t *pmdp;
 
-	pmdp = pmd_offset(&pud, addr);
+	pmdp = pmd_offset_lockless(pudp, pud, addr);
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
@@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	return 1;
 }
 
-static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pud_t *pudp;
 
-	pudp = pud_offset(&p4d, addr);
+	pudp = pud_offset_lockless(p4dp, p4d, addr);
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
@@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
 					 PUD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
+		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
 	return 1;
 }
 
-static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	p4d_t *p4dp;
 
-	p4dp = p4d_offset(&pgd, addr);
+	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
@@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
 					 P4D_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
+		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
 			return 0;
 	} while (p4dp++, addr = next, addr != end);
 
@@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
 					 PGDIR_SHIFT, next, flags, pages, nr))
 				return;
-		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
+		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
 			return;
 	} while (pgdp++, addr = next, addr != end);
 }
-- 
⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 19:03                               ` Vasily Gorbik
  0 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-11 19:03 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard, Linus Torvalds
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, LKML, Michael Ellerman, Andrew Morton, linux-power,
	Mike Rapoport

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
...
        pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

This happens since s390 moved to common gup code with
commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
and commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code"). s390 tried to mimic static level folding by
changing pXd_offset primitives to always calculate top level page table
offset in pgd_offset and just return the value passed when pXd_offset
has to act as folded.

What is crucial for gup_fast and what has been overlooked is
that PxD_SIZE/MASK and thus pXd_addr_end should also change
correspondingly. And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original
pXdp pointers down to gup_pXd_range functions. And introduce
pXd_offset_lockless helpers, which take an additional pXd
entry value parameter. This has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Cc: <stable@vger.kernel.org> # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
 include/linux/pgtable.h         | 10 ++++++++
 mm/gup.c                        | 18 +++++++-------
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..b55561cc8786 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
 
 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
 
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
 {
-	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-	return (p4d_t *) pgd;
+	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
+	return (p4d_t *) pgdp;
 }
+#define p4d_offset_lockless p4d_offset_lockless
 
-static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
 {
-	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
-		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
-	return (pud_t *) p4d;
+	return p4d_offset_lockless(pgdp, *pgdp, address);
+}
+
+static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
+{
+	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
+		return (pud_t *) p4d_deref(p4d) + pud_index(address);
+	return (pud_t *) p4dp;
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
+{
+	return pud_offset_lockless(p4dp, *p4dp, address);
 }
 #define pud_offset pud_offset
 
-static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
+static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
+{
+	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
+		return (pmd_t *) pud_deref(pud) + pmd_index(address);
+	return (pmd_t *) pudp;
+}
+#define pmd_offset_lockless pmd_offset_lockless
+
+static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
 {
-	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
-		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
-	return (pmd_t *) pud;
+	return pmd_offset_lockless(pudp, *pudp, address);
 }
 #define pmd_offset pmd_offset
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..e899d3506671 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+#ifndef p4d_offset_lockless
+#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&pgd, address)
+#endif
+#ifndef pud_offset_lockless
+#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&p4d, address)
+#endif
+#ifndef pmd_offset_lockless
+#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&pud, address)
+#endif
+
 /*
  * p?d_leaf() - true if this entry is a final mapping to a physical address.
  * This differs from p?d_huge() by the fact that they are always available (if
diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..578bf5bd8bf8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	return 1;
 }
 
-static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
 		unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pmd_t *pmdp;
 
-	pmdp = pmd_offset(&pud, addr);
+	pmdp = pmd_offset_lockless(pudp, pud, addr);
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
@@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	return 1;
 }
 
-static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pud_t *pudp;
 
-	pudp = pud_offset(&p4d, addr);
+	pudp = pud_offset_lockless(p4dp, p4d, addr);
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
@@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
 					 PUD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
+		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
 	return 1;
 }
 
-static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	p4d_t *p4dp;
 
-	p4dp = p4d_offset(&pgd, addr);
+	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
@@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
 					 P4D_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
+		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
 			return 0;
 	} while (p4dp++, addr = next, addr != end);
 
@@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
 					 PGDIR_SHIFT, next, flags, pages, nr))
 				return;
-		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
+		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
 			return;
 	} while (pgdp++, addr = next, addr != end);
 }
-- 
⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 19:03                               ` Vasily Gorbik
  0 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-11 19:03 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard, Linus Torvalds
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, LKML, Michael Ellerman, Andrew Morton, linux-power,
	Mike Rapoport

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
...
        pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

This happens since s390 moved to common gup code with
commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
and commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code"). s390 tried to mimic static level folding by
changing pXd_offset primitives to always calculate top level page table
offset in pgd_offset and just return the value passed when pXd_offset
has to act as folded.

What is crucial for gup_fast and what has been overlooked is
that PxD_SIZE/MASK and thus pXd_addr_end should also change
correspondingly. And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original
pXdp pointers down to gup_pXd_range functions. And introduce
pXd_offset_lockless helpers, which take an additional pXd
entry value parameter. This has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Cc: <stable@vger.kernel.org> # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
 include/linux/pgtable.h         | 10 ++++++++
 mm/gup.c                        | 18 +++++++-------
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..b55561cc8786 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
 
 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
 
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
 {
-	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-	return (p4d_t *) pgd;
+	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
+	return (p4d_t *) pgdp;
 }
+#define p4d_offset_lockless p4d_offset_lockless
 
-static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
 {
-	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
-		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
-	return (pud_t *) p4d;
+	return p4d_offset_lockless(pgdp, *pgdp, address);
+}
+
+static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
+{
+	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
+		return (pud_t *) p4d_deref(p4d) + pud_index(address);
+	return (pud_t *) p4dp;
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
+{
+	return pud_offset_lockless(p4dp, *p4dp, address);
 }
 #define pud_offset pud_offset
 
-static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
+static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
+{
+	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
+		return (pmd_t *) pud_deref(pud) + pmd_index(address);
+	return (pmd_t *) pudp;
+}
+#define pmd_offset_lockless pmd_offset_lockless
+
+static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
 {
-	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
-		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
-	return (pmd_t *) pud;
+	return pmd_offset_lockless(pudp, *pudp, address);
 }
 #define pmd_offset pmd_offset
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..e899d3506671 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+#ifndef p4d_offset_lockless
+#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&pgd, address)
+#endif
+#ifndef pud_offset_lockless
+#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&p4d, address)
+#endif
+#ifndef pmd_offset_lockless
+#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&pud, address)
+#endif
+
 /*
  * p?d_leaf() - true if this entry is a final mapping to a physical address.
  * This differs from p?d_huge() by the fact that they are always available (if
diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..578bf5bd8bf8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	return 1;
 }
 
-static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
 		unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pmd_t *pmdp;
 
-	pmdp = pmd_offset(&pud, addr);
+	pmdp = pmd_offset_lockless(pudp, pud, addr);
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
@@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	return 1;
 }
 
-static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pud_t *pudp;
 
-	pudp = pud_offset(&p4d, addr);
+	pudp = pud_offset_lockless(p4dp, p4d, addr);
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
@@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
 					 PUD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
+		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
 	return 1;
 }
 
-static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	p4d_t *p4dp;
 
-	p4dp = p4d_offset(&pgd, addr);
+	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
@@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
 					 P4D_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
+		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
 			return 0;
 	} while (p4dp++, addr = next, addr != end);
 
@@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
 					 PGDIR_SHIFT, next, flags, pages, nr))
 				return;
-		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
+		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
 			return;
 	} while (pgdp++, addr = next, addr != end);
 }
-- 
⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-11 19:03                               ` Vasily Gorbik
                                                   ` (2 preceding siblings ...)
  (?)
@ 2020-09-11 19:09                                 ` Linus Torvalds
  -1 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-11 19:09 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Jason Gunthorpe, John Hubbard, Gerald Schaefer,
	Alexander Gordeev, Peter Zijlstra, Dave Hansen, LKML, linux-mm,
	linux-arch, Andrew Morton, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda

On Fri, Sep 11, 2020 at 12:04 PM Vasily Gorbik <gor@linux.ibm.com> wrote:
>
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
[ ... ]

Ack, this looks sane to me.

I was going to ask how horrible it would be to convert all the other
users, but a quick grep convinced me that yeah, it's only GUP that is
this special, and we don't want to make this interface be the real one
for everything else too..

                Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 19:09                                 ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-11 19:09 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Jason Gunthorpe, John Hubbard, Gerald Schaefer,
	Alexander Gordeev, Peter Zijlstra, Dave Hansen, LKML, linux-mm,
	linux-arch, Andrew Morton, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda

On Fri, Sep 11, 2020 at 12:04 PM Vasily Gorbik <gor@linux.ibm.com> wrote:
>
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
[ ... ]

Ack, this looks sane to me.

I was going to ask how horrible it would be to convert all the other
users, but a quick grep convinced me that yeah, it's only GUP that is
this special, and we don't want to make this interface be the real one
for everything else too..

                Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 19:09                                 ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-11 19:09 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Jason Gunthorpe, John Hubbard, Gerald Schaefer,
	Alexander Gordeev, Peter Zijlstra, Dave Hansen, LKML, linux-mm,
	linux-arch, Andrew Morton, Russell King, Mike Rapoport,
	Catalin Marinas, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, Jeff Dike,
	Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda

On Fri, Sep 11, 2020 at 12:04 PM Vasily Gorbik <gor@linux.ibm.com> wrote:
>
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
[ ... ]

Ack, this looks sane to me.

I was going to ask how horrible it would be to convert all the other
users, but a quick grep convinced me that yeah, it's only GUP that is
this special, and we don't want to make this interface be the real one
for everything else too..

                Linus


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 19:09                                 ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-11 19:09 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Andrew Morton, linux-power, Mike Rapoport

On Fri, Sep 11, 2020 at 12:04 PM Vasily Gorbik <gor@linux.ibm.com> wrote:
>
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
[ ... ]

Ack, this looks sane to me.

I was going to ask how horrible it would be to convert all the other
users, but a quick grep convinced me that yeah, it's only GUP that is
this special, and we don't want to make this interface be the real one
for everything else too..

                Linus

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 19:09                                 ` Linus Torvalds
  0 siblings, 0 replies; 313+ messages in thread
From: Linus Torvalds @ 2020-09-11 19:09 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, LKML, Michael Ellerman, Andrew Morton, linux-power,
	Mike Rapoport

On Fri, Sep 11, 2020 at 12:04 PM Vasily Gorbik <gor@linux.ibm.com> wrote:
>
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
[ ... ]

Ack, this looks sane to me.

I was going to ask how horrible it would be to convert all the other
users, but a quick grep convinced me that yeah, it's only GUP that is
this special, and we don't want to make this interface be the real one
for everything else too..

                Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-11 19:03                               ` Vasily Gorbik
                                                   ` (2 preceding siblings ...)
  (?)
@ 2020-09-11 19:40                                 ` Jason Gunthorpe
  -1 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 19:40 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: John Hubbard, Linus Torvalds, Gerald Schaefer, Alexander Gordeev,
	Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch,
	Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Fri, Sep 11, 2020 at 09:03:06PM +0200, Vasily Gorbik wrote:
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..e899d3506671 100644
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>  #endif
>  
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&pgd, address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&p4d, address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&pud, address)

Needs brackets: &(pgd) 

These would probably be better as static inlines though, as only s390
compiles will type check pudp like this.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 19:40                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 19:40 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: John Hubbard, Linus Torvalds, Gerald Schaefer, Alexander Gordeev,
	Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch,
	Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Fri, Sep 11, 2020 at 09:03:06PM +0200, Vasily Gorbik wrote:
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..e899d3506671 100644
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>  #endif
>  
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&pgd, address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&p4d, address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&pud, address)

Needs brackets: &(pgd) 

These would probably be better as static inlines though, as only s390
compiles will type check pudp like this.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 19:40                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 19:40 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Fri, Sep 11, 2020 at 09:03:06PM +0200, Vasily Gorbik wrote:
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..e899d3506671 100644
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>  #endif
>  
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&pgd, address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&p4d, address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&pud, address)

Needs brackets: &(pgd) 

These would probably be better as static inlines though, as only s390
compiles will type check pudp like this.

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 19:40                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 19:40 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Fri, Sep 11, 2020 at 09:03:06PM +0200, Vasily Gorbik wrote:
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..e899d3506671 100644
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>  #endif
>  
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&pgd, address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&p4d, address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&pud, address)

Needs brackets: &(pgd) 

These would probably be better as static inlines though, as only s390
compiles will type check pudp like this.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 19:40                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 19:40 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Fri, Sep 11, 2020 at 09:03:06PM +0200, Vasily Gorbik wrote:
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..e899d3506671 100644
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>  #endif
>  
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&pgd, address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&p4d, address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&pud, address)

Needs brackets: &(pgd) 

These would probably be better as static inlines though, as only s390
compiles will type check pudp like this.

Jason

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-11 19:40                                 ` Jason Gunthorpe
                                                     ` (2 preceding siblings ...)
  (?)
@ 2020-09-11 20:05                                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 20:05 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: John Hubbard, Linus Torvalds, Gerald Schaefer, Alexander Gordeev,
	Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch,
	Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Fri, Sep 11, 2020 at 04:40:00PM -0300, Jason Gunthorpe wrote:
> These would probably be better as static inlines though, as only s390
> compiles will type check pudp like this.

Never mind, it must be a macro - still need brackets though

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 20:05                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 20:05 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: John Hubbard, Linus Torvalds, Gerald Schaefer, Alexander Gordeev,
	Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch,
	Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Fri, Sep 11, 2020 at 04:40:00PM -0300, Jason Gunthorpe wrote:
> These would probably be better as static inlines though, as only s390
> compiles will type check pudp like this.

Never mind, it must be a macro - still need brackets though

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 20:05                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 20:05 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Fri, Sep 11, 2020 at 04:40:00PM -0300, Jason Gunthorpe wrote:
> These would probably be better as static inlines though, as only s390
> compiles will type check pudp like this.

Never mind, it must be a macro - still need brackets though

Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 20:05                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 20:05 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Fri, Sep 11, 2020 at 04:40:00PM -0300, Jason Gunthorpe wrote:
> These would probably be better as static inlines though, as only s390
> compiles will type check pudp like this.

Never mind, it must be a macro - still need brackets though

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 20:05                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 20:05 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Fri, Sep 11, 2020 at 04:40:00PM -0300, Jason Gunthorpe wrote:
> These would probably be better as static inlines though, as only s390
> compiles will type check pudp like this.

Never mind, it must be a macro - still need brackets though

Jason

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-11 20:05                                   ` Jason Gunthorpe
                                                       ` (2 preceding siblings ...)
  (?)
@ 2020-09-11 20:36                                     ` Vasily Gorbik
  -1 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-11 20:36 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Linus Torvalds, Gerald Schaefer, Alexander Gordeev,
	Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch,
	Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
...
        pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

This happens since s390 moved to common gup code with
commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
and commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code"). s390 tried to mimic static level folding by
changing pXd_offset primitives to always calculate top level page table
offset in pgd_offset and just return the value passed when pXd_offset
has to act as folded.

What is crucial for gup_fast and what has been overlooked is
that PxD_SIZE/MASK and thus pXd_addr_end should also change
correspondingly. And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original
pXdp pointers down to gup_pXd_range functions. And introduce
pXd_offset_lockless helpers, which take an additional pXd
entry value parameter. This has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Cc: <stable@vger.kernel.org> # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
---
v2: added brackets &pgd -> &(pgd)

 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
 include/linux/pgtable.h         | 10 ++++++++
 mm/gup.c                        | 18 +++++++-------
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..b55561cc8786 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
 
 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
 
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
 {
-	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-	return (p4d_t *) pgd;
+	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
+	return (p4d_t *) pgdp;
 }
+#define p4d_offset_lockless p4d_offset_lockless
 
-static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
 {
-	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
-		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
-	return (pud_t *) p4d;
+	return p4d_offset_lockless(pgdp, *pgdp, address);
+}
+
+static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
+{
+	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
+		return (pud_t *) p4d_deref(p4d) + pud_index(address);
+	return (pud_t *) p4dp;
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
+{
+	return pud_offset_lockless(p4dp, *p4dp, address);
 }
 #define pud_offset pud_offset
 
-static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
+static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
+{
+	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
+		return (pmd_t *) pud_deref(pud) + pmd_index(address);
+	return (pmd_t *) pudp;
+}
+#define pmd_offset_lockless pmd_offset_lockless
+
+static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
 {
-	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
-		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
-	return (pmd_t *) pud;
+	return pmd_offset_lockless(pudp, *pudp, address);
 }
 #define pmd_offset pmd_offset
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..90654cb63e9e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+#ifndef p4d_offset_lockless
+#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
+#endif
+#ifndef pud_offset_lockless
+#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
+#endif
+#ifndef pmd_offset_lockless
+#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
+#endif
+
 /*
  * p?d_leaf() - true if this entry is a final mapping to a physical address.
  * This differs from p?d_huge() by the fact that they are always available (if
diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..578bf5bd8bf8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	return 1;
 }
 
-static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
 		unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pmd_t *pmdp;
 
-	pmdp = pmd_offset(&pud, addr);
+	pmdp = pmd_offset_lockless(pudp, pud, addr);
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
@@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	return 1;
 }
 
-static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pud_t *pudp;
 
-	pudp = pud_offset(&p4d, addr);
+	pudp = pud_offset_lockless(p4dp, p4d, addr);
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
@@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
 					 PUD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
+		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
 	return 1;
 }
 
-static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	p4d_t *p4dp;
 
-	p4dp = p4d_offset(&pgd, addr);
+	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
@@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
 					 P4D_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
+		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
 			return 0;
 	} while (p4dp++, addr = next, addr != end);
 
@@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
 					 PGDIR_SHIFT, next, flags, pages, nr))
 				return;
-		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
+		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
 			return;
 	} while (pgdp++, addr = next, addr != end);
 }
-- 
⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 20:36                                     ` Vasily Gorbik
  0 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-11 20:36 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Linus Torvalds, Gerald Schaefer, Alexander Gordeev,
	Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch,
	Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
...
        pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

This happens since s390 moved to common gup code with
commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
and commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code"). s390 tried to mimic static level folding by
changing pXd_offset primitives to always calculate top level page table
offset in pgd_offset and just return the value passed when pXd_offset
has to act as folded.

What is crucial for gup_fast and what has been overlooked is
that PxD_SIZE/MASK and thus pXd_addr_end should also change
correspondingly. And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original
pXdp pointers down to gup_pXd_range functions. And introduce
pXd_offset_lockless helpers, which take an additional pXd
entry value parameter. This has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Cc: <stable@vger.kernel.org> # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
---
v2: added brackets &pgd -> &(pgd)

 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
 include/linux/pgtable.h         | 10 ++++++++
 mm/gup.c                        | 18 +++++++-------
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..b55561cc8786 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
 
 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
 
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
 {
-	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-	return (p4d_t *) pgd;
+	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
+	return (p4d_t *) pgdp;
 }
+#define p4d_offset_lockless p4d_offset_lockless
 
-static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
 {
-	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
-		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
-	return (pud_t *) p4d;
+	return p4d_offset_lockless(pgdp, *pgdp, address);
+}
+
+static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
+{
+	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
+		return (pud_t *) p4d_deref(p4d) + pud_index(address);
+	return (pud_t *) p4dp;
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
+{
+	return pud_offset_lockless(p4dp, *p4dp, address);
 }
 #define pud_offset pud_offset
 
-static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
+static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
+{
+	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
+		return (pmd_t *) pud_deref(pud) + pmd_index(address);
+	return (pmd_t *) pudp;
+}
+#define pmd_offset_lockless pmd_offset_lockless
+
+static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
 {
-	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
-		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
-	return (pmd_t *) pud;
+	return pmd_offset_lockless(pudp, *pudp, address);
 }
 #define pmd_offset pmd_offset
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..90654cb63e9e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+#ifndef p4d_offset_lockless
+#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
+#endif
+#ifndef pud_offset_lockless
+#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
+#endif
+#ifndef pmd_offset_lockless
+#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
+#endif
+
 /*
  * p?d_leaf() - true if this entry is a final mapping to a physical address.
  * This differs from p?d_huge() by the fact that they are always available (if
diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..578bf5bd8bf8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	return 1;
 }
 
-static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
 		unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pmd_t *pmdp;
 
-	pmdp = pmd_offset(&pud, addr);
+	pmdp = pmd_offset_lockless(pudp, pud, addr);
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
@@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	return 1;
 }
 
-static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pud_t *pudp;
 
-	pudp = pud_offset(&p4d, addr);
+	pudp = pud_offset_lockless(p4dp, p4d, addr);
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
@@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
 					 PUD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
+		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
 	return 1;
 }
 
-static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	p4d_t *p4dp;
 
-	p4dp = p4d_offset(&pgd, addr);
+	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
@@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
 					 P4D_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
+		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
 			return 0;
 	} while (p4dp++, addr = next, addr != end);
 
@@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
 					 PGDIR_SHIFT, next, flags, pages, nr))
 				return;
-		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
+		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
 			return;
 	} while (pgdp++, addr = next, addr != end);
 }
-- 
⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 20:36                                     ` Vasily Gorbik
  0 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-11 20:36 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
...
        pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

This happens since s390 moved to common gup code with
commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
and commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code"). s390 tried to mimic static level folding by
changing pXd_offset primitives to always calculate top level page table
offset in pgd_offset and just return the value passed when pXd_offset
has to act as folded.

What is crucial for gup_fast and what has been overlooked is
that PxD_SIZE/MASK and thus pXd_addr_end should also change
correspondingly. And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original
pXdp pointers down to gup_pXd_range functions. And introduce
pXd_offset_lockless helpers, which take an additional pXd
entry value parameter. This has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Cc: <stable@vger.kernel.org> # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
---
v2: added brackets &pgd -> &(pgd)

 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
 include/linux/pgtable.h         | 10 ++++++++
 mm/gup.c                        | 18 +++++++-------
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..b55561cc8786 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
 
 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
 
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
 {
-	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-	return (p4d_t *) pgd;
+	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
+	return (p4d_t *) pgdp;
 }
+#define p4d_offset_lockless p4d_offset_lockless
 
-static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
 {
-	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
-		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
-	return (pud_t *) p4d;
+	return p4d_offset_lockless(pgdp, *pgdp, address);
+}
+
+static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
+{
+	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
+		return (pud_t *) p4d_deref(p4d) + pud_index(address);
+	return (pud_t *) p4dp;
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
+{
+	return pud_offset_lockless(p4dp, *p4dp, address);
 }
 #define pud_offset pud_offset
 
-static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
+static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
+{
+	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
+		return (pmd_t *) pud_deref(pud) + pmd_index(address);
+	return (pmd_t *) pudp;
+}
+#define pmd_offset_lockless pmd_offset_lockless
+
+static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
 {
-	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
-		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
-	return (pmd_t *) pud;
+	return pmd_offset_lockless(pudp, *pudp, address);
 }
 #define pmd_offset pmd_offset
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..90654cb63e9e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+#ifndef p4d_offset_lockless
+#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
+#endif
+#ifndef pud_offset_lockless
+#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
+#endif
+#ifndef pmd_offset_lockless
+#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
+#endif
+
 /*
  * p?d_leaf() - true if this entry is a final mapping to a physical address.
  * This differs from p?d_huge() by the fact that they are always available (if
diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..578bf5bd8bf8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	return 1;
 }
 
-static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
 		unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pmd_t *pmdp;
 
-	pmdp = pmd_offset(&pud, addr);
+	pmdp = pmd_offset_lockless(pudp, pud, addr);
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
@@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	return 1;
 }
 
-static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pud_t *pudp;
 
-	pudp = pud_offset(&p4d, addr);
+	pudp = pud_offset_lockless(p4dp, p4d, addr);
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
@@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
 					 PUD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
+		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
 	return 1;
 }
 
-static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	p4d_t *p4dp;
 
-	p4dp = p4d_offset(&pgd, addr);
+	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
@@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
 					 P4D_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
+		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
 			return 0;
 	} while (p4dp++, addr = next, addr != end);
 
@@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
 					 PGDIR_SHIFT, next, flags, pages, nr))
 				return;
-		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
+		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
 			return;
 	} while (pgdp++, addr = next, addr != end);
 }
-- 
⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 20:36                                     ` Vasily Gorbik
  0 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-11 20:36 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
...
        pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

This happens since s390 moved to common gup code with
commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
and commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code"). s390 tried to mimic static level folding by
changing pXd_offset primitives to always calculate top level page table
offset in pgd_offset and just return the value passed when pXd_offset
has to act as folded.

What is crucial for gup_fast and what has been overlooked is
that PxD_SIZE/MASK and thus pXd_addr_end should also change
correspondingly. And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original
pXdp pointers down to gup_pXd_range functions. And introduce
pXd_offset_lockless helpers, which take an additional pXd
entry value parameter. This has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Cc: <stable@vger.kernel.org> # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
---
v2: added brackets &pgd -> &(pgd)

 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
 include/linux/pgtable.h         | 10 ++++++++
 mm/gup.c                        | 18 +++++++-------
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..b55561cc8786 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
 
 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
 
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
 {
-	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-	return (p4d_t *) pgd;
+	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
+	return (p4d_t *) pgdp;
 }
+#define p4d_offset_lockless p4d_offset_lockless
 
-static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
 {
-	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
-		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
-	return (pud_t *) p4d;
+	return p4d_offset_lockless(pgdp, *pgdp, address);
+}
+
+static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
+{
+	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
+		return (pud_t *) p4d_deref(p4d) + pud_index(address);
+	return (pud_t *) p4dp;
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
+{
+	return pud_offset_lockless(p4dp, *p4dp, address);
 }
 #define pud_offset pud_offset
 
-static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
+static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
+{
+	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
+		return (pmd_t *) pud_deref(pud) + pmd_index(address);
+	return (pmd_t *) pudp;
+}
+#define pmd_offset_lockless pmd_offset_lockless
+
+static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
 {
-	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
-		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
-	return (pmd_t *) pud;
+	return pmd_offset_lockless(pudp, *pudp, address);
 }
 #define pmd_offset pmd_offset
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..90654cb63e9e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+#ifndef p4d_offset_lockless
+#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
+#endif
+#ifndef pud_offset_lockless
+#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
+#endif
+#ifndef pmd_offset_lockless
+#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
+#endif
+
 /*
  * p?d_leaf() - true if this entry is a final mapping to a physical address.
  * This differs from p?d_huge() by the fact that they are always available (if
diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..578bf5bd8bf8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	return 1;
 }
 
-static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
 		unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pmd_t *pmdp;
 
-	pmdp = pmd_offset(&pud, addr);
+	pmdp = pmd_offset_lockless(pudp, pud, addr);
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
@@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	return 1;
 }
 
-static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pud_t *pudp;
 
-	pudp = pud_offset(&p4d, addr);
+	pudp = pud_offset_lockless(p4dp, p4d, addr);
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
@@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
 					 PUD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
+		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
 	return 1;
 }
 
-static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	p4d_t *p4dp;
 
-	p4dp = p4d_offset(&pgd, addr);
+	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
@@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
 					 P4D_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
+		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
 			return 0;
 	} while (p4dp++, addr = next, addr != end);
 
@@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
 					 PGDIR_SHIFT, next, flags, pages, nr))
 				return;
-		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
+		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
 			return;
 	} while (pgdp++, addr = next, addr != end);
 }
-- 
⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-11 20:36                                     ` Vasily Gorbik
  0 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-11 20:36 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
...
        pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

This happens since s390 moved to common gup code with
commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
and commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code"). s390 tried to mimic static level folding by
changing pXd_offset primitives to always calculate top level page table
offset in pgd_offset and just return the value passed when pXd_offset
has to act as folded.

What is crucial for gup_fast and what has been overlooked is
that PxD_SIZE/MASK and thus pXd_addr_end should also change
correspondingly. And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original
pXdp pointers down to gup_pXd_range functions. And introduce
pXd_offset_lockless helpers, which take an additional pXd
entry value parameter. This has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Cc: <stable@vger.kernel.org> # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
---
v2: added brackets &pgd -> &(pgd)

 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
 include/linux/pgtable.h         | 10 ++++++++
 mm/gup.c                        | 18 +++++++-------
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..b55561cc8786 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
 
 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
 
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
 {
-	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-	return (p4d_t *) pgd;
+	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
+	return (p4d_t *) pgdp;
 }
+#define p4d_offset_lockless p4d_offset_lockless
 
-static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
 {
-	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
-		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
-	return (pud_t *) p4d;
+	return p4d_offset_lockless(pgdp, *pgdp, address);
+}
+
+static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
+{
+	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
+		return (pud_t *) p4d_deref(p4d) + pud_index(address);
+	return (pud_t *) p4dp;
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
+{
+	return pud_offset_lockless(p4dp, *p4dp, address);
 }
 #define pud_offset pud_offset
 
-static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
+static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
+{
+	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
+		return (pmd_t *) pud_deref(pud) + pmd_index(address);
+	return (pmd_t *) pudp;
+}
+#define pmd_offset_lockless pmd_offset_lockless
+
+static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
 {
-	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
-		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
-	return (pmd_t *) pud;
+	return pmd_offset_lockless(pudp, *pudp, address);
 }
 #define pmd_offset pmd_offset
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..90654cb63e9e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+#ifndef p4d_offset_lockless
+#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
+#endif
+#ifndef pud_offset_lockless
+#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
+#endif
+#ifndef pmd_offset_lockless
+#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
+#endif
+
 /*
  * p?d_leaf() - true if this entry is a final mapping to a physical address.
  * This differs from p?d_huge() by the fact that they are always available (if
diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..578bf5bd8bf8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	return 1;
 }
 
-static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
 		unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pmd_t *pmdp;
 
-	pmdp = pmd_offset(&pud, addr);
+	pmdp = pmd_offset_lockless(pudp, pud, addr);
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
@@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	return 1;
 }
 
-static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pud_t *pudp;
 
-	pudp = pud_offset(&p4d, addr);
+	pudp = pud_offset_lockless(p4dp, p4d, addr);
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
@@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
 					 PUD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
+		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
 	return 1;
 }
 
-static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	p4d_t *p4dp;
 
-	p4dp = p4d_offset(&pgd, addr);
+	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
@@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
 					 P4D_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
+		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
 			return 0;
 	} while (p4dp++, addr = next, addr != end);
 
@@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
 					 PGDIR_SHIFT, next, flags, pages, nr))
 				return;
-		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
+		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
 			return;
 	} while (pgdp++, addr = next, addr != end);
 }
-- 
⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply related	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-11 20:36                                     ` Vasily Gorbik
                                                         ` (2 preceding siblings ...)
  (?)
@ 2020-09-15 17:09                                       ` Vasily Gorbik
  -1 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-15 17:09 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, John Hubbard
  Cc: Gerald Schaefer, Alexander Gordeev, Linus Torvalds,
	Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
...snip...
> ---
> v2: added brackets &pgd -> &(pgd)
> 
>  arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>  include/linux/pgtable.h         | 10 ++++++++
>  mm/gup.c                        | 18 +++++++-------
>  3 files changed, 49 insertions(+), 21 deletions(-)

Andrew, any chance you would pick this up?

There is an Ack from Linus. And I haven't seen any objections from Jason or John.
This seems to be as safe for other architectures as possible.

@Jason and John
Any acks/nacks?

Thank you,
Vasily

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:09                                       ` Vasily Gorbik
  0 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-15 17:09 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, John Hubbard
  Cc: Gerald Schaefer, Alexander Gordeev, Linus Torvalds,
	Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch,
	Russell King, Mike Rapoport, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Jeff Dike, Richard Weinberger, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Andrey Ryabinin, linux-x86, linux-arm, linux-power, linux-sparc,
	linux-um, linux-s390, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
...snip...
> ---
> v2: added brackets &pgd -> &(pgd)
> 
>  arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>  include/linux/pgtable.h         | 10 ++++++++
>  mm/gup.c                        | 18 +++++++-------
>  3 files changed, 49 insertions(+), 21 deletions(-)

Andrew, any chance you would pick this up?

There is an Ack from Linus. And I haven't seen any objections from Jason or John.
This seems to be as safe for other architectures as possible.

@Jason and John
Any acks/nacks?

Thank you,
Vasily

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:09                                       ` Vasily Gorbik
  0 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-15 17:09 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Linus Torvalds, Mike Rapoport

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
...snip...
> ---
> v2: added brackets &pgd -> &(pgd)
> 
>  arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>  include/linux/pgtable.h         | 10 ++++++++
>  mm/gup.c                        | 18 +++++++-------
>  3 files changed, 49 insertions(+), 21 deletions(-)

Andrew, any chance you would pick this up?

There is an Ack from Linus. And I haven't seen any objections from Jason or John.
This seems to be as safe for other architectures as possible.

@Jason and John
Any acks/nacks?

Thank you,
Vasily

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:09                                       ` Vasily Gorbik
  0 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-15 17:09 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Linus Torvalds,
	Mike Rapoport

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
...snip...
> ---
> v2: added brackets &pgd -> &(pgd)
> 
>  arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>  include/linux/pgtable.h         | 10 ++++++++
>  mm/gup.c                        | 18 +++++++-------
>  3 files changed, 49 insertions(+), 21 deletions(-)

Andrew, any chance you would pick this up?

There is an Ack from Linus. And I haven't seen any objections from Jason or John.
This seems to be as safe for other architectures as possible.

@Jason and John
Any acks/nacks?

Thank you,
Vasily

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:09                                       ` Vasily Gorbik
  0 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-15 17:09 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Linus Torvalds,
	Mike Rapoport

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
...snip...
> ---
> v2: added brackets &pgd -> &(pgd)
> 
>  arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>  include/linux/pgtable.h         | 10 ++++++++
>  mm/gup.c                        | 18 +++++++-------
>  3 files changed, 49 insertions(+), 21 deletions(-)

Andrew, any chance you would pick this up?

There is an Ack from Linus. And I haven't seen any objections from Jason or John.
This seems to be as safe for other architectures as possible.

@Jason and John
Any acks/nacks?

Thank you,
Vasily

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-11 20:36                                     ` Vasily Gorbik
                                                         ` (2 preceding siblings ...)
  (?)
@ 2020-09-15 17:14                                       ` Jason Gunthorpe
  -1 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-15 17:14 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: John Hubbard, Linus Torvalds, Gerald Schaefer, Alexander Gordeev,
	Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch,
	Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> ...
>         pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> {
>         unsigned long next;
>         pud_t *pudp;
> 
>         // pud_offset returns &p4d itself (a pointer to a value on stack)
>         pudp = pud_offset(&p4d, addr);
>         do {
>                 // on second iteratation reading "random" stack value
>                 pud_t pud = READ_ONCE(*pudp);
> 
>                 // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                 next = pud_addr_end(addr, end);
>                 ...
>         } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>         return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---
> v2: added brackets &pgd -> &(pgd)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Regards,
Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:14                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-15 17:14 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: John Hubbard, Linus Torvalds, Gerald Schaefer, Alexander Gordeev,
	Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch,
	Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> ...
>         pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> {
>         unsigned long next;
>         pud_t *pudp;
> 
>         // pud_offset returns &p4d itself (a pointer to a value on stack)
>         pudp = pud_offset(&p4d, addr);
>         do {
>                 // on second iteratation reading "random" stack value
>                 pud_t pud = READ_ONCE(*pudp);
> 
>                 // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                 next = pud_addr_end(addr, end);
>                 ...
>         } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>         return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---
> v2: added brackets &pgd -> &(pgd)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Regards,
Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:14                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-15 17:14 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> ...
>         pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> {
>         unsigned long next;
>         pud_t *pudp;
> 
>         // pud_offset returns &p4d itself (a pointer to a value on stack)
>         pudp = pud_offset(&p4d, addr);
>         do {
>                 // on second iteratation reading "random" stack value
>                 pud_t pud = READ_ONCE(*pudp);
> 
>                 // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                 next = pud_addr_end(addr, end);
>                 ...
>         } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>         return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---
> v2: added brackets &pgd -> &(pgd)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Regards,
Jason

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:14                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-15 17:14 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> ...
>         pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> {
>         unsigned long next;
>         pud_t *pudp;
> 
>         // pud_offset returns &p4d itself (a pointer to a value on stack)
>         pudp = pud_offset(&p4d, addr);
>         do {
>                 // on second iteratation reading "random" stack value
>                 pud_t pud = READ_ONCE(*pudp);
> 
>                 // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                 next = pud_addr_end(addr, end);
>                 ...
>         } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>         return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---
> v2: added brackets &pgd -> &(pgd)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Regards,
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:14                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 313+ messages in thread
From: Jason Gunthorpe @ 2020-09-15 17:14 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> ...
>         pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> {
>         unsigned long next;
>         pud_t *pudp;
> 
>         // pud_offset returns &p4d itself (a pointer to a value on stack)
>         pudp = pud_offset(&p4d, addr);
>         do {
>                 // on second iteratation reading "random" stack value
>                 pud_t pud = READ_ONCE(*pudp);
> 
>                 // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                 next = pud_addr_end(addr, end);
>                 ...
>         } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>         return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---
> v2: added brackets &pgd -> &(pgd)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Regards,
Jason

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-11 20:36                                     ` Vasily Gorbik
                                                         ` (2 preceding siblings ...)
  (?)
@ 2020-09-15 17:18                                       ` Mike Rapoport
  -1 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-15 17:18 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Jason Gunthorpe, John Hubbard, Linus Torvalds, Gerald Schaefer,
	Alexander Gordeev, Peter Zijlstra, Dave Hansen, LKML, linux-mm,
	linux-arch, Andrew Morton, Russell King, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> ...
>         pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> {
>         unsigned long next;
>         pud_t *pudp;
> 
>         // pud_offset returns &p4d itself (a pointer to a value on stack)
>         pudp = pud_offset(&p4d, addr);
>         do {
>                 // on second iteratation reading "random" stack value
>                 pud_t pud = READ_ONCE(*pudp);
> 
>                 // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                 next = pud_addr_end(addr, end);
>                 ...
>         } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>         return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
> v2: added brackets &pgd -> &(pgd)
> 
>  arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>  include/linux/pgtable.h         | 10 ++++++++
>  mm/gup.c                        | 18 +++++++-------
>  3 files changed, 49 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..b55561cc8786 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
>  
>  #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
>  
> -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
> +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
>  {
> -	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> -		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
> -	return (p4d_t *) pgd;
> +	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> +		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
> +	return (p4d_t *) pgdp;
>  }
> +#define p4d_offset_lockless p4d_offset_lockless
>  
> -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
>  {
> -	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> -		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
> -	return (pud_t *) p4d;
> +	return p4d_offset_lockless(pgdp, *pgdp, address);
> +}
> +
> +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
> +{
> +	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> +		return (pud_t *) p4d_deref(p4d) + pud_index(address);
> +	return (pud_t *) p4dp;
> +}
> +#define pud_offset_lockless pud_offset_lockless
> +
> +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
> +{
> +	return pud_offset_lockless(p4dp, *p4dp, address);
>  }
>  #define pud_offset pud_offset
>  
> -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
> +{
> +	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> +		return (pmd_t *) pud_deref(pud) + pmd_index(address);
> +	return (pmd_t *) pudp;
> +}
> +#define pmd_offset_lockless pmd_offset_lockless
> +
> +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
>  {
> -	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> -		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
> -	return (pmd_t *) pud;
> +	return pmd_offset_lockless(pudp, *pudp, address);
>  }
>  #define pmd_offset pmd_offset
>  
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..90654cb63e9e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>  #endif
>  
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
> +#endif
> +
>  /*
>   * p?d_leaf() - true if this entry is a final mapping to a physical address.
>   * This differs from p?d_huge() by the fact that they are always available (if
> diff --git a/mm/gup.c b/mm/gup.c
> index e5739a1974d5..578bf5bd8bf8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>  	return 1;
>  }
>  
> -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
>  		unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	pmd_t *pmdp;
>  
> -	pmdp = pmd_offset(&pud, addr);
> +	pmdp = pmd_offset_lockless(pudp, pud, addr);
>  	do {
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
> @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  	return 1;
>  }
>  
> -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
> +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
>  			 unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	pud_t *pudp;
>  
> -	pudp = pud_offset(&p4d, addr);
> +	pudp = pud_offset_lockless(p4dp, p4d, addr);
>  	do {
>  		pud_t pud = READ_ONCE(*pudp);
>  
> @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
>  					 PUD_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
> +		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (pudp++, addr = next, addr != end);
>  
>  	return 1;
>  }
>  
> -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
> +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
>  			 unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	p4d_t *p4dp;
>  
> -	p4dp = p4d_offset(&pgd, addr);
> +	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
>  	do {
>  		p4d_t p4d = READ_ONCE(*p4dp);
>  
> @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
>  					 P4D_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
> +		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (p4dp++, addr = next, addr != end);
>  
> @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
>  					 PGDIR_SHIFT, next, flags, pages, nr))
>  				return;
> -		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
> +		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
>  			return;
>  	} while (pgdp++, addr = next, addr != end);
>  }
> -- 
> ⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
> ⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
> ⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
> ⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
> ⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
> ⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:18                                       ` Mike Rapoport
  0 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-15 17:18 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Jason Gunthorpe, John Hubbard, Linus Torvalds, Gerald Schaefer,
	Alexander Gordeev, Peter Zijlstra, Dave Hansen, LKML, linux-mm,
	linux-arch, Andrew Morton, Russell King, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> ...
>         pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> {
>         unsigned long next;
>         pud_t *pudp;
> 
>         // pud_offset returns &p4d itself (a pointer to a value on stack)
>         pudp = pud_offset(&p4d, addr);
>         do {
>                 // on second iteratation reading "random" stack value
>                 pud_t pud = READ_ONCE(*pudp);
> 
>                 // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                 next = pud_addr_end(addr, end);
>                 ...
>         } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>         return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
> v2: added brackets &pgd -> &(pgd)
> 
>  arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>  include/linux/pgtable.h         | 10 ++++++++
>  mm/gup.c                        | 18 +++++++-------
>  3 files changed, 49 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..b55561cc8786 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
>  
>  #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
>  
> -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
> +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
>  {
> -	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> -		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
> -	return (p4d_t *) pgd;
> +	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> +		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
> +	return (p4d_t *) pgdp;
>  }
> +#define p4d_offset_lockless p4d_offset_lockless
>  
> -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
>  {
> -	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> -		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
> -	return (pud_t *) p4d;
> +	return p4d_offset_lockless(pgdp, *pgdp, address);
> +}
> +
> +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
> +{
> +	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> +		return (pud_t *) p4d_deref(p4d) + pud_index(address);
> +	return (pud_t *) p4dp;
> +}
> +#define pud_offset_lockless pud_offset_lockless
> +
> +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
> +{
> +	return pud_offset_lockless(p4dp, *p4dp, address);
>  }
>  #define pud_offset pud_offset
>  
> -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
> +{
> +	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> +		return (pmd_t *) pud_deref(pud) + pmd_index(address);
> +	return (pmd_t *) pudp;
> +}
> +#define pmd_offset_lockless pmd_offset_lockless
> +
> +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
>  {
> -	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> -		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
> -	return (pmd_t *) pud;
> +	return pmd_offset_lockless(pudp, *pudp, address);
>  }
>  #define pmd_offset pmd_offset
>  
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..90654cb63e9e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>  #endif
>  
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
> +#endif
> +
>  /*
>   * p?d_leaf() - true if this entry is a final mapping to a physical address.
>   * This differs from p?d_huge() by the fact that they are always available (if
> diff --git a/mm/gup.c b/mm/gup.c
> index e5739a1974d5..578bf5bd8bf8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>  	return 1;
>  }
>  
> -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
>  		unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	pmd_t *pmdp;
>  
> -	pmdp = pmd_offset(&pud, addr);
> +	pmdp = pmd_offset_lockless(pudp, pud, addr);
>  	do {
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
> @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  	return 1;
>  }
>  
> -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
> +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
>  			 unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	pud_t *pudp;
>  
> -	pudp = pud_offset(&p4d, addr);
> +	pudp = pud_offset_lockless(p4dp, p4d, addr);
>  	do {
>  		pud_t pud = READ_ONCE(*pudp);
>  
> @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
>  					 PUD_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
> +		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (pudp++, addr = next, addr != end);
>  
>  	return 1;
>  }
>  
> -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
> +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
>  			 unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	p4d_t *p4dp;
>  
> -	p4dp = p4d_offset(&pgd, addr);
> +	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
>  	do {
>  		p4d_t p4d = READ_ONCE(*p4dp);
>  
> @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
>  					 P4D_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
> +		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (p4dp++, addr = next, addr != end);
>  
> @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
>  					 PGDIR_SHIFT, next, flags, pages, nr))
>  				return;
> -		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
> +		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
>  			return;
>  	} while (pgdp++, addr = next, addr != end);
>  }
> -- 
> ⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
> ⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
> ⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
> ⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
> ⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
> ⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:18                                       ` Mike Rapoport
  0 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-15 17:18 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Andrew Morton, Linus Torvalds

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> ...
>         pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> {
>         unsigned long next;
>         pud_t *pudp;
> 
>         // pud_offset returns &p4d itself (a pointer to a value on stack)
>         pudp = pud_offset(&p4d, addr);
>         do {
>                 // on second iteratation reading "random" stack value
>                 pud_t pud = READ_ONCE(*pudp);
> 
>                 // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                 next = pud_addr_end(addr, end);
>                 ...
>         } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>         return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
> v2: added brackets &pgd -> &(pgd)
> 
>  arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>  include/linux/pgtable.h         | 10 ++++++++
>  mm/gup.c                        | 18 +++++++-------
>  3 files changed, 49 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..b55561cc8786 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
>  
>  #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
>  
> -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
> +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
>  {
> -	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> -		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
> -	return (p4d_t *) pgd;
> +	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> +		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
> +	return (p4d_t *) pgdp;
>  }
> +#define p4d_offset_lockless p4d_offset_lockless
>  
> -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
>  {
> -	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> -		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
> -	return (pud_t *) p4d;
> +	return p4d_offset_lockless(pgdp, *pgdp, address);
> +}
> +
> +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
> +{
> +	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> +		return (pud_t *) p4d_deref(p4d) + pud_index(address);
> +	return (pud_t *) p4dp;
> +}
> +#define pud_offset_lockless pud_offset_lockless
> +
> +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
> +{
> +	return pud_offset_lockless(p4dp, *p4dp, address);
>  }
>  #define pud_offset pud_offset
>  
> -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
> +{
> +	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> +		return (pmd_t *) pud_deref(pud) + pmd_index(address);
> +	return (pmd_t *) pudp;
> +}
> +#define pmd_offset_lockless pmd_offset_lockless
> +
> +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
>  {
> -	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> -		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
> -	return (pmd_t *) pud;
> +	return pmd_offset_lockless(pudp, *pudp, address);
>  }
>  #define pmd_offset pmd_offset
>  
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..90654cb63e9e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>  #endif
>  
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
> +#endif
> +
>  /*
>   * p?d_leaf() - true if this entry is a final mapping to a physical address.
>   * This differs from p?d_huge() by the fact that they are always available (if
> diff --git a/mm/gup.c b/mm/gup.c
> index e5739a1974d5..578bf5bd8bf8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>  	return 1;
>  }
>  
> -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
>  		unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	pmd_t *pmdp;
>  
> -	pmdp = pmd_offset(&pud, addr);
> +	pmdp = pmd_offset_lockless(pudp, pud, addr);
>  	do {
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
> @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  	return 1;
>  }
>  
> -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
> +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
>  			 unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	pud_t *pudp;
>  
> -	pudp = pud_offset(&p4d, addr);
> +	pudp = pud_offset_lockless(p4dp, p4d, addr);
>  	do {
>  		pud_t pud = READ_ONCE(*pudp);
>  
> @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
>  					 PUD_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
> +		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (pudp++, addr = next, addr != end);
>  
>  	return 1;
>  }
>  
> -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
> +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
>  			 unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	p4d_t *p4dp;
>  
> -	p4dp = p4d_offset(&pgd, addr);
> +	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
>  	do {
>  		p4d_t p4d = READ_ONCE(*p4dp);
>  
> @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
>  					 P4D_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
> +		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (p4dp++, addr = next, addr != end);
>  
> @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
>  					 PGDIR_SHIFT, next, flags, pages, nr))
>  				return;
> -		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
> +		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
>  			return;
>  	} while (pgdp++, addr = next, addr != end);
>  }
> -- 
> ⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
> ⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
> ⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
> ⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
> ⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
> ⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:18                                       ` Mike Rapoport
  0 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-15 17:18 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> ...
>         pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> {
>         unsigned long next;
>         pud_t *pudp;
> 
>         // pud_offset returns &p4d itself (a pointer to a value on stack)
>         pudp = pud_offset(&p4d, addr);
>         do {
>                 // on second iteratation reading "random" stack value
>                 pud_t pud = READ_ONCE(*pudp);
> 
>                 // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                 next = pud_addr_end(addr, end);
>                 ...
>         } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>         return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
> v2: added brackets &pgd -> &(pgd)
> 
>  arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>  include/linux/pgtable.h         | 10 ++++++++
>  mm/gup.c                        | 18 +++++++-------
>  3 files changed, 49 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..b55561cc8786 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
>  
>  #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
>  
> -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
> +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
>  {
> -	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> -		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
> -	return (p4d_t *) pgd;
> +	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> +		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
> +	return (p4d_t *) pgdp;
>  }
> +#define p4d_offset_lockless p4d_offset_lockless
>  
> -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
>  {
> -	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> -		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
> -	return (pud_t *) p4d;
> +	return p4d_offset_lockless(pgdp, *pgdp, address);
> +}
> +
> +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
> +{
> +	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> +		return (pud_t *) p4d_deref(p4d) + pud_index(address);
> +	return (pud_t *) p4dp;
> +}
> +#define pud_offset_lockless pud_offset_lockless
> +
> +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
> +{
> +	return pud_offset_lockless(p4dp, *p4dp, address);
>  }
>  #define pud_offset pud_offset
>  
> -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
> +{
> +	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> +		return (pmd_t *) pud_deref(pud) + pmd_index(address);
> +	return (pmd_t *) pudp;
> +}
> +#define pmd_offset_lockless pmd_offset_lockless
> +
> +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
>  {
> -	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> -		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
> -	return (pmd_t *) pud;
> +	return pmd_offset_lockless(pudp, *pudp, address);
>  }
>  #define pmd_offset pmd_offset
>  
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..90654cb63e9e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>  #endif
>  
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
> +#endif
> +
>  /*
>   * p?d_leaf() - true if this entry is a final mapping to a physical address.
>   * This differs from p?d_huge() by the fact that they are always available (if
> diff --git a/mm/gup.c b/mm/gup.c
> index e5739a1974d5..578bf5bd8bf8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>  	return 1;
>  }
>  
> -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
>  		unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	pmd_t *pmdp;
>  
> -	pmdp = pmd_offset(&pud, addr);
> +	pmdp = pmd_offset_lockless(pudp, pud, addr);
>  	do {
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
> @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  	return 1;
>  }
>  
> -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
> +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
>  			 unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	pud_t *pudp;
>  
> -	pudp = pud_offset(&p4d, addr);
> +	pudp = pud_offset_lockless(p4dp, p4d, addr);
>  	do {
>  		pud_t pud = READ_ONCE(*pudp);
>  
> @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
>  					 PUD_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
> +		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (pudp++, addr = next, addr != end);
>  
>  	return 1;
>  }
>  
> -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
> +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
>  			 unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	p4d_t *p4dp;
>  
> -	p4dp = p4d_offset(&pgd, addr);
> +	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
>  	do {
>  		p4d_t p4d = READ_ONCE(*p4dp);
>  
> @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
>  					 P4D_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
> +		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (p4dp++, addr = next, addr != end);
>  
> @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
>  					 PGDIR_SHIFT, next, flags, pages, nr))
>  				return;
> -		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
> +		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
>  			return;
>  	} while (pgdp++, addr = next, addr != end);
>  }
> -- 
> ⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
> ⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
> ⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
> ⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
> ⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
> ⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:18                                       ` Mike Rapoport
  0 siblings, 0 replies; 313+ messages in thread
From: Mike Rapoport @ 2020-09-15 17:18 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, Dave Hansen,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, John Hubbard, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	linux-mm, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> ...
>         pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> {
>         unsigned long next;
>         pud_t *pudp;
> 
>         // pud_offset returns &p4d itself (a pointer to a value on stack)
>         pudp = pud_offset(&p4d, addr);
>         do {
>                 // on second iteratation reading "random" stack value
>                 pud_t pud = READ_ONCE(*pudp);
> 
>                 // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                 next = pud_addr_end(addr, end);
>                 ...
>         } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>         return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
> v2: added brackets &pgd -> &(pgd)
> 
>  arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>  include/linux/pgtable.h         | 10 ++++++++
>  mm/gup.c                        | 18 +++++++-------
>  3 files changed, 49 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..b55561cc8786 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
>  
>  #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
>  
> -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
> +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
>  {
> -	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> -		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
> -	return (p4d_t *) pgd;
> +	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> +		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
> +	return (p4d_t *) pgdp;
>  }
> +#define p4d_offset_lockless p4d_offset_lockless
>  
> -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
>  {
> -	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> -		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
> -	return (pud_t *) p4d;
> +	return p4d_offset_lockless(pgdp, *pgdp, address);
> +}
> +
> +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
> +{
> +	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> +		return (pud_t *) p4d_deref(p4d) + pud_index(address);
> +	return (pud_t *) p4dp;
> +}
> +#define pud_offset_lockless pud_offset_lockless
> +
> +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
> +{
> +	return pud_offset_lockless(p4dp, *p4dp, address);
>  }
>  #define pud_offset pud_offset
>  
> -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
> +{
> +	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> +		return (pmd_t *) pud_deref(pud) + pmd_index(address);
> +	return (pmd_t *) pudp;
> +}
> +#define pmd_offset_lockless pmd_offset_lockless
> +
> +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
>  {
> -	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> -		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
> -	return (pmd_t *) pud;
> +	return pmd_offset_lockless(pudp, *pudp, address);
>  }
>  #define pmd_offset pmd_offset
>  
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..90654cb63e9e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>  #endif
>  
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
> +#endif
> +
>  /*
>   * p?d_leaf() - true if this entry is a final mapping to a physical address.
>   * This differs from p?d_huge() by the fact that they are always available (if
> diff --git a/mm/gup.c b/mm/gup.c
> index e5739a1974d5..578bf5bd8bf8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>  	return 1;
>  }
>  
> -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
>  		unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	pmd_t *pmdp;
>  
> -	pmdp = pmd_offset(&pud, addr);
> +	pmdp = pmd_offset_lockless(pudp, pud, addr);
>  	do {
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
> @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  	return 1;
>  }
>  
> -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
> +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
>  			 unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	pud_t *pudp;
>  
> -	pudp = pud_offset(&p4d, addr);
> +	pudp = pud_offset_lockless(p4dp, p4d, addr);
>  	do {
>  		pud_t pud = READ_ONCE(*pudp);
>  
> @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
>  					 PUD_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
> +		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (pudp++, addr = next, addr != end);
>  
>  	return 1;
>  }
>  
> -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
> +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
>  			 unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	p4d_t *p4dp;
>  
> -	p4dp = p4d_offset(&pgd, addr);
> +	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
>  	do {
>  		p4d_t p4d = READ_ONCE(*p4dp);
>  
> @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
>  					 P4D_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
> +		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (p4dp++, addr = next, addr != end);
>  
> @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
>  					 PGDIR_SHIFT, next, flags, pages, nr))
>  				return;
> -		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
> +		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
>  			return;
>  	} while (pgdp++, addr = next, addr != end);
>  }
> -- 
> ⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
> ⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
> ⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
> ⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
> ⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
> ⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-11 20:36                                     ` Vasily Gorbik
                                                         ` (2 preceding siblings ...)
  (?)
@ 2020-09-15 17:31                                       ` John Hubbard
  -1 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-15 17:31 UTC (permalink / raw)
  To: Vasily Gorbik, Jason Gunthorpe
  Cc: Linus Torvalds, Gerald Schaefer, Alexander Gordeev,
	Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch,
	Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On 9/11/20 1:36 PM, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                           unsigned int flags, struct page **pages, int *nr)
> ...
>          pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                           unsigned int flags, struct page **pages, int *nr)
> {
>          unsigned long next;
>          pud_t *pudp;
> 
>          // pud_offset returns &p4d itself (a pointer to a value on stack)
>          pudp = pud_offset(&p4d, addr);
>          do {
>                  // on second iteratation reading "random" stack value
>                  pud_t pud = READ_ONCE(*pudp);
> 
>                  // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                  next = pud_addr_end(addr, end);
>                  ...
>          } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>          return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---

Looks cleaner than I'd dared hope for. :)

Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA

> v2: added brackets &pgd -> &(pgd)
> 
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>   include/linux/pgtable.h         | 10 ++++++++
>   mm/gup.c                        | 18 +++++++-------
>   3 files changed, 49 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..b55561cc8786 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
>   
>   #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
>   
> -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
> +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
>   {
> -	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> -		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
> -	return (p4d_t *) pgd;
> +	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> +		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
> +	return (p4d_t *) pgdp;
>   }
> +#define p4d_offset_lockless p4d_offset_lockless
>   
> -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
>   {
> -	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> -		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
> -	return (pud_t *) p4d;
> +	return p4d_offset_lockless(pgdp, *pgdp, address);
> +}
> +
> +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
> +{
> +	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> +		return (pud_t *) p4d_deref(p4d) + pud_index(address);
> +	return (pud_t *) p4dp;
> +}
> +#define pud_offset_lockless pud_offset_lockless
> +
> +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
> +{
> +	return pud_offset_lockless(p4dp, *p4dp, address);
>   }
>   #define pud_offset pud_offset
>   
> -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
> +{
> +	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> +		return (pmd_t *) pud_deref(pud) + pmd_index(address);
> +	return (pmd_t *) pudp;
> +}
> +#define pmd_offset_lockless pmd_offset_lockless
> +
> +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
>   {
> -	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> -		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
> -	return (pmd_t *) pud;
> +	return pmd_offset_lockless(pudp, *pudp, address);
>   }
>   #define pmd_offset pmd_offset
>   
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..90654cb63e9e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>   #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>   #endif
>   
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
> +#endif
> +
>   /*
>    * p?d_leaf() - true if this entry is a final mapping to a physical address.
>    * This differs from p?d_huge() by the fact that they are always available (if
> diff --git a/mm/gup.c b/mm/gup.c
> index e5739a1974d5..578bf5bd8bf8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>   	return 1;
>   }
>   
> -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
>   		unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	pmd_t *pmdp;
>   
> -	pmdp = pmd_offset(&pud, addr);
> +	pmdp = pmd_offset_lockless(pudp, pud, addr);
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	return 1;
>   }
>   
> -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
> +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
>   			 unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	pud_t *pudp;
>   
> -	pudp = pud_offset(&p4d, addr);
> +	pudp = pud_offset_lockless(p4dp, p4d, addr);
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
>   					 PUD_SHIFT, next, flags, pages, nr))
>   				return 0;
> -		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
> +		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
>   			return 0;
>   	} while (pudp++, addr = next, addr != end);
>   
>   	return 1;
>   }
>   
> -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
> +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
>   			 unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	p4d_t *p4dp;
>   
> -	p4dp = p4d_offset(&pgd, addr);
> +	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
>   					 P4D_SHIFT, next, flags, pages, nr))
>   				return 0;
> -		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
> +		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
>   			return 0;
>   	} while (p4dp++, addr = next, addr != end);
>   
> @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
>   					 PGDIR_SHIFT, next, flags, pages, nr))
>   				return;
> -		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
> +		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
>   			return;
>   	} while (pgdp++, addr = next, addr != end);
>   }
> 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:31                                       ` John Hubbard
  0 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-15 17:31 UTC (permalink / raw)
  To: Vasily Gorbik, Jason Gunthorpe
  Cc: Linus Torvalds, Gerald Schaefer, Alexander Gordeev,
	Peter Zijlstra, Dave Hansen, LKML, linux-mm, linux-arch,
	Andrew Morton, Russell King, Mike Rapoport, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Jeff Dike, Richard Weinberger, Dave Hansen,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Andrey Ryabinin, linux-x86, linux-arm,
	linux-power, linux-sparc, linux-um, linux-s390, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda

On 9/11/20 1:36 PM, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                           unsigned int flags, struct page **pages, int *nr)
> ...
>          pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                           unsigned int flags, struct page **pages, int *nr)
> {
>          unsigned long next;
>          pud_t *pudp;
> 
>          // pud_offset returns &p4d itself (a pointer to a value on stack)
>          pudp = pud_offset(&p4d, addr);
>          do {
>                  // on second iteratation reading "random" stack value
>                  pud_t pud = READ_ONCE(*pudp);
> 
>                  // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                  next = pud_addr_end(addr, end);
>                  ...
>          } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>          return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---

Looks cleaner than I'd dared hope for. :)

Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA

> v2: added brackets &pgd -> &(pgd)
> 
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>   include/linux/pgtable.h         | 10 ++++++++
>   mm/gup.c                        | 18 +++++++-------
>   3 files changed, 49 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..b55561cc8786 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
>   
>   #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
>   
> -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
> +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
>   {
> -	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> -		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
> -	return (p4d_t *) pgd;
> +	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> +		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
> +	return (p4d_t *) pgdp;
>   }
> +#define p4d_offset_lockless p4d_offset_lockless
>   
> -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
>   {
> -	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> -		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
> -	return (pud_t *) p4d;
> +	return p4d_offset_lockless(pgdp, *pgdp, address);
> +}
> +
> +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
> +{
> +	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> +		return (pud_t *) p4d_deref(p4d) + pud_index(address);
> +	return (pud_t *) p4dp;
> +}
> +#define pud_offset_lockless pud_offset_lockless
> +
> +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
> +{
> +	return pud_offset_lockless(p4dp, *p4dp, address);
>   }
>   #define pud_offset pud_offset
>   
> -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
> +{
> +	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> +		return (pmd_t *) pud_deref(pud) + pmd_index(address);
> +	return (pmd_t *) pudp;
> +}
> +#define pmd_offset_lockless pmd_offset_lockless
> +
> +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
>   {
> -	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> -		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
> -	return (pmd_t *) pud;
> +	return pmd_offset_lockless(pudp, *pudp, address);
>   }
>   #define pmd_offset pmd_offset
>   
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..90654cb63e9e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>   #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>   #endif
>   
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
> +#endif
> +
>   /*
>    * p?d_leaf() - true if this entry is a final mapping to a physical address.
>    * This differs from p?d_huge() by the fact that they are always available (if
> diff --git a/mm/gup.c b/mm/gup.c
> index e5739a1974d5..578bf5bd8bf8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>   	return 1;
>   }
>   
> -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
>   		unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	pmd_t *pmdp;
>   
> -	pmdp = pmd_offset(&pud, addr);
> +	pmdp = pmd_offset_lockless(pudp, pud, addr);
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	return 1;
>   }
>   
> -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
> +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
>   			 unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	pud_t *pudp;
>   
> -	pudp = pud_offset(&p4d, addr);
> +	pudp = pud_offset_lockless(p4dp, p4d, addr);
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
>   					 PUD_SHIFT, next, flags, pages, nr))
>   				return 0;
> -		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
> +		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
>   			return 0;
>   	} while (pudp++, addr = next, addr != end);
>   
>   	return 1;
>   }
>   
> -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
> +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
>   			 unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	p4d_t *p4dp;
>   
> -	p4dp = p4d_offset(&pgd, addr);
> +	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
>   					 P4D_SHIFT, next, flags, pages, nr))
>   				return 0;
> -		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
> +		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
>   			return 0;
>   	} while (p4dp++, addr = next, addr != end);
>   
> @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
>   					 PGDIR_SHIFT, next, flags, pages, nr))
>   				return;
> -		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
> +		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
>   			return;
>   	} while (pgdp++, addr = next, addr != end);
>   }
> 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:31                                       ` John Hubbard
  0 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-15 17:31 UTC (permalink / raw)
  To: Vasily Gorbik, Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Andrew Morton, Linus Torvalds,
	Mike Rapoport

On 9/11/20 1:36 PM, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                           unsigned int flags, struct page **pages, int *nr)
> ...
>          pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                           unsigned int flags, struct page **pages, int *nr)
> {
>          unsigned long next;
>          pud_t *pudp;
> 
>          // pud_offset returns &p4d itself (a pointer to a value on stack)
>          pudp = pud_offset(&p4d, addr);
>          do {
>                  // on second iteratation reading "random" stack value
>                  pud_t pud = READ_ONCE(*pudp);
> 
>                  // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                  next = pud_addr_end(addr, end);
>                  ...
>          } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>          return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---

Looks cleaner than I'd dared hope for. :)

Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA

> v2: added brackets &pgd -> &(pgd)
> 
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>   include/linux/pgtable.h         | 10 ++++++++
>   mm/gup.c                        | 18 +++++++-------
>   3 files changed, 49 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..b55561cc8786 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
>   
>   #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
>   
> -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
> +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
>   {
> -	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> -		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
> -	return (p4d_t *) pgd;
> +	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> +		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
> +	return (p4d_t *) pgdp;
>   }
> +#define p4d_offset_lockless p4d_offset_lockless
>   
> -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
>   {
> -	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> -		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
> -	return (pud_t *) p4d;
> +	return p4d_offset_lockless(pgdp, *pgdp, address);
> +}
> +
> +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
> +{
> +	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> +		return (pud_t *) p4d_deref(p4d) + pud_index(address);
> +	return (pud_t *) p4dp;
> +}
> +#define pud_offset_lockless pud_offset_lockless
> +
> +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
> +{
> +	return pud_offset_lockless(p4dp, *p4dp, address);
>   }
>   #define pud_offset pud_offset
>   
> -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
> +{
> +	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> +		return (pmd_t *) pud_deref(pud) + pmd_index(address);
> +	return (pmd_t *) pudp;
> +}
> +#define pmd_offset_lockless pmd_offset_lockless
> +
> +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
>   {
> -	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> -		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
> -	return (pmd_t *) pud;
> +	return pmd_offset_lockless(pudp, *pudp, address);
>   }
>   #define pmd_offset pmd_offset
>   
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..90654cb63e9e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>   #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>   #endif
>   
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
> +#endif
> +
>   /*
>    * p?d_leaf() - true if this entry is a final mapping to a physical address.
>    * This differs from p?d_huge() by the fact that they are always available (if
> diff --git a/mm/gup.c b/mm/gup.c
> index e5739a1974d5..578bf5bd8bf8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>   	return 1;
>   }
>   
> -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
>   		unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	pmd_t *pmdp;
>   
> -	pmdp = pmd_offset(&pud, addr);
> +	pmdp = pmd_offset_lockless(pudp, pud, addr);
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	return 1;
>   }
>   
> -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
> +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
>   			 unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	pud_t *pudp;
>   
> -	pudp = pud_offset(&p4d, addr);
> +	pudp = pud_offset_lockless(p4dp, p4d, addr);
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
>   					 PUD_SHIFT, next, flags, pages, nr))
>   				return 0;
> -		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
> +		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
>   			return 0;
>   	} while (pudp++, addr = next, addr != end);
>   
>   	return 1;
>   }
>   
> -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
> +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
>   			 unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	p4d_t *p4dp;
>   
> -	p4dp = p4d_offset(&pgd, addr);
> +	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
>   					 P4D_SHIFT, next, flags, pages, nr))
>   				return 0;
> -		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
> +		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
>   			return 0;
>   	} while (p4dp++, addr = next, addr != end);
>   
> @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
>   					 PGDIR_SHIFT, next, flags, pages, nr))
>   				return;
> -		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
> +		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
>   			return;
>   	} while (pgdp++, addr = next, addr != end);
>   }
> 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:31                                       ` John Hubbard
  0 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-15 17:31 UTC (permalink / raw)
  To: Vasily Gorbik, Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On 9/11/20 1:36 PM, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                           unsigned int flags, struct page **pages, int *nr)
> ...
>          pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                           unsigned int flags, struct page **pages, int *nr)
> {
>          unsigned long next;
>          pud_t *pudp;
> 
>          // pud_offset returns &p4d itself (a pointer to a value on stack)
>          pudp = pud_offset(&p4d, addr);
>          do {
>                  // on second iteratation reading "random" stack value
>                  pud_t pud = READ_ONCE(*pudp);
> 
>                  // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                  next = pud_addr_end(addr, end);
>                  ...
>          } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>          return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---

Looks cleaner than I'd dared hope for. :)

Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA

> v2: added brackets &pgd -> &(pgd)
> 
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>   include/linux/pgtable.h         | 10 ++++++++
>   mm/gup.c                        | 18 +++++++-------
>   3 files changed, 49 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..b55561cc8786 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
>   
>   #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
>   
> -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
> +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
>   {
> -	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> -		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
> -	return (p4d_t *) pgd;
> +	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> +		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
> +	return (p4d_t *) pgdp;
>   }
> +#define p4d_offset_lockless p4d_offset_lockless
>   
> -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
>   {
> -	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> -		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
> -	return (pud_t *) p4d;
> +	return p4d_offset_lockless(pgdp, *pgdp, address);
> +}
> +
> +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
> +{
> +	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> +		return (pud_t *) p4d_deref(p4d) + pud_index(address);
> +	return (pud_t *) p4dp;
> +}
> +#define pud_offset_lockless pud_offset_lockless
> +
> +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
> +{
> +	return pud_offset_lockless(p4dp, *p4dp, address);
>   }
>   #define pud_offset pud_offset
>   
> -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
> +{
> +	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> +		return (pmd_t *) pud_deref(pud) + pmd_index(address);
> +	return (pmd_t *) pudp;
> +}
> +#define pmd_offset_lockless pmd_offset_lockless
> +
> +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
>   {
> -	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> -		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
> -	return (pmd_t *) pud;
> +	return pmd_offset_lockless(pudp, *pudp, address);
>   }
>   #define pmd_offset pmd_offset
>   
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..90654cb63e9e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>   #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>   #endif
>   
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
> +#endif
> +
>   /*
>    * p?d_leaf() - true if this entry is a final mapping to a physical address.
>    * This differs from p?d_huge() by the fact that they are always available (if
> diff --git a/mm/gup.c b/mm/gup.c
> index e5739a1974d5..578bf5bd8bf8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>   	return 1;
>   }
>   
> -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
>   		unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	pmd_t *pmdp;
>   
> -	pmdp = pmd_offset(&pud, addr);
> +	pmdp = pmd_offset_lockless(pudp, pud, addr);
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	return 1;
>   }
>   
> -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
> +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
>   			 unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	pud_t *pudp;
>   
> -	pudp = pud_offset(&p4d, addr);
> +	pudp = pud_offset_lockless(p4dp, p4d, addr);
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
>   					 PUD_SHIFT, next, flags, pages, nr))
>   				return 0;
> -		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
> +		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
>   			return 0;
>   	} while (pudp++, addr = next, addr != end);
>   
>   	return 1;
>   }
>   
> -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
> +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
>   			 unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	p4d_t *p4dp;
>   
> -	p4dp = p4d_offset(&pgd, addr);
> +	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
>   					 P4D_SHIFT, next, flags, pages, nr))
>   				return 0;
> -		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
> +		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
>   			return 0;
>   	} while (p4dp++, addr = next, addr != end);
>   
> @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
>   					 PGDIR_SHIFT, next, flags, pages, nr))
>   				return;
> -		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
> +		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
>   			return;
>   	} while (pgdp++, addr = next, addr != end);
>   }
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
@ 2020-09-15 17:31                                       ` John Hubbard
  0 siblings, 0 replies; 313+ messages in thread
From: John Hubbard @ 2020-09-15 17:31 UTC (permalink / raw)
  To: Vasily Gorbik, Jason Gunthorpe
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer,
	Heiko Carstens, Arnd Bergmann, Jeff Dike, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Dave Hansen, linux-power, LKML, Michael Ellerman, Andrew Morton,
	Linus Torvalds, Mike Rapoport

On 9/11/20 1:36 PM, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                           unsigned int flags, struct page **pages, int *nr)
> ...
>          pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                           unsigned int flags, struct page **pages, int *nr)
> {
>          unsigned long next;
>          pud_t *pudp;
> 
>          // pud_offset returns &p4d itself (a pointer to a value on stack)
>          pudp = pud_offset(&p4d, addr);
>          do {
>                  // on second iteratation reading "random" stack value
>                  pud_t pud = READ_ONCE(*pudp);
> 
>                  // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                  next = pud_addr_end(addr, end);
>                  ...
>          } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>          return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---

Looks cleaner than I'd dared hope for. :)

Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA

> v2: added brackets &pgd -> &(pgd)
> 
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>   include/linux/pgtable.h         | 10 ++++++++
>   mm/gup.c                        | 18 +++++++-------
>   3 files changed, 49 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..b55561cc8786 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
>   
>   #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
>   
> -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
> +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
>   {
> -	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> -		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
> -	return (p4d_t *) pgd;
> +	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> +		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
> +	return (p4d_t *) pgdp;
>   }
> +#define p4d_offset_lockless p4d_offset_lockless
>   
> -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
>   {
> -	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> -		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
> -	return (pud_t *) p4d;
> +	return p4d_offset_lockless(pgdp, *pgdp, address);
> +}
> +
> +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
> +{
> +	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> +		return (pud_t *) p4d_deref(p4d) + pud_index(address);
> +	return (pud_t *) p4dp;
> +}
> +#define pud_offset_lockless pud_offset_lockless
> +
> +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
> +{
> +	return pud_offset_lockless(p4dp, *p4dp, address);
>   }
>   #define pud_offset pud_offset
>   
> -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
> +{
> +	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> +		return (pmd_t *) pud_deref(pud) + pmd_index(address);
> +	return (pmd_t *) pudp;
> +}
> +#define pmd_offset_lockless pmd_offset_lockless
> +
> +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
>   {
> -	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> -		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
> -	return (pmd_t *) pud;
> +	return pmd_offset_lockless(pudp, *pudp, address);
>   }
>   #define pmd_offset pmd_offset
>   
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..90654cb63e9e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>   #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>   #endif
>   
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
> +#endif
> +
>   /*
>    * p?d_leaf() - true if this entry is a final mapping to a physical address.
>    * This differs from p?d_huge() by the fact that they are always available (if
> diff --git a/mm/gup.c b/mm/gup.c
> index e5739a1974d5..578bf5bd8bf8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>   	return 1;
>   }
>   
> -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
>   		unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	pmd_t *pmdp;
>   
> -	pmdp = pmd_offset(&pud, addr);
> +	pmdp = pmd_offset_lockless(pudp, pud, addr);
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	return 1;
>   }
>   
> -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
> +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
>   			 unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	pud_t *pudp;
>   
> -	pudp = pud_offset(&p4d, addr);
> +	pudp = pud_offset_lockless(p4dp, p4d, addr);
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
>   					 PUD_SHIFT, next, flags, pages, nr))
>   				return 0;
> -		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
> +		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
>   			return 0;
>   	} while (pudp++, addr = next, addr != end);
>   
>   	return 1;
>   }
>   
> -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
> +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
>   			 unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	p4d_t *p4dp;
>   
> -	p4dp = p4d_offset(&pgd, addr);
> +	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
>   					 P4D_SHIFT, next, flags, pages, nr))
>   				return 0;
> -		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
> +		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
>   			return 0;
>   	} while (p4dp++, addr = next, addr != end);
>   
> @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
>   					 PGDIR_SHIFT, next, flags, pages, nr))
>   				return;
> -		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
> +		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
>   			return;
>   	} while (pgdp++, addr = next, addr != end);
>   }
> 


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-11 20:36                                     ` Vasily Gorbik
                                                       ` (7 preceding siblings ...)
  (?)
@ 2020-09-17 15:53                                     ` Sasha Levin
  2020-09-18 15:15                                       ` Vasily Gorbik
  -1 siblings, 1 reply; 313+ messages in thread
From: Sasha Levin @ 2020-09-17 15:53 UTC (permalink / raw)
  To: Sasha Levin, Vasily Gorbik, Jason Gunthorpe, John Hubbard
  Cc: Linus Torvalds, stable

Hi

[This is an automated email]

This commit has been processed because it contains a "Fixes:" tag
fixing commit: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code").

The bot has tested the following trees: v5.8.9, v5.4.65.

v5.8.9: Build OK!
v5.4.65: Failed to apply! Possible dependencies:
    051a7a94aaa9 ("arm64: hibernate: use get_safe_page directly")
    13373f0e6580 ("arm64: hibernate: rename dst to page in create_safe_exec_page")
    48c963e31bc6 ("KVM: arm/arm64: Release kvm->mmu_lock in loop to prevent starvation")
    68ecabd0e680 ("arm64/mm: Use phys_to_page() to access pgtable memory")
    8a0af66b35f8 ("arm: mm: add p?d_leaf() definitions")
    974b9b2c68f3 ("mm: consolidate pte_index() and pte_offset_*() definitions")
    a2c2e67923ec ("arm64: hibernate: add trans_pgd public functions")
    a89d7ff933b0 ("arm64: hibernate: remove gotos as they are not needed")
    d234332c2815 ("arm64: hibernate: pass the allocated pgdp to ttbr0")
    e9f6376858b9 ("arm64: add support for folded p4d page tables")


NOTE: The patch will not be queued to stable trees until it is upstream.

How should we proceed with this patch?

-- 
Thanks
Sasha

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-17 15:53                                     ` Sasha Levin
@ 2020-09-18 15:15                                       ` Vasily Gorbik
  2020-09-18 15:15                                         ` [PATCH stable-5.4.y backport] " Vasily Gorbik
  0 siblings, 1 reply; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-18 15:15 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Jason Gunthorpe, John Hubbard, Linus Torvalds, stable

On Thu, Sep 17, 2020 at 03:53:33PM +0000, Sasha Levin wrote:
> Hi
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag
> fixing commit: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code").
> 
> The bot has tested the following trees: v5.8.9, v5.4.65.
> 
> v5.8.9: Build OK!
> v5.4.65: Failed to apply! Possible dependencies:
>     051a7a94aaa9 ("arm64: hibernate: use get_safe_page directly")
>     13373f0e6580 ("arm64: hibernate: rename dst to page in create_safe_exec_page")
>     48c963e31bc6 ("KVM: arm/arm64: Release kvm->mmu_lock in loop to prevent starvation")
>     68ecabd0e680 ("arm64/mm: Use phys_to_page() to access pgtable memory")
>     8a0af66b35f8 ("arm: mm: add p?d_leaf() definitions")
>     974b9b2c68f3 ("mm: consolidate pte_index() and pte_offset_*() definitions")
>     a2c2e67923ec ("arm64: hibernate: add trans_pgd public functions")
>     a89d7ff933b0 ("arm64: hibernate: remove gotos as they are not needed")
>     d234332c2815 ("arm64: hibernate: pass the allocated pgdp to ttbr0")
>     e9f6376858b9 ("arm64: add support for folded p4d page tables")
> 
> 
> NOTE: The patch will not be queued to stable trees until it is upstream.
> 
> How should we proceed with this patch?

Patch contains Cc: <stable@vger.kernel.org> [5.2+], so a stable backport to
v5.4.y is also desired once upstream. A patch is self contained and has no
dependencies.

 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
 include/asm-generic/pgtable.h   | 10 ++++++++
 mm/gup.c                        | 18 +++++++-------
 3 files changed, 49 insertions(+), 21 deletions(-)

There are 2 trivial merge conflicts both for include/asm-generic/pgtable.h.
1. commit ca5999fde0a1 ("mm: introduce include/linux/pgtable.h")
   since v5.8
   it renamed include/asm-generic/pgtable.h => include/linux/pgtable.h

2. commit 93fab1b22ef7 ("mm: add generic p?d_leaf() macros")
   since v5.6
   introduced p?d_leaf() macros after mm_pmd_folded, so the bottom context
   of the hunk mismatch here:

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..90654cb63e9e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define mm_pmd_folded(mm)჻·····__is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+#ifndef p4d_offset_lockless
+#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
+#endif
+#ifndef pud_offset_lockless
+#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
+#endif
+#ifndef pmd_offset_lockless
+#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
+#endif
+
 /*
  * p?d_leaf() - true if this entry is a final mapping to a physical address.
  * This differs from p?d_huge() by the fact that they are always available (if

Patch for a stable-5.4.y backport sent as a follow on.

Thank you,
Vasily


^ permalink raw reply related	[flat|nested] 313+ messages in thread

* [PATCH stable-5.4.y backport] mm/gup: fix gup_fast with dynamic page table folding
  2020-09-18 15:15                                       ` Vasily Gorbik
@ 2020-09-18 15:15                                         ` Vasily Gorbik
  0 siblings, 0 replies; 313+ messages in thread
From: Vasily Gorbik @ 2020-09-18 15:15 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Jason Gunthorpe, John Hubbard, Linus Torvalds, stable

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
...
        pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return.  This happens when the
level is folded (on most arches), and that pointer should not be iterated.

On s390 due to the fact that each task might have different 5,4 or 3-level
address translation and hence different levels folded the logic is more
complex and non-iteratable pointer to a local copy leads to severe
problems.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                         unsigned int flags, struct page **pages, int *nr)
{
        unsigned long next;
        pud_t *pudp;

        // pud_offset returns &p4d itself (a pointer to a value on stack)
        pudp = pud_offset(&p4d, addr);
        do {
                // on second iteratation reading "random" stack value
                pud_t pud = READ_ONCE(*pudp);

                // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                next = pud_addr_end(addr, end);
                ...
        } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

        return 1;
}

This happens since s390 moved to common gup code with commit d1874a0c2805
("s390/mm: make the pxd_offset functions more robust") and commit
1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code").
s390 tried to mimic static level folding by changing pXd_offset
primitives to always calculate top level page table offset in pgd_offset
and just return the value passed when pXd_offset has to act as folded.

What is crucial for gup_fast and what has been overlooked is that
PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original pXdp pointers
down to gup_pXd_range functions.  And introduce pXd_offset_lockless
helpers, which take an additional pXd entry value parameter.  This has
already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: <stable@vger.kernel.org>	[5.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
 include/asm-generic/pgtable.h   | 10 ++++++++
 mm/gup.c                        | 18 +++++++-------
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 34a655ad7123..a03862579d6b 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1247,25 +1247,43 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
 #define pgd_offset_k(address) pgd_offset(&init_mm, address)
 
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
 {
-	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-	return (p4d_t *) pgd;
+	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
+	return (p4d_t *) pgdp;
 }
+#define p4d_offset_lockless p4d_offset_lockless
 
-static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
 {
-	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
-		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
-	return (pud_t *) p4d;
+	return p4d_offset_lockless(pgdp, *pgdp, address);
 }
 
-static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
+static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
 {
-	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
-		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
-	return (pmd_t *) pud;
+	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
+		return (pud_t *) p4d_deref(p4d) + pud_index(address);
+	return (pud_t *) p4dp;
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
+{
+	return pud_offset_lockless(p4dp, *p4dp, address);
+}
+
+static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
+{
+	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
+		return (pmd_t *) pud_deref(pud) + pmd_index(address);
+	return (pmd_t *) pudp;
+}
+#define pmd_offset_lockless pmd_offset_lockless
+
+static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
+{
+	return pmd_offset_lockless(pudp, *pudp, address);
 }
 
 static inline pte_t *pte_offset(pmd_t *pmd, unsigned long address)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 818691846c90..1423f08c6ba9 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1187,4 +1187,14 @@ static inline bool arch_has_pfn_modify_check(void)
 #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+#ifndef p4d_offset_lockless
+#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
+#endif
+#ifndef pud_offset_lockless
+#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
+#endif
+#ifndef pmd_offset_lockless
+#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
diff --git a/mm/gup.c b/mm/gup.c
index 4a8e969a6e59..3ef769529548 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2184,13 +2184,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	return 1;
 }
 
-static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
 		unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pmd_t *pmdp;
 
-	pmdp = pmd_offset(&pud, addr);
+	pmdp = pmd_offset_lockless(pudp, pud, addr);
 	do {
 		pmd_t pmd = READ_ONCE(*pmdp);
 
@@ -2227,13 +2227,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	return 1;
 }
 
-static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pud_t *pudp;
 
-	pudp = pud_offset(&p4d, addr);
+	pudp = pud_offset_lockless(p4dp, p4d, addr);
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
@@ -2248,20 +2248,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
 					 PUD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
+		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
 	return 1;
 }
 
-static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
 	unsigned long next;
 	p4d_t *p4dp;
 
-	p4dp = p4d_offset(&pgd, addr);
+	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
 	do {
 		p4d_t p4d = READ_ONCE(*p4dp);
 
@@ -2273,7 +2273,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
 					 P4D_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
+		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
 			return 0;
 	} while (p4dp++, addr = next, addr != end);
 
@@ -2301,7 +2301,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
 					 PGDIR_SHIFT, next, flags, pages, nr))
 				return;
-		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
+		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
 			return;
 	} while (pgdp++, addr = next, addr != end);
 }
-- 
⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

^ permalink raw reply related	[flat|nested] 313+ messages in thread

end of thread, other threads:[~2020-09-18 15:15 UTC | newest]

Thread overview: 313+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-07 18:00 [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding Gerald Schaefer
2020-09-07 18:00 ` Gerald Schaefer
2020-09-07 18:00 ` Gerald Schaefer
2020-09-07 18:00 ` Gerald Schaefer
2020-09-07 18:00 ` Gerald Schaefer
2020-09-07 18:00 ` [RFC PATCH v2 1/3] " Gerald Schaefer
2020-09-07 18:00   ` Gerald Schaefer
2020-09-07 18:00   ` Gerald Schaefer
2020-09-07 18:00   ` Gerald Schaefer
2020-09-07 18:00   ` Gerald Schaefer
2020-09-08  5:06   ` Christophe Leroy
2020-09-08  5:06     ` Christophe Leroy
2020-09-08  5:06     ` Christophe Leroy
2020-09-08  5:06     ` Christophe Leroy
2020-09-08  5:06     ` Christophe Leroy
2020-09-08 12:09     ` Christian Borntraeger
2020-09-08 12:09       ` Christian Borntraeger
2020-09-08 12:09       ` Christian Borntraeger
2020-09-08 12:09       ` Christian Borntraeger
2020-09-08 12:09       ` Christian Borntraeger
2020-09-08 12:40       ` Christophe Leroy
2020-09-08 12:40         ` Christophe Leroy
2020-09-08 12:40         ` Christophe Leroy
2020-09-08 12:40         ` Christophe Leroy
2020-09-08 12:40         ` Christophe Leroy
2020-09-08 13:38         ` Gerald Schaefer
2020-09-08 13:38           ` Gerald Schaefer
2020-09-08 13:38           ` Gerald Schaefer
2020-09-08 13:38           ` Gerald Schaefer
2020-09-08 13:38           ` Gerald Schaefer
2020-09-08 14:30   ` Dave Hansen
2020-09-08 14:30     ` Dave Hansen
2020-09-08 14:30     ` Dave Hansen
2020-09-08 14:30     ` Dave Hansen
2020-09-08 17:59     ` Gerald Schaefer
2020-09-08 17:59       ` Gerald Schaefer
2020-09-08 17:59       ` Gerald Schaefer
2020-09-08 17:59       ` Gerald Schaefer
2020-09-08 17:59       ` Gerald Schaefer
2020-09-09 12:29     ` Gerald Schaefer
2020-09-09 12:29       ` Gerald Schaefer
2020-09-09 12:29       ` Gerald Schaefer
2020-09-09 12:29       ` Gerald Schaefer
2020-09-09 12:29       ` Gerald Schaefer
2020-09-09 16:18       ` Dave Hansen
2020-09-09 16:18         ` Dave Hansen
2020-09-09 16:18         ` Dave Hansen
2020-09-09 16:18         ` Dave Hansen
2020-09-09 16:18         ` Dave Hansen
2020-09-09 17:25         ` Gerald Schaefer
2020-09-09 17:25           ` Gerald Schaefer
2020-09-09 17:25           ` Gerald Schaefer
2020-09-09 17:25           ` Gerald Schaefer
2020-09-09 17:25           ` Gerald Schaefer
2020-09-09 18:03           ` Jason Gunthorpe
2020-09-09 18:03             ` Jason Gunthorpe
2020-09-09 18:03             ` Jason Gunthorpe
2020-09-09 18:03             ` Jason Gunthorpe
2020-09-09 18:03             ` Jason Gunthorpe
2020-09-10  9:39             ` Alexander Gordeev
2020-09-10  9:39               ` Alexander Gordeev
2020-09-10  9:39               ` Alexander Gordeev
2020-09-10  9:39               ` Alexander Gordeev
2020-09-10  9:39               ` Alexander Gordeev
2020-09-10 13:02               ` Jason Gunthorpe
2020-09-10 13:02                 ` Jason Gunthorpe
2020-09-10 13:02                 ` Jason Gunthorpe
2020-09-10 13:02                 ` Jason Gunthorpe
2020-09-10 13:02                 ` Jason Gunthorpe
2020-09-10 13:28                 ` Gerald Schaefer
2020-09-10 13:28                   ` Gerald Schaefer
2020-09-10 13:28                   ` Gerald Schaefer
2020-09-10 13:28                   ` Gerald Schaefer
2020-09-10 13:28                   ` Gerald Schaefer
2020-09-10 15:10                   ` Jason Gunthorpe
2020-09-10 15:10                     ` Jason Gunthorpe
2020-09-10 15:10                     ` Jason Gunthorpe
2020-09-10 15:10                     ` Jason Gunthorpe
2020-09-10 15:10                     ` Jason Gunthorpe
2020-09-10 17:07                     ` Gerald Schaefer
2020-09-10 17:07                       ` Gerald Schaefer
2020-09-10 17:07                       ` Gerald Schaefer
2020-09-10 17:07                       ` Gerald Schaefer
2020-09-10 17:07                       ` Gerald Schaefer
2020-09-10 17:19                       ` Jason Gunthorpe
2020-09-10 17:19                         ` Jason Gunthorpe
2020-09-10 17:19                         ` Jason Gunthorpe
2020-09-10 17:19                         ` Jason Gunthorpe
2020-09-10 17:19                         ` Jason Gunthorpe
2020-09-10 17:57                 ` Gerald Schaefer
2020-09-10 17:57                   ` Gerald Schaefer
2020-09-10 17:57                   ` Gerald Schaefer
2020-09-10 17:57                   ` Gerald Schaefer
2020-09-10 17:57                   ` Gerald Schaefer
2020-09-10 23:21                   ` Jason Gunthorpe
2020-09-10 23:21                     ` Jason Gunthorpe
2020-09-10 23:21                     ` Jason Gunthorpe
2020-09-10 23:21                     ` Jason Gunthorpe
2020-09-10 23:21                     ` Jason Gunthorpe
2020-09-10 17:35               ` Linus Torvalds
2020-09-10 17:35                 ` Linus Torvalds
2020-09-10 17:35                 ` Linus Torvalds
2020-09-10 17:35                 ` Linus Torvalds
2020-09-10 17:35                 ` Linus Torvalds
2020-09-10 17:35                 ` Linus Torvalds
2020-09-10 18:13                 ` Jason Gunthorpe
2020-09-10 18:13                   ` Jason Gunthorpe
2020-09-10 18:13                   ` Jason Gunthorpe
2020-09-10 18:13                   ` Jason Gunthorpe
2020-09-10 18:13                   ` Jason Gunthorpe
2020-09-10 18:33                   ` Linus Torvalds
2020-09-10 18:33                     ` Linus Torvalds
2020-09-10 18:33                     ` Linus Torvalds
2020-09-10 18:33                     ` Linus Torvalds
2020-09-10 18:33                     ` Linus Torvalds
2020-09-10 19:10                     ` Gerald Schaefer
2020-09-10 19:10                       ` Gerald Schaefer
2020-09-10 19:10                       ` Gerald Schaefer
2020-09-10 19:10                       ` Gerald Schaefer
2020-09-10 19:10                       ` Gerald Schaefer
2020-09-10 19:32                       ` Linus Torvalds
2020-09-10 19:32                         ` Linus Torvalds
2020-09-10 19:32                         ` Linus Torvalds
2020-09-10 19:32                         ` Linus Torvalds
2020-09-10 19:32                         ` Linus Torvalds
2020-09-10 21:59                         ` Jason Gunthorpe
2020-09-10 21:59                           ` Jason Gunthorpe
2020-09-10 21:59                           ` Jason Gunthorpe
2020-09-10 21:59                           ` Jason Gunthorpe
2020-09-10 21:59                           ` Jason Gunthorpe
2020-09-11  7:09                           ` peterz
2020-09-11  7:09                             ` peterz
2020-09-11  7:09                             ` peterz
2020-09-11  7:09                             ` peterz
2020-09-11  7:09                             ` peterz
2020-09-11 11:19                             ` Jason Gunthorpe
2020-09-11 11:19                               ` Jason Gunthorpe
2020-09-11 11:19                               ` Jason Gunthorpe
2020-09-11 11:19                               ` Jason Gunthorpe
2020-09-11 11:19                               ` Jason Gunthorpe
2020-09-11 19:03                             ` [PATCH] " Vasily Gorbik
2020-09-11 19:03                               ` Vasily Gorbik
2020-09-11 19:03                               ` Vasily Gorbik
2020-09-11 19:03                               ` Vasily Gorbik
2020-09-11 19:03                               ` Vasily Gorbik
2020-09-11 19:09                               ` Linus Torvalds
2020-09-11 19:09                                 ` Linus Torvalds
2020-09-11 19:09                                 ` Linus Torvalds
2020-09-11 19:09                                 ` Linus Torvalds
2020-09-11 19:09                                 ` Linus Torvalds
2020-09-11 19:40                               ` Jason Gunthorpe
2020-09-11 19:40                                 ` Jason Gunthorpe
2020-09-11 19:40                                 ` Jason Gunthorpe
2020-09-11 19:40                                 ` Jason Gunthorpe
2020-09-11 19:40                                 ` Jason Gunthorpe
2020-09-11 20:05                                 ` Jason Gunthorpe
2020-09-11 20:05                                   ` Jason Gunthorpe
2020-09-11 20:05                                   ` Jason Gunthorpe
2020-09-11 20:05                                   ` Jason Gunthorpe
2020-09-11 20:05                                   ` Jason Gunthorpe
2020-09-11 20:36                                   ` [PATCH v2] " Vasily Gorbik
2020-09-11 20:36                                     ` Vasily Gorbik
2020-09-11 20:36                                     ` Vasily Gorbik
2020-09-11 20:36                                     ` Vasily Gorbik
2020-09-11 20:36                                     ` Vasily Gorbik
2020-09-15 17:09                                     ` Vasily Gorbik
2020-09-15 17:09                                       ` Vasily Gorbik
2020-09-15 17:09                                       ` Vasily Gorbik
2020-09-15 17:09                                       ` Vasily Gorbik
2020-09-15 17:09                                       ` Vasily Gorbik
2020-09-15 17:14                                     ` Jason Gunthorpe
2020-09-15 17:14                                       ` Jason Gunthorpe
2020-09-15 17:14                                       ` Jason Gunthorpe
2020-09-15 17:14                                       ` Jason Gunthorpe
2020-09-15 17:14                                       ` Jason Gunthorpe
2020-09-15 17:18                                     ` Mike Rapoport
2020-09-15 17:18                                       ` Mike Rapoport
2020-09-15 17:18                                       ` Mike Rapoport
2020-09-15 17:18                                       ` Mike Rapoport
2020-09-15 17:18                                       ` Mike Rapoport
2020-09-15 17:31                                     ` John Hubbard
2020-09-15 17:31                                       ` John Hubbard
2020-09-15 17:31                                       ` John Hubbard
2020-09-15 17:31                                       ` John Hubbard
2020-09-15 17:31                                       ` John Hubbard
2020-09-17 15:53                                     ` Sasha Levin
2020-09-18 15:15                                       ` Vasily Gorbik
2020-09-18 15:15                                         ` [PATCH stable-5.4.y backport] " Vasily Gorbik
2020-09-10 21:22                   ` [RFC PATCH v2 1/3] " John Hubbard
2020-09-10 21:22                     ` John Hubbard
2020-09-10 21:22                     ` John Hubbard
2020-09-10 21:22                     ` John Hubbard
2020-09-10 21:22                     ` John Hubbard
2020-09-10 22:11                     ` Jason Gunthorpe
2020-09-10 22:11                       ` Jason Gunthorpe
2020-09-10 22:11                       ` Jason Gunthorpe
2020-09-10 22:11                       ` Jason Gunthorpe
2020-09-10 22:11                       ` Jason Gunthorpe
2020-09-10 22:17                       ` John Hubbard
2020-09-10 22:17                         ` John Hubbard
2020-09-10 22:17                         ` John Hubbard
2020-09-10 22:17                         ` John Hubbard
2020-09-10 22:17                         ` John Hubbard
2020-09-11 12:19                       ` Alexander Gordeev
2020-09-11 12:19                         ` Alexander Gordeev
2020-09-11 12:19                         ` Alexander Gordeev
2020-09-11 12:19                         ` Alexander Gordeev
2020-09-11 12:19                         ` Alexander Gordeev
2020-09-11 16:45                         ` Linus Torvalds
2020-09-11 16:45                           ` Linus Torvalds
2020-09-11 16:45                           ` Linus Torvalds
2020-09-11 16:45                           ` Linus Torvalds
2020-09-11 16:45                           ` Linus Torvalds
2020-09-10 13:11             ` Gerald Schaefer
2020-09-10 13:11               ` Gerald Schaefer
2020-09-10 13:11               ` Gerald Schaefer
2020-09-10 13:11               ` Gerald Schaefer
2020-09-10 13:11               ` Gerald Schaefer
2020-09-07 18:00 ` [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware Gerald Schaefer
2020-09-07 18:00   ` Gerald Schaefer
2020-09-07 18:00   ` Gerald Schaefer
2020-09-07 18:00   ` Gerald Schaefer
2020-09-07 18:00   ` Gerald Schaefer
2020-09-08  5:14   ` Christophe Leroy
2020-09-08  5:14     ` Christophe Leroy
2020-09-08  5:14     ` Christophe Leroy
2020-09-08  5:14     ` Christophe Leroy
2020-09-08  5:14     ` Christophe Leroy
2020-09-08  7:46     ` Alexander Gordeev
2020-09-08  7:46       ` Alexander Gordeev
2020-09-08  7:46       ` Alexander Gordeev
2020-09-08  7:46       ` Alexander Gordeev
2020-09-08  7:46       ` Alexander Gordeev
2020-09-08  8:16       ` Christophe Leroy
2020-09-08  8:16         ` Christophe Leroy
2020-09-08  8:16         ` Christophe Leroy
2020-09-08  8:16         ` Christophe Leroy
2020-09-08  8:16         ` Christophe Leroy
2020-09-08 14:15         ` Alexander Gordeev
2020-09-08 14:15           ` Alexander Gordeev
2020-09-08 14:15           ` Alexander Gordeev
2020-09-08 14:15           ` Alexander Gordeev
2020-09-08 14:15           ` Alexander Gordeev
2020-09-09  8:38           ` Christophe Leroy
2020-09-09  8:38             ` Christophe Leroy
2020-09-09  8:38             ` Christophe Leroy
2020-09-09  8:38             ` Christophe Leroy
2020-09-09  8:38             ` Christophe Leroy
2020-09-08 14:25     ` Alexander Gordeev
2020-09-08 14:25       ` Alexander Gordeev
2020-09-08 14:25       ` Alexander Gordeev
2020-09-08 14:25       ` Alexander Gordeev
2020-09-08 14:25       ` Alexander Gordeev
2020-09-08 13:26   ` Jason Gunthorpe
2020-09-08 13:26     ` Jason Gunthorpe
2020-09-08 13:26     ` Jason Gunthorpe
2020-09-08 13:26     ` Jason Gunthorpe
2020-09-08 13:26     ` Jason Gunthorpe
2020-09-08 14:33   ` Dave Hansen
2020-09-08 14:33     ` Dave Hansen
2020-09-08 14:33     ` Dave Hansen
2020-09-08 14:33     ` Dave Hansen
2020-09-07 18:00 ` [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions Gerald Schaefer
2020-09-07 18:00   ` Gerald Schaefer
2020-09-07 18:00   ` Gerald Schaefer
2020-09-07 18:00   ` Gerald Schaefer
2020-09-07 18:00   ` Gerald Schaefer
2020-09-07 20:15   ` Mike Rapoport
2020-09-07 20:15     ` Mike Rapoport
2020-09-07 20:15     ` Mike Rapoport
2020-09-07 20:15     ` Mike Rapoport
2020-09-07 20:15     ` Mike Rapoport
2020-09-08  5:19   ` Christophe Leroy
2020-09-08  5:19     ` Christophe Leroy
2020-09-08  5:19     ` Christophe Leroy
2020-09-08  5:19     ` Christophe Leroy
2020-09-08  5:19     ` Christophe Leroy
2020-09-08 15:48     ` Alexander Gordeev
2020-09-08 15:48       ` Alexander Gordeev
2020-09-08 15:48       ` Alexander Gordeev
2020-09-08 15:48       ` Alexander Gordeev
2020-09-08 15:48       ` Alexander Gordeev
2020-09-08 17:20       ` Christophe Leroy
2020-09-08 17:20         ` Christophe Leroy
2020-09-08 17:20         ` Christophe Leroy
2020-09-08 17:20         ` Christophe Leroy
2020-09-08 17:20         ` Christophe Leroy
2020-09-08 16:05   ` kernel test robot
2020-09-07 20:12 ` [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding Mike Rapoport
2020-09-07 20:12   ` Mike Rapoport
2020-09-07 20:12   ` Mike Rapoport
2020-09-07 20:12   ` Mike Rapoport
2020-09-07 20:12   ` Mike Rapoport
2020-09-08  5:22   ` Christophe Leroy
2020-09-08  5:22     ` Christophe Leroy
2020-09-08  5:22     ` Christophe Leroy
2020-09-08  5:22     ` Christophe Leroy
2020-09-08  5:22     ` Christophe Leroy
2020-09-08 17:36     ` Gerald Schaefer
2020-09-08 17:36       ` Gerald Schaefer
2020-09-08 17:36       ` Gerald Schaefer
2020-09-08 17:36       ` Gerald Schaefer
2020-09-08 17:36       ` Gerald Schaefer
2020-09-09 16:12       ` Gerald Schaefer
2020-09-09 16:12         ` Gerald Schaefer
2020-09-09 16:12         ` Gerald Schaefer
2020-09-09 16:12         ` Gerald Schaefer
2020-09-09 16:12         ` Gerald Schaefer
2020-09-08  4:42 ` Christophe Leroy
2020-09-08  4:42   ` Christophe Leroy
2020-09-08  4:42   ` Christophe Leroy
2020-09-08  4:42   ` Christophe Leroy
2020-09-08  4:42   ` Christophe Leroy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.